12.01.2013 Views

Problem Determination Guide - Systems Group

Problem Determination Guide - Systems Group

Problem Determination Guide - Systems Group

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IBM System Blue Gene<br />

Solution<br />

<strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

Learn detailed procedures<br />

through illustrations<br />

Use sample scenarios as<br />

helpful resources<br />

Discover GPFS<br />

installation hints and tips<br />

ibm.com/redbooks<br />

Front cover<br />

Octavian Lascu<br />

Peter F Custerson<br />

Marty Fullam<br />

Ravi K Komanduri<br />

Dr. Thanh V Lam<br />

Sean Saunders<br />

Chris Stone<br />

Shinsuke Ueyama<br />

Dino Quintero


International Technical Support Organization<br />

IBM System Blue Gene Solution: <strong>Problem</strong><br />

<strong>Determination</strong> <strong>Guide</strong><br />

October 2006<br />

SG24-7211-00


Note: Before using this information and the product it supports, read the information in<br />

“Notices” on page ix.<br />

First Edition (October 2006)<br />

This edition applies to Version 1, Release 2, Modification 1 of Blue Gene/L driver<br />

(V1R2M1_020_2006-060110).<br />

© Copyright International Business Machines Corporation 2006. All rights reserved.<br />

Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP<br />

Schedule Contract with IBM Corp.


Contents<br />

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix<br />

Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x<br />

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi<br />

The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi<br />

Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv<br />

Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv<br />

Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.1 Blue Gene/L system overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.2 Hardware components of the Blue Gene/L system. . . . . . . . . . . . . . . . . . . 2<br />

1.2.1 Racks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2.2 Midplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.2.3 Compute (processor) card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.2.4 I/O card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.2.5 Node card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.2.6 Service card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />

1.2.7 Fan card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

1.2.8 Clock card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2.9 Bulk power modules/enclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2.10 Link card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.2.11 Rack, power module, card, and fan naming conventions . . . . . . . . 19<br />

1.3 Blue Gene/L networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

1.3.1 Service network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

1.3.2 Functional network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

1.3.3 Three dimensional torus (3D torus). . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

1.3.4 Collective (tree) network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

1.3.5 Global barrier and interrupt network . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

1.4 Service Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

1.4.1 DB2 database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

1.4.2 Service Node system processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

1.5 Front-End Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

1.6 External file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

1.7 Blue Gene/L system software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.7.1 I/O node kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.7.2 I/O kernel ramdisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

1.7.3 I/O kernel CIOD daemon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

1.7.4 Compute Node Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

© Copyright IBM Corp. 2006. All rights reserved. iii


1.7.5 Microloader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

1.8 Boot process, job submission, and termination. . . . . . . . . . . . . . . . . . . . . 36<br />

1.9 System discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

1.9.1 The bglmaster process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

1.9.2 SystemController. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

1.9.3 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

1.9.4 PostDiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

1.9.5 CableDiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

1.10 Discovering your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

1.10.1 Discovery logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

1.11 Service actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

1.11.1 PrepareForService . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

1.11.2 EndServiceAction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

1.12 Turning off the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

1.13 Turning on the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

Chapter 2. <strong>Problem</strong> determination methodology . . . . . . . . . . . . . . . . . . . . 55<br />

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

2.2 Identifying the installed system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

2.2.1 Blue Gene/L Web interface (BGWEB) . . . . . . . . . . . . . . . . . . . . . . . 57<br />

2.2.2 DB2 select statements of the DB2 database on the SN . . . . . . . . . . 58<br />

2.2.3 Network diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

2.2.4 Service Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

2.2.5 Front-End Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

2.2.6 Control system server logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

2.2.7 File systems (NFS and GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

2.2.8 Job submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

2.2.9 Racks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

2.2.10 Midplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

2.2.11 Clock cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

2.2.12 Service cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

2.2.13 Link cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

2.2.14 Link card chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

2.2.15 Link summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

2.2.16 Node cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

2.2.17 I/O cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

2.2.18 Compute or processor cards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

2.3 Sanity checks for installed components . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

2.3.1 Check the operating system on the SN. . . . . . . . . . . . . . . . . . . . . . . 83<br />

2.3.2 Check communication services on the SN . . . . . . . . . . . . . . . . . . . . 84<br />

2.3.3 Check that BGWEB is running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

2.3.4 Check that DB2 is working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

2.3.5 Check that BGLMaster and its child daemons are running. . . . . . . . 89<br />

iv IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.3.6 Check the NFS subsystem on the SN. . . . . . . . . . . . . . . . . . . . . . . . 90<br />

2.3.7 Check that a block can be allocated using mmcs_console. . . . . . . . 91<br />

2.3.8 Check that a simple job can run (mmcs_console) . . . . . . . . . . . . . . 96<br />

2.3.9 Check the control system server logs . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

2.3.10 Check remote shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

2.3.11 Check remote command execution with secure shell . . . . . . . . . . 102<br />

2.3.12 Check the network switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />

2.3.13 Check the physical Blue Gene/L racks configuration . . . . . . . . . . 104<br />

2.4 <strong>Problem</strong> determination methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

2.4.1 Define the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

2.4.2 Identify the Blue Gene/L system . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

2.4.3 Identify the problem area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

2.5 Identifying core Blue Gene/L system problems. . . . . . . . . . . . . . . . . . . . 108<br />

2.6 Identifying IBM LoadLeveler jobs to BGL jobs (CLI) . . . . . . . . . . . . . . . . 109<br />

Chapter 3. <strong>Problem</strong> determination tools . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

3.2 Hardware monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

3.2.1 Collectable information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

3.2.2 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

3.2.3 Checking the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

3.3 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

3.3.1 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

3.3.2 Checking the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

3.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

3.4.1 Test cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />

3.4.2 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

3.4.3 Checking the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

3.5 MMCS console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />

3.5.1 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />

3.5.2 Checking the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />

Chapter 4. Running jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

4.1 Parallel programming environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

4.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />

4.2.1 The blrts tool chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

4.2.2 The IBM XLC/XLF compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148<br />

4.3 Submitting jobs using built-in tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

4.3.1 Submitting a job using MMCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

4.3.2 The mpirun program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

4.3.3 Example of submitting a job using mpirun . . . . . . . . . . . . . . . . . . . 163<br />

4.4 IBM LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167<br />

4.4.1 LoadLeveler overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167<br />

Contents v


4.4.2 Principles of operation in a Blue Gene/L environment . . . . . . . . . . 170<br />

4.4.3 How LoadLeveler plugs into Blue Gene/L. . . . . . . . . . . . . . . . . . . . 172<br />

4.4.4 Configuring LoadLeveler for Blue Gene/L. . . . . . . . . . . . . . . . . . . . 172<br />

4.4.5 Making the Blue Gene/L libraries available to LoadLeveler . . . . . . 173<br />

4.4.6 Setting Blue Gene/L specific environment variables. . . . . . . . . . . . 175<br />

4.4.7 LoadLeveler and the Blue Gene/L job cycle . . . . . . . . . . . . . . . . . . 176<br />

4.4.8 LoadLeveler job submission process . . . . . . . . . . . . . . . . . . . . . . . 178<br />

4.4.9 LoadLeveler checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186<br />

4.4.10 Updating LoadLeveler in a Blue Gene/L environment . . . . . . . . . 209<br />

Chapter 5. File systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211<br />

5.1 NFS and GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212<br />

5.1.1 I/O node boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212<br />

5.1.2 Additional scripts in I/O node boot sequence . . . . . . . . . . . . . . . . . 214<br />

5.2 NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />

5.2.1 How NFS plugs into a Blue Gene/L system . . . . . . . . . . . . . . . . . . 215<br />

5.2.2 Adding an NFS file system to the Blue Gene/L system . . . . . . . . . 216<br />

5.2.3 NFS problem determination methodology. . . . . . . . . . . . . . . . . . . . 217<br />

5.2.4 NFS checklists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218<br />

5.3 GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />

5.3.1 When to use GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225<br />

5.3.2 Features and concepts of GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 226<br />

5.3.3 GPFS requirements for Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . 227<br />

5.3.4 GPFS supported levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229<br />

5.3.5 How GPFS plugs in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230<br />

5.3.6 Creating the GPFS file system on a remote cluster (gpfsNSD) . . . 232<br />

5.3.7 Creating a GPFS cluster on Blue Gene/L (bgIO) . . . . . . . . . . . . . . 232<br />

5.3.8 Cross mounting the GPFS file system on to Blue Gene/L cluster. . 246<br />

5.3.9 GPFS problem determination methodology . . . . . . . . . . . . . . . . . . 254<br />

5.3.10 GPFS Checklists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255<br />

5.3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264<br />

Chapter 6. Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265<br />

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266<br />

6.2 Blue Gene/L core system scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267<br />

6.2.1 Hardware error: Compute card error. . . . . . . . . . . . . . . . . . . . . . . . 268<br />

6.2.2 Functional network: Defective cable . . . . . . . . . . . . . . . . . . . . . . . . 269<br />

6.2.3 Service network: Defective cable . . . . . . . . . . . . . . . . . . . . . . . . . . 269<br />

6.2.4 Service Node functional network interface down . . . . . . . . . . . . . . 271<br />

6.2.5 SN service network interface down. . . . . . . . . . . . . . . . . . . . . . . . . 271<br />

6.2.6 The /bgl file system full on the SN (no GPFS) . . . . . . . . . . . . . . . . 273<br />

6.2.7 The / file system full on the SN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274<br />

6.2.8 The /tmp file system is full on the SN . . . . . . . . . . . . . . . . . . . . . . . 275<br />

vi IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6.2.9 The ciodb daemon is not running on the SN. . . . . . . . . . . . . . . . . . 276<br />

6.2.10 The idoproxy daemon not running on the SN . . . . . . . . . . . . . . . . 280<br />

6.2.11 The mmcs_server is not running on the SN . . . . . . . . . . . . . . . . . 282<br />

6.2.12 DB2 not started on the SN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284<br />

6.2.13 The bglsysdb user OS password changed (Linux) . . . . . . . . . . . . 287<br />

6.2.14 Uncontrolled rack power off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289<br />

6.3 File system scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294<br />

6.3.1 Port mapper daemon not running on the SN . . . . . . . . . . . . . . . . . 295<br />

6.3.2 NFS daemon not running on the SN . . . . . . . . . . . . . . . . . . . . . . . . 296<br />

6.3.3 GPFS pagepool (wrongly) set to 512MB on bgIO cluster nodes . . 297<br />

6.3.4 Secure shell (ssh) is broken . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303<br />

6.3.5 The /bgl file system full (Blue Gene/L uses GPFS). . . . . . . . . . . . . 312<br />

6.3.6 Installing new Blue Gene/L driver code (driver update) . . . . . . . . . 320<br />

6.3.7 Duplicate IP addresses in /etc/hosts . . . . . . . . . . . . . . . . . . . . . . . . 343<br />

6.3.8 Missing I/O node in /etc/hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345<br />

6.3.9 Adding an extra alias for the SN in /etc/hosts . . . . . . . . . . . . . . . . . 347<br />

6.4 Job submission scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />

6.4.1 The mpirun command: scenarios description . . . . . . . . . . . . . . . . . 349<br />

6.4.2 The mpirun command: environment variables not set . . . . . . . . . . 350<br />

6.4.3 The mpirun command: incorrect remote command<br />

execution (rsh) setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355<br />

6.4.4 LoadLeveler: scenarios description. . . . . . . . . . . . . . . . . . . . . . . . . 358<br />

6.4.5 LoadLeveler: job failed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358<br />

6.4.6 LoadLeveler: job in hold state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363<br />

6.4.7 LoadLeveler: job disappears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369<br />

6.4.8 LoadLeveler: Blue Gene/L is absent . . . . . . . . . . . . . . . . . . . . . . . . 371<br />

6.4.9 LoadLeveler: LoadLeveler cannot start. . . . . . . . . . . . . . . . . . . . . . 375<br />

Chapter 7. Additional topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383<br />

7.1 Cluster <strong>Systems</strong> Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384<br />

7.1.1 Overview of CSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384<br />

7.1.2 Monitoring the Blue Gene/L database with CSM . . . . . . . . . . . . . . 386<br />

7.1.3 Customizing the monitoring capabilities of CSM. . . . . . . . . . . . . . . 387<br />

7.1.4 Defining your own CSM monitoring constructs . . . . . . . . . . . . . . . . 390<br />

7.1.5 Miscellaneous related information. . . . . . . . . . . . . . . . . . . . . . . . . . 394<br />

7.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394<br />

7.2 Secure shell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395<br />

7.2.1 Basic cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395<br />

7.2.2 Secure shell basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399<br />

7.2.3 Sample configuration in a cluster environment . . . . . . . . . . . . . . . . 402<br />

7.2.4 Using ssh in a Blue Gene/L environment . . . . . . . . . . . . . . . . . . . . 406<br />

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L . . . 409<br />

Contents vii


Installing LoadLeveler on SN and FENs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410<br />

Obtaining the rpms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410<br />

Installing the rpms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />

Setting up the LoadLeveler cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />

Enabling Blue Gene/L capabilities in LoadLeveler . . . . . . . . . . . . . . . . . . 413<br />

Setting Blue Gene/L specific environment variables. . . . . . . . . . . . . . . . . 413<br />

Appendix B. The sitefs file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423<br />

The /bgl/dist/etc/rc.d/init.d/sitefs file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424<br />

Appendix C. The ionode.README file . . . . . . . . . . . . . . . . . . . . . . . . . . . 431<br />

/bgl/BlueLight/ppcfloor/docs/ionode.README file . . . . . . . . . . . . . . . . . . . . . 432<br />

Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451<br />

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453<br />

IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453<br />

Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453<br />

Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454<br />

How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454<br />

Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454<br />

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455<br />

viii IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Notices<br />

This information was developed for products and services offered in the U.S.A.<br />

IBM may not offer the products, services, or features discussed in this document in other countries. Consult<br />

your local IBM representative for information on the products and services currently available in your area.<br />

Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM<br />

product, program, or service may be used. Any functionally equivalent product, program, or service that<br />

does not infringe any IBM intellectual property right may be used instead. However, it is the user's<br />

responsibility to evaluate and verify the operation of any non-IBM product, program, or service.<br />

IBM may have patents or pending patent applications covering subject matter described in this document.<br />

The furnishing of this document does not give you any license to these patents. You can send license<br />

inquiries, in writing, to:<br />

IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.<br />

The following paragraph does not apply to the United Kingdom or any other country where such provisions<br />

are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES<br />

THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,<br />

INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,<br />

MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer<br />

of express or implied warranties in certain transactions, therefore, this statement may not apply to you.<br />

This information could include technical inaccuracies or typographical errors. Changes are periodically made<br />

to the information herein; these changes will be incorporated in new editions of the publication. IBM may<br />

make improvements and/or changes in the product(s) and/or the program(s) described in this publication at<br />

any time without notice.<br />

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any<br />

manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the<br />

materials for this IBM product and use of those Web sites is at your own risk.<br />

IBM may use or distribute any of the information you supply in any way it believes appropriate without<br />

incurring any obligation to you.<br />

Information concerning non-IBM products was obtained from the suppliers of those products, their published<br />

announcements or other publicly available sources. IBM has not tested those products and cannot confirm<br />

the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on<br />

the capabilities of non-IBM products should be addressed to the suppliers of those products.<br />

This information contains examples of data and reports used in daily business operations. To illustrate them<br />

as completely as possible, the examples include the names of individuals, companies, brands, and products.<br />

All of these names are fictitious and any similarity to the names and addresses used by an actual business<br />

enterprise is entirely coincidental.<br />

COPYRIGHT LICENSE:<br />

This information contains sample application programs in source language, which illustrates programming<br />

techniques on various operating platforms. You may copy, modify, and distribute these sample programs in<br />

any form without payment to IBM, for the purposes of developing, using, marketing or distributing application<br />

programs conforming to the application programming interface for the operating platform for which the<br />

sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,<br />

therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy,<br />

modify, and distribute these sample programs in any form without payment to IBM for the purposes of<br />

developing, using, marketing, or distributing application programs conforming to IBM's application<br />

programming interfaces.<br />

© Copyright IBM Corp. 2006. All rights reserved. ix


Trademarks<br />

The following terms are trademarks of the International Business Machines Corporation in the United States,<br />

other countries, or both:<br />

eServer<br />

ibm.com®<br />

AIX 5L<br />

AIX®<br />

Blue Gene®<br />

DB2®<br />

IBM®<br />

LoadLeveler®<br />

NUMA-Q®<br />

Power PC®<br />

PowerPC®<br />

POWER<br />

POWER4<br />

POWER5<br />

The following terms are trademarks of other companies:<br />

x IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

Redbooks<br />

Redbooks (logo) <br />

Sequent®<br />

System p<br />

Tivoli®<br />

1350<br />

Java, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other<br />

countries, or both.<br />

Nina, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or<br />

both.<br />

Chips, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel<br />

Corporation or its subsidiaries in the United States, other countries, or both.<br />

UNIX is a registered trademark of The Open <strong>Group</strong> in the United States and other countries.<br />

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.<br />

Other company, product, or service names may be trademarks or service marks of others.


Preface<br />

This IBM® Redbook is intended as a problem determination guide for system<br />

administrators in a High Performance Computing environment. It can help you<br />

find a solution to issues that you encounter on your IBM eServer Blue Gene®<br />

system.<br />

This redbook presents an architectural overview of the IBM eServer Blue Gene<br />

Solution, with some of the principles that have been used to design this<br />

revolutionary supercomputer, a description of the hardware and software<br />

environment that compose this solution, along with a short description of each<br />

component and how to identify them in an installed system.<br />

This redbook also includes a problem determination methodology that we<br />

developed during our residency, along with the problem determination tools that<br />

are available with the basic IBM eServer Blue Gene Solution. It also discusses<br />

additional software components that are required for integrating your Blue Gene<br />

system in a complex computing environment. These components include file<br />

systems (NFS and GPFS) and job submission tools (mpirun and IBM<br />

LoadLeveler®).<br />

This redbook also describes a GPFS installation procedure that we used in our<br />

test environment and several scenarios that describe possible issues and their<br />

resolution that we developed following the proposed problem determination<br />

methodology.<br />

Finally, this redbook includes a short introduction about to how to integrate your<br />

Blue Gene system in a High Performance Computing environment managed by<br />

IBM Cluster <strong>Systems</strong> Management as well as an introduction to how you can use<br />

secure shell in such an environment.<br />

The team that wrote this redbook<br />

This redbook was produced by a team of specialists from around the world<br />

working at the International Technical Support Organization (ITSO), Austin<br />

Center.<br />

Octavian Lascu is a Project Leader at the ITSO, Poughkeepsie Center. He<br />

writes extensively and teaches IBM classes worldwide on all areas of IBM<br />

System p and Linux® clusters. Before joining the ITSO, Octavian worked in<br />

IBM Global Services Romania as a software and hardware Services Manager.<br />

© Copyright IBM Corp. 2006. All rights reserved. xi


He holds a Master's Degree in Electronic Engineering from the Polytechnical<br />

Institute in Bucharest and is also an IBM Certified Advanced Technical Expert in<br />

AIX/PSSP/HACMP. He has worked with IBM since 1992.<br />

Peter F Custerson is a Product Support Specialist based in Farnborough,<br />

United Kingdom. He has worked for IBM for nine years, four years for the former<br />

Sequent® organization and five years for the IBM UK Unix Support Centre. He<br />

currently is the High Performance Computing (HPC) Technical Advisor, and<br />

concentrates on customer support issues with the HPC product set for our<br />

regions HPC customers. This includes Cluster 1600, 1350 and Blue Gene<br />

customers. He holds an honors degree in Computer Studies from the University<br />

of Glamorgan in the United Kingdom.<br />

Marty Fullam is a software engineer on the Cluster <strong>Systems</strong> Management<br />

development team. He is the architect and lead developer for the CSM Blue<br />

Gene support project. He joined IBM in Poughkeepsie, New York, in 1982,<br />

working on electronic design automation tools. He holds a B.E in Electrical<br />

Engineering from SUNY Stonybrook, and an M.S. in Computer Engineering from<br />

Syracuse University.<br />

Ravi K Komanduri is a software engineer at High Performance Computing<br />

group of IBM <strong>Systems</strong> and Technology Labs, India. He joined IBM in 2004 and<br />

has about 5 1/2 years of total experience in High Performance Computing & Grid<br />

technologies. His areas of expertise include Parallel programming,<br />

Benchmarking & Performance analysis, Developing software tools on clusters,<br />

and currently involved in functional testing of the Blue Gene control system. He<br />

holds a bachelors degree in Computer Science from Jawaharlal Nehru<br />

Technological University, Hyderabad, India.<br />

Dr. Thanh V Lam is a software engineer in Cluster System Test. He leads a<br />

team of test engineers in testing cluster software and hardware on multiple<br />

platforms. His areas of expertise include LoadLeveler, high performance<br />

computing, parallel applications, and Blue Gene. He joined the IBM High<br />

Performance Supercomputing Lab, Kingston, New York, in 1988 and started<br />

working in service processing for the early scalable parallel systems known as<br />

SP. He holds a degree in Doctor of Professional Study in Computing from Pace<br />

University, White Plains, New York.<br />

Sean Saunders is an HPC Product Support Specialist working for ITS in IBM<br />

UK. He joined IBM in 2000, initially working on NUMA-Q® systems and now<br />

mainly supporting Cluster 1600, 1350 and Blue Gene systems, including the<br />

HPC software stack. He also supports AIX® and Linux. He holds a B.Sc.(Hons)<br />

degree in Computer Science from Kingston University, England.<br />

xii IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chris Stone is a Senior IT Specialist based IBM Hursley Park, United Kingdom.<br />

He works as part of the High Performance Computing Services Team based in<br />

the UK mainly on storage systems for customers. He has been working for IBM<br />

for over 20 years and in that time has gained a wide range of experience in<br />

monitor hardware development, software development and customer services.<br />

He has 7 years of experience in designing and installing GPFS storage systems<br />

for customers and has recently gained experience with Blue Gene systems by<br />

team leading the install of two Blue Gene installs in Europe. He holds an<br />

Honours degree in Electrical and Electronic engineering from Bristol University in<br />

the UK.<br />

Shinsuke Ueyama is a software engineer for Engineering and Technology<br />

Services in IBM Japan. He joined IBM in 2005, mainly working on supporting<br />

Blue Gene systems, and recently experienced the installation of a Blue Gene<br />

system in Japan. He holds a Master degree in Information Science and<br />

Technology from The University of Tokyo, Japan.<br />

Dino Quintero is a Consulting IT Specialist at the ITSO in Poughkeepsie, New<br />

York. Before joining the ITSO, he worked as a Performance Analyst for the<br />

Enterprise <strong>Systems</strong> <strong>Group</strong>, and as a Disaster Recovery Architect for IBM Global<br />

Services. His areas of expertise include disaster recovery and IBM System p<br />

clustering solutions. He is an IBM Certified Professional on IBM System p<br />

technologies and also certified on System p system administration and System p<br />

clustering technologies. Currently, he leads technical teams that deliver IBM<br />

Redbook solutions on System p clustering technologies and technical workshops<br />

worldwide.<br />

Thanks to the following people for their contributions to this project:<br />

Steve Mearns<br />

IBM Portsmouth, UK<br />

Randy A. Brewster<br />

Puneet Chaudhary<br />

Richard Coppinger<br />

Alexander Druyan<br />

Bruce Hempel<br />

Steve Normann<br />

Edwin Varella<br />

IBM Poughkeepsie, NY<br />

Lynn A Boger<br />

Mark Campana<br />

Cathy Cebell<br />

Thomas A. Budnik<br />

Darwin Dumonceaux<br />

Preface xiii


Frank Ingram<br />

Randal Massot<br />

Mike Nelson<br />

Jeff Parker<br />

Karl Solie<br />

Mike Woiwood<br />

IBM Rochester, MN<br />

Tom Engelsiepen<br />

IBM San Jose, CA<br />

Mark Mendell<br />

IBM Toronto, Canada<br />

Marc B Dombrowa<br />

David M. Singer<br />

IBM Yorktown Heights, NY<br />

CSM for Blue Gene Development Team:<br />

Ling Gao<br />

IBM Poughkeepsie, NY<br />

ITSO Editing team:<br />

Ella Buslovich<br />

IBM Poughkeepsie, NY<br />

Debbie Willmschen<br />

IBM Raleigh, NC<br />

Become a published author<br />

Join us for a two- to six-week residency program! Help write an IBM Redbook<br />

dealing with specific products or solutions, while getting hands-on experience<br />

with leading-edge technologies. You'll team with IBM technical professionals,<br />

Business Partners or customers.<br />

Your efforts will help increase product acceptance and customer satisfaction. As<br />

a bonus, you'll develop a network of contacts in IBM development labs, and<br />

increase your productivity and marketability.<br />

Find out more about the residency program, browse the residency index, and<br />

apply online at:<br />

ibm.com/redbooks/residencies.html<br />

xiv IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Comments welcome<br />

Your comments are important to us!<br />

We want our Redbooks to be as helpful as possible. Send us your comments<br />

about this or other Redbooks in one of the following ways:<br />

► Use the online Contact us review redbook form found at:<br />

ibm.com/redbooks<br />

► Send your comments in an e-mail to:<br />

redbook@us.ibm.com<br />

► Mail your comments to:<br />

IBM Corporation, International Technical Support Organization<br />

Dept. HYTD Mail Station P099<br />

2455 South Road<br />

Poughkeepsie, NY 12601-5400<br />

Preface xv


xvi IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chapter 1. Introduction<br />

1<br />

This book discusses diagnostic and problem determination methodologies for an<br />

IBM eServer Blue Gene Solution (also known as Blue Gene/L). In this chapter,<br />

we give a brief overview of Blue Gene/L architecture and describe the networks<br />

and the system boot process.<br />

© Copyright IBM Corp. 2006. All rights reserved. 1


1.1 Blue Gene/L system overview<br />

In this book, we investigate and describe symptoms, techniques, and<br />

methodologies for tackling issues that you might encounter while using your Blue<br />

Gene/L system. However, we do not describe in detail the hardware or<br />

application porting and running process for Blue Gene/L. For that purpose, we<br />

recommend the following IBM Redbooks:<br />

► Blue Gene/L: Hardware Overview and Planning, SG24-6796<br />

► Blue Gene/L: System Administration, SG24-7178<br />

► Unfolding the IBM eServer Blue Gene Solution, SG24-6686<br />

To begin using this book and solving any issues that you have with your system,<br />

you should have an understanding of your Blue Gene/L hardware, software,<br />

network, and boot process, as well as the location codes and LED status<br />

reported by the system. This type of information is described in this chapter.<br />

Later chapters discuss problem solving and include check lists for the basic Blue<br />

Gene/L system as well as other components, such as GPFS and LoadLeveler.<br />

1.2 Hardware components of the Blue Gene/L system<br />

The current configuration of Blue Gene/L is built from dual core processors<br />

placed in pairs on a Compute Card together with 1 GB of RAM (512 MB for each<br />

dual core CPU), as shown in Figure 1-1 on page 3. The configuration includes:<br />

► 16 Compute Cards installed in a node card (32 processors)<br />

► 16 Node Cards (512 processors) installed into a dual-sided midplane (1/2<br />

rack)<br />

► Two midplanes installed in a rack (1024 processors)<br />

You can link the racks together to a maximum of 64 racks (65536 processors).<br />

2 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.2.1 Racks<br />

Chip<br />

2 processors<br />

Compute Card<br />

2 chips, 1x2x1<br />

2.8/5.6 GF/s<br />

4MBC h<br />

Node Card<br />

(32 chips 4x4x2)<br />

16 compute, 0-2 IO cards<br />

5.6/11.2 GF/s<br />

1.0 GB RAM<br />

Figure 1-1 Blue Gene/L System<br />

32 Node Cards<br />

90/180 GF/s<br />

16 GB RAM<br />

2.8/5.6 TF/s<br />

512 GB RAM<br />

64 Racks, 64x32x32<br />

180/360 TF/s<br />

32 TB RAM<br />

The following paragraphs describe each of the hardware elements in a Blue<br />

Gene/L system.<br />

The hardware that we discuss in this chapter is installed in racks. The current<br />

maximum numbers of racks in a Blue Gene/L system is 64. Each rack has its<br />

own location code that is seen in system logs and RAS events. You need to learn<br />

these locations to determine problem efficiently.<br />

Figure 1-2 on page 4 and Figure 1-3 on page 5 show where the hardware<br />

components are installed in the rack and the location codes. These two figures<br />

show a front and a back view for each side of the midplane, and each rack has<br />

two midplanes installed.<br />

Rack<br />

System<br />

Note: In the hardware descriptions that we include here, you see two location<br />

codes per item. The first code refers to how the Blue Gene/L system software<br />

references the location, and the second code is how the hardware is labeled.<br />

(Hardware references begin with the letter J.)<br />

Chapter 1. Introduction 3


Figure 1-2 Card positions in the front of the rack with location names<br />

4 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 1-3 Card positions in the back of the rack with location names<br />

Chapter 1. Introduction 5


1.2.2 Midplane<br />

As the name suggests, the midplane sits in the middle of the rack (in vertical<br />

position). There are two midplanes per rack. Position 0 is at the bottom, and<br />

position 1 is at the top. All nodes, link, service, and fan cards plug into the<br />

midplane (see Figure 1-4).<br />

The midplane also provides the communications infrastructure by which the<br />

components to talk to each other. The network services provided by the node,<br />

service cards, and midplane are discussed in 1.3, “Blue Gene/L networks” on<br />

page 21.<br />

Figure 1-4 Midplane position in the rack<br />

6 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.2.3 Compute (processor) card<br />

1.2.4 I/O card<br />

The compute (processor) card, shown in Figure 1-5, is the basic building block of<br />

a Blue Gene/L system. It is comprised of two 700 Mhz 32-bit PowerPC® 440x5<br />

dual-core processors and 1 GB of RAM (512 Mb per dual core CPU).<br />

These nodes are the work horse of the system on which applications run. One<br />

compute card provides two or four nodes, depending on whether you select to<br />

run in communication co-processor (co) or virtual node (vn) mode. In co mode,<br />

one core of each PowerPC chip runs the application, while the other handles the<br />

message passing. In vn mode, both cores run the application and perform the<br />

message passing duties.<br />

The node runs the compute node kernel (CNK), which is a proprietary IBM kernel<br />

that is optimized for the Blue Gene/L environment. This kernel is a single user,<br />

single process, with up to two threads and does not implement paging (in fact, in<br />

this configuration, the virtual memory is limited to the real memory). It uses a<br />

subset of about 40% of supported Linux system calls. The CNK can also be<br />

known as Blue Gene Light Runtime System (BLRTS).<br />

Figure 1-5 Compute (processor) card<br />

The I/O card is similar to the compute card except these nodes have more<br />

memory (currently 2 Gb per card or 1 Gb per node) and have a specific use<br />

within the system. The I/O card runs a Linux version (2.4 uni-processor kernel)<br />

known as the mini-control program (MCP). The MCP has been altered from a<br />

standard Linux distribution to include support for the Blue Gene/L hardware<br />

(including an altered kernel). There is one instance of the MCP per chip, giving<br />

two nodes per I/O card.<br />

Important: Even though the I/O card is a dual core CPU, only one processor<br />

is used to run the I/O MCP.<br />

Chapter 1. Introduction 7


1.2.5 Node card<br />

As an application runs on a compute node, it is I/O is routed to the outside world<br />

through the integrated communication hardware on the I/O card. The I/O card<br />

handles I/O operations on behalf of compute nodes (groups of compute nodes).<br />

This grouping of compute nodes to an assigned I/O card is known as its<br />

processor set or pset.<br />

You can tell the difference between the compute and I/O card, by looking at the<br />

edge connectors. The compute card has a total of six edge connectors, which is<br />

one more than the I/O card (see Figure 1-6).<br />

Figure 1-6 I/O card<br />

The node card is the location into which the compute and I/O cards are inserted.<br />

16 compute cards are added to each node card, as well as up to two optional I/O<br />

cards. Each card slot inside the node card has its own unique location code, as<br />

shown in Figure 1-7 on page 9.<br />

Note: If every I/O node position in the system has a card installed, the system<br />

is known as I/O rich.<br />

The node cards are then installed into a midplane — eight in the front and eight<br />

in the back. This installation is repeated for the second midplane, giving a total of<br />

32 node cards per rack, each with its own unique location code.<br />

8 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 1-7 Node card block diagram showing location codes<br />

Note: Each of the locations has a specific use. For example:<br />

– J1: Midplane connector<br />

– C0 → CF or J2 → J17: Compute cards<br />

– I0, I1 or J18, J19: I/O cards<br />

The node card has LED status indicators on the front (Figure 1-8 on page 10).<br />

The center LED panel shows whether the card is linked to the midplane, whether<br />

the IDo link is working, and whether the card itself or something plugged into the<br />

card has a fault.<br />

Chapter 1. Introduction 9


Figure 1-8 Node card center LED panel<br />

Table 1-1 Node card status LEDs<br />

Ethernet LEDs<br />

Led Name/Color Indication<br />

LINK/Green ON - An active link from the Service Network into the<br />

node card. Card can be monitored and controlled.<br />

OFF - Control link is missing. All the other LEDs might<br />

contain inaccurate information because the Service<br />

Network connection is used to set them.<br />

4/Green ON - Flashing - card is operating<br />

OFF - Card is down. Node Cards should only be<br />

removed when this LED is OFF.<br />

3/Green Flashing - Un-initialized<br />

2/Green Flashing - Un-initialized<br />

1/Green Flashing - Un-initialized<br />

0/Amber ON - Card has a fault<br />

OFF - Card has no fault<br />

FLASHES - Card needs human interaction<br />

Note: If LED 0 is on after everything is initialized, one of the power modules<br />

might have failed. In this case, the card is operational but no longer has its<br />

redundant power.<br />

10 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.2.6 Service card<br />

In addition to the center LED panel, each RJ45 jack marked 1, 2, 3, or 4 has<br />

status LEDs (Figure 1-8 on page 10 shows two RJ45 jacks marked 2 and 3. The<br />

left green LED in 3 is on.)<br />

Table 1-2 Node card RJ45 LEDs and indications<br />

Led Name/Color/Position Indication<br />

RJ45 Gbit ethernet link LED/Green/Left ON - Active Gbit ethernet link<br />

OFF - No active Gbit ethernet link<br />

RJ45 Gbit ethernet link LED/Green/Right FLASHING - Indicates traffic<br />

OFF - No traffic<br />

The service card provides the major control functions for each midplane:<br />

► Provides an interface to the network that controls the fans and cooling<br />

through the I²C network. (See also 1.3.1, “Service network” on page 22.)<br />

► Distributes the clock signal to all node cards on the midplane. The clock card<br />

is connected directly to the front of each service card. (See 1.2.8, “Clock card”<br />

on page 15.)<br />

► Controls the ethernet switch for the midplane TCP/IP network.<br />

► Controls the boot sequence of each midplane.<br />

► Delivers persistent power to link cards, so that torus and collective networks<br />

continue to function in the event of a power outage, so that additional racks in<br />

the Blue Gene/L system continue to work. See 1.3.3, “Three dimensional<br />

torus (3D torus)” on page 26 and 1.3.4, “Collective (tree) network” on page 26<br />

for more details.<br />

► Connects devices on the midplane to the service node through the service<br />

network. Each device that is inserted into the midplane has an IDo chip that<br />

acts as a bridge between the network and the other hardware interfaces. (For<br />

more details see 1.3.1, “Service network” on page 22.)<br />

The service card also has its own set of LEDs to help with problem<br />

determination, as illustrated in Figure 1-9.<br />

Figure 1-9 Front of the service card<br />

Chapter 1. Introduction 11


Status LEDs are located on the right-hand side of the service card. Table 1-3<br />

explains these LEDs.<br />

Table 1-3 Service card status LEDs<br />

Led Name/Color Indication<br />

Power/Green ON - 3.3 volt persistent power to the Midplane present. This<br />

power is used to run the service card and to run the service<br />

path logic on all other Link, Node Cards and Fan modules in the<br />

midplane. This LED should always be on during normal<br />

operations.<br />

OFF - 3.3 volt persistent power to the midplane missing.<br />

Service Cards should only be removed when this LED is OFF<br />

4 /Green ON/FLASHING - Card is operating<br />

OFF - Card is down<br />

3/Amber OFF - Card initialized or no rack power<br />

FLASHING - Card un-initialized or needs to be brought up after<br />

a power cycle<br />

2/Amber OFF - Card initialized or no rack power<br />

FLASHING - Card un-initialized or needs to be brought up after<br />

a power cycle<br />

1/Amber OFF - Card initialized or no rack power<br />

FLASHING - Card un-initialized or needs to be brought up after<br />

a power cycle<br />

0 - Amber ON - Card has a fault<br />

OFF - Card has no fault<br />

FLASHING - Card un-initialized or needs to be brought up after<br />

a power cycle<br />

The service card also hosts 100 Mbps and 1 Gbps network ports. The 100 Mbps<br />

port is used in the discovery process, and the 1 Gbps port is used for talking to<br />

the hardware when the discovery process has completed. For more information<br />

about the discovery process, see 1.9.3, “Discovery” on page 43. Each RJ45 jack<br />

on the service card has status LEDs that are similar to the ones on the node card<br />

(see Table 1-2 on page 11).<br />

12 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.2.7 Fan card<br />

The cooling on the Blue Gene/L system is achieved through banks of fans. Each<br />

fan unit contains three individual fans and two status LEDs on the front<br />

(Figure 1-10).<br />

Figure 1-10 Fan unit front (showing LEDs) and fan unit removed from rack<br />

The fans draw the air through the intake plenum, through the rack, and release it<br />

out through the output plenum towards the ceiling (Figure 1-11). It is the plenums<br />

that give the Blue Gene/L its unique shape.<br />

Figure 1-11 Rack cooling<br />

Chapter 1. Introduction 13


There are 20 fan units (clusters) per rack — 10 are installed on the front and 10<br />

in the back, each with their own location code. (For the locations codes, see<br />

Figure 1-2 on page 4.) Figure 1-12 shows the fan clusters installed in the rack.<br />

Figure 1-12 Fans installed in a rack - exhaust plenum removed for clarity<br />

Table 1-4 shows the fan status LEDs significance.<br />

Table 1-4 Fan status LEDs<br />

Fan Good - Green - Left Fan Fault - Amber - Right Indication<br />

OFF OFF Not powered, can be in this<br />

mode during the first few<br />

seconds of rack power up.<br />

ON ON Autonomous mode. Host<br />

communication failure.<br />

ON OFF Fan is working normally.<br />

OFF ON One or more fans in the<br />

module is operating slow<br />

or not at all. Fan assembly<br />

is bad and needs to be<br />

looked at by hardware<br />

support.<br />

FLASHING ON or OFF Identification mode, used<br />

to identify a specific fan<br />

unit.<br />

14 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.2.8 Clock card<br />

The clock card is attached physically to the bottom of the rack and is connected<br />

to the midplanes in the system through special cables that connect to the coaxial<br />

ports that are located on the front of each service card (Figure 1-13). Because<br />

Blue Gene/L is a massive parallel system, the clocks on all nodes that are<br />

involved in running a job must be synchronized (especially for any MPI job). The<br />

clock card logic provides the necessary synchronization for all node cards.<br />

Figure 1-13 Clock card<br />

1.2.9 Bulk power modules/enclosure<br />

The bulk power modules (BPM) are located on top of each rack. There are seven<br />

modules per rack (three in the front and four in the back of the rack). These<br />

modules are connected in an n+1 redundancy scheme. The modules are<br />

inserted into the bulk power enclosure (BPE), as shown in Figure 1-14, which<br />

also houses the circuit breaker module that is used to turn on and off the rack.<br />

Figure 1-14 Front view of the BPE (three BPMs and a circuit breaker)<br />

Chapter 1. Introduction 15


The BPMs also have status LEDs, as shown in Figure 1-15.<br />

Figure 1-15 BPM (LEDs on top left)<br />

Table 1-5 details the significance of these LEDs.<br />

Table 1-5 BPM Status LEDs<br />

LED Name/Color/Position Indication<br />

AC Good/Green/Left ON - AC Input to power module is present and<br />

within specification<br />

OFF - There is a problem with the AC supply<br />

DC Good/Green/Middle ON - DC Input to power module is present and<br />

within specification<br />

OFF - There is a problem with the DC supply<br />

Fault/Amber/Right ON - Replace power module<br />

OFF - Operating normally (as long as AC Good and<br />

DC Good are ON)<br />

All LEDs OFF - No Power<br />

16 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.2.10 Link card<br />

The link card is used to connect different Blue Gene/L internal networks between<br />

compute (processor) cards on different midplanes. It allows the Blue Gene/L to<br />

expand beyond the physical midplane limit. The link care also provides<br />

connectors for X, Y, and Z dimension cabling through torus cables which are<br />

plugged into it (as shown in Figure 1-16). The Z cables go between midplanes<br />

that are located in the same rack, the Y cables go between midplanes in the<br />

same row, and the X cables go between midplanes in different rows.<br />

Figure 1-16 Direction of X,Y and Z cabling<br />

Note: The process of cabling a Blue Gene/L is beyond the scope of this<br />

redbook and is covered by the installation team when the system is installed.<br />

What you need to know is the locations and socket names on the link card (as<br />

illustrated in Figure 1-16).<br />

Chapter 1. Introduction 17


Figure 1-17 Link card that shows the location codes of X, Y, and Z cable connections<br />

18 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.2.11 Rack, power module, card, and fan naming conventions<br />

Each card is named according to its position in the rack. Figure 1-17,<br />

Figure 1-18, and Figure 1-19 depict the way a unique system location is built for<br />

each component.<br />

Racks:<br />

Rxx<br />

Rack column (0-F)<br />

Rack row (0-F)<br />

Power modules:<br />

Rxx-Px<br />

Power module (0-7)<br />

0-3 Left-to-right facing front<br />

Midplanes:<br />

Rxx-Mx<br />

Clock cards:<br />

Rxx-K<br />

4-7 Left-to-right facing rear<br />

Rack<br />

Figure 1-18 Hardware naming convention, 1 of 3<br />

Note: Position 0 is<br />

reserved for the<br />

circuit breaker<br />

Midplane (0-1) 0 = Bottom, 1 = Top<br />

Rack<br />

One Clock card per rack<br />

Rack<br />

Chapter 1. Introduction 19


Fan assemblies:<br />

Rxx-Mx-Ax<br />

Fan assembly (1-A)<br />

1 = Bottom front, 9 = Top front (Odd numbers)<br />

Link cards:<br />

Rxx-Mx-Lx<br />

Figure 1-19 Hardware naming convention, 2 of 3<br />

20 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

2 = Bottom rear, A = Top rear (Even numbers)<br />

Midplane<br />

Rack<br />

Link card(0-3)<br />

Midplane<br />

Rack<br />

0 = Bottom front, 1 = Top front<br />

2 = Bottom rear, 2 = Top rear<br />

Fans:<br />

Rxx-Mx-Ax-Fx Fan(0-2) 0 = Tailstock, 2 = Midplane<br />

Fan assembly<br />

Midplane<br />

Rack<br />

Service cards:<br />

Rxx-Mx-S<br />

Service card - one per rack<br />

Midplane<br />

Rack


Node cards:<br />

Rxx-Mx-Nx<br />

Figure 1-20 Hardware naming convention, 3 of 3<br />

1.3 Blue Gene/L networks<br />

Node card (0-F)<br />

0 = Bottom front, 7 = Top front<br />

8 = Bottom rear, F = Top rear<br />

Midplane<br />

Rack<br />

Compute cards:<br />

Rxx-Mx-Nx-C:Jxx Compute card (2* -17)<br />

Node card<br />

Midplane<br />

Rack<br />

I/O cards:<br />

Rxx-Mx-Nx-I:Jxx I/O card (18-19) 18 = Right, 19 = Left<br />

Node card<br />

Midplane<br />

Rack<br />

Even numbers - right<br />

Odd numbers - left<br />

Lower numbers - toward midplane<br />

Upper numbers toward tailstock<br />

Blue Gene/L has five networks that connect the internal components and the<br />

system to the outside world. The networks are:<br />

► Service network<br />

– JTAG<br />

– I²C<br />

► Functional Ethernet<br />

► Torus<br />

► Collective (tree)<br />

► Barrier and Global interrupt<br />

Chapter 1. Introduction 21


1.3.1 Service network<br />

The service network provides a gateway into the Blue Gene/L system, so that the<br />

Service Node can control and monitor the hardware. The Service Node<br />

communicates to the Service Card using 100 Mbps or 1 Gbps connections. An<br />

ethernet switch located in the Service Card provides two Fast and two Gigabit<br />

ethernet ports that are located on the front of the Service Card. The internal<br />

switch also provides 20 additional ports, which are used to connect to the IDo<br />

chips that are located on the 16 node cards and four link cards hosted on the<br />

midplane (Figure 1-21).<br />

Figure 1-21 Service Network<br />

The hardware on the node and link cards is unable to talk directly to the Service<br />

Network. These components communicate using their own protocols. The IDo<br />

chip acts as a bridge between the service network and these communication<br />

protocols. The two protocols are:<br />

► JTAG<br />

► I²C<br />

22 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


JTAG<br />

Each processor has an interface onto a control system. The protocol used is<br />

called the Joint Technical Advisory <strong>Group</strong> (JTAG) interface, which is an Institute<br />

of Electrical and Electronic Engineers (IEEE) 1149.1 standard. The IDo chip<br />

converts the 100 Mbps ethernet bus into 40 JTAG buses, two for each Compute<br />

Node (32 in all, one for control of each processor), two for each of the I/O nodes<br />

(4 in all, one for control of each processor), and four for the gigabit ethernet<br />

transceivers that are associated with the I/O nodes. On Blue Gene/L, JTAG<br />

provides:<br />

► Hardware control to turn on the CPU and start its clock.<br />

► Hardware diagnostic and debugging.<br />

► Delivery of the microloader to each CPU to start the boot process.<br />

► A mailbox function for recording messages from the MCP or CNK, which are<br />

used for recording RAS events.<br />

Figure 1-22 illustrates node cards links to the JTAG network.<br />

Figure 1-22 Node cards links to the JTAG network<br />

Chapter 1. Introduction 23


I²C<br />

The Inter-Integrated Circuit (I²C) bus (and protocol) is used to interface with the<br />

fan control logic and temperature sensors in the Blue Gene/L system<br />

(Figure 1-23). Data such as fan speed and voltages are also reported using this<br />

2-wire serial protocol.<br />

I²C Network<br />

Rear<br />

Link Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

BPM<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

Link Card iDo<br />

Figure 1-23 Service network showing JTAG and I²C connections<br />

1.3.2 Functional network<br />

The functional network connects the Blue Gene/L I/O nodes, Service Node,<br />

Front-end Nodes (FENs), and file system providers together in one network. The<br />

functional network:<br />

► Provides system software and application data access to the I/O and<br />

Compute nodes, because there is no persistent data storage inside the Blue<br />

Gene/L racks. This is done through the I/O nodes over the functional network,<br />

in the form of NFS GPFS and mounts from external sources. (For more<br />

information, see Chapter 5, “File systems” on page 211.)<br />

24 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

Midplane Network<br />

Midplane<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

Switch<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

iDo<br />

FAN<br />

Front<br />

Link Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

iDo<br />

Node Card<br />

Node Card<br />

Node Card<br />

Node Card<br />

Link Card<br />

Service Card<br />

JTAG Network<br />

Service<br />

Network<br />

Service Node<br />

(idoproxy)


► Communicates stdin, stdout, and stderr from the Compute Nodes back<br />

from where the application was submitted through the ciod and ciodb<br />

processes that are running on the I/O Node and Service Node (as shown in<br />

Figure 1-24). For more information, see 1.4, “Service Node” on page 29.<br />

Figure 1-24 Functional network<br />

Chapter 1. Introduction 25


1.3.3 Three dimensional torus (3D torus)<br />

Three dimensional torus (3D torus) is the first of the three specialized networks<br />

implemented on the Blue Gene/L system to enable high performance parallel<br />

computing. The 3D torus network is used for general purpose, point-to-point<br />

message passing and multicast operations to a selected class of nodes when an<br />

application is running.<br />

Each processor is directly connected to six other neighbor processors, two<br />

processors in each direction (X, Y, and Z dimensions: X+1, X-1, Y+1, Y-1, Z+1,<br />

Z-1), as shown in Figure 1-25. The tori are implemented in hardware by the node<br />

card, midplane, link cards, and torus cables.<br />

Figure 1-25 4x4x4 (64) node torus<br />

For more information about the torus network, see Unfolding the IBM eServer<br />

Blue Gene Solution, SG24-6686.<br />

1.3.4 Collective (tree) network<br />

The collective network is the second Blue Gene/L specialized network. It is used<br />

for one-to-all, all-to-one, and all-to-all communication. It connects all compute<br />

26 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


C – Compute Node<br />

I/O – I/O Node<br />

I/O<br />

C C<br />

C C C C<br />

Figure 1-26 Collective network<br />

nodes in the shape of a tree, and any node can be the tree root (originating<br />

point). Any compute node can send messages up or down the tree structure, and<br />

that message can stop at any level (see Figure 1-26).<br />

In a system with as many processors as Blue Gene/L, it is impractical to provide<br />

each processor with its own external connection. We have I/O nodes with<br />

dedicated external connections that handle external I/O operations on behalf of<br />

groups of compute nodes. This grouping of compute nodes to an assigned I/O<br />

node is known as its processor set or pset.<br />

The I/O nodes are connected to the collective network but do not participate in<br />

global messaging. They just handle I/O requests. All compute nodes exist under<br />

their associated I/O node in the collective (tree) network.<br />

C C<br />

C External<br />

C External<br />

C External<br />

C<br />

In/Out<br />

In/Out<br />

In/Out<br />

C C<br />

C C C C C C C C C C C C<br />

Pset Pset Pset<br />

1.3.5 Global barrier and interrupt network<br />

I/O<br />

I/O<br />

External<br />

In/Out<br />

C C<br />

The third specialized network is the global barrier and interrupt network. A global<br />

barrier is a way of synchronizing groups of compute nodes. When a barrier is<br />

raised by an application, the nodes wait until everyone reaches a certain position<br />

or condition, so that they can all continue. An interrupt is an asynchronous signal<br />

that indicates the need for attention (for example an error condition).<br />

Every node on the Blue Gene/L system is connected to the global barrier and<br />

interrupt network through four inputs and four outputs. Each input and output pair<br />

forms a channel in the network. Each of these channels can be independently<br />

I/O<br />

I/O<br />

C C<br />

External<br />

In/Out<br />

Chapter 1. Introduction 27


programmed as a global logical “OR” or “AND” statement, depending on the<br />

polarity of the signals.<br />

On each node card, the outputs of eight of the compute connections with the<br />

outputs of the optional I/O connections are connected together to form a<br />

“dot-OR” network. Each node card therefore has four “dot-OR” networks. These<br />

are then connected into the node cards IDo chip.<br />

The IDo chip on the Node Card is used to sample on all the “dot-OR” networks<br />

on that card and to pass any signals onto other “dot-OR” networks in the same<br />

card or to pass it onto other node cards.<br />

The global barrier and interrupt network on a midplane is divided into quadrants,<br />

each with four node cards. One of these node cards serves as the head of the<br />

quadrant.<br />

The outputs of the three non-head node cards are connected together and fed<br />

into the IDo chip of the quadrant head along with the quadrant head “dot-OR”<br />

networks. The output of each quadrant head is connected to the IDo chips on the<br />

link cards (see Figure 1-27). A link card handles each global interrupt and barrier<br />

channel.<br />

The output signal from a node is called the up signal and carries the node<br />

contribution all the way to the top of the partition. The combined signals are then<br />

fed back to all the nodes of the partition. This input signal is called the down<br />

signal.<br />

28 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 1-27 Global barrier and interrupt network<br />

1.4 Service Node<br />

1.4.1 DB2 database<br />

The Service Node is the control system for the Blue Gene/L racks. It is an IBM<br />

System p system (stand-alone or LPAR) running SUSE Linux Enterprise Server<br />

9 (SLES 9) and has connections to the Service and Functional networks. A DB2<br />

database is installed and running on the Service Node, and this contains the<br />

current state of the Blue Gene/L as well as jobs and system configuration. The<br />

Blue Gene/L specific set of daemons runs on the Service node, utilizing the<br />

database and performing activities such as booting, job submission, and<br />

hardware control.<br />

The service node runs a DB2 database, which is used to store the following four<br />

data categories about the Blue Gene/L:<br />

Chapter 1. Introduction 29


► Configuration data: Includes system configuration data, operational data,<br />

and historical data.<br />

– System configuration data, which includes a representation of the<br />

physical system in database tables. All system hardware and connections<br />

are recorded in this category of data and are only altered when a<br />

component fails or is replaced.<br />

Each component of the system is represented in a database table with its<br />

relevant information. All the entries in the tables are identified by the<br />

hardware components unique serial numbers as follows:<br />

Machine Generated value<br />

Midplane Service Card’s IDo chip License Plate (unique<br />

identifier for each chip)<br />

Node Card Card’s IDo Chip License Plate<br />

Processor Card Processor Card’s Serial Number and position in the<br />

Node Card (for example, J03,J04) or ECID<br />

(ElectronicChip ID is the unique identifier for each<br />

chip)<br />

Node Compute or I/O Node ECID<br />

Link Card IDo chip License Plate<br />

Service Card IDo chip License Plate<br />

Link Chip Chips® ECID<br />

IDo Chip License Plate<br />

Note: The configuration database is empty when the system is first<br />

installed. It is populated by the discovery process as described in 1.9.3,<br />

“Discovery” on page 43.<br />

– Operational data, which includes the status of what is currently in use by<br />

applications and the status of the jobs themselves. The Midplane<br />

Management Control System (MMCS) interacts with this database to<br />

schedule jobs and allocate blocks.<br />

– Historical data, which keeps track of hardware changes on the system<br />

and shows the history of what has run, when it has run, and on what<br />

hardware it has run.<br />

► Environment data: Periodically records the values for all sensors in the<br />

system. Voltage levels, fan speeds, and so forth are recorded here.<br />

► RAS data: This data is a very important in terms of problem determination.<br />

The system records all Reliability, Availability, and Scalability (RAS)<br />

30 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


information. It is critical to monitor RAS information as an indicator of system<br />

health. For more information about monitoring RAS, see 3.2, “Hardware<br />

monitor” on page 114 and “RAS events” on page 126.<br />

► Diagnostic data: Contains the results from diagnostic tests on the hardware.<br />

1.4.2 Service Node system processes<br />

The following paragraphs describe Blue Gene/L system processes that run on<br />

the Service Node.<br />

mmcs_db_server<br />

The Midplane Management Control System (MMCS) server process is<br />

responsible for the management of blocks. Blocks are partitions (sets of compute<br />

and I/O nodes) of the Blue Gene/L in which jobs run. The mmcs_db_server<br />

process configures blocks at boot time and identifies what physical hardware<br />

should be used and in what configuration. It also polls the database for block<br />

actions and starts the boot processes.<br />

ciodb<br />

After the blocks are booted, ciodb manages the job launch to the block. It then<br />

handles passing back stdin, stdout, and stderr for each job. The ciodb daemon<br />

talks to the ciod process running on the I/O node.<br />

idoproxydb<br />

The idoproxydb daemon handles hardware related communication<br />

communication through the Service Network. It communicates with the IDo chips<br />

located on the Service, Link, and Node Cards.<br />

bglmaster<br />

The bglmaster process is the parent process for the other three system<br />

processes. It starts all three of the main system processes (idoproxy,<br />

mmcs-db_server, and ciodb) and restarts them if a process is ended for any<br />

reason. It can also provide information about the latest status of the spawned<br />

processes.<br />

Additional software<br />

For the Service Node to function, additional software is required. For more<br />

information, see Unfolding the IBM eServer Blue Gene Solution, SG24-6686.<br />

Chapter 1. Introduction 31


1.5 Front-End Node<br />

A Front-End Node is another IBM System p running SUSE Linux Enterprise<br />

Server 9 (SLES 9). Multiple Front-End Nodes can be installed. This is where the<br />

users log into and submit jobs to the Blue Gene/L. IBM compilers and Blue<br />

Gene/L compiler extensions are installed so that the user can cross compile<br />

code so it will run on the Blue Gene/L hardware. The user then submits the job to<br />

the system through a job scheduler.<br />

The Front-End Node is needed because compilations and handling job I/O from<br />

many users can have a severe effect on the performance of the Service Node.<br />

You can find more information about compiling and submitting jobs in Chapter 4,<br />

“Running jobs” on page 141.<br />

1.6 External file systems<br />

As we explained in 1.3.2, “Functional network” on page 24, the Blue Gene/L<br />

system does not have any persistent storage (disk) directly attached. Storage is<br />

provided through external file systems such as NFS or GPFS (as illustrated in<br />

Figure 1-28). These file systems are then mounted by the I/O nodes, and the<br />

Compute nodes perform I/O through the collective network. You can find more<br />

information in Chapter 5, “File systems” on page 211.<br />

32 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Front End Node<br />

Job<br />

Job Scheduler<br />

ciodb<br />

idoproxy<br />

mmcs_server<br />

DB2<br />

Service Node<br />

Figure 1-28 Communication in Blue Gene/L<br />

1.7 Blue Gene/L system software<br />

1.7.1 I/O node kernel<br />

NFS/GPFS<br />

Functional Network<br />

Service Network<br />

CIOD<br />

Blue Gene/L<br />

I/O Node(s)<br />

Compute<br />

Resources<br />

iDo to JTAG<br />

Block<br />

So far we have talked about the physical hardware and software that runs on the<br />

Service Node. Now, we move onto the software that actually runs on the Blue<br />

Gene/L hardware (compute and I/O nodes).<br />

When a block is booted, each I/O node within that block receives the same boot<br />

image and configuration. This is in fact a port of Linux with specific patches to<br />

support the Blue Gene/L hardware. This altered kernel is also known as the<br />

Mini-Control Program (MCP). It is seen as linuximg in the block definition and<br />

has a small specialized shell that provides a subset of commands and command<br />

options called BusyBox. The I/O node boot scripts run the BusyBox commands.<br />

Chapter 1. Introduction 33


BusyBox is open source software. You can learn more about BusyBox at:<br />

http://www.busybox.net/<br />

1.7.2 I/O kernel ramdisk<br />

The ramdisk is a stripped down UNIX® file system which contains just the root<br />

user, configuration files, and binaries for the services that need to be started.<br />

This file system is mounted by the MCP at boot time. It is customized<br />

dynamically, so updates to startup files and services are seen immediately by the<br />

I/O nodes when they are next booted. The ramdisk is specified by ramdiskimg in<br />

the block definition.<br />

1.7.3 I/O kernel CIOD daemon<br />

The Compute node IO Daemon (ciod) is started on the I/O node by the<br />

initialization scripts. It controls applications on the Compute Nodes and provides<br />

I/O services to them. It interacts with ciodb on the Service Node. The ciod<br />

daemon waits for connection on TCP port 7000. When it receives the connection,<br />

it reads the command from ciodb from the socket. Commands that are sent are:<br />

VERSION Establish protocol<br />

LOGIN Set up job information<br />

LOAD Load application<br />

START Start running the job<br />

KILL End running job<br />

After the application is running, ciod also reports output from the Compute<br />

Nodes back to ciodb. CONSOLE_TEXT messages are stdout and stderr output, and<br />

CONSOLE_STATUS is the return status of the application.<br />

1.7.4 Compute Node Kernel<br />

For information about the Compute Node Kernel, see 1.2.3, “Compute<br />

(processor) card” on page 7. It is seen as blrtsimg in the clock definition.<br />

34 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.7.5 Microloader<br />

The microloader is used to boot inactive processors into a state where they can<br />

receive the CNK (compute nodes) or MCP (I/O nodes). It is seen as mloaderimg<br />

in the block definition. Example 1-1 shows the block output and the boot images<br />

specified.<br />

Example 1-1 Block definition showing image definitions<br />

mmcs$ list bglblock R000_128<br />

OK<br />

==> DBBlock record<br />

_blockid = R000_128<br />

_numpsets = 0<br />

_numbps = 0<br />

_owner =<br />

_istorus = 000<br />

_sizex = 0<br />

_sizey = 0<br />

_sizez = 0<br />

_description = Generated via genSmallBlock<br />

_mode = C<br />

_options =<br />

_status = F<br />

_statuslastmodified = 2006-03-31 15:08:11.992582<br />

_mloaderimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />

_blrtsimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />

_linuximg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />

_ramdiskimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />

_debuggerimg = none<br />

_debuggerparmsize = 0<br />

_createdate = 2006-03-06 17:56:20.600708<br />

Chapter 1. Introduction 35


1.8 Boot process, job submission, and termination<br />

The following steps outline the boot process, job submission, and termination<br />

steps:<br />

1. When a user submits a job using mpirun, llsubmit, or submit_job, depending<br />

on the command, a new block might be created according to the user’s<br />

specification or an exiting block might be reused.<br />

2. The selected block in the BGLBLOCK database table is set to C for configure.<br />

An entry is made in the BGLJOB table with the status of Q for queued.<br />

3. The mmcs_db_server process continually polls the BGLBLOCK database table<br />

looking for blocks in the C for configure state.<br />

4. When the mmcs_db_server process finds a block in the configure state, the<br />

boot process is started by changing the status of the block to A for allocating.<br />

The BGLEVENTLOG table is monitored for any FATAL RAS events. If any are<br />

recorded during the boot process, the block is D for de-allocated and F for<br />

freed.<br />

5. Block information is updated with the block owner which is set to the user ID<br />

that was used to submit the job. The mmcs_db_server process then uses<br />

idoproxy to establish connections to the IDo chips on the cards where the<br />

block is to be booted. If any IDo connections fail, a FATAL RAS event is<br />

created in the BGLEVENTLOG database table.<br />

6. IDo commands are used to initialize the chips on the I/O and compute cards.<br />

7. If the block spans multiple midplanes, Link Training is performed on the link<br />

chips. Signal patterns are sent between the chips. The patterns are used, so<br />

that the chips can lock onto each other and synchronize when they recognize<br />

the signal.<br />

Figure 1-29 illustrates these steps.<br />

36 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Front-End Node<br />

Job<br />

Job Scheduler<br />

mpirun<br />

Create Block<br />

Status = C<br />

JOB Status = Q<br />

Poll for Block<br />

status=C<br />

3<br />

1<br />

2<br />

Service Node<br />

mpirun_be<br />

Database<br />

STATUS=C<br />

STATUS=A<br />

OWNER<br />

Block Definition<br />

STATUS=Q<br />

Job Definition<br />

mmcs_db_server<br />

idoproxy<br />

Figure 1-29 Booting a block - initial steps<br />

5<br />

Set to A<br />

Set Owner<br />

4<br />

5<br />

6<br />

Link Training<br />

iDo<br />

SRAM<br />

SRAM<br />

7<br />

SRAM<br />

SRAM<br />

iDo<br />

iDo<br />

SRAM<br />

SRAM<br />

SRAM<br />

SRAM<br />

Rack<br />

SRAM<br />

SRAM<br />

Node Cards<br />

Service Card<br />

Service Card<br />

SRAM<br />

SRAM<br />

SRAM<br />

SRAM<br />

SRAM<br />

SRAM<br />

SRAM<br />

SRAM<br />

Chapter 1. Introduction 37


After the chips are initialized, the microloader that is specified in the block<br />

definition is loaded onto the SRAM area of the I/O and compute cards (see<br />

Figure 1-30). A checksum is performed to show that the code has arrived as<br />

expected. Each processor is then started and the microloader executes. Failure<br />

to load the microloader generates a FATAL RAS event.<br />

Figure 1-30 Microloader distributed to processors<br />

The microloader then loads over the Service Network and performs checksums<br />

for the MCP and ramdisk on the I/O nodes and CNK on compute nodes (see<br />

Figure 1-31). The loads are performed in parallel, and a start command is sent to<br />

each microloader when completed. The microloader then gives up control and<br />

starts the loaded code. The MCP and CNK boot nodes become active entities.<br />

Figure 1-31 MCP and CNK loaded<br />

38 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


The next steps in the process are:<br />

1. After the mmcs_db_server process has finished sending the start commands, it<br />

sets the status of the block to B for booting.<br />

2. During this process, the nodes begin Tree training. Collective and GB<br />

ethernet drivers are loaded. MCP and CNK establish links to each other.<br />

Training involves sending signal patterns for the nodes to identify each other:<br />

– Synchronization pattern is sent out on the torus and collective networks<br />

Torus network across midplanes<br />

Collective between I/O and Compute Nodes<br />

– Nodes take turns in sending and receiving signals<br />

– Each node looks out for a particular bit stream<br />

– After the stream is found, the nodes synchronize<br />

– After everyone has found their counterparts, training completes<br />

– Failure in the process generates a FATAL RAS event<br />

3. I/O nodes start the GB Ethernet connection to the functional network and NFS<br />

mount /bgl from the Service Node. Initialization scripts start on the I/O nodes,<br />

and ciod starts.<br />

4. The ciod daemon sends a message down the collective network and waits for<br />

the Compute Nodes to respond.<br />

5. When the nodes respond, ciod generates the RAS event CIOD initialized.<br />

6. The mmcs_db_server process waits for all I/O nodes to generate CIOD<br />

initialized, and then changes the block to I for initialized.<br />

Figure 1-32 illustrates this process.<br />

Chapter 1. Introduction 39


56<br />

ALL CIOD<br />

Complete<br />

Front-End Node<br />

Job<br />

Job Scheduler<br />

mpirun<br />

Start 1<br />

Complete<br />

Service Node<br />

mpirun_be<br />

/BGL<br />

Figure 1-32 Initializing the block<br />

Database<br />

STATUS=I<br />

STATUS=B<br />

Block Definition<br />

CIOD_INITIALIZED<br />

RAS<br />

mmcs_db_server<br />

idoproxy<br />

Mount /bgl<br />

3<br />

Compute<br />

Responded<br />

Tree<br />

Training<br />

40 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

5<br />

2<br />

iDo<br />

iDo<br />

iDo<br />

Rack<br />

Node Cards<br />

Service Card<br />

Service Card<br />

CIOD Start<br />

4<br />

CIOD CIOD


The next steps in this process are:<br />

1. When the block is set to I for initialized, mpirun changes the status of the job<br />

in BGLJOB table from Q for queued to S for starting.<br />

2. Then, ciodb polls for jobs in status S. When it finds these, it establishes a<br />

connection to ciod on the I/O Node on port 7000. The ciod receives the LOAD<br />

command from ciodb and sends the application over the collective network to<br />

the Compute Nodes.<br />

3. When all Compute Nodes have all received the application, the START<br />

command is issued. BGLJOB STATUS is set to R for running.<br />

4. Then, ciodb forwards STDIN to ciod and ciod forwards STDOUT and STDERR<br />

back to ciodb. ciod handles the file I/O on behalf of the Compute Nodes.<br />

ciodb continues to poll ciod for job completion, which happens when the job<br />

completes or is killed.<br />

5. Finally, ciodb marks BGLJOB STATUS as T for terminated. ciodb then closes<br />

the connection to ciod.<br />

Figure 1-33 illustrates these steps.<br />

Chapter 1. Introduction 41


Frontend Node<br />

Job<br />

Job Scheduler<br />

mpirun<br />

BGLBLOCK=I<br />

MPIRUN set to S<br />

BGLJOB<br />

STATUS=S<br />

Service Node<br />

mpirun_be<br />

Database<br />

STATUS=I<br />

Block Definition<br />

mmcs_db_server<br />

ciodb<br />

Figure 1-33 Starting the job<br />

2<br />

1<br />

STATUS=T<br />

STATUS=S<br />

Job Definition<br />

JOB<br />

Complete<br />

For a list of block and job states, see Table 4-3 on page 162 and Table 4-4 on<br />

page 162.<br />

Logs of the various system processes and I/O nodes are described in 2.2.6,<br />

“Control system server logs” on page 61.<br />

42 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

5<br />

4 STDIN/STDOUT/<br />

STDERR<br />

3<br />

Code Load<br />

& START<br />

CIOD CIOD<br />

I/O<br />

4<br />

GPFS<br />

NFS<br />

iDo<br />

Node Cards


1.9 System discovery<br />

System discovery is the process of finding all the hardware and communication<br />

links in a Blue Gene/L system and initializing them. The discovery process is also<br />

responsible for populating and updating the configuration database held on the<br />

Service Node. If a component is replaced, it is rediscovered and placed in the<br />

database. The old entries are marked as missing, allowing tracking of replaced<br />

hardware. For discovery work, you need to start several processes. We discuss<br />

these process in this section.<br />

1.9.1 The bglmaster process<br />

The bglmaster process is the master process that starts the Blue Gene/L system<br />

processes which provide an interface to be able to talk to the hardware after it is<br />

discovered through the idoproxy daemon.<br />

1.9.2 SystemController<br />

1.9.3 Discovery<br />

1.9.4 PostDiscovery<br />

When the racks are powered on they are in un-initialized state. In an<br />

un-initialized state, we are unable to communicate with the system, because the<br />

IDo chips do not have an IP address. We are, therefore, unable to communicate<br />

over the Service Network. To discover the system, we use the SystemController<br />

process. This process finds the IDo chips and allocates an IP address to each of<br />

them that was predefined at installation time in the Service Node database. This<br />

operation is done on the 100 Mb ethernet connections on the Service Card.<br />

After the SystemController has initialized the IDo chips, the Discovery process<br />

itself can then start to talk to them through the 1 Gb network connections on each<br />

Service Card. There are separate Discovery processes for each row of racks in<br />

your environment. Discovery turns on the hardware, finds the components, and<br />

populates (updates in the case of a power cycle or hardware change because<br />

the entries are already present) the relevant database table. If a device fails to<br />

respond to discovery, it is marked as missing. When completed, control of each<br />

component is passed over to idoproxy daemon.<br />

After Discovery has populated the database, PostDiscovery checks that the data<br />

is valid and cleans the database. It then adds location information of each<br />

component to the tables.<br />

Chapter 1. Introduction 43


1.9.5 CableDiscovery<br />

CableDiscovery is run after the Discovery process is completed. As the name<br />

suggests, it looks at the current configuration in the database and discovers Link<br />

Cards and the connected data cables. This process only needs to be performed<br />

on initial system installation or when a data cable or Link Card is replaced.<br />

1.10 Discovering your system<br />

In this section we provide a procedure that you can use to discover information<br />

about your Blue Gene/L system. Log on to your Service Node and execute the<br />

following steps:<br />

1. Start the idoproxy.<br />

cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

./bglmaster start<br />

2. Start SystemController.<br />

cd /discovery<br />

./SystemController start<br />

To view SystemController's messages, issue this command:<br />

./SystemController monitor<br />

3. Start a discovery up for each row of BGL racks.<br />

./Discovery0 start //This is for the first row of BGL racks.<br />

./Discovery1 start //This is for the second row of BGL racks.<br />

................................<br />

To view Discovery0 messages:<br />

./Discovery0 monitor<br />

To view Discovery1 messages:<br />

./Discovery1 monitor<br />

................................<br />

4. Start PostDiscovery.<br />

./PostDiscovery start<br />

To view PostDiscovery messages:<br />

./PostDiscovery monitor<br />

5. Use DB2 queries or Web page (if available) to verify all hardware reports as<br />

described in 2.2, “Identifying the installed system” on page 57.<br />

44 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6. After you have checked all information, stop discovery for each of the racks:<br />

./Discovery0 stop //This is for the first row of BGL racks.<br />

./Discovery1 stop //This is for the second row of BGL racks.<br />

7. Stop PostDiscovery.<br />

./PostDiscovery stop<br />

8. Restart bglmaster to restart the system processes.<br />

cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

./bglmaster restart<br />

9. Set the status of the Link Cards to good ready for CableDiscovery.<br />

source /discovery/dbprofile<br />

./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ pgood_linkcards all<br />

OK<br />

mmcs$ quit<br />

10.Start CableDiscovery.<br />

./bglmaster stop<br />

cd /discovery<br />

./CableDiscoveryAll<br />

To view CableDiscovery messages<br />

./CableDiscovery monitor<br />

CableDiscovery should end with:<br />

DiscoverCables ended<br />

11.Start the Blue Gene/L system processes.<br />

cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

./bglmaster start<br />

Chapter 1. Introduction 45


1.10.1 Discovery logs<br />

SystemController, Discovery, PostDiscovery, and CableDiscovery all have logs<br />

of their activity. Theses logs are created in /bgl/BlueLight/logs/BGL. Example 1-2<br />

presents the activity logs as we observed on our system.<br />

Example 1-2 Discovery process logs<br />

supportsn:/bgl/BlueLight/logs/BGL # ls -l | grep CurrentLog<br />

lrwxrwxrwx 1 root root 58 Mar 31 14:04 CurrentLog-Discovery0<br />

-> /bgl/BlueLight/logs/BGL/Discovery0-2006-03-31-14:04:28.log<br />

lrwxrwxrwx 1 root root 61 Mar 31 14:04<br />

CurrentLog-PostDiscovery -><br />

/bgl/BlueLight/logs/BGL/PostDiscovery-2006-03-31-14:04:43.log<br />

lrwxrwxrwx 1 root root 64 Mar 31 14:04<br />

CurrentLog-SystemController -><br />

/bgl/BlueLight/logs/BGL/SystemController-2006-03-31-14:04:32.log<br />

supportsn:/bgl/BlueLight/logs/BGL # ls -l | grep Cable<br />

-rwxrwxr-x 1 root bgladmin 16493 Mar 8 12:04<br />

CableDiscoveryAll-2006-03-08-12:03:22.log<br />

-rw-r--r-- 1 root root 6294 Mar 28 14:09<br />

CableDiscoveryAll-2006-03-28-14:07:47.log<br />

1.11 Service actions<br />

When your system has a problem that requires a part to be replaced or that<br />

requires the system to be shutdown, you need to run a service action. A service<br />

action prepares the specified location by powering it down in a controlled<br />

manner. Parts can then be removed from the rack or the rack itself be turned off<br />

through the circuit breaker in the BPE. When servicing action is complete, we<br />

can then close the service action. This turns on and brings the location back into<br />

production. Two commands are provided to allow this to be done:<br />

PrepareForService and EndServiceAction.<br />

1.11.1 PrepareForService<br />

The syntax for this command is:<br />

PrepareForService LocationString [Verbose] [FORCE]<br />

Where:<br />

► LocationString is the location string of the part/card that needs to be<br />

serviced. Location codes are shown in 1.2.1, “Racks” on page 3.<br />

46 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


► Verbose provides extra command output.<br />

► FORCE is a keyword that indicates that a new Service Action should be started<br />

even if there is already an existing active service action for the resource.<br />

Supported locations are:<br />

► R00-M1-N0: specific NodeCard<br />

► R00-M1-N : all NodeCards in the Midplane<br />

► R11-M0: all Node and LinkCards in the Midplane<br />

► R37-M0-Ax: individual Fan Module<br />

► R01: Whole rack<br />

► R20-P1: Bulk Power Module<br />

► R00-M1-L3: LinkCard. This card requires that the tool turns off ALL Link and<br />

NodeCards in the neighborhood of the specified LinkCard. Neighborhood is<br />

defined as those cards that are in either the same row or the same column as<br />

the specified LinkCard.<br />

Figure 1-18 on page 19, Figure 1-19 on page 20, and Figure 1-20 on page 21<br />

illustrate these locations.<br />

At the end of the PrepareForService command, you are given a service action ID<br />

that must be noted for further use to return the hardware to production.<br />

Example 1-3 shows an how PrepareForService is used for a compute node<br />

replacement.<br />

Example 1-3 PrepareForService on a Node Card<br />

bglsn:/discovery # ./PrepareForService R00-M0-N0<br />

Logging to<br />

/bgl/BlueLight/logs/BGL/PrepareForService-2006-03-31-11:25:09.log<br />

Mar 31 11:25:10.169 EST: PrepareForService started<br />

Mar 31 11:25:35.363 EST: Freed any blocks using R000<br />

Mar 31 11:25:43.288 EST: Disabled this NodeCard's ethernet port on<br />

ServiceCard (R00-M0-S, FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), Port (11)<br />

Mar 31 11:25:43.681 EST: Marked NodeCard<br />

(203231503833343000000000594c31304b34323630304b35) missing<br />

Mar 31 11:25:43.711 EST: Deleted node hardware attrs for 34 nodes<br />

Mar 31 11:25:43.712 EST: Card has been successfully powered off!<br />

Mar 31 11:25:43.736 EST: Proceed with service on part<br />

(mLctn(R00-M0-N0),<br />

mCardSernum(203231503833343000000000594c31304b34323630304b35),<br />

mLp(FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A), mIp(10.0.0.25), mType(4))<br />

Mar 31 11:25:43.736 EST:<br />

Chapter 1. Introduction 47


Mar 31 11:25:43.750 EST: Service Action ID 19<br />

Mar 31 11:25:43.755 EST: MyShutdownHook - Exiting PrepareForService<br />

Mar 31 11:25:43.756 EST: +++ This logfile is closed +++<br />

In Example 1-4, the Node Card is turned off and is marked as missing in the<br />

database, together with the 34 nodes that are inserted into it (32 Compute Nodes<br />

and 2 I/O Nodes).<br />

Example 1-4 Checking hardware is removed from the Service Node database<br />

bglsn:/discovery # db2 "select location,status from bglnodecard where<br />

location = 'R00-M0-N0'"<br />

LOCATION STATUS<br />

-------------------------------- ------<br />

R00-M0-N0 M<br />

1 record(s) selected.<br />

bglsn:/discovery # db2 "select location,status from bglprocessorcard<br />

where location like 'R00-M0-N0%'"<br />

LOCATION STATUS<br />

-------------------------------- ------<br />

R00-M0-N0-C:J02 M<br />

R00-M0-N0-C:J03 M<br />

R00-M0-N0-C:J04 M<br />

R00-M0-N0-C:J05 M<br />

R00-M0-N0-C:J06 M<br />

R00-M0-N0-C:J07 M<br />

R00-M0-N0-C:J08 M<br />

R00-M0-N0-C:J09 M<br />

R00-M0-N0-C:J10 M<br />

R00-M0-N0-C:J11 M<br />

R00-M0-N0-C:J12 M<br />

R00-M0-N0-C:J13 M<br />

R00-M0-N0-C:J14 M<br />

R00-M0-N0-C:J15 M<br />

R00-M0-N0-C:J16 M<br />

R00-M0-N0-C:J17 M<br />

R00-M0-N0-I:J18 M<br />

17 record(s) selected.<br />

48 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


There is a dedicated table in the database for the service actions. Actions on the<br />

same component cannot be done until the previous service action is closed. You<br />

can obtain open service actions by looking for the entries in<br />

BGLSERVICEACTION with a status of O (for Open), as shown in Example 1-5.<br />

Example 1-5 Showing open service actions<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # source /discovery/db.src<br />

Database Connection Information<br />

Database server = DB2/LINUXPPC 8.2.3<br />

SQL authorization ID = BGLSYSDB<br />

Local database alias = BGDB0<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select<br />

ID,LOCATION,STATUS from bglserviceaction where status='O'"<br />

ID LOCATION STATUS<br />

----------- -------------------------------- ------<br />

19 R00-M0-N0 O<br />

1 record(s) selected.<br />

If for some reason you end up with a service action for a component that is not<br />

really disabled for service, you can complete it by manually updating the<br />

database entry for that service action to C (for Closed), as shown in Example 1-6.<br />

This status can happen if a service action is initialized, but the system is then<br />

brought up by another method such as discovery, rather than using<br />

EndServiceAction.<br />

Example 1-6 Completing a service action manually<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select<br />

ID,LOCATION,STATUS from bglserviceaction where status='O'"<br />

ID LOCATION STATUS<br />

----------- -------------------------------- ------<br />

19 R00-M0-N0 O<br />

1 record(s) selected.<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "update bglserviceaction<br />

set STATUS='C' where ID=19"<br />

DB20000I The SQL command completed successfully.<br />

Chapter 1. Introduction 49


glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select<br />

ID,LOCATION,STATUS from bglserviceaction where ID=19"<br />

ID LOCATION STATUS<br />

----------- -------------------------------- ------<br />

19 R00-M0-N0 C<br />

1 record(s) selected.<br />

PrepareForService logs its invocations to /bgl/BlueLight/logs/BGL. The naming<br />

convention for log files begin with “PrepareForService” and is followed by the<br />

date and time (as shown in Example 1-7).<br />

Example 1-7 PrepareForService logs<br />

bglsn:/bgl/BlueLight/logs/BGL # ls PrepareForService*<br />

PrepareForService-2006-03-16-10:45:40.log<br />

PrepareForService-2006-03-30-14:06:40.log<br />

PrepareForService-2006-03-16-10:46:09.log<br />

PrepareForService-2006-03-30-14:07:07.log<br />

PrepareForService-2006-03-16-10:47:01.log<br />

1.11.2 EndServiceAction<br />

The syntax for this command is:<br />

EndServiceAction id [Verbose] [Wait / NoWait]<br />

Where:<br />

► id is the service action ID that was finished by PrepareForService<br />

► Wait indicates that it should wait for the component and subcomponents to<br />

become active<br />

► NoWait indicates that it should return after turning on the component but not<br />

wait for it and its subcomponents to become active in the database<br />

Example 1-8 shows the return to service of the node card that we disabled in<br />

1.11.1, “PrepareForService” on page 46.<br />

Before starting EndServiceAction, you need to start the systemcontroller,<br />

discovery, and postdiscovery processes. These processes are required when<br />

the hardware comes back online, because the hardware needs to be<br />

rediscovered and marked available instead of missing in the database. When all<br />

of the expected hardware is back online, you can stop the systemcontroller,<br />

discovery, and postdiscovery processes.<br />

50 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 1-8 EndServiceAction on a Node card<br />

bglsn:/discovery # ./EndServiceAction 19<br />

Mar 31 13:57:41.376 CST: EndServiceAction started<br />

Mar 31 13:57:55.069 CST: Disabled this NodeCard's ethernet port on<br />

ServiceCard (R00-M0-S, FF:F2:9F:15:1E:55:00:0D:60:EA:E1:AA), Port (11)<br />

Mar 31 13:57:55.453 CST: Marked NodeCard<br />

(203231503833343000000000594c31324b35323135303148) missing<br />

Mar 31 13:57:55.500 CST: Deleted node hardware attrs for 36 nodes<br />

Mar 31 13:57:55.517 CST: Card has been successfully powered off!<br />

Mar 31 13:57:55.596 CST: Powered Off NodeCard<br />

(203231503833343000000000594c31324b35323135303148)<br />

-snip-<br />

ar 31 13:59:21.252 CST: Enabled all of the NodeCard ethernet ports on<br />

ServiceCard (R00-M0-S, FF:F2:9F:15:1E:55:00:0D:60:EA:E1:AA)<br />

Mar 31 13:59:21.302 CST: Changed Midplane R00-M0's status from 'E' to<br />

'A'<br />

Mar 31 14:00:21.363 CST: @ Still waiting for 16 NodeCards in R00-M0 to<br />

become active<br />

Mar 31 14:07:21.498 CST:<br />

Mar 31 14:08:21.528 CST: All of R00-M0's NodeCards are active<br />

Mar 31 14:09:21.621 CST: @ Still waiting for 16 compute Processor cards<br />

in R00-M0 to become active<br />

Mar 31 14:11:21.669 CST: All of R00-M0's NodeCards are active<br />

Mar 31 14:11:21.684 CST: All of R00-M0's compute Processor cards are<br />

active<br />

Mar 31 14:11:21.706 CST: All of R00-M0's compute Nodes are active<br />

Mar 31 14:11:21.707 CST:<br />

Mar 31 14:11:35.531 CST: Enabling the PortB on each of this<br />

Midplane's LinkCards - to indicate that PortA is powered on<br />

Mar 31 14:11:36.537 CST: Enabling this Midplane's Port B output<br />

drivers<br />

Mar 31 14:11:37.726 CST: Enabling this Midplane's Port A input<br />

receivers<br />

Mar 31 14:11:38.946 CST: Ended Service Action Id 19 for R00-M0-N0<br />

Mar 31 14:11:38.952 CST: MyShutdownHook - Exiting EndServiceAction<br />

Mar 31 14:11:38.955 CST: +++ This logfile is closed +++<br />

Chapter 1. Introduction 51


EndServiceAction logs its invocations to /bgl/BlueLight/logs/BGL. The naming<br />

convention for log files start with “EndServiceAction” and is followed by the date<br />

and time, as shown in Example 1-9.<br />

Example 1-9 EndServiceAction logs<br />

bglsn:/bgl/BlueLight/logs/BGL # ls EndServiceAction*<br />

EndServiceAction-2006-03-16-11:06:08.log<br />

EndServiceAction-2006-03-30-14:30:27.log<br />

EndServiceAction-2006-03-16-11:15:04.log<br />

EndServiceAction-2006-03-30-14:31:16.log<br />

EndServiceAction-2006-03-16-11:49:05.log<br />

EndServiceAction-2006-03-30-14:31:45.log<br />

1.12 Turning off the system<br />

When turning off your Blue Gene/L system, you need to be careful and to ensure<br />

that you do so in a controlled manner. Simply switching off racks can leave the<br />

system in state that might make it difficult to get it back to operational. To turn off<br />

your system, follow these steps:<br />

1. Prepare each individual rack for service using the following commands:<br />

cd /discovery<br />

./PrepareForService Rxx<br />

This preparation has the effect of shutting down all the Blue Gene/L<br />

hardware.<br />

Repeat the command, and change xx for each rack in your system.<br />

After the PrepareForService command is finished, note the serviceactionid<br />

that is displayed at the end of the command output for each command.<br />

2. Stop the Control System using the following commands:<br />

cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

./bglmaster stop<br />

3. Stop DB2® on the Service Node using the following commands:<br />

su - bglsysdb<br />

db2force application all<br />

db2terminate<br />

db2stop<br />

4. Shutdown and turn off the Service Node.<br />

5. Turn off the racks.<br />

52 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


1.13 Turning on the system<br />

Having turned off your system properly, you cannot turn it on using these steps:<br />

1. Turn on the racks.<br />

2. Turn on and boot the Service Node.<br />

3. Check that DB2 has started. If it has not, start it with the following commands:<br />

su - bglsysdb<br />

dbstart<br />

For more details on getting DB2 started and to start it automatically when the<br />

system boots, see 2.3.4, “Check that DB2 is working” on page 87.<br />

4. Start the processes that rediscovers your hardware and initialize it using the<br />

following commands:<br />

cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

./bglmaster start<br />

cd /discovery<br />

./SystemController start<br />

./Discovery0 start<br />

Repeat the discovery command for each row of racks in your system,<br />

changing the last digit to each row number:<br />

./Discovery1 start<br />

./Discovery2 start<br />

5. Start the PostDiscovery process to check the discovered configuration and<br />

add position information.<br />

./PostDiscovery start<br />

6. End the service action that was used to shut down the system.<br />

cd /discovery<br />

./EndServiceAction X<br />

Repeat the EndServiceAction for each rack changing using the previously<br />

saved serviceactionid.<br />

7. Use DB2 queries or a Web page (if available) to verify all hardware reports as<br />

described in 2.2, “Identifying the installed system” on page 57.<br />

8. After the last EndServiceAction completes and all the hardware is shown,<br />

stop all the processes previously launched for the system discovery using the<br />

following commands:<br />

./SystemController stop<br />

./Discovery0 stop<br />

Chapter 1. Introduction 53


Repeat the discovery command for each row of racks in your system,<br />

changing the last digit to each row number.<br />

./Discovery1 stop<br />

./Discovery2 stop<br />

9. Stop PostDiscovery.<br />

./PostDiscovery stop<br />

10.Restart the Blue Gene/L system processes, so that the system can go back<br />

into production.<br />

./bglmaster restart<br />

54 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chapter 2. <strong>Problem</strong> determination<br />

methodology<br />

2<br />

This chapter discusses how to identify the various components in an IBM System<br />

Blue Gene Solution system. It also includes a list of core Blue Gene/L sanity<br />

checks that you can use to ensure that your system is working properly.<br />

This chapter also provides a problem determination methodology that can help<br />

you identify the cause of Blue Gene/L system problems. Following the<br />

methodology helps you quickly find the issue, identify the Blue Gene/L system,<br />

and identify the problem area so that you can confirm and fix the problem.<br />

© Copyright IBM Corp. 2006. All rights reserved. 55


2.1 Introduction<br />

Whenever you have to work with a complex system, it is essential that you obtain<br />

the actual system configuration. This chapter provides a list tasks that enable<br />

you to determine the system configuration for your Blue Gene/L system. We also<br />

provide a set of checks to ensure that the components in your core Blue Gene/L<br />

system are functioning correctly. We also present what we consider the core of<br />

the Blue Gene/L system to be.<br />

Due to the numerous components in a Blue Gene/L system, we consider that the<br />

best way to approach a problem is to separate it into different problem areas.<br />

The methodology that we discuss here provides a process for this approach that<br />

includes three distinct areas:<br />

► Defining the problem.<br />

► Identifying the Blue Gene/L system.<br />

► Identifying the problem area.<br />

This approach allows someone with little Blue Gene/L experience to assess<br />

where the problem lies quickly, and perhaps more importantly, to determine<br />

whether there is a problem at all. Such problem determination is the key to<br />

maintaining a successful running system. The methodology then points to<br />

particular check lists to show how to practically determine the problem for each<br />

of the areas.<br />

In this book we discuss the Blue Gene/L system in two different ways:<br />

1. Core Blue Gene/L<br />

– Blue Gene/L racks<br />

– Service Node, including the Blue Gene/L processes, NFS, and DB2.<br />

– Network switches for Service and Functional Network<br />

2. Complex Blue Gene/L<br />

– Core Blue Gene/L<br />

– Front-End Nodes<br />

– MPI<br />

– GPFS<br />

– LoadLeveler<br />

Note: For our discussion, we assume that readers already have a working<br />

knowledge of the Linux operating system and TCP/IP. This knowledge is a<br />

prerequisite for understanding the environment that we use and tools that we<br />

present.<br />

56 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.2 Identifying the installed system<br />

You can determine the Blue Gene/L system configuration with a combination of<br />

the following tools and actions:<br />

► Blue Gene/L Web interface (BGWEB)<br />

► DB2 select statements of the DB2 database on the SN<br />

► Standard operating system (Linux) commands<br />

► Visually inspecting the hardware<br />

We discuss these tools and actions in the following paragraphs.<br />

2.2.1 Blue Gene/L Web interface (BGWEB)<br />

BGWEB is installed on the Service Node (SN). To connect to this service, point<br />

your browser to the following URL:<br />

http:///web/index.php<br />

Figure 2-1 gives an example of the BGWEB home page.<br />

Figure 2-1 BGWEB home page on the SN<br />

Chapter 2. <strong>Problem</strong> determination methodology 57


The Configuration section displays the structure of the system and expands<br />

further into more detail for each physical component of the core Blue Gene/L<br />

system.<br />

Note: You can run BGWEB from a Front-End Node (FEN). However, this is<br />

not supported officially. BGWEB requires a DB2 client to interface with the<br />

DB2 database on the SN and a Web server configured on the FEN.<br />

2.2.2 DB2 select statements of the DB2 database on the SN<br />

You can run SQL select statements to query information that is stored in the DB2<br />

database. Select statements are useful to query the amount of different<br />

components in the system. Example 2-1 shows a DB2 select statement that<br />

displays the number of compute nodes cards in a system.<br />

Example 2-1 DB2 select command displaying the number of compute nodes<br />

# db2 "select count (*)num_of_compute_node_cards from BGLPROCESSORCARD<br />

/ where ISIOCARD ='F' and STATUS 'M'"<br />

NUM_OF_COMPUTE_NODE_CARDS<br />

-------------------------<br />

64<br />

1 record(s) selected.<br />

In this example, the user needs to create the appropriate execution environment<br />

by loading /discovery/db.src. This script sets up the default database<br />

environment and connects to the database (to run DB2 commands). Running<br />

this script should produce an output similar to that shown in Example 2-8.<br />

Example 2-2 Sourcing the /discovery/db.src file<br />

bglsn:/bgl/BlueLight/logs/BGL # source /discovery/db.src<br />

Database Connection Information<br />

Database server = DB2/LINUXPPC 8.2.3<br />

SQL authorization ID = BGLSYSDB<br />

Local database alias = BGDB0<br />

Note: The /discovery/db.src script is a copy of the<br />

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/db.src file. In<br />

addition, the directory /bgl/BlueLight/V1R2M1_020_2006-060110 represents<br />

the driver version that we used for this redbook.<br />

58 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.2.3 Network diagram<br />

2.2.4 Service Node<br />

The system administrator needs to understand the network configuration of the<br />

Blue Gene/L system. (For information about the functions of both the service and<br />

functional networks, see 1.3, “Blue Gene/L networks” on page 21. These<br />

networks are required in a core or a complex Blue Gene/L system configuration.)<br />

It is important to have an up-to-date diagram of the network that connects the<br />

Blue Gene/L system. This diagram should include the IP addresses of the<br />

system and network switches.<br />

There can be only one SN per installed Blue Gene/L system. (Refer to 1.4.2,<br />

“Service Node system processes” on page 31 for a more detailed description.)<br />

You can check the IP configuration for the service and functional networks on the<br />

SN in the following way:<br />

► Service network<br />

a. Using the Blue Gene/L Web interface, at the BGWEB home page, click<br />

Database Browser at the bottom of the page.<br />

b. Click the TBGLIDOPROXY database table link. A new page is displayed<br />

showing the idoproxy configuration as shown in Figure 2-2.<br />

Figure 2-2 The DB2 database table TBGLIDOPROXY (Service network)<br />

c. Use the command line on the SN as shown in Example 2-3.<br />

Example 2-3 Showing service network using DB2 CLI<br />

# db2 "select PROXYID,PROXYIPADDRESS from TBGLIDOPROXY"<br />

PROXYID PROXYIPADDRESS<br />

-----------------------------------------------------------------------<br />

BlueGene1 10.0.0.1<br />

Chapter 2. <strong>Problem</strong> determination methodology 59


1 record(s) selected.<br />

Then check the IP range in the /etc/hosts:<br />

# grep 10.0.0.1 /etc/hosts<br />

10.0.0.1 bglsn_sn.itso.ibm.com bglsn_sn<br />

► Functional network<br />

a. Using the Blue Gene/L Web interface, at the BGWEB home page, click<br />

Database Browser at the bottom of the page.<br />

b. Click the TBGLIPPOOL database table link. A new page is displayed<br />

showing the IP range for the I/O nodes, as shown in Figure 2-3.<br />

Figure 2-3 Data from DB2 database table TBGLIPPOOL<br />

c. Using the command line on the SN, you can obtain the addresses for the<br />

functional network that are stored in the DB2 database and check the IP<br />

labels that are associated in /etc/hosts, as shown in Example 2-4.<br />

Example 2-4 Functional network info using DB2 CLI<br />

# db2 "select IPADDRESS from TBGLIPPOOL"<br />

IPADDRESS<br />

-----------------------------------------------------------------------<br />

172.30.2.1<br />

172.30.2.10<br />

..snip..<br />

# grep 172.30.2 /etc/hosts<br />

172.30.2.1 ionode1<br />

172.30.2.2 ionode2<br />

60 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


172.30.2.3 ionode3<br />

..snip..<br />

d. You can then compare these IP address to the output from /sbin/ip ad,<br />

as shown in Example 2-5.<br />

Example 2-5 Network interface configuration on the SN<br />

# ip ad<br />

..snip..<br />

2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />

link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />

inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />

inet6 fe80::20d:60ff:fe4d:28ea/64 scope link<br />

valid_lft forever preferred_lft forever<br />

..snip..<br />

5: eth3: mtu 1500 qdisc pfifo_fast qlen 1000<br />

link/ether 00:11:25:08:30:90 brd ff:ff:ff:ff:ff:ff<br />

inet 172.30.1.1/16 brd 172.30.255.255 scope global eth3<br />

inet6 fe80::211:25ff:fe08:3090/64 scope link<br />

valid_lft forever preferred_lft forever<br />

2.2.5 Front-End Nodes<br />

The Blue Gene/L system can have one or more Front-End Nodes (FENs). There<br />

does not seem to be any way to identify FENs apart from knowing that they are<br />

separate components within the Blue Gene/L configuration. FENs are nodes<br />

where user jobs are submitted through LoadLeveler or mpirun. So, the only way<br />

to identify the FENs is to know the components from where the jobs are<br />

submitted.<br />

These components should be included in the topology diagram. A possible check<br />

would be to see if MPI support and LoadLeveler are installed using the rpm<br />

-qa|grep mpi command . If LoadLeveler is installed, all the FENs should be<br />

listed in the llstatus output, as shown in Figure 4-12 on page 172. Refer to 1.5,<br />

“Front-End Node” on page 32 for more information.<br />

2.2.6 Control system server logs<br />

There are a number of logs that are generated on the SN. These logs are called<br />

the control system server logs. Table 2-1 on page 62 through Table 2-8 on<br />

page 64 show the logs that are generated and their purpose. There are also logs<br />

generated for diagnostics. We discuss these logs further in 3.4, “Diagnostics” on<br />

page 131.<br />

Chapter 2. <strong>Problem</strong> determination methodology 61


The default location of control system server logs is /bgl/BlueLight/logs directory.<br />

In this directory, there are two directories:<br />

► /bgl/BlueLight/logs/BGL<br />

This directory includes all the logs for the BGLMaster and its child daemons.<br />

There is also a log for each I/O node in the Blue Gene/L system. These logs<br />

are written through the /bgl NFS mount to the I/O nodes because the I/O<br />

nodes do not have any persistent storage.<br />

► /bgl/BlueLight/logs/diags<br />

There is a time stamped directory for every diagnostic run on the Blue Gene/L<br />

system. The directory looks similar to:<br />

/bgl/BlueLight/logs/diags/2006-0307-17:40:08_R000<br />

Table 2-1 Control system log for BGLMaster<br />

BGLMaster<br />

Name of log --bglmaster-current.log sym link to<br />

-bglmaster-.log<br />

Example bglsn-bglmaster-2006-0330-14:56:20.log<br />

Description Shows the initialization of BGLMaster and its child daemons,<br />

which involves parsing the<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster.init file. Also, logs<br />

the status of its child daemons.<br />

Table 2-2 Control system log for idoproxy<br />

idoproxy (BGLMaster child daemon)<br />

Name of log -idoproxydb-.log sym link to<br />

--idoproxydb-current.log<br />

Example bglsn-idoproxydb-2006-0330-14:56:20.log<br />

Description Shows initialization of idoproxy, the complete name of the<br />

process, the log generated, the Blue Gene/L driver version, ip<br />

address and the ports it opens. Also, provides information about<br />

each block that is booted in the Blue Gene/L system.<br />

62 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Table 2-3 Control system log for ciodb<br />

ciodb (BGLMaster child daemon)<br />

Name of log --ciodb-current.log sym link to<br />

-ciodb-.log<br />

Example bglsn-ciodb-2006-0330-14:56:20.log<br />

Description Shows initialization of ciodb, the log generated, the complete<br />

name of the process and the Blue Gene/L driver version. Also<br />

provides useful information about submitted jobs including the<br />

Blue Gene/L job ID and the I/O nodes (with IP addresses) used<br />

for the job.<br />

Table 2-4 Control system log for mmcs_db_server<br />

mmcs_db_server (BGLMaster child daemon)<br />

Name of log -mmcs_db_server-.log sym link to<br />

--mmcs_db_server-current.log<br />

Example bglsn-mmcs_db_server-2006-0330-14:56:20.log<br />

Description Shows initialization of mmcs_db_server, the log generated, the<br />

complete name of the process and the Blue Gene/L driver<br />

version. Also provides very useful information associating the<br />

booted block with the I/O node location codes, their log files,<br />

hostnames and IP addresses. It also provides useful runtime<br />

debug data.<br />

Table 2-5 Control system log for monitorHW<br />

monitorHW (BGLMaster child daemon)<br />

Name of log -monitorHW-.log sym link to<br />

--monitorHW-current.log<br />

Example bglsn-monitorHW-2006-0323-15:47:47.log<br />

Description Shows initialization of monitorHW, the log generated, the<br />

complete name of the process and the Blue Gene/L driver<br />

version. Also provides information about the monitoring that has<br />

taken place.<br />

Chapter 2. <strong>Problem</strong> determination methodology 63


Table 2-6 Control system log for perfmon<br />

perfmon (BGLMaster child daemon)<br />

Name of log -perfmon.pl-.log sym link to<br />

--perfmon.pl-current.log<br />

Example bglsn-perfmon.pl-2006-0329-18:24:38.log<br />

Description Shows initialization of perfmon, the log generated, the complete<br />

name of the process and the Blue Gene/L driver version. Also<br />

provides performance data on running Blue Gene/L jobs. This<br />

does not seem to be used at the time the redbook was written.<br />

Table 2-7 Control system log for I/O nodes<br />

I/O node log (One for each I/O node in the Blue Gene/L system)<br />

Name of log ---I:-.log<br />

Example R00-M0-N0-I:J18-U01.log<br />

Description Shows startup process of the I/O node, this includes the loading<br />

of the MCP, the startup scripts with their output and the partition<br />

it is associated with at boot up. When the I/O node is shutdown<br />

all the messages from the shutdown scripts and other shutdown<br />

messages are displayed. These files are appended to for every<br />

boot and shutdown.<br />

Table 2-8 The updateSchema.pl script log<br />

updateSchema.pl<br />

Name of log updateSchema-.log<br />

Example updateSchema-2006-04-01-13:18:29.log<br />

Description Shows updateSchema.pl updating the schema from the new<br />

driver version on the Blue Gene/L database. This only done<br />

when a driver update is applied.<br />

2.2.7 File systems (NFS and GPFS)<br />

Depending on the Blue Gene/L system configuration that you might have, you<br />

might have more than one file system available to the I/O nodes. Although<br />

Network File System (NFS) is required for the system to function (refer to 1.8,<br />

“Boot process, job submission, and termination” on page 36), General Parallel<br />

64 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


File System (GPFS) can also be used by the I/O nodes for reading and writing a<br />

jobs’ data. You need to be aware of the components that serve NFS and GPFS<br />

on the functional network:<br />

► NFS<br />

The SN exports (NFS) the /bgl directory over the functional network. Refer to<br />

2.3.6, “Check the NFS subsystem on the SN” on page 90 for more<br />

information.<br />

The I/O nodes can mount NFS file systems that are exported from servers<br />

which are different from either SN or FENs. In this case neither SN or FENs<br />

are required to mount these file systems. However, SN needs to know about<br />

these file systems because it has to pass the information down the I/O nodes<br />

using the /bgl/dist/etc/rc.d/rc3.d/S10sitefs file. If this file exists, you should<br />

check for a line that looks similar to the one shown in Example 2-6.<br />

Example 2-6 Additional NFS file systems to be mounted by the I/O nodes<br />

bglsn_# cat /bgl/dist/etc/rc.d/rc3.d/S10sitefs<br />

..<br />

# Mount a scratch file system...<br />

mountSiteFs $SITEFS /bgscratch /bgscratch<br />

tcp,rsize=32768,wsize=32768,async<br />

..<br />

► GPFS<br />

As previously mentioned, if a sitefs file exists, then it is possible to check<br />

whether GPFS has been enabled to run on the I/O nodes. Check for a line<br />

that looks similar to the one shown in Example 2-7.<br />

Example 2-7 The sitefs configuration for a Blue Gene/L system using GPFS<br />

bglsn_# cat /bgl/dist/etc/rc.d/rc3.d/S10sitefs<br />

..<br />

# Uncomment the first line to start GPFS.<br />

# Optionally uncomment the other lines to change the defaults for<br />

# GPFS.<br />

echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />

..<br />

To identify whether GPFS is in use, run /usr/lpp/mmfs/bin/mmlscluster and<br />

/usr/lpp/mmfs/bin/mmlsconfig on the SN. The I/O nodes in your Blue<br />

Gene/L system as well as your SN should be listed. You must configure<br />

GPFS to mount the GPFS file systems automatically when the GPFS daemon<br />

is started on SN as well as on all I/O nodes that are a part of the cluster. For<br />

more information about GPFS running on the Blue Gene/L system, refer to<br />

Chapter 5, “File systems” on page 211.<br />

Chapter 2. <strong>Problem</strong> determination methodology 65


2.2.8 Job submission<br />

2.2.9 Racks<br />

Note: The /bgl/dist/etc/rc.d/rc3.d/S10sitefs is a symbolic link to<br />

/bgl/dist/etc/rc.d/init.d/sitefs. Be aware that this might also be found under<br />

/bgl/BlueLight/ppcfloor/dist/etc/rc.d, although this is not the advised<br />

location for it.<br />

You can run jobs on the Blue Gene/L system in three different ways:<br />

► You can run the sumbit_job command at the mmcs_console on the SN. Only<br />

the root user on the SN can run this command.<br />

► You can run the mpirun command from a FEN or SN. You can run this<br />

command an authorized, non-root user.<br />

► You can run the llsubmit command from a FEN or SN. This command is a<br />

LoadLeveler command, and you can run it as an authorized, non-root user.<br />

Note: It is not very likely that the submit_job command will be used a part of<br />

daily activity on the system. Using this command may be appropriate during<br />

certain system verification actions, however we do not encourage the use of it.<br />

You need to be able to identify how jobs are submitted on your system. It is likely<br />

that the jobs will be submitted by MPI or LoadLeveler programs from the FENs.<br />

For more information about running jobs with the submit_job command refer to<br />

2.3.8, “Check that a simple job can run (mmcs_console)” on page 96, and for<br />

information about mpirun and llsubmit, refer to Chapter 4, “Running jobs” on<br />

page 141.<br />

Here are two ways to determine the number of racks in the Blue Gene/L system:<br />

► Using the Blue Gene/L Web interface:<br />

At the BGWEB home page, click Configuration to display the racks that are<br />

configured in the IBM System Blue Gene Solution system (Figure 2-4 on<br />

page 67).<br />

66 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 2-4 Configuration of Blue Gene/L in BGWEB<br />

Note: If a piece of hardware has a red box around, it means this part of the<br />

system or a hardware within it is missing or has a hardware error.<br />

► Using a DB2 select statement:<br />

In Example 2-8, the number of records in the BGLBASEPARTITION view<br />

represents the number of racks that are available to the SN since the last<br />

system Discovery was performed.<br />

Example 2-8 Listing the active racks in the Blue Gene/L system<br />

# db2 "select BPID,STATUS from BGLBASEPARTITION where STATUS 'M'"<br />

BPID STATUS<br />

---- ------<br />

R000 A<br />

1 record(s) selected.<br />

Chapter 2. <strong>Problem</strong> determination methodology 67


2.2.10 Midplanes<br />

2.2.11 Clock cards<br />

Note: The select statements and database views in this section query for<br />

records with the STATUS field not equal to M (Missing), using the operator < >.<br />

These records can also have values of E (Error) and A (Available).<br />

Each rack contains two midplanes that are only detected when a service card is<br />

plugged into the midplane and then connected by ethernet to the service<br />

network. You can determine the number of midplanes that are available to the<br />

Blue Gene/L system in two ways:<br />

► Using the Blue Gene/L Web interface:<br />

At the BGWEB home page, click Configuration to display the midplanes that<br />

are configured in the IBM System Blue Gene Solution system (Figure 2-4 on<br />

page 67).<br />

► Using a DB2 select statement:<br />

In Example 2-9, the number of records in the BGLMIDPLANE view<br />

represents the number of midplanes that are available to the system since the<br />

last system Discovery was performed. This example only shows one<br />

midplane being used.<br />

Example 2-9 Listing the midplanes available to the Blue Gene/L system<br />

# db2 "select LOCATION,POSINMACHINE,STATUS from BGLMIDPLANE where<br />

STATUS / 'M'"<br />

LOCATION POSINMACHINE STATUS<br />

-------------------------------- ------------ ------<br />

R00-M0 R000 A<br />

1 record(s) selected.<br />

Each rack has one clock card. Therefore, the number of clock cards is the same<br />

as the number of racks in the Blue Gene/L system. There is no way to identify the<br />

clock cards that are available to the Blue Gene/L system through the BGWEB or<br />

DB2. The only way to check cards is to check the bottom of each rack manually.<br />

For detail information about the clock card, refer to 1.2.8, “Clock card” on<br />

page 15.<br />

68 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.2.12 Service cards<br />

Blue Gene/L has one service card per midplane. Therefore, there are a<br />

maximum of two service cards per rack. You can determine the number of<br />

service cards that are available to the Blue Gene/L system in two ways:<br />

► Using the Blue Gene/L Web interface:<br />

At the BGWEB home page, click Configuration, and then click the midplane<br />

link in which you are interested. A new page displays a table of the cards<br />

within that midplane. Figure 2-5 shows the page that is loaded. You can see<br />

the service card at the bottom of the table that is displayed.<br />

Note: To identify the cards within a given midplane, Blue Gene/L: <strong>Systems</strong><br />

Administration, SG24-7178 advises you to query the relevant DB2 table.<br />

For example, if you want to find the number of service cards in a midplane,<br />

query the TBLSERVICECARD table and use the midplane serial number<br />

to ensure that you only find the cards in that midplane. In addition to the<br />

DB2 database tables, there are database views that do some of this work<br />

for you.<br />

Figure 2-5 BGWEB table showing the cards within a midplane<br />

Chapter 2. <strong>Problem</strong> determination methodology 69


2.2.13 Link cards<br />

► Using a DB2 select statement:<br />

In Example 2-10, the number of service cards is represented by the<br />

NUMSERVICECARDS field in the BGLSERVICECARDCOUNT view. This<br />

value is the number of service cards, per midplane, that are available to the<br />

system since the last system Discovery was performed.<br />

BGLSERVICECARDCOUNT generates information by querying the database<br />

alias BGLSERVICECARD.<br />

Example 2-10 Listing the number of service cards in the Blue Gene/L system<br />

# db2 "select * from BGLSERVICECARDCOUNT"<br />

MIDPLANESERIALNUMBER NUMSERVICECARDS<br />

--------------------------------------------------- --------------x'203937503631353900000000594C31304B35303238303036'<br />

1<br />

1 record(s) selected.<br />

Note: BGLSERVICECARDCOUNT is a database view and does not use the<br />

STATUS ‘M’ statement in its SQL statement the way that other count<br />

database views that are available on the system do. You need to be aware<br />

that there should only be one service card per midplane.<br />

A full Blue Gene/L rack has eight link cards—four link cards per midplane. Here<br />

are two ways to determine the number of link cards that are available to the<br />

system:<br />

► Using the Blue Gene/L Web interface:<br />

At the BGWEB home page, click Configuration, and then click the midplane<br />

links. A new page displays a table of the cards within that midplane.<br />

Figure 2-5 on page 69 shows the page that is loaded. You can see the link<br />

cards identified in the Type column.<br />

70 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


► Using a DB2 select statement:<br />

Example 2-11 shows the number of link cards that are represented by the<br />

NUMLINKCARDS field in the BGLLINKCARDCOUNT view. This value is the<br />

number of link cards, per midplane, that are available to the system since the<br />

last system Discovery was performed.<br />

Example 2-11 Listing the link cards available to the Blue Gene/L system<br />

# db2 "select * from BGLLINKCARDCOUNT"<br />

MIDPLANESERIALNUMBER NUMLINKCARDS<br />

--------------------------------------------------- -----------x'203937503631353900000000594C31304B35303238303036'<br />

4<br />

1 record(s) selected.<br />

2.2.14 Link card chips<br />

As previously discussed, there are four link cards per midplane. In addition, each<br />

link card contains six link chips. These chips are used to link signals between<br />

compute processors on different midplanes. For more information about the link<br />

card and chips, refer to Figure 1-17 on page 18.<br />

You can determine the number of link chips that are available to the system in<br />

two ways:<br />

► Using the Blue Gene/L Web interface:<br />

At the BGWEB home page, click Configuration, and then click the midplane<br />

links. A new page displays a table of the cards within that midplane.<br />

Figure 2-5 on page 69 shows the page that is loaded. You can see the link<br />

cards in the table. To view the link card details, click one of the link card<br />

hardware names. Figure 2-6 shows the BGWEB link card detail.<br />

Chapter 2. <strong>Problem</strong> determination methodology 71


Figure 2-6 BGWEB table showing the link chips on a link card<br />

Figure 2-7 shows the remainder of the same BGWEB page that is displayed<br />

in Figure 2-6 which gives details on the connection status of each link chip.<br />

Figure 2-7 shows an example of a 2x2 system. Therefore, it has X, Y, and Z<br />

cables between the midplanes and racks that use all six chips.<br />

72 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 2-7 BGWEB table showing cable connection data for a link card<br />

► Using a DB2 select statement:<br />

In Example 2-12, the number of available link chips is represented by the<br />

LINKCHIPS field in the output from the DB2 statement. This value is the<br />

number of link chips, per link card, that are available to the system since the<br />

last system Discovery was performed. The SERIALNUMBER field represents<br />

the serial number of the individual link cards. This also shows the status of the<br />

link cards.<br />

Example 2-12 Listing the link chips per link card on the system<br />

# db2 " SELECT serialNumber, (select count(*) from BGLLinkchip where /<br />

BGLLinkcard.serialNumber=BGLLinkchip.CardSerialNumber and<br />

BGLLinkcard.status / 'M' ) linkchips,status FROM BGLLinkcard where<br />

STATUS 'M'"<br />

SERIALNUMBER LINKCHIPS STATUS<br />

--------------------------------------------------- ----------- -----x'203937503438383700000000594C31314B35303630303146'<br />

6 A<br />

x'203937503438393200000000594C31344B3530353930314B' 6 A<br />

x'203937503438393200000000594C31354B35303737303044' 6 A<br />

x'203937503438383700000000594C33304B35323336303034' 6 A<br />

4 record(s) selected.<br />

Chapter 2. <strong>Problem</strong> determination methodology 73


2.2.15 Link summary<br />

You can view the configuration of the X, Y, and Z cables on the Blue Gene/L<br />

systems from the BGWEB interface. This view provides the best way to get a<br />

visual idea of the 3D torus cabling of your system. You can view a link summary<br />

in two ways:<br />

► Using the Blue Gene/L Web interface:<br />

a. At the BGWEB home page, click Configuration.<br />

b. Click Link Summary at the top of the page (Figure 2-8 and Figure 2-9).<br />

Figure 2-8 BGWEB Link Summary output for the X and Y links between racks<br />

74 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 2-9 BGWEB Link Summary output for Z links between midplanes<br />

Note: In our environment for this redbook, we did not have a system with X, Y,<br />

and Z cables to test. So, the output in Figure 2-8 and Figure 2-9 is from<br />

another system with more than four node cards in its configuration.<br />

► Using a DB2 select statement:<br />

Example 2-13 shows a DB2 select statement that identifies the Z links in a<br />

Blue Gene/L system. This particular example is a complex statement from the<br />

BGWEB source, but it does give a clear view of the link between the<br />

midplanes.<br />

Example 2-13 Displaying the Z links between two midplanes<br />

# db2 "SELECT CHAR(LEFT(SUBSTR(source,4,3),3),3) AS source, /<br />

CHAR(LEFT(SUBSTR(destination,4,3),3),3) AS destination FROM<br />

bglsysdb.TbglLink / WHERE source LIKE 'Z_%' AND sourceport = 'P2' FOR<br />

READ ONLY WITH UR "<br />

SOURCE DESTINATION<br />

------ -----------<br />

000 001<br />

001 000<br />

2 record(s) selected.<br />

Example 2-14 gives simpler DB2 select statement that shows the links within<br />

the Blue Gene/L system from ASIC-to-ASIC on each link card in use. In order<br />

to determine that the ASIC values 2 and 3 are for the Z cables, refer to<br />

Figure 1-17 on page 18.<br />

Chapter 2. <strong>Problem</strong> determination methodology 75


2.2.16 Node cards<br />

Example 2-14 Displaying the links within the Blue Gene/L system<br />

# db2 "select<br />

FROMLCLOCATION,FROMASIC,TOLCLOCATION,TOASIC,NUMBADWIRES,STATUS from<br />

tbglcable"<br />

FROMLCLOCATION FROMASIC TOLCLOCATION TOASIC NUMBADWIRES STATUS<br />

-------------- -------- ------------ ------ ----------- ------<br />

R00-M1-L0 2 R00-M0-L0 2 0 A<br />

R00-M1-L0 3 R00-M0-L0 3 0 A<br />

R00-M1-L1 2 R00-M0-L1 2 0 A<br />

R00-M1-L1 3 R00-M0-L1 3 0 A<br />

R00-M1-L2 2 R00-M0-L2 2 0 A<br />

R00-M1-L2 3 R00-M0-L2 3 0 A<br />

R00-M1-L3 2 R00-M0-L3 2 0 A<br />

R00-M1-L3 3 R00-M0-L3 3 0 A<br />

R00-M0-L0 2 R00-M1-L0 2 0 A<br />

R00-M0-L0 3 R00-M1-L0 3 0 A<br />

R00-M0-L1 2 R00-M1-L1 2 0 A<br />

R00-M0-L1 3 R00-M1-L1 3 0 A<br />

R00-M0-L2 2 R00-M1-L2 2 0 A<br />

R00-M0-L2 3 R00-M1-L2 3 0 A<br />

R00-M0-L3 2 R00-M1-L3 2 0 A<br />

R00-M0-L3 3 R00-M1-L3 3 0 A<br />

16 record(s) selected.<br />

The number of node cards can vary for each Blue Gene/L system. You can<br />

determine the number of node cards that are available in two ways:<br />

► Using the Blue Gene/L Web interface:<br />

At the BGWEB home page, click Configuration, and then click the midplane<br />

links. A new page displays a table of the cards within that midplane.<br />

Figure 2-5 on page 69 shows the page that is loaded. You can see the node<br />

cards identified in the Type column of the table.<br />

76 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.2.17 I/O cards<br />

► Using a DB2 select statement:<br />

In Example 2-15, the number of node cards is represented by the<br />

NUMNODECARDS field in the BGLNODECARDCOUNT view. This value is<br />

the number of node cards, per midplane, that are available to the system<br />

since the last system Discovery was performed.<br />

Example 2-15 Listing the number of node cards per midplane<br />

# db2 "select * from BGLNODECARDCOUNT"<br />

MIDPLANESERIALNUMBER NUMNODECARDS<br />

--------------------------------------------------- -----------x'203937503631353900000000594C31304B35303238303036'<br />

4<br />

1 record(s) selected.<br />

Note: There are no DB2 database views to help separate the number of I/O<br />

and Computer/Processor cards in the Blue Gene/L system. Database view<br />

BGLPROCESSORCARDCOUNT only displays the total number of processor<br />

cards on the system, including I/O cards.<br />

A node card can have one or two I/O cards installed. If it has two I/O cards<br />

installed, it is called an I/O rich node card. You can determine the number of I/O<br />

cards that are available to the system in two ways:<br />

► Using the Blue Gene/L Web interface:<br />

a. At the BGWEB home page, click Configuration.<br />

b. Click the midplane links. A new page displays a table of the cards within<br />

that midplane. Figure 2-5 on page 69 shows the page that is loaded.<br />

c. Click the individual node cards listed in the table to view the processor<br />

cards. Figure 2-10 and Figure 2-11 show the full page. The full page<br />

displays the cards with the I/O nodes identified by a Yes in the Is IO Card<br />

column.<br />

Chapter 2. <strong>Problem</strong> determination methodology 77


Figure 2-10 Top of the page for the Node card view in BGWEB page<br />

Figure 2-11 Node card view in BGWEB page (I/O node card marked as Yes)<br />

78 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 2-12 shows the detailed view of a I/O node that shows the two chips on<br />

the card.<br />

Figure 2-12 I/O node card detail showing the IP addresses of the I/O nodes<br />

► Using a DB2 select statement:<br />

The I/O nodes in a node card within a particular midplane are linked to the<br />

node card by their serial number (license plate). In Example 2-16, the<br />

IONODECARDS field is the I/O node count per node card. This example also<br />

shows the status of the node cards.<br />

Example 2-16 The number of I/O node cards per midplane<br />

# db2 " SELECT serialNumber, (select count(*) from BGLProcessorCard<br />

where / BGLNodeCard.serialNumber=BGLProcessorCard.nodeCardSerialNumber<br />

and / BGLProcessorCard.status 'M' and isiocard = 'T')<br />

ionodecard,status FROM / BGLNodeCard where STATUS 'M'"<br />

SERIALNUMBER IONODECARD STATUS<br />

--------------------------------------------------- ----------- -----x'203231503833343000000000594C31304B34323635303134'<br />

1 A<br />

x'203231503833343000000000594C31304B3432383030314B' 1 A<br />

x'203231503833343000000000594C31304B34323630304B35' 1 A<br />

x'203937503538373400000000594C31304B34313534303032' 1 A<br />

4 record(s) selected.<br />

Chapter 2. <strong>Problem</strong> determination methodology 79


2.2.18 Compute or processor cards<br />

Compute cards are also referred to as processor cards. Each node card holds 16<br />

compute cards. Thus, you can use the number of node cards in a midplane to<br />

determine the number of compute cards using the following equation:<br />

(Number of Node cards) x 16 = (number of compute cards)<br />

The number of compute nodes within the system is two times the number of<br />

compute cards. Therefore, in a midplane that has four node cards, the number of<br />

compute cards is 4 x 16 = 64 compute cards, resulting in (64 x 2) = 128 compute<br />

nodes in the system.<br />

Here are two ways to determine the number of compute cards that are available<br />

to the system:<br />

► Using the Blue Gene/L Web interface:<br />

a. At the BGWEB home page, click Configuration, and then click the<br />

midplane links. A new page displays a table of the cards within that<br />

midplane. Figure 2-5 on page 69 shows the page that is loaded.<br />

b. Click the individual node cards that are listed in the table to view the<br />

computer (processor) cards (Figure 2-10 on page 78 and Figure 2-11 on<br />

page 78).<br />

c. Click the processor card links to expand the detail for each processor<br />

card, as shown in Figure 2-13.<br />

Figure 2-13 Processor card detail view through BGWEB interface<br />

80 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


► Using a DB2 select statement:<br />

The compute cards in a node card within a particular midplane are linked to<br />

the midplane by its serial number (license plate). In Example 2-17, the<br />

COMPUTENODES field is the compute cards count per node card. This field<br />

also shows the status of the node cards.<br />

Example 2-17 The number of compute cards per node card<br />

# db2 "SELECT serialNumber, (select count(*) from BGLProcessorCard<br />

where / BGLNodeCard.serialNumber=BGLProcessorCard.nodeCardSerialNumber<br />

and / BGLProcessorCard.status 'M' and isiocard = 'F')<br />

computecards,status FROM / BGLNodeCard where STATUS 'M'"<br />

SERIALNUMBER COMPUTECARDS STATUS<br />

--------------------------------------------------- ------------ -----x'203231503833343000000000594C31304B34323635303134'<br />

16 A<br />

x'203231503833343000000000594C31304B3432383030314B' 16 A<br />

x'203231503833343000000000594C31304B34323630304B35' 16 A<br />

x'203937503538373400000000594C31304B34313534303032' 16 A<br />

4 record(s) selected.<br />

Note: To find specific STATUS values of the database records, change the not<br />

equal to operator (), as used in the DB2 examples with the STATUS field<br />

operator, to equals (=).<br />

Another way of quickly checking the number of processor cards that serve as<br />

compute nodes is to run the DB2 command shown in Example 2-18.<br />

Example 2-18 Checking the number of processor cards<br />

# db2 "select count (*)num_of_compute_node_cards from BGLPROCESSORCARD<br />

/ where ISIOCARD ='F' and STATUS 'M'"<br />

NUM_OF_COMPUTE_NODE_CARDS<br />

-------------------------<br />

64<br />

1 record(s) selected.<br />

Chapter 2. <strong>Problem</strong> determination methodology 81


To check the number of processor cards that serve as I/O node cards, run the<br />

DB2 command shown in Example 2-19.<br />

Example 2-19 Checking the node cards with I/O nodes<br />

# db2 "select count (*)num_of_io_node_cards from BGLPROCESSORCARD where<br />

/<br />

ISIOCARD ='T' and STATUS 'M'"<br />

NUM_OF_IO_NODE_CARDS<br />

--------------------<br />

4<br />

1 record(s) selected.<br />

2.3 Sanity checks for installed components<br />

If the core components of your Blue Gene/L system are not running properly,<br />

your system will not function correctly. We recommend that you follow this list of<br />

checks to ensure that the system is in a healthy state:<br />

► Check the operating system on the SN<br />

► Check communication services on the SN<br />

► Check that BGWEB is running<br />

► Check that DB2 is working<br />

► Check that BGLMaster and its child daemons are running<br />

► Check the NFS subsystem on the SN<br />

► Check that a block can be allocated using mmcs_console<br />

► Check that a simple job can run (mmcs_console)<br />

► Check the control system server logs<br />

► Check remote shell<br />

► Check remote command execution with secure shell<br />

► Check the network switches<br />

► Check the physical Blue Gene/L racks configuration<br />

Note: One component in the system can prevent another from running<br />

correctly. For example, if DB2 is not running, it might be down because of an<br />

operating system or communications issue and, therefore, a job cannot run.<br />

82 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


BGWEB<br />

DB2<br />

Figure 2-14 illustrates the core Blue Gene/L configuration that we used to<br />

provide examples of the systems checks in this section.<br />

BLUE GENE/L<br />

System Processes<br />

Service Node running<br />

SLES9 PPC<br />

64 bit<br />

IP: 10.0.0.1<br />

netmask: 255.255.0.0<br />

IP: 172.30.1.1<br />

netmask: 255.255.0.0<br />

IP: 192.168.00.49<br />

netmask: 255.255.255.0<br />

NFS<br />

Public LAN<br />

Switch<br />

Figure 2-14 Diagram of a core Blue Gene/L configuration<br />

2.3.1 Check the operating system on the SN<br />

eth0<br />

eth1<br />

eth5<br />

Service<br />

Network<br />

Switch<br />

Functional<br />

Network<br />

Switch<br />

You need to perform two checks for the operating system on the SN:<br />

1. Check the /etc/passwd and /etc/shadow files for the required Blue Gene/L<br />

users. In a core Blue Gene/L configuration, without FENs, we would only<br />

need the users root, bglsysdb, and bgdb2clim, as shown in Example 2-20.<br />

Example 2-20 Blue Gene/L user checking<br />

Gbit ido<br />

Node card 3<br />

Node card 2<br />

Node card 1<br />

Node card 0<br />

BLUE GENE Rack 00,<br />

Front half of Midplane:<br />

R00-M0<br />

Clock card<br />

master<br />

# egrep "root|bglsysdb|bgdb2cli" /etc/passwd /etc/shadow<br />

/etc/passwd:root:x:0:0:root:/root:/bin/bash<br />

/etc/passwd:bglsysdb:x:1000:1000::/dbhome/bglsysdb:/bin/bash<br />

/etc/passwd:bgdb2cli:x:1001:1001::/dbhome/bgdb2cli:/bin/bash<br />

/etc/shadow:root:$1$5zzzJBvz$XTd9evpJ8d1cVvDw5c3hV/:13210:0:10000::::<br />

slave<br />

Service<br />

card<br />

Chapter 2. <strong>Problem</strong> determination methodology 83


etc/shadow:bglsysdb:$1$SwI1..4e$iGNeJ3bSSOHXD1Dy5TM250:13222:0:99999:7<br />

:::<br />

/etc/shadow:bgdb2cli:$1$Lyzz.trF$npMmXlHv5.XPf.ijiBFGC1:13213:0:99999:7<br />

:::<br />

2. Check for any full or nearly full file systems using the command /bin/df. In a<br />

non-customized SN installation, with DB2 and Blue Gene/L code installed, the<br />

file system layout looks similar to the output shown in Example 2-21. Full file<br />

systems can cause problems with many system processes.<br />

Example 2-21 File system checking on the SN<br />

# df -k<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

/dev/sdb3 70614928 4752552 65862376 7% /<br />

tmpfs 1898508 8 1898500 1% /dev/shm<br />

/dev/sda4 489452 95872 393580 20% /tmp<br />

/dev/sda1 9766544 4075948 5690596 42% /bgl<br />

/dev/sda2 9766608 712428 9054180 8% /dbhome<br />

If you discover issues during the check of the operating system, you can take the<br />

following corrective actions:<br />

1. If a user does not exist in /etc/passwd and /etc/shadow, then you need to<br />

create the user.<br />

2. If any of the file systems are full or are nearly full, you need to either clean<br />

that file system (remove unnecessary files or add disk space) to ensure that<br />

the operating system, database, and other Blue Gene/L processes are not<br />

affected.<br />

2.3.2 Check communication services on the SN<br />

The Service and Functional networks are required for the Blue Gene/L system to<br />

function correctly. To check these networks, perform the following<br />

communication checks:<br />

1. Verify that both the carrier and the network is up with the /usr/sbin/ethtool<br />

command. Example 2-22 shows the output for a working interface on the SN.<br />

Note the Speed, Duplex, and Link detected fields.<br />

Example 2-22 Ethernet adapter characteristics<br />

# /usr/sbin/ethtool eth0<br />

Settings for eth0:<br />

Supported ports: [ TP ]<br />

Supported link modes: 10baseT/Half 10baseT/Full<br />

84 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


100baseT/Half 100baseT/Full<br />

1000baseT/Full<br />

Supports auto-negotiation: Yes<br />

Advertised link modes: 10baseT/Half 10baseT/Full<br />

100baseT/Half 100baseT/Full<br />

1000baseT/Full<br />

Advertised auto-negotiation: Yes<br />

Speed: 1000Mb/s<br />

Duplex: Full<br />

Port: Twisted Pair<br />

PHYAD: 0<br />

Transceiver: internal<br />

Auto-negotiation: on<br />

Supports Wake-on: umbg<br />

Wake-on: g<br />

Current message level: 0x00000007 (7)<br />

Link detected: yes<br />

2. Check TCP/IP configuration using the /sbin/ip ad command. Example 2-23<br />

shows output showing the loopback, eth0, and eth1 interfaces configured and<br />

up on the SN.<br />

Example 2-23 IP configuration on the SN<br />

# /sbin/ip ad<br />

1: lo: mtu 16436 qdisc noqueue<br />

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00<br />

inet 127.0.0.1/8 brd 127.255.255.255 scope host lo<br />

inet6 ::1/128 scope host<br />

valid_lft forever preferred_lft forever<br />

2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />

link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />

inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />

inet6 fe80::20d:60ff:fe4d:28ea/64 scope link<br />

valid_lft forever preferred_lft forever<br />

3: eth1: mtu 1500 qdisc pfifo_fast qlen 1000<br />

link/ether 00:11:25:08:30:90 brd ff:ff:ff:ff:ff:ff<br />

inet 172.30.1.1/16 brd 172.30.255.255 scope global eth3<br />

inet6 fe80::211:25ff:fe08:3090/64 scope link<br />

valid_lft forever preferred_lft forever<br />

...etc..<br />

3. Verify that the IP configuration is correct on the network interfaces. Refer to<br />

the network diagram for the system as discussed in 2.2.3, “Network diagram”<br />

on page 59.<br />

Chapter 2. <strong>Problem</strong> determination methodology 85


4. Use the /bin/ping command to check network connectivity of the SN<br />

interfaces for the functional and service networks from another system.<br />

Note: It is likely there will be another system on the functional network,<br />

such as an FEN. However, for the service network, you might need to<br />

connect a test system to perform this check.<br />

5. Check the lights on the network interfaces on the SN.<br />

If you discover issues during the check of the communication services, you can<br />

take the following corrective actions:<br />

1. If the /usr/sbin/ethtool output is not as expected, then check the settings<br />

on the interfaces and ensure that the ethernet cables are plugged in correctly<br />

at the SN and the switch.<br />

2. The /sbin/ip ad command should show the interfaces as UP. If they are not<br />

in UP state, check the configuration files and activate them using appropriate<br />

commands or scripts (/etc/init.d/network or /sbin/ifconfig).<br />

2.3.3 Check that BGWEB is running<br />

As mentioned at the beginning of 2.2, “Identifying the installed system” on<br />

page 57, BGWEB is a very useful tool for gathering data on many aspects of the<br />

Blue Gene/L system. To check if BGWEB is working properly, follow these steps:<br />

1. From the SN or a remote system, try to connect to BGWEB using the<br />

following URL:<br />

http:///web/index.php<br />

Note: There might be a firewall between the system where the Web page is<br />

loaded and the SN. A tunnel can be created to forward port TCP:80 on the SN<br />

to a local port (in this case, local TCP:5519 port). The URL would then need to<br />

change to something like:<br />

http://localhost:5919/web/index.php<br />

2. Check that the apache server processes are running on the SN using the<br />

/bin/ps command:<br />

ps -ef | grep httpd<br />

86 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


3. Check whether apache is configured to start automatically on start with the<br />

/sbin/chkconfig command:<br />

# chkconfig --list apache<br />

apache 0:off 1:off 2:off 3:on 4:off 5:on<br />

6:off<br />

If you discover issues during the check of BGWEB, you can take the following<br />

corrective actions:<br />

1. If apache is not running then try and start it using the start script<br />

/etc/rc.d/apache on the SN:<br />

/etc/rc.d/apache start<br />

2. If apache has not been configured to start when the SN boots, run (as root)<br />

the /sbin/chkconfig command:<br />

# chkconfig -s apache 35<br />

3. If there are issues with the apache configuration, check the<br />

/etc/httpd/httpd.conf file.<br />

2.3.4 Check that DB2 is working<br />

The DB2 database needs to be running on the SN because the Blue Gene/L<br />

relies on it to operate. To check that DB2 is working properly, follow these steps:<br />

1. Check that you can connect to the database instance as shown in<br />

Example 2-2 on page 58.<br />

2. Check that the DB2 user exists by connecting the to the SN using the<br />

following command:<br />

# ssh bglsysdb@<br />

Although highly unlikely, if secure shell is not configured, use the<br />

/usr/bin/telnet command.<br />

3. Check that the user’s bglsysdb password is the same as the password in the<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file. In Example 2-24 the<br />

database_password field in the db.properties file represents the password for<br />

the user bglsysdb.<br />

Example 2-24 Output of the /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file<br />

# cat db.properties<br />

database_name=bgdb0<br />

database_user=bglsysdb<br />

database_password=bglsysdb<br />

...<br />

Chapter 2. <strong>Problem</strong> determination methodology 87


If you discover issues during the check of DB2, you can take the following<br />

corrective actions:<br />

1. If DB2 is not running, you need to start it. Run the commands:<br />

# su - bglsysdb<br />

# /dbhome/bglsysdb/sqllib/adm/db2start<br />

2. If required, update the /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file to<br />

reflect the configuration shown in Example 2-24 on page 87.<br />

3. If DB2 will not start, even though the user exists and the password for<br />

bglsysdb is configured correctly in the db.properties file, go through the steps<br />

in 2.3.1, “Check the operating system on the SN” on page 83 and 2.3.2,<br />

“Check communication services on the SN” on page 84.<br />

4. If DB2 has not been started automatically when the SN was booted, then<br />

check that the database instance has been set to start automatically with the<br />

/dbhome/bglsysdb/sqllib/adm/db2set command:<br />

# /dbhome/bglsysdb/sqllib/adm/db2set -i bglsysdb<br />

DB2COMM=tcpip<br />

DB2AUTOSTART=YES<br />

If the DB2AUTOSTART field has the value YES, then it is set to start<br />

automatically. If this is not set, you can enable it using the<br />

/opt/IBM/db2/V8.1/instance/db2iauto command:<br />

# /opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb<br />

Also, check that the db2fmcd entry has not been commented out of or deleted<br />

from the /etc/inittab file.<br />

# grep db2fmcd /etc/inittab<br />

fmc:2345:respawn:/opt/IBM/db2/V8.1/bin/db2fmcd #DB2 Fault Monitor<br />

Coordinator<br />

Note: The previous DB2 commands can be run as the root user or the DB2<br />

user bglsysdb.<br />

5. If after all these checks DB2 still does not work correctly, you need to call your<br />

DB2 support.<br />

88 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.3.5 Check that BGLMaster and its child daemons are running<br />

BGLMaster and three of its child daemons must be running for the Blue Gene/L<br />

system to operate properly. These daemons are:<br />

► idoproxy<br />

► ciodb<br />

► mmcs_server<br />

To ensure that BGLMaster is running properly, follow these steps:<br />

1. Check the status of BGLMaster on the SN using the following command:<br />

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster status<br />

Example 2-25 shows the expected output from BGLMaster on a system that<br />

is not using hardware (monitor) or performance (perfmon) monitoring<br />

daemons for BGL.<br />

Example 2-25 Checking the BGLMaster status<br />

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster status<br />

idoproxy started [18622]<br />

ciodb started [18623]<br />

mmcs_server started [18624]<br />

monitor stopped<br />

perfmon stopped<br />

2. We advise to double check that there is only one process for the BGLMaster<br />

and each of and its child daemons by running the following command:<br />

# ps -ef | egrep "BGLMaster|idoproxy|mmcs_db_server|ciodb"<br />

If there is more than one instance of a process, this needs to be cleaned up<br />

because BGLMaster is only aware of the process it has most recently started<br />

for each daemon name. It is also unaware of daemon processes if they are<br />

not started by the bglmaster command.<br />

If you discover issues during the check of BGLMaster, you can take the following<br />

corrective actions:<br />

1. If you need to start a particular child daemon, use the command.<br />

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start <br />

Caution: Restarting the BGLMaster terminates all communications between<br />

the SN and the racks. This action terminates all running jobs and booted<br />

partitions on the Blue Gene/L system.<br />

Chapter 2. <strong>Problem</strong> determination methodology 89


2. If you need to restart BGLMaster, you can run the following command:<br />

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster restart<br />

3. If there are still issues with BGLMaster, we suggest that you run through this<br />

sequence of commands:<br />

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster stop<br />

# ps -ef | egrep "BGLMaster|idoproxy|mmcs_db_server|ciodb"<br />

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start<br />

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster status<br />

Note: You can edit the file /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster.init to<br />

determine what is started automatically when you run<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start. For more<br />

information, see Blue Gene/L: System Administration, SG24-7178.<br />

2.3.6 Check the NFS subsystem on the SN<br />

NFS is an integral part of the Blue Gene/L boot process, as described in 1.8,<br />

“Boot process, job submission, and termination” on page 36. It can also be used<br />

for the reading and writing of data by the jobs that are submitted on the system.<br />

(This is discussed in 2.2.7, “File systems (NFS and GPFS)” on page 64 and in<br />

more depth in Chapter 5, “File systems” on page 211.) Thus, NFS needs to be<br />

functioning correctly on the SN, otherwise a block will not be able to boot.<br />

To check that the NFS subsystem is functioning properly, check that the /bgl file<br />

system is exported from the SN over the functional network. The<br />

/usr/sbin/showmount -e command shows what file systems are exported and<br />

also gives a good indication whether NFS is working.<br />

# showmount -e bglsn<br />

Export list for bglsn:<br />

/bgl 172.30.0.0/255.255.0.0<br />

90 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


If the showmount command returns an error, depending on the message, it might<br />

be possible to identify the potential area of the problem or part of the problem<br />

with NFS:<br />

1. showmount -e points to a potential issue with the port mapper service<br />

# showmount -e<br />

mount clntudp_create: RPC: Port mapper failure - RPC: Unable to<br />

receive<br />

Possible fix:<br />

# /etc/init.d/portmap restart<br />

# /etc/init.d/nfsserver restart<br />

2. showmount -e points to a potential issue with rpc.mountd or nfsd<br />

# showmount -e<br />

mount clntudp_create: RPC: Program not registered<br />

Possible fix:<br />

# /etc/init.d/nfsserver restart<br />

Note: Refer to the check lists in Chapter 5, “File systems” on page 211 for<br />

more NFS related checks.<br />

2.3.7 Check that a block can be allocated using mmcs_console<br />

Note: The Blue Gene/L blocks are also referred to as partitions.<br />

A good way to ensure that the core Blue Gene/L system is working correctly is to<br />

check whether a block can be booted. This check ensures that the<br />

communication between the SN and the racks and files used during the boot<br />

process is functioning. (These topics are covered in 1.8, “Boot process, job<br />

submission, and termination” on page 36.)<br />

To check that a block can be allocated, follow these steps:<br />

1. Connect to the mmcs_db_console on the SN:<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$<br />

Chapter 2. <strong>Problem</strong> determination methodology 91


2. After you are connected to the console, try to allocate a predefined block. The<br />

following is an example of a successful boot of the block called R000_128:<br />

mmcs$ allocate R000_128<br />

OK<br />

mmcs$<br />

It is possible to list the predefined blocks and the currently booted blocks<br />

within the mmcs_db_console using the list bglblock command. The list<br />

command queries the DB2 database alias BGLBLOCK.<br />

3. Monitor the block from the bglsn-bgdb0-idoproxydb-current.log and<br />

bglsn-bgdb0-mmcs_db_server-current.log that is located in<br />

/bgl/BlueLight/logs/BGL.<br />

4. Check the block state from the mmcs_db_console with the list bglblock<br />

command, as shown in Example 2-26.<br />

Example 2-26 Checking a block (R000_128)<br />

mmcs$ list bglblock R000_128<br />

OK<br />

==> DBBlock record<br />

_blockid = R000_128<br />

_numpsets = 0<br />

_numbps = 0<br />

_owner =<br />

_istorus = 000<br />

_sizex = 0<br />

_sizey = 0<br />

_sizez = 0<br />

_description = Generated via genSmallBlock<br />

_mode = C<br />

_options =<br />

_status = I<br />

_statuslastmodified = 2006-04-01 15:08:30.896873<br />

_mloaderimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />

_blrtsimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />

_linuximg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />

_ramdiskimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />

_debuggerimg = none<br />

_debuggerparmsize = 0<br />

_createdate = 2006-03-06 17:56:20.600708<br />

92 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


The _status field shows the correct status of the block. (Table 4-3 on<br />

page 162 explains these values.)<br />

You can also gather the block information from the BGWEB interface. At the<br />

BGWEB home page, click Runtime to show the currently booting or initialized<br />

dynamic or predefined blocks. Predefined block information is displayed<br />

whatever the current status. Figure 2-15 shows an example of both dynamic<br />

and predefined blocks.<br />

Figure 2-15 BGWEB Block Information page<br />

An additional check on the core system that you can perform to ensure that a<br />

block can be allocated is to check the I/O nodes are running and are connected<br />

to the functional network. Take the following steps:<br />

1. Go to the point where the predefined block shows that it has been allocated in<br />

/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log.<br />

Example 2-27 shows the text that is generated when block R000_128 is<br />

allocated at the mmcs_db_console.<br />

Example 2-27 Text generated in mmcs_db_server log when a block is initialized<br />

..snip..<br />

Apr 05 17:20:57 (I) [1090516192] test1 allocate R000_128<br />

Apr 05 17:20:57 (I) [1090516192] test1<br />

DBMidplaneController::addBlock(R000_128)<br />

..snip..<br />

Apr 05 17:20:57 (I) [1090516192] test1:R000_128<br />

BlockController::connect()<br />

Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />

R00-M0-N0-I:J18-U01 log file:<br />

/bgl/BlueLight/logs/BGL/R00-M0-N0-I:J18-U01.log<br />

Chapter 2. <strong>Problem</strong> determination methodology 93


Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />

R00-M0-N0-I:J18-U11 log file:<br />

/bgl/BlueLight/logs/BGL/R00-M0-N0-I:J18-U11.log<br />

.snip..<br />

Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />

R00-M0-N3-I:J18-U01 log file:<br />

/bgl/BlueLight/logs/BGL/R00-M0-N3-I:J18-U01.log<br />

Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />

R00-M0-N3-I:J18-U11 log file:<br />

/bgl/BlueLight/logs/BGL/R00-M0-N3-I:J18-U11.log<br />

Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />

BlockController::load_microloader() loading microloader<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />

Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />

BlockController::start_microloaders() starting mailbox polling<br />

Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />

BlockController::start_microloaders() starting microloader boot image<br />

on 136 nodes<br />

Apr 05 17:20:58 (I) [1092621536] test1:R000_128 MailboxMonitor thread<br />

starting<br />

Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />

BlockController::start_microloaders() startMicroLoader starting 136<br />

nodes<br />

Apr 05 17:20:59 (I) [1090516192] test1:R000_128<br />

DBBlockController::boot(): making switch settings for block R000_128<br />

Apr 05 17:20:59 (I) [1090516192] test1:R000_128<br />

DBBlockController::boot(): completed switch settings for block R000_128<br />

Apr 05 17:20:59 (I) [1090516192] test1:R000_128 BlockController::load()<br />

loading program /bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />

Apr 05 17:21:03 (I) [1090516192] test1:R000_128 BlockController::load()<br />

loading program /bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />

Apr 05 17:21:07 (I) [1090516192] test1:R000_128 BlockController::load()<br />

loading program /bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />

Apr 05 17:21:08 (I) [1090516192] test1:R000_128<br />

BlockController::start() starting 136 nodes<br />

Apr 05 17:21:08 (I) [1090516192] test1:R000_128 starting 128 nodes with<br />

entry point 0x00000290<br />

Apr 05 17:21:08 (I) [1090516192] test1:R000_128 starting 8 nodes with<br />

entry point 0x00800000<br />

Apr 05 17:21:08 (I) [1090516192] test1:R000_128<br />

DBBlockController::waitBoot(R000_128)<br />

Apr 05 17:21:36 (I) [1092621536] test1:R000_128 DBBlockController:<br />

contacting I/O node {119} at address 172.30.2.4:7000<br />

Apr 05 17:21:36 (I) [1092621536] test1:R000_128 DBBlockController:<br />

contacted I/O node {119} at address 172.30.2.4:7000<br />

94 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


..snip..<br />

Apr 05 17:21:36 (I) [1092621536] test1:R000_128 DBBlockController:<br />

contacted I/O node {34} at address 172.30.2.5:7000<br />

Apr 05 17:21:38 (I) [1090516192] test1:R000_128<br />

DBBlockController::waitBoot(R000_128) block initialization successful<br />

..snip..<br />

2. Using the date pattern from when the block was allocated (Example 2-27 on<br />

page 93), it is possible to create a for loop to identify the I/O nodes that were<br />

used for the block and to ensure that they are responding to a ping over the<br />

network, as shown in Example 2-28.<br />

Example 2-28 Checking the I/O nodes for the booted blocks<br />

# for i in `grep "R000_128"<br />

/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log | grep /<br />

"Apr 05 17:2" |grep contact| awk '{print $14}'| cut -f1 -d':'| uniq /<br />

|sort`; do echo Pinging ionode IP $i; ping -c1 $i >/dev/null 2>&1; /<br />

echo Return Code:$?; done<br />

Pinging ionode IP 172.30.2.1<br />

Return Code:0<br />

Pinging ionode IP 172.30.2.2<br />

Return Code:0<br />

Pinging ionode IP 172.30.2.3<br />

Return Code:0<br />

Pinging ionode IP 172.30.2.4<br />

Return Code:0<br />

Pinging ionode IP 172.30.2.5<br />

Return Code:0<br />

Pinging ionode IP 172.30.2.6<br />

Return Code:0<br />

Pinging ionode IP 172.30.2.7<br />

Return Code:0<br />

Pinging ionode IP 172.30.2.8<br />

Return Code:0<br />

Note: This test can be useful when the booting of a block is hanging. If a block<br />

cannot boot and if there is no indication where the problem might be, try to<br />

boot node card sized blocks that are generated by the gensmallblock<br />

command to isolate the hardware problem.<br />

Chapter 2. <strong>Problem</strong> determination methodology 95


2.3.8 Check that a simple job can run (mmcs_console)<br />

If a block can be booted from the mmcs_db_console, then a good test is to see if<br />

you can run a simple job on the block using a program such as hello.rts. You<br />

can find an example of this code in Unfolding the IBM eServer Blue Gene<br />

Solution, SG24-6686.<br />

To check that a simple job can run, follow these steps:<br />

1. When the block has booted successfully, in this case R000_128, select the<br />

block to run the job, using select_block:<br />

mmcs$ select_block R000_128<br />

OK<br />

mmcs$<br />

Note: There is no need to select the block if the previously run command<br />

at the mmcs prompt was allocate.<br />

2. Submit the job using submit_job:<br />

mmcs$ submit_job /bglscratch/test1/hello.rts /bglscratch/test1<br />

OK<br />

jobId=257<br />

Note: In addition, it is also possible to select the block and submit the job at<br />

the mmcs_db_console prompt with the submitjob command.<br />

3. Monitor the job from bglsn-bgdb0-mmcs_db_server-current.log and<br />

bglsn-bgdb0-ciodb-current.log, which are located in /bgl/BlueLight/logs/BGL.<br />

4. Check the status of the job from the mmcs_db_console with the list bgljob<br />

command. In Example 2-29, job 257 has a status of T (Terminated).<br />

Example 2-29 Checking the status of a job<br />

mmcs$ list bgljob 257<br />

OK<br />

==> DBJob record<br />

_jobid = 257<br />

_entrydate = 2006-03-20 14:55:56.497880<br />

_username = root<br />

_blockid = R000_128<br />

_jobname = Mon Mar 20 14:55:53 2006<br />

.520.1096840416<br />

96 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


_executable = /bgl/applications/Examples/hello.rts<br />

_outputdir = /bgl/applications/Examples<br />

_status = T<br />

_errtext =<br />

_action = D<br />

_exitstatus = 0<br />

_mode = C<br />

_starttime = 2006-03-20 14:55:56.258832<br />

_nodesused = 128<br />

_strace = -1<br />

_stdininfo = 1024<br />

_stdoutinfo = 1024<br />

_stderrinfo = 1024<br />

The list bgljob command in the mmcs_db_console queries the DB2<br />

alias BGLJOB. In the same way, you can use the list bgljob_history<br />

command to gather data on previously run jobs. You can query the<br />

BGLJOB and BGLJOB_HISTORY aliases using the<br />

/dbhome/bgdb2cli/sqllib/bin/db2 command.<br />

You can also retrieve job information from the BGWEB interface. At the<br />

BGWEB home page, click Runtime, and then click Job Information.<br />

Figure 2-16 and Figure 2-17 present examples of job information views<br />

through the BGWEB interface.<br />

Chapter 2. <strong>Problem</strong> determination methodology 97


Figure 2-16 BGWEB showing job Information page<br />

Figure 2-17 BGWEB showing detailed Job ID information<br />

98 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


An additional checks that you can perform is that you can map the Blue Gene/L<br />

job ID to a block using the BGWEB interface:<br />

1. Click Runtime and then Block Information.<br />

2. Select a link for a particular block. It gives details on the block, including the<br />

current jobs that are running on it, as shown in Figure 2-18.<br />

Figure 2-18 BGWEB showing block details and jobs for block data<br />

2.3.9 Check the control system server logs<br />

There are certain logs that you can check, depending on what information is<br />

required. For information about the control system server logs, see 2.2.6,<br />

“Control system server logs” on page 61. There, we include an explanation about<br />

the information that is available in each log file.<br />

Chapter 2. <strong>Problem</strong> determination methodology 99


To check the control system server log, you need to do the following:<br />

► Block monitoring<br />

The block can be monitored from bglsn-bgdb0-idoproxydb-current.log and<br />

bglsn-bgdb0-mmcs_db_server-current.log, which is located in<br />

/bgl/BlueLight/logs/BGL. Run the following command on each log:<br />

# tail -F /bgl/BlueLight/logs/BGL/bglsn-bgdb0-idoproxydb-current.log<br />

# tail -F<br />

/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log<br />

► Job monitoring<br />

When the job is submitted, you can monitor it by checking the<br />

bglsn-bgdb0-ciodb-current.log. Use the following command:<br />

# tail -F /bgl/BlueLight/logs/BGL/bglsn-bgdb0-ciodb-current.log<br />

2.3.10 Check remote shell<br />

In a complex Blue Gene/L system, to execute commands on remote machines<br />

(FENs), you can implement rsh and rshd. Here, we show what you need to check<br />

to ensure that rsh is configured correctly:<br />

1. Check that the SN and FENs can talk to each other using /usr/bin/rsh in<br />

both directions across the functional network.<br />

– From SN, use the following commands:<br />

Use rsh date to check to the FEN, which should be repeated to each<br />

FEN.<br />

# rsh bglfen1_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

Use rsh date to check to the SN itself.<br />

# rsh bglsn_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

– From the FENs, use the following commands:<br />

Use rsh date to check to the SN itself.<br />

# rsh bglsn_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

Use rsh date to check to the FEN, which should be repeated to each<br />

FEN.<br />

# rsh bglfen1_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

100 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2. Check that rsh is enabled with the /sbin/chkconfig command. The following<br />

example is the expected output when rsh is enabled.<br />

# /sbin/chkconfig --list rsh<br />

xinetd based services:<br />

rsh: on<br />

You can double check that rsh is enable by looking at the /etc/xinetd.d/rsh file<br />

and checking that there is not an extra entry for the stanza field disabled with<br />

the value yes. Example 2-30 shows the contents of the /etc/xinetd.d/rsh file<br />

with rshd enabled.<br />

Example 2-30 Checking rshd stanza in xinetd configuration<br />

# cat /etc/xinetd.d/rsh<br />

# default: off<br />

# description:<br />

# The rshd server is a server for the rcmd(3) routine and,<br />

# consequently, for the rsh(1) program. The server provides<br />

# remote execution facilities with authentication based on<br />

# privileged port numbers from trusted hosts.<br />

#<br />

service shell<br />

{<br />

socket_type = stream<br />

protocol = tcp<br />

flags = NAMEINARGS<br />

wait = no<br />

user = root<br />

group = root<br />

log_on_success += USERID<br />

log_on_failure += USERID<br />

server = /usr/sbin/tcpd<br />

# server_args = /usr/sbin/in.rshd -L<br />

server_args = /usr/sbin/in.rshd -aL<br />

instances = 200<br />

disable = no<br />

}<br />

3. Check the ~/.rhosts file for each user or the /etc/hosts.equiv file. For further<br />

details, refer to “Remote command execution setup” on page 151.<br />

Note: If the rsh checks are done after a job has failed, you need to run them<br />

as the owner (user) of the job that failed.<br />

Chapter 2. <strong>Problem</strong> determination methodology 101


If there is an issue with rsh, then check the following:<br />

1. Check for correct name resolution, and resolve identical from all nodes<br />

involved. Name resolution should be local (/etc/hosts).<br />

2. Check that the ~/.rhosts has the correct IP labels (host names) and users or<br />

that /etc/hosts.equiv has the correct IP labels (host names).<br />

3. Enable rsh on the SN and the FENs, if required, with the /sbin/chkconfig<br />

command.<br />

# chkconfig rsh on<br />

# chkconfig --list rsh<br />

xinetd based services:<br />

rsh: on<br />

2.3.11 Check remote command execution with secure shell<br />

In a complex Blue Gene/L configuration, you can implement secure shell (ssh)<br />

for remote command execution. Although you can also use ssh for remote<br />

command execution between SN and I/O nodes, in this section, we only cover<br />

checks for ssh between SN and FENs.<br />

Note: For details about setting up secure shell, see “Configuring ssh and scp<br />

on SN and I/O nodes” on page 237 and 7.2, “Secure shell” on page 395.<br />

To check remote command execution with ssh, follow these steps:<br />

1. Check that the SN and FENs can talk to each other in both directions through<br />

the functional network using /usr/bin/ssh.<br />

– From the SN:<br />

Use ssh date to check to the FEN, which should be repeated to each<br />

FEN.<br />

# ssh bglfen1_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

Use ssh date to check to the SN itself.<br />

# ssh bglsn_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

– From the FENs:<br />

Use ssh date to check to the SN itself.<br />

# ssh bglsn_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

102 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Use ssh date to check to the FEN, which should be repeated to each<br />

FEN.<br />

# ssh bglfen1_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

2. Check that sshd is enabled with the /sbin/chkconfig command. The<br />

following command shows sshd enabled:<br />

# chkconfig --list sshd<br />

sshd 0:off 1:off 2:off 3:on 4:off 5:on<br />

6:off<br />

3. For further checks, refer to 7.2, “Secure shell” on page 395.<br />

Note: If the checks are being done after a job has failed the checks need to be<br />

run as the owner (user) of the job that failed.<br />

If there is an issue with ssh, check the following:<br />

1. Ensure that the ~/.ssh/known_hosts and ~/.ssh/authorized_keys have the<br />

correct entries for root and all users able to submit jobs.<br />

2. Enable sshd on the SN and the FENs, if required, with the /sbin/chkconfig<br />

command.<br />

# chkconfig sshd 23<br />

# chkconfig --list sshd<br />

sshd 0:off 1:off 2:on 3:on 4:off 5:off<br />

6:off<br />

For further actions, refer to 7.2, “Secure shell” on page 395 and to ssh man<br />

pages.<br />

2.3.12 Check the network switches<br />

A good understanding of your network switches and the topology that is used for<br />

the Blue Gene/L system configuration is required to solve any problems related<br />

to the network switches. (See also 2.2.3, “Network diagram” on page 59.) We<br />

recommend that you check the network switches for errors and that you perform<br />

a general health check on them. Refer to the network switches documentation<br />

and get additional help from your site network administrating team.<br />

If there are any issues with the switches, you need to resolve them to ensure<br />

reliable connectivity and network performance.<br />

Chapter 2. <strong>Problem</strong> determination methodology 103


2.3.13 Check the physical Blue Gene/L racks configuration<br />

It is also useful to perform a physical inspection of the Blue Gene/L racks to help<br />

identify any obvious problems.<br />

1. Check that the status lights on the service cards and the node cards are in the<br />

expected state, as explained in 1.2, “Hardware components of the Blue<br />

Gene/L system” on page 2.<br />

2. Check that all cables are connected to the to the interfaces and are plugged<br />

into the correct place.<br />

3. Check the clock card cables on the clock card and the service card.<br />

4. Ensure that the master/slave switch is set to the correct setting.<br />

If there are any issues with the physical configuration of the Blue Gene/L racks,<br />

then these issues need to be addressed by qualified service technicians.<br />

Note: Node cards that are connected to a powered switch with a carrier will<br />

not show they have a carrier until they are discovered by the discovery<br />

process. Refer to 1.9, “System discovery” on page 43 for more information.<br />

2.4 <strong>Problem</strong> determination methodology<br />

In this section, we describe a methodology that you can use to locate issues in a<br />

Blue Gene/L system. We developed this methodology by first building a core<br />

Blue Gene/L system, as listed in “Core Blue Gene/L” on page 56, and injecting<br />

errors into that system to simulate faults that might occur. In the process of<br />

investigating these errors, we then developed this practical problem<br />

determination methodology.<br />

We then created a more complex Blue Gene/L system with FENs, LoadLeveler,<br />

MPI, and GPFS, as listed in “Complex Blue Gene/L” on page 56. We then<br />

evolved the methodology further to ensure that these components were also<br />

addressed.<br />

104 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


NO<br />

Use monitoring<br />

tools, hardware<br />

diagnostics,<br />

RAS database and<br />

so forth<br />

Figure 2-19 illustrates this methodology. In this illustration, the steps of the<br />

methodology are in the circles and the check lists are in the squares.<br />

Anything<br />

changed?<br />

YES<br />

Call IBM<br />

software<br />

support<br />

START<br />

<strong>Problem</strong><br />

definition<br />

HW SW<br />

HW/SW?<br />

<strong>Problem</strong><br />

identified?<br />

YES<br />

NO NO Can you fix<br />

it?<br />

END<br />

YES<br />

YES<br />

YES<br />

Figure 2-19 <strong>Problem</strong> determination methodology for Blue Gene/L system<br />

The three steps in our methodology are:<br />

1. Define the problem.<br />

2. Identify the Blue Gene/L system.<br />

3. Identify the problem area.<br />

Anything<br />

changed?<br />

<strong>Problem</strong><br />

identified?<br />

YES<br />

Call IBM<br />

software<br />

support<br />

If you cannot find the problem area in one of the check lists where you think the<br />

problem has occurred, then we recommend that you run through the check lists<br />

for the other relevant components that you have in your system. If the cause of<br />

the problem is not obvious, we recommend that you start with 2.5, “Identifying<br />

core Blue Gene/L system problems” on page 108 for help with identifying the<br />

problem.<br />

NO<br />

Core<br />

SW<br />

Compilers<br />

GPFS<br />

Analyze<br />

components<br />

NFS<br />

Chapter 2. <strong>Problem</strong> determination methodology 105<br />

LL<br />

MPI


This methodology also allows someone with little experience of a Blue Gene/L<br />

system to assess where the problem lies quickly and, perhaps more importantly,<br />

to determine whether there is a problem at all.<br />

Following the methodology, you can find the appropriate checks to confirm in<br />

more detail where the problem lies. If it is not clear where an issue is in a<br />

complex system, then you can break down the problem into smaller chunks and<br />

check that each part of the system is working correctly.<br />

We also recommend that your support or administration staff document the<br />

diagnosis and resolution steps for any problem that you encounter to help with<br />

potential problems in the future.<br />

<strong>Problem</strong>s on the Blue Gene/L system generally fall into the falling categories:<br />

► Block initialization<br />

► Runtime issues<br />

► General hardware problems<br />

► Results from diagnostics<br />

► Monitoring tool outputs<br />

You can then check the appropriate components to ensure that they are in a<br />

functional state using individual checklists for each component in the current<br />

Blue Gene/L environment. This methodology is demonstrated in Chapter 6,<br />

“Scenarios” on page 265 through a number of different scenarios.<br />

We recommend that you follow the seven steps in the following list to ensure that<br />

you determine the problem, resolve it, and document it correctly. The most<br />

important steps are to get a clear description of the problem symptoms and to<br />

understand the system that you are dealing with.<br />

We recommend that you:<br />

1. Define the problem.<br />

2. Identify the Blue Gene/L system.<br />

3. Identify the problem area.<br />

The first three points are the three main areas that we cover in this book.<br />

However, a normal problem determination methodology extends also to the<br />

following steps:<br />

4. If possible, check to see if the issue has occurred before.<br />

5. Generate an action plan to resolve the issue.<br />

6. After you have determined the problem, take corrective measures.<br />

7. Document the problem for future administration purposes.<br />

106 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.4.1 Define the problem<br />

When you have answered the following questions, you should have enough data<br />

to indicate where to start looking for the problem:<br />

► What is the description of the problem?<br />

► Are there logs or output to demonstrate the problem?<br />

► Is the problem reproducible?<br />

► Is this a hardware or a software problem?<br />

2.4.2 Identify the Blue Gene/L system<br />

The Blue Gene/L system might be running within a simple or complex<br />

configuration, and it is important that you know the system. A process to<br />

understand what configuration that you have is described in 2.2, “Identifying the<br />

installed system” on page 57.<br />

2.4.3 Identify the problem area<br />

When you have answered the questions in 2.4.1, “Define the problem” on<br />

page 107, then go through the following high-level checks to pin point the<br />

problem area on the Blue Gene/L system:<br />

► Has anything changed recently on the system?<br />

► Where are we seeing the problem?<br />

The questions listed in 2.4.1, “Define the problem” on page 107 should give<br />

you a good indication of where to start looking. As mentioned in 2.3, “Sanity<br />

checks for installed components” on page 82, the functionality of one part of<br />

the Blue Gene/L is dependant on another part functioning correctly.<br />

Therefore, even though the initial starting point in trying to determine a<br />

problem area might seem obvious, it does not mean this is what is causing<br />

the overall problem.<br />

The following list includes the main components that make up a complex Blue<br />

Gene/L environment. Each component has a separate checklist to identify<br />

whether it is working as expected:<br />

– Software<br />

2.5, “Identifying core Blue Gene/L system problems” on page 108<br />

4.4.9, “LoadLeveler checklist” on page 186<br />

“The mpirun checklist” on page 166<br />

5.3.10, “GPFS Checklists” on page 255<br />

5.2.4, “NFS checklists” on page 218<br />

Chapter 2. <strong>Problem</strong> determination methodology 107


– Hardware<br />

3.2, “Hardware monitor” on page 114<br />

3.4, “Diagnostics” on page 131.<br />

2.3.12, “Check the network switches” on page 103<br />

2.3.13, “Check the physical Blue Gene/L racks configuration” on<br />

page 104<br />

2.5 Identifying core Blue Gene/L system problems<br />

Here is a basic checklist that you can use to reveal core Blue Gene/L<br />

system-related issues:<br />

► Check what has changed on the system.<br />

► Perform some basic checks from the SN:<br />

– Check that DB2 is working<br />

– Check that BGWEB is running<br />

– Check that BGLMaster and its child daemons are running<br />

– Check that a block can be allocated using mmcs_console<br />

– Check that a simple job can run (mmcs_console)<br />

► Check the control system server logs<br />

► Check the RAS Events or use RAS drill down in the BGWEB for relevant<br />

errors when the issue occurred. See Chapter 3, “<strong>Problem</strong> determination<br />

tools” on page 113 for more information.<br />

► In the BGWEB, check the Runtime link to obtain information about the job<br />

that has been run and the blocks in use.<br />

Examples are shown in Figure 2-16 on page 98 and Figure 2-17 on page 98.<br />

For more information, see Chapter 3, “<strong>Problem</strong> determination tools” on<br />

page 113.<br />

► Check the Configuration link in the BGWEB for hardware issues.<br />

Examples are shown in 2.2, “Identifying the installed system” on page 57. For<br />

more information, see Chapter 3, “<strong>Problem</strong> determination tools” on page 113.<br />

► If there is no indication where the problem might be, we recommend that you<br />

use the following sequence of checks:<br />

– 2.3.1, “Check the operating system on the SN” on page 83<br />

– 2.3.2, “Check communication services on the SN” on page 84<br />

– 2.3.6, “Check the NFS subsystem on the SN” on page 90<br />

– 2.3.12, “Check the network switches” on page 103<br />

– 2.3.13, “Check the physical Blue Gene/L racks configuration” on page 104<br />

108 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


2.6 Identifying IBM LoadLeveler jobs to BGL jobs (CLI)<br />

This section gives an overview on how to identify the jobs that are submitted to<br />

Blue Gene/L by the IBM LoadLeveler. The tasks and actions that we present<br />

here are not intended to replace the procedures that are described in the<br />

LoadLeveler documentation. Rather, we provide some tips.<br />

We start by setting the ‘-verbose’ option in the LoadLeveler commands file<br />

(.cmd), as shown in Example 2-31.<br />

Example 2-31 LoadLeveler job command file with ‘-verbose’ option<br />

bglsn:/bglscratch/test1 # cat ior-gpfs.cmd<br />

#@ job_type = bluegene<br />

#@ executable = /usr/bin/mpirun<br />

#@ bg_size = 128<br />

##@ bg_partition = R000_128<br />

##@ arguments = -verbose 2 -exe /bglscratch/test1/hello-file-2.rts<br />

-args 6<br />

#@ arguments = -verbose 2 -exe<br />

/bglscratch/test1/applications/IOR/IOR.rts -args -f<br />

/bglscratch/test1/applications/IOR/ior-inputs<br />

##@ output = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />

##@ error = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />

#@ output = /bglscratch/test1/ior-gpfs.out<br />

#@ error = /bglscratch/test1/ior-gpfs.err<br />

#@ environment = COPY_ALL<br />

##@ notification = error<br />

##@ notify_user = loadl<br />

#@ class = small<br />

#@ queue<br />

Then, you can check the job’s stderr(2) file for detail information, as shown in<br />

Example 2-32.<br />

Example 2-32 Verbose information in job’s stderr(2) file<br />

..snip..<br />

BE_MPI (Debug): Adding job<br />

bglfen1.itso.ibm.com.16.0 to the DB...<br />

BRIDGE (Debug): rm_get_jobs() - Called<br />

BRIDGE (Debug): rm_get_jobs() - Completed<br />

Successfully<br />

SCHED_BRIDGE (Debug): Partition<br />

RMP24Mr104437181 - No BG/L job assigned to this partition<br />

BRIDGE (Debug): rm_add_job() - Called<br />

Chapter 2. <strong>Problem</strong> determination methodology 109


BRIDGE (Debug): rm_add_job() - Completed<br />

Successfully<br />

BE_MPI (Debug): Job bglfen1.itso.ibm.com.16.0<br />

was successfully added to the DB<br />

BE_MPI (Debug): Quering the DB job ID<br />

BE_MPI (Debug): DB job ID is 199<br />

..snip..<br />

Here, we can see the interaction between LoadLeveler and the Blue Gene/L<br />

database:<br />

► LL job bglfen1.itso.ibm.com.16.0<br />

► Partition Id : RMP24Mr104437181<br />

► BGL job ID: 199<br />

You can map the information in this list to the job information, as reported by the<br />

llq command on the Front-End Node, as shown in Example 2-33.<br />

Example 2-33 LoadLeveler queue information (llq command)<br />

test1@bglfen1:/bglscratch/test1/applications/IOR> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

------------------------ ---------- ----------- -- --- ------------<br />

----------bglfen1.12.0<br />

test1 3/24 09:45 R 50 small<br />

bglsn<br />

bglfen1.13.0 test1 3/24 09:46 R 50 small<br />

bglfen1<br />

bglfen1.15.0 test1 3/24 09:46 R 50 small<br />

bglfen2<br />

bglfen1.16.0 test1 3/24 09:47 R 50 small<br />

bglfen1<br />

bglfen1.17.0 test1 3/24 09:48 I 50 small<br />

Also, you can identify the job that is running (Example 2-33) in the log file<br />

~/loadl/log/StarterLog, from which we extracted the lines shown in Example 2-34.<br />

Example 2-34 Job log file<br />

..snip..<br />

03/24 09:46:23 TI-0 bglfen1.13.0 Prolog not run, no program was<br />

specified.<br />

03/24 09:46:23 TI-0 bglfen1.13.0 run_dir =<br />

/home/loadl/execute/bglfen1.itso.ibm.com.13.0<br />

110 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


03/24 09:46:23 TI-0 bglfen1.13.0 Sending request for executable to<br />

Schedd<br />

03/24 09:46:23 TI-0 03/24 09:46:23 TI-0 bglfen1.13.0 User environment<br />

prolog not run, no program was specified.<br />

03/24 09:46:23 TI-0 LoadLeveler: 2539-475 Cannot receive command from<br />

client bglfen1.itso.ibm.com, errno =2.<br />

03/24 09:46:23 TI-0 bglfen1.13.0 llcheckpriv program exited, termsig =<br />

0, coredump = 0, retcode = 0<br />

03/24 09:46:23 TI-0 bglfen1.itso.ibm.com.13.0 Sending READY status to<br />

Startd<br />

03/24 09:46:23 TI-0 bglfen1.13.0 Main task program started (pid=9064<br />

process count=1).<br />

03/24 09:46:23 TI-0 bglfen1.itso.ibm.com.13.0 Sending RUNNING status to<br />

Startd<br />

03/24 09:46:24 TI-0 bglfen1.13.0 Blue Gene partition id<br />

RMP24Mr104230153.<br />

03/24 09:46:24 TI-0 bglfen1.13.0 Blue Gene Job Name<br />

bglfen1.itso.ibm.com.13.0.<br />

..snip..<br />

Here, we have the following relevant information:<br />

► LoadLeveler ID: bglfen1.13.0<br />

► LoadLeveler long job name : bglfen1.itso.ibm.com.13.0<br />

► Partition name : RMP24Mr104230153<br />

To cross-check with the Blue Gene/L database, we can query the TBGLJOB and<br />

TBGLJOB_HISTORY tables. The TBGLJOB table shows the currently running<br />

jobs, and the TBGLJOB_HISTORY table shows the jobs that have finished.<br />

You can get all the job statistics from these tables, including the BGL job ID,<br />

using the DB2 statements shown in Example 2-35.<br />

Example 2-35 Querying the DB2 database for LoadLeveler jobs<br />

bglsn:/bglscratch/test1 # db2 "select jobid,jobname,blockid from<br />

TBGLJOB_HISTORY where jobname = 'bglfen1.itso.ibm.com.15.0'"<br />

JOBID JOBNAME BLOCKID<br />

----------- ------------------------------------------------<br />

----------------<br />

198 bglfen1.itso.ibm.com.15.0<br />

RMP24Mr104232176<br />

1 record(s) selected.<br />

Chapter 2. <strong>Problem</strong> determination methodology 111


112 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chapter 3. <strong>Problem</strong> determination tools<br />

In this chapter we introduce problem determination tools that are available on the<br />

Blue Gene/L system.<br />

We start with an overview of each tool and then provide more detail on each tool<br />

in individual sections. We include information about when to use the tool<br />

according to the problem determination methodology, and how to interpret the<br />

output of the tool.<br />

We discuss the following tools:<br />

► Hardware monitor<br />

► Web interface<br />

► Diagnostics<br />

► MMCS console<br />

3<br />

In addition to these tools, Blue Gene/L provides operational logs, which are<br />

explained in 2.2.6, “Control system server logs” on page 61.<br />

© Copyright IBM Corp. 2006. All rights reserved. 113


3.1 Introduction<br />

This chapter explains how to monitor and analyze your system by using available<br />

problem determination tools. Among these tools, we focus on the hardware<br />

monitor, Web interface, diagnostic suite, and MMCS (Midplane Management<br />

Control System) console.<br />

Each section contains subsections that illustrate a procedure of how to start the<br />

tool and how to check the results.<br />

3.2 Hardware monitor<br />

The hardware monitor is a tool that is used to capture information about a<br />

specified piece of hardware of the Blue Gene/L system. After you start the<br />

hardware monitor on the running system, it keeps monitoring and storing the<br />

environmental information.<br />

3.2.1 Collectable information<br />

The hardware monitor gathers information about the following Blue Gene/L rack<br />

hardware:<br />

► Fans<br />

► Bulk power modules<br />

► Service cards<br />

► Node cards<br />

► Link cards<br />

For each of the devices presented in the previous list, multiple data is available.<br />

For example, monitoring the fans captures temperature, voltage, speed and<br />

status flags. See Table 3-1 for the complete list of the collectable data. Note that<br />

the data for fan modules and bulk power modules are collected by monitoring the<br />

service cards.<br />

The hardware monitor stores all of these information in the DB2 database. They<br />

are accessible through the Environmental and the Database section in the Web<br />

interface, or you can query the database directly using SQL commands.<br />

114 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


3.2.2 Starting the tool<br />

Table 3-1 The complete list of the collectable data<br />

Hardware component Data collected<br />

Service card Temperature, voltage, status flags<br />

Link card Temperature, voltage, power, status flags<br />

Node card Temperature, voltage, status flags<br />

Fan modules Temperature, voltage, speed, status flags<br />

Bulk power modules Temperature, voltage, status flags<br />

Hardware monitor and RAS events<br />

Among the information stored in the DB2 database, the hardware monitor<br />

generates a reliability, availability, and serviceability (RAS) event when it finds<br />

status information outside the normal range. Those events can be examined by<br />

issuing a query to the DB2 database, or they can be viewed using the Web<br />

browser interface.<br />

Specifying MONITOR facility on the RAS Event Query page returns the events<br />

generated by the hardware monitor. See 3.3, “Web interface” on page 119 for<br />

details about the Web interface.<br />

The hardware monitor also records error messages in its own log file. When it is<br />

created, the file is stored in the directory /bgl/BlueLight/logs/BGL as<br />

-monitorHW-.log. The hardware monitor writes error<br />

messages when it cannot contact or recognize any piece of hardware (that might<br />

not have been removed, added, or replaced).<br />

Here, we discuss briefly how to start the hardware monitor. For more detailed<br />

information, see Blue Gene/L: System Administration, SG24-7178.<br />

You can start the hardware monitor manually from the command line, or you can<br />

start it automatically with bglmaster.<br />

Chapter 3. <strong>Problem</strong> determination tools 115


Using the GUI<br />

If you have a graphical user interface (GUI) on the Service Node (SN), run the<br />

commands shown in Example 3-1 to start the hardware monitor.<br />

Example 3-1 Starting the hardware monitor<br />

cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

./startmon<br />

This opens a new window, which is shown in Figure 3-1.<br />

Figure 3-1 Initial screen of the hardware monitor<br />

To start monitoring on the hardware:<br />

1. Click Settings → Start Monitoring.<br />

2. In the Start Monitoring window, click the drop-down list box and select ALL<br />

LINK CARDS.<br />

3. Click Update.<br />

4. Repeat this procedure for all available cards.<br />

116 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Using the console<br />

You can also start the hardware monitor without a GUI. Issuing the commands in<br />

Example 3-2 allows you to start monitoring from a console.<br />

Example 3-2 Start monitoring row 0 from a console<br />

$ cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

$ ./startmon --row 0 --autostart --nodisplay<br />

Example 3-2 launches the hardware monitor for row 0 of the Blue Gene/L<br />

system, monitoring all the cards with default interval time. The default time<br />

interval is 5 minutes for service cards and 30 minutes for the other cards.<br />

Tip: You can use the --row --autostart option for the GUI startup as<br />

well. This option opens the window with all the cards monitored with the<br />

default interval time.<br />

3.2.3 Checking the results<br />

There are two ways of checking the environmental data that is collected by the<br />

hardware monitor. One is by using the Web interface, and the other is by using<br />

the GUI window of the hardware monitor. Both methods provide the same<br />

information, because they are referring to the same DB2 tables.<br />

To check the collected data through the Web interface:<br />

1. Point a Web browser to:<br />

http:///web/index.php<br />

2. Click Environmental.<br />

3. Select one of the information from the top drop list box.<br />

4. Specify the range for the date.<br />

5. Specify location if needed.<br />

6. Click send.<br />

This procedure returns a table that includes the data that was collected by the<br />

hardware monitor in the specified time interval. Figure 3-2 shows an example of<br />

querying a temperature of a service card.<br />

Chapter 3. <strong>Problem</strong> determination tools 117


Figure 3-2 Querying the environmental data<br />

Also, as previously mentioned, the hardware monitor generates an RAS event on<br />

a particular circumstance, such as hardware component cannot be contacted.<br />

Figure 3-3 on page 119 illustrates one of the error messages that is reported on<br />

the RAS event pages.<br />

Note: It is good idea to check both environmental and RAS event pages,<br />

because some information might only come up in only one of these two pages.<br />

118 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


3.3 Web interface<br />

Figure 3-3 RAS events generated by the hardware monitor<br />

The Web interface is one of the major tools that you can use for problem<br />

determination. It provides a number of ways to analyze your Blue Gene/L<br />

system. All the information is stored in the DB2 database that is running on the<br />

SN, and a Web browser interface gives an easy way of selecting the necessary<br />

information.<br />

The Web interface provides the following basic sections:<br />

► Configuration<br />

► Runtime<br />

► Environmental<br />

► RAS events<br />

► Diagnostics test results<br />

► Database browser<br />

Chapter 3. <strong>Problem</strong> determination tools 119


3.3.1 Starting the tool<br />

Figure 3-4 shows the top page of the Web interface.<br />

Figure 3-4 Top page of the Web interface<br />

Tip: Whenever the Blue Gene logo is shown at the top, left corner of a Web<br />

page, clicking the logo brings your Web browser back to the top page.<br />

For the Web interface, there is no particular procedure that is required to start or<br />

to stop the tool. However, there are certain checks to make sure that the tool is<br />

working correctly:<br />

1. Make sure that the DB2 database is running because all the information<br />

accessed through the Web interface is stored in the DB2 database.<br />

If required, use the appropriate procedures to start the database, as shown in<br />

2.3.4, “Check that DB2 is working” on page 87.<br />

2. Check whether an Apache Web server is correctly configured and running,<br />

including the PHP module.<br />

Make sure that the Apache Web server starts when the system restarts and<br />

that the name resolution and apache directories exist as they are defined in<br />

the configuration file. For details see 2.3.3, “Check that BGWEB is running”<br />

on page 86.<br />

120 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


3.3.2 Checking the result<br />

When the Web server and the DB2 are working correctly, the Web interface is<br />

accessible by pointing a Web browser to:<br />

http:///web/index.php<br />

This top page provides links to six basic categories of information with a short<br />

description.<br />

Configuration<br />

The Configuration page provides information about the current status of<br />

hardware. It includes four subcategories:<br />

► Hardware browser<br />

► Link summary<br />

► Service actions<br />

► <strong>Problem</strong> monitor<br />

Hardware browser<br />

The hardware browser provides a Web page that gives you a simple and<br />

effective way to find which hardware is recognized and which is marked as<br />

missing.<br />

If a piece of hardware is in a missing state, then that particular hardware is not<br />

available as a system resource. Hardware is marked missing, for example, when<br />

a service action is performed (maintenance operation) or when a piece of<br />

hardware is in error state.<br />

Throughout the Web pages, each piece of hardware is represented by its name<br />

based on the naming convention (as listed in 1.2.11, “Rack, power module, card,<br />

and fan naming conventions” on page 19). Also, when the hardware browser<br />

finds a system resource in a missing state, it highlights the name of the missing<br />

system with a red box, as shown in Figure 3-5 on page 122.<br />

Note: If you know the name of the particular hardware in which you are<br />

interested, you can enter the name of that hardware in the text box titled Find<br />

Hardware, and then press Enter. This jumps to the detailed information page<br />

of that hardware.<br />

Chapter 3. <strong>Problem</strong> determination tools 121


Figure 3-5 Missing hardware in a red box<br />

The hardware browser provides the exact location when you find a system<br />

resource in a missing state and helps to investigate the cause of that state.<br />

You can also use information that you obtain from the hardware browser with<br />

other tools to concentrate on a specific issue. For example, you can use the<br />

location of the hardware and search for RAS events that are related to that<br />

location.<br />

Link summary<br />

The link summary page gives an overall picture of how the Blue Gene/L system<br />

is wired. It shows which midplane is connected to the other midplane in X, Y, and<br />

Z direction.<br />

This page helps you to make sure that all the data cables are connected,<br />

detected, and configured correctly.<br />

Service actions<br />

The service actions page show service actions in progress and the history of<br />

completed service actions. The service action consists of two phases:<br />

PrepareForService and EndServiceAction. Between these two actions, a<br />

hardware service engineer can remove or replace defective or suspect<br />

hardware.<br />

122 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


An entry of service action in progress means that PrepareForService has been<br />

performed and the ID for the service has been assigned. To complete the<br />

service, EndServiceAction must be performed for that ID.<br />

Using the service actions allow a specified hardware to be turned off and on for<br />

maintenance purpose.<br />

<strong>Problem</strong> monitor<br />

The problem monitor generates a list of hardware that has been detected with<br />

some sort of problem that needs to be solved. It gives a location of hardware and<br />

a short description of the problem.<br />

A location is presented as a link (see Figure 3-6) and helps a you to jump quickly<br />

to the detailed page for the piece of hardware in question.<br />

Link<br />

Figure 3-6 <strong>Problem</strong> monitor showing a midplane is in error state<br />

Runtime<br />

The Runtime page provides the current status of jobs and blocks. It includes four<br />

subcategories:<br />

► Block information<br />

► Job information<br />

► Midplane information<br />

► Utilization<br />

For our discussion, we focus on the block information and the job information.<br />

Chapter 3. <strong>Problem</strong> determination tools 123


Block information<br />

The block information page provides a table of available blocks and their current<br />

status (Figure 3-7). The Status and Job Status columns contain the information<br />

that might need attention.<br />

Figure 3-7 List of blocks page and related information<br />

For example, if a block is in Booting status for a long time, there might be<br />

problem in booting that block. Alternatively, if a job is in the Ready to start<br />

status for a long time, this status might also indicate that there are some<br />

problems with loading the job.<br />

Tip: Clicking the name of each column title returns a new table which is sorted<br />

by the selected column.<br />

For the complete list of the types of status for a block, see Table 4-3 on page 162,<br />

and for the complete list of the types of status for a job see Table 4-4 on<br />

page 162.<br />

Note: A block that is created by LoadLeveler acts differently. For a job<br />

submitted through LoadLeveler, if a predefined block is not specified, one<br />

block is created and allocated dynamically. The dynamically created block<br />

remains in the table until the LoadLeveler job ends. Then, it is released<br />

(freed), and information about this block is lost.<br />

124 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Clicking a specific block ID reveals detailed information for that block. The<br />

Overlapping Blocks table provides useful information, especially if the users of<br />

the Blue Gene/L system are using mpirun to submit their jobs (see Figure 3-8).<br />

It is also helpful to check the Hardware used by this block table, especially after<br />

you have changed a configuration of the system or you have replaced an I/O<br />

card. The table tells whether a particular block is using the correct hardware.<br />

Figure 3-8 Overlapping blocks and hardware in use<br />

Job information<br />

Job information is provided in two main tables. One includes information for the<br />

current jobs, while the other includes the history of jobs submitted. As noted<br />

earlier, the status column is also important for the job information table. If a job is<br />

in Error status, for example, it is good idea to click the job ID to see the details of<br />

the job. Checking the Error Text can help you to determine the problem.<br />

Chapter 3. <strong>Problem</strong> determination tools 125


In addition, the Show RAS events for this job link at the bottom of the page<br />

(Figure 3-9) jumps to the RAS event page that shows RAS events only related to<br />

the job in question.<br />

Figure 3-9 Detailed job information<br />

Environmental<br />

You can use the environmental page to view hardware information that is<br />

collected by the hardware monitor as described in 3.2, “Hardware monitor” on<br />

page 114. Refer to that section for more information.<br />

RAS events<br />

The RAS event page allows you to search for all RAS events that ar stored in the<br />

DB2 database. Because the Blue Gene/L system generates a vast amount of<br />

events, the RAS event page allows an interface that accepts a number of<br />

variables to pick up only the desired information. Specifying a time range, block,<br />

job ID, and location are the typical examples (Figure 3-10 on page 127).<br />

This RAS event page is one of the most frequently used tools for problem<br />

determination, because it gives you an idea of when and where the problem<br />

occurred. For example, the Entry Data suggests the cause of the problem.<br />

126 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 3-10 Entry Data in the RAS event query result<br />

Another feature of the RAS event page is the RAS drill down. The RAS drill down<br />

is accessible from the link that is located at the bottom of the page. This drill<br />

down provides a number of fatal or advisory RAS events that occurred in a<br />

specified period of time for each location.<br />

Clicking the link in front of the location name expands the table to show more<br />

information. Compute card, for example, provides a link to a table of RAS events<br />

that are specific to the compute card when you fully expand a row in the drill<br />

down table, as shown in Figure 3-11 on page 128.<br />

Chapter 3. <strong>Problem</strong> determination tools 127


Figure 3-11 RAS drill down showing the number of errors and their location<br />

Diagnostics test results<br />

Diagnostic test results show the result of a diagnostic test in various ways. The<br />

first page provides a brief test result for each block. If the last result column is a<br />

color other than green (Figure 3-12), you should look into the details by clicking<br />

the block ID.<br />

Figure 3-12 The top page for Diagnostics<br />

128 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


For detailed information about the diagnostic test, including how to run the<br />

diagnostics, see 3.4, “Diagnostics” on page 131.<br />

Details of the diagnostic test results are provided in the form of a table. Each row<br />

in the table represents a single test case with its test results. A number in the<br />

result column shows the count for hardware that passed or failed the test. All of<br />

the numbers are in the form of a link, and clicking one of the numbers brings you<br />

to a more detailed test result that is focused on each piece of hardware<br />

(Figure 3-13).<br />

Figure 3-13 Testcase details<br />

Each piece of hardware has its individual log files for every test case that is<br />

performed in a diagnostic test. Those log files give you an idea for what each test<br />

case is looking. Moreover, if you look into a log file of hardware which is reported<br />

as failed, you might find a reason for the failure marked in red, as shown in<br />

Figure 3-14.<br />

Chapter 3. <strong>Problem</strong> determination tools 129


Figure 3-14 Testcase highlighting the cause of failure<br />

The diagnostic test results page for each block ID also provides two links at the<br />

bottom of the page:<br />

► Show all failed tests for <br />

► Full log for run on <br />

Although both links are self explanatory, checking the red and blue lines for the<br />

second link is useful. This second link provides a summary of the test at the end<br />

and also provides information about the current status of the system. As an<br />

example, the diagnostics test can mark a midplane unavailable (or M for missing),<br />

depending on the test result.<br />

Database browser<br />

The database browser page provides the complete list of database tables and<br />

views that are used in the Blue Gene/L system. All table and view names are in a<br />

form of link. Clicking them shows you a structure of definition and the actual data<br />

stored in the DB2 database.<br />

Although the Database browser page allows you to look into each table or view, it<br />

is not always the best tool to explore the content of the database. For example, if<br />

the total number of currently active compute cards for the entire system is in<br />

question, there is another, better way to obtain this information. Because the<br />

name of tables and views listed in the database browser page are indeed used in<br />

the system, looking up those names and using DB2 commands might help in<br />

some occasions (see Example 3-3).<br />

Example 3-3 Sample DB2 command with one of the names from database browser<br />

$ db2 “select count(*) from TBGLPROCESSORCARD where stauts=’A’”<br />

1<br />

-----------<br />

68<br />

1 record(s) selected.<br />

130 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


3.4 Diagnostics<br />

The diagnostics suite performs a number of tests on the Blue Gene/L rack to<br />

determine system health. The test suite consists of two major sections:<br />

► Blue Gene/L Compute and IO nodes (BLC ASICs) for a midplane<br />

► Blue Gene/L Link chips (BLL ASICs) for a whole rack<br />

Because every data that is related to the Blue Gene/L system is stored in the<br />

DB2 database, the result of diagnostics is also stored in the database. These<br />

results are accessible through the Web interface Diagnostic section. For<br />

information about how to view these results, refer to “Diagnostics test results” on<br />

page 128.<br />

A system administrator can run diagnostics anytime that they are necessary.<br />

However, we recommend that you use diagnostics especially in the following<br />

circumstances:<br />

► When you encounter a problem and cannot isolate whether the problem is<br />

caused by software or hardware error, run diagnostics.<br />

Because the diagnostic suite checks the entire hardware, it helps you to<br />

isolate and to determine what the problem is. In this case, the diagnostic suite<br />

is used for an error detection.<br />

► When you replace hardware, run diagnostics immediately after the<br />

replacement.<br />

The main focus of running the tests is not only on the newly inserted<br />

hardware but also on all the hardware in a rack. The diagnostic suite might<br />

find a piece of hardware with few correctable errors for multiple test cases.<br />

This type of error might indicate a piece of hardware that requires attention. In<br />

this case, the diagnostic suite is used as a precaution.<br />

Chapter 3. <strong>Problem</strong> determination tools 131


3.4.1 Test cases<br />

The diagnostics suite includes a numbers of test cases. Each of the following<br />

tables shows a list of test cases. Table 3-2 shows a list for BLC, which is for<br />

compute and I/O nodes. Table 3-3 on page 133 shows a list for BLL, which is for<br />

link chips. The tables also include a short description of each the test cases.<br />

Table 3-2 List of available testcases for BLC<br />

Test case Description<br />

blc_powermodules Queries the status of each power module on a nodecard.<br />

blc_voltages Queries the 1.5V and 2.5V power rail on a nodecard.<br />

blc_temperatures Queries all temperature sensors on a nodecard.<br />

bs_trash_0 Generates random instruction and executes them on the<br />

PPC instructions unit 0.<br />

bs_trash_1 Same as bs_trash_0 but using unit 1 instead of unit 0.<br />

dgemm160 Tests the floating point unit on the BLC ASIC.<br />

dgemm160e Extended diagnostics based on dgemm160 for testing the<br />

floating point unit.<br />

dgemm3200 Tests both the floating point unit and the memory subsystem<br />

on the BLC ASIC for a problem not found in the earlier tests.<br />

dgemm3200e Extended diagnostics based on dgemm3200 for testing the<br />

floating point unit and memory subsystem.<br />

dr_bitfail Writes to and reads from all DDR memory locations, and<br />

flags all failures. Does a simple memory test and prints a log<br />

description that attempts to identify the failing component.<br />

Identifies specific failing bits by ASIC and DRAM pin when<br />

possible.<br />

emac_dg Tests the ethernet function in loopback on the BLC ASIC.<br />

Specifically checks BLC ASICs, PHYs, and connection<br />

between the two.<br />

gi_single_chip Tests whether the global interrupt port is accessible and<br />

whether the global interrupt wires can be forced to 0 and 1<br />

using the local global interrupt loopback.<br />

gidcm Tests whether the communication through the global<br />

interrupt barrier network is fully functional. One of the<br />

compute node sends out signals in a specific pattern and the<br />

rest of the nodes receive the signals.<br />

132 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Test case Description<br />

linpack Runs single node linpack. Aims to examine the hardware<br />

health using a program that an ordinary user would submit.<br />

mem_l2_coherency Exercises all paths from L2 to connected memory and IO<br />

modules for testing the BLC on-chip cache function.<br />

ms_gen_short This runs memory pattern tests which is a complete memory<br />

test that checks for BLC ASIC function and the external<br />

DRAM.<br />

power_module_stress This test allows the compute cards to use as much power as<br />

possible and see whether power module can cope with<br />

whether the power module can cope with the power surge.<br />

ti_abist SRAM ABIST (SRAM Array Built-In Self Test) tests all<br />

on-chip SRAM arrays.<br />

ti_edramabist EDRAM ABIST (EDRAM Array Built-In Self Test) tests all<br />

on-chip Embedded DRAM arrays.<br />

ti_lbist LBIST (Logic Built-In Self Test) tests the on-chip logic for<br />

operation at frequency, using random patterns.<br />

tr_connectivity Checks basic connectivity of the collective network.<br />

tr_loopback Tests the on-chip functionality of the collective unit on the<br />

BLC ASIC.<br />

tr_multinode Thorough check of the collective network.<br />

ts_connectivity Checks basic connectivity of the torus network.<br />

ts_loopback_0 Tests the on-chip functionality of the torus unit on BLC ASIC<br />

0.<br />

ts_loopback_1 Same as ts_loopback_0 but using unit 1 instead of unit 0.<br />

ts_multinode Thorough check of the torus network.<br />

Table 3-3 List of available testcases for BLL<br />

Test case Description<br />

bll_lbist LBIST (Logic Built-In Self Test) tests the on-chip logic<br />

for operation at frequency, using random patterns.<br />

bll_lbist_linkreset Performs a resets or re-initializes linkchips.<br />

bll_lbist_pgood Resets linkchips to simple state after turning on the<br />

system.<br />

Chapter 3. <strong>Problem</strong> determination tools 133


3.4.2 Starting the tool<br />

Test case Description<br />

bll_powermodules Queries the status of each power module on a linkcard.<br />

bll_temperatures Queries all temperature sensors on a linkcard.<br />

bll_voltages Queries all 1.5V and 2.5V power rail on a linkcard.<br />

There are several ways to run the diagnostic suite on the Blue Gene/L system.<br />

Among those, we recommend that you use the following scripts:<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/submit_rack_diags.ksh<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/submit_midplane_diags.ksh<br />

Both of these scripts require at least one argument, which is a midplane or a rack<br />

identifier (see Example 3-4). If none is specified, the script uses R00 as a default<br />

rack identifier and R000 as a default midplane identifier.<br />

Example 3-4 Specifying an identifier for the scripts<br />

$ /bgl/BlueLight/ppcfloor/bglsys/bin/submit_rack_diags.ksh R01<br />

$ /bgl/BlueLight/ppcfloor/bglsys/bin/submit_midplane_diags.ksh R010<br />

The scripts also accept optional arguments. For a list of acceptable arguments,<br />

issue the command ./rundiag -h, which is located in the<br />

/bgl/BlueLight/ppcfloor/bgldiag/ directory, or for detailed instructions (the manual<br />

page), issue ./rundiag -help from the same directory.<br />

You can also run a subset of tests from the diagnostic suite by choosing from a<br />

menu which test to run in interactive mode, as shown in Example 3-5.<br />

Example 3-5 Evoking the diagnostic suite menu<br />

$ cd /bgl/BlueLight/ppcfloor/bgldiags/common<br />

$ ./rundiag -host localhost -block <br />

--- Blue Gene/L System Diagnostic Console ---<br />

1 Power, Packaging & Cooling tests<br />

2 Link chip diagnostic<br />

3 Compute & IO node BIST engine tests<br />

4 Compute & IO single node bootstrap tests<br />

5 Compute & IO single node kernel tests<br />

6 Compute & IO node global interrupt tests<br />

7 Compute & IO multi-node tests<br />

8 IO node only tests<br />

134 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


9 Compute & IO node BLADE exercisers (e.g. dgemm3200e)<br />

10 Compute & IO node power modules stress test<br />

11 Compute & IO node BLRTS tests<br />

19 Enter remarks or notes for this run (accepts also: ‘remarks’ or<br />

‘r’)<br />

20 Print the current remarks or notes for this run (accepts also:<br />

‘printremarks’ or ‘p’)<br />

exit exit (short: e,q)<br />

Note: Running a diagnostics suite using the provided script creates its own<br />

block to run the tests. The block will include all the available I/O nodes that are<br />

installed in the system. For certain configurations, for example when one I/O<br />

card is installed per node card but only one ethernet cable is connected to the<br />

node card, the diagnostic tests might cause a problem for this block.<br />

Because an I/O card supports two ethernet cables and the diagnostic block<br />

expects to have two ethernet connections, the block might fail to boot with an<br />

error message no ethernet link. Thus, depending on your configuration, some<br />

test cases might be unable to execute.<br />

If you have such a configuration, use the rundiag command with the -block<br />

option instead of the script. Make sure that you specify a bootable (or<br />

valid) block for the option.<br />

Tip: You can run testcases interactively from the menu, or you can add the<br />

-batch option to run test cases in batch mode.<br />

For more detailed information, refer to the diagnostic documentation that comes<br />

with each driver release, which is located in /bgl/BlueLight/ppcfloor/bgldiags/doc.<br />

3.4.3 Checking the results<br />

The results of the diagnostic suite are stored in the DB2 database and in log files.<br />

You can use the Web interface for easy access to the content that is stored in the<br />

database. (See “Diagnostics test results” on page 128 for more information.)<br />

You can also check the result in the log files. The log files are stored in the<br />

/bgl/BlueLight/logs/diags/ directory (on the SN). Each time the diagnostic suite<br />

runs, it creates a directory with the start time, followed by the block ID and the<br />

midplane or rack identifier as the directory name. Inside this directory, the<br />

diagnostics store various log files, scripts, and a directory for each individual test<br />

case.<br />

Chapter 3. <strong>Problem</strong> determination tools 135


While the Web interface provides a log file for each node in each test case, it is<br />

still worth looking at the files in the diags directory. Here is a typical example of<br />

such a case.<br />

We assume that one of the test cases has failed, and a large number of nodes<br />

has reported errors. You might first want to look into the individual error logs<br />

through the diagnostics test results page. Although it depends on the problem<br />

that you encounter, often some nodes return a different error message from<br />

others. In such a case, instead of clicking each link for a log file on the Web<br />

interface, sometimes it is more efficient to look into the files under the logs<br />

directory. Some files might have a larger file size, which indicates that the<br />

particular node produced a larger amount of error messages.<br />

Because those log files are plain text, if you are looking for a particular message,<br />

you can use tools such as awk, sed, grep and so on to filter messages.<br />

3.5 MMCS console<br />

3.5.1 Starting the tool<br />

The Midplane Management Control System (MMCS) console is a tool that<br />

provides various commands to control and maintain blocks and jobs<br />

(MMCS_DB). Although this tool is not designed for a problem determination, it is<br />

an important and useful tool to obtain the current status of the system.<br />

To launch the MMCS console, issue the following commands on the SN:<br />

$ cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

$ ./mmcs_db_console<br />

These commands open a shell that has a prompt which starts with mmcs$. If<br />

launching the mmcs_db_console does not provide the mmcs$ prompt and exits with<br />

an error message, check the following:<br />

► mmcs_db_server, idoproxydb, and ciodb must be running on the SN.<br />

► A path to the DB2 client libraries must be included in your PATH environment<br />

variable. Running the following command updates the PATH variable:<br />

$ . ~bgdb2cli/sqllib/db2profile<br />

136 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


3.5.2 Checking the results<br />

Most of the commands return the results to the console, just like an ordinary<br />

shell. Some of the commands that affect the system status might update the DB2<br />

database as well.<br />

To list all the available commands on mmcs_db_console, type help at the mmcs$<br />

prompt. This command returns a list of commands and their syntax. If you know<br />

the command for which you are looking, then help shows the syntax<br />

and a short description of the desired command.<br />

Among the number of useful commands that are provided by the MMCS console,<br />

we focus on the ones that are particularly helpful for problem determination:<br />

Interacting with nodes<br />

Generally, opening an interactive login shell on the I/O node is not desired<br />

(because of the amount of memory required to open a shell). However, MMCS<br />

console provides a tool, called write_con , which is useful if you have to<br />

look into an I/O node. The write_con utility allows you to submit a command to<br />

run on the I/O node. By using the option, nodes and IDo chips can be<br />

specified.<br />

Example 3-6 shows how to send the hostname command to the system on all I/O<br />

nodes in the block. The same example also demonstrates the usage of target<br />

option. The {i} option specifies all I/O nodes of the selected block. Each node is<br />

represented by the numbers in curly braces. In the example, the first I/O node<br />

returned its host name, ionode4, and it was recognized as node 119 in the<br />

system.<br />

Example 3-6 Example of using write_con and target option<br />

mmcs$ redirect R000_128 on<br />

OK<br />

mmcs$ {i} write_con hostname<br />

OK<br />

mmcs$ Apr 06 18:59:41 (I) [1079031008] {119}.0: h<br />

Apr 06 18:59:41 (I) [1079031008] {119}.0: ostname<br />

ionode4<br />

$<br />

Apr 06 18:59:41 (I) [1079031008] {102}.0: h<br />

Apr 06 18:59:41 (I) [1079031008] {102}.0: ostname<br />

ionode3<br />

$<br />

Chapter 3. <strong>Problem</strong> determination tools 137


Apr 06 18:59:41 (I) [1079031008] {17}.0: h<br />

Apr 06 18:59:41 (I) [1079031008] {17}.0: ostname<br />

ionode2<br />

The use of the target option also allows you to specify one particular I/O node.<br />

For example, to send the hostname command to the I/O node, which is<br />

recognized as #17, use {17} instead of {i}, as shown in Example 3-7.<br />

Example 3-7 Targeting a specific node<br />

mmcs$ redirect R000_128 on<br />

OK<br />

mmcs$ {17} write_con hostname<br />

OK<br />

mmcs$ Apr 06 19:33:34 (I) [1079031008] {17}.0: h<br />

Apr 06 19:33:34 (I) [1079031008] {17}.0: ostname<br />

ionode2<br />

Tip: If you need to know the physical location of the given node number, the<br />

locate command provides a list for the selected block.<br />

The redirect command, also used in Example 3-7, enables the output of the<br />

sent command to be displayed in the MMCS console. The output is also<br />

recorded in a log file for the specified node. The log file is located in<br />

/bgl/BlueLight/logs/BGL directory.<br />

Booting a block and submit a job<br />

You can use the MMCS console to check whether a block boots successfully and<br />

a job runs correctly. Example 3-8 illustrates the booting of a block and the<br />

submitting of a simple job, which is a good check when isolating a problem.<br />

Example 3-8 A sequence of submitting a job<br />

$ cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />

$ ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ list_blocks<br />

OK<br />

mmcs$ allocate R000_128<br />

138 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


OK<br />

mmcs$ list_blocks<br />

OK<br />

R000_128 root(1) connected<br />

mmcs$ submit_job /bgl/hello/hello.rts /bgl/hello<br />

OK<br />

jobID=285<br />

mmcs$ list_jobs<br />

OK<br />

mmcs$ free R000_128<br />

OK<br />

mmcs$ quit<br />

After issuing the list_jobs command, do not forget to check the output of the<br />

program and confirm that it ran as expected.<br />

Chapter 3. <strong>Problem</strong> determination tools 139


140 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chapter 4. Running jobs<br />

4<br />

This chapter describes the parallel programming, compilers, and job submission<br />

environment on the BlueGene/L system. Also, we briefly discuss the Message<br />

Passing Interface (MPI), which is the foundation for large scale scientific and<br />

engineering applications. The topics covered are:<br />

► Parallel programming environment<br />

► Compilers<br />

► Job submission mechanisms (how they plug into the BlueGene/L<br />

environment)<br />

– Submit job (submit_job command) from Midplane Management Control<br />

System (MMCS)<br />

– mpirun command (a stand-alone program for submitting jobs)<br />

– LoadLeveler, which is the IBM job management for AIX and Linux (batch<br />

queuing system)<br />

© Copyright IBM Corp. 2006. All rights reserved. 141


4.1 Parallel programming environment<br />

The Message Passing Interface (MPI) is a parallel programming environment<br />

that has been ported and modified to suit the Blue Gene/L system. In Blue<br />

Gene/L terminology, each partition or block is a set of compute nodes and I/O<br />

nodes. Compute node kernel (CNK) is a very light weight kernel that developed<br />

specifically for Blue Gene/L system and supports a very limited set of system<br />

calls (about 35% of the Linux kernel system calls).<br />

All of the I/O calls (such as open, read, write and so forth) for the compute nodes<br />

are shipped to the I/O nodes for doing the job. Also, the compute nodes cannot<br />

be talked directly. They can be talked to through I/O nodes in the external<br />

network. At any instance, only a single user can allocate and use the partition, in<br />

the sense that no context switching is possible on the compute nodes.<br />

Each compute node is seen as a single process or thread for any application that<br />

is executing code on the system. Multiple compute nodes form the virtual<br />

communication network when any application executes. This process is not<br />

specific to the Blue Gene/L system, because the parallel applications forms this<br />

for the virtual set of processes that are executing. The MPI_COMM_WORLD<br />

argument that is specified in each MPI library call binds all of these virtual sets of<br />

processes under one group.<br />

There are three basic communication modes in MPI:<br />

► Point-to-point communication (for example: MPI_Send/Recv and so forth)<br />

► Collective communication (for example: MPI_Scatter, MPI_Barrier and so<br />

forth)<br />

► Collective communication and computation (for example: MPI_Reduce and so<br />

forth)<br />

Note: Certain non-blocking communication calls (for example:<br />

MPI_Isend/IRecv) exists in the MPI library that are out of scope of our<br />

discussion here. These are simply point-to-point communication calls.<br />

Let us briefly look at each of the modes:<br />

► Point-to-point communication mode describes the communication across<br />

processes, which in our case takes place across compute nodes (one<br />

compute node runs one process at a point in time). In order to achieve faster<br />

communication throughout the compute nodes, a three dimensional torus (3D<br />

Torus) network was designed on the Blue Gene/L system. Each compute<br />

node is connected to six neighbor compute nodes using the Torus network,<br />

as shown in Figure 1-25 on page 26.<br />

142 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


► With collective communication mode, the communication for I/O calls and<br />

some MPI calls are passed through the collective network (see Figure 1-26<br />

on page 27). However, one of the MPI calls, MPI_Barrier, uses a different<br />

network (see Figure 1-27 on page 29) that is implemented into the Blue<br />

Gene/L system. This is because this call specifically requires every process to<br />

come to a synchronized state before processing can continue. In summary, in<br />

order to achieve high bandwidth and low latency, MPI is designed on the Blue<br />

Gene/L system to take advantage of the underlying network topology.<br />

► Collective communication and computation mode is similar in nature to the<br />

collective communication mode. However, it depends on the implementation<br />

details of MPI, which is out of our scope of discussion here.<br />

Example 4-1 shows the sample Hello world! program C code, which we use<br />

throughout different sections in this chapter while doing compilation and job<br />

submission on the system.<br />

Example 4-1 Sample “Hello world!” program in C<br />

#include "mpi.h"<br />

#include <br />

int main(int argc, char *argv[])<br />

{<br />

int numprocs; /* Number of processors */<br />

int MyRank; /* Processor number */<br />

/* Initialize MPI */<br />

MPI_Init(&argc, &argv);<br />

/* Find this processor number */<br />

MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);<br />

/* Find the number of processors */<br />

MPI_Comm_size(MPI_COMM_WORLD, &numprocs);<br />

printf ("Hello world! from processor %d out of %d\n", MyRank,<br />

numprocs);<br />

/* Shut down MPI */<br />

MPI_Finalize();<br />

printf (“I am the Root process\n”);<br />

return 0;<br />

}<br />

Chapter 4. Running jobs 143


4.2 Compilers<br />

The following steps explain the program shown in Example 4-1 for users who do<br />

not have experience in MPI.<br />

1. The MPI environment is initialized using the MPI_Init call, by which all the<br />

processes recognize and start simultaneously.<br />

2. MPI_Comm_rank gives the unique identity of each process in the total<br />

MPI_COMM_WORLD (set of processes) and MPI_Comm_size gives the information<br />

about how many processes are in the MPI_COMM_WORLD. MPI_COMM_WORLD is a<br />

dynamic parameter that is initialized and updated during the application<br />

execution. We explain this in 4.3.2, “The mpirun program” on page 150.<br />

3. Each process prints the Hello world! message along with its identification<br />

and the total number of processes for the current execution.<br />

4. The MPI_Finalize call terminates the MPI environment and all processes<br />

except the main process that executes this program (the root process).<br />

Note: The root process (rank zero or master process or thread) for an MPI<br />

application should not be confused with operating system root user ID.<br />

5. The root process executes the printf () function (C programming call), and<br />

the program returns.<br />

Note: For details about the MPI programming standard, refer to:<br />

http://www.mcs.anl.gov/mpi<br />

The compiler is one of the major software package that needs to be discussed<br />

before we head onto the execution environment. Depending on your site and the<br />

size of your Blue Gene/L system, you can install a number of FENs to balance<br />

the load of the user community. (Refer to Unfolding the IBM eServer Blue Gene<br />

Solution, SG24-6686 for an overview of the required software on the Front-End<br />

Nodes (FENs) and Service Node (SN).<br />

To use the XLC/XLF compilers on the FENs (Linux PPC64 platform), you will<br />

need to get a formal license agreement. Because Blue Gene/L has a different<br />

processor architecture (PowerPC 440) and there is no way to compile the job on<br />

the Compute node, the applications need to be cross-compiled. XLC/XLF<br />

compilers do require additional add-on RPMs for the Blue Gene/L system in<br />

order to compile user applications. On the SN, the Blue Gene/L control system<br />

RPMs are installed under the /bgl directory (shared file system across FENs and<br />

144 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


I/O nodes). The CNK supports the applications that are compiled either with the<br />

blrts-gnu-gcc/g77 or by the blrts_xlc/xlf compilers.<br />

The cross-compiler environment can be summarized by indicating the following<br />

required components:<br />

► Front-End Node running SuSE SLES 9 on a PPC64 (POWER4 and<br />

POWER5)<br />

► PowerPC-Linux-GNU to generate PowerPC-blrts-GNU<br />

► GNU tool chain for Blue Gene/L<br />

► IBM XL cross compilers for Blue Gene/L<br />

Currently, to build binaries (executables) for Blue Gene/L, the IBM XL compilers<br />

require the following:<br />

► Installation of IBM XLC V7.0/XLF V9.1 compilers for SuSE SLES9<br />

Linux/PPC64<br />

► Installation of the Blue Gene/L add-on that includes Blue Gene/L versions of<br />

the XL run-time libraries, compiler scripts, and configuration files.<br />

– The GNU Blue Gene/L tool chain:<br />

gcc, g++, and g77 v3.2<br />

binutils (as, ld, and so forth) v2.13<br />

GLIBC v2.2.5<br />

– Blue Gene/L support is supplied through patches. You apply the patches<br />

and build the tool chain; IBM supplies scripts to download, patch, and build<br />

everything.<br />

Note: For further reference on the compiler refer to Unfolding the IBM eServer<br />

Blue Gene Solution, SG24-6686.<br />

4.2.1 The blrts tool chain<br />

The blrts tool chain is the GNU tool chain, but it has been adapted so that it<br />

generates code and programs that run in the Blue Gene/L environment. The<br />

following packages (Open source and GPL license) are required for building a<br />

tool chain:<br />

► binutils-2.13<br />

► gcc-3.2<br />

► gdb-5.3<br />

► glibc-2.2.5<br />

► glibc-linuxthreads-2.2.5<br />

Chapter 4. Running jobs 145


Because these RPMs can be downloaded from open source community Web<br />

sites, IBM provides the patches and a script for building the toolchain. This script<br />

is located in the /bgl/BlueLight/ppcfloor/toolchain/buildBlrtsToolChain.sh<br />

directory and is used for applying the patches and building the blrts-gnu directory<br />

where the compilers (powerpc-bgl-blrts-gnu-gcc/g77), debuggers (gdb), and so<br />

forth are installed.<br />

The default cross compiler used for generating Blue Gene/L code is<br />

powerpc-bgl-blrts-gnu-gcc.<br />

Note: Refer to the readme file for the tool chain, which is available on the<br />

customer site, when downloading the updated drivers.<br />

Compilation process using blrts-gnu-gcc/g77<br />

You compile and link user applications on the FENs only. In Example 4-2, we<br />

compiled the sample Hello world! program using the blrts-gnu-gcc compiler.<br />

This is a parallel program that requires MPI libraries and header files that are<br />

included while generating the executable code.<br />

Example 4-2 Compiling the “Hello world!” program using blrts-gnu-gcc<br />

test1@bglfen1:~/Examples/codes> ls<br />

hello-world.c<br />

test1@bglfen1:~/Examples/codes><br />

/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gcc -o<br />

hello-world.rts hello-world.c -I/bgl/BlueLight/ppcfloor/bglsys/include<br />

-L/bgl/BlueLight/ppcfloor/bglsys/lib -lmpich.rts -lrts.rts<br />

-lmsglayer.rts -ldevices.rts<br />

test1@bglfen1:~/Examples/codes> ls<br />

hello-world.c hello-world.rts<br />

test1@bglfen1:~/Examples/codes><br />

Tip: To simplify the syntax of the compile command line, you can supply the<br />

compiler options, the include files, and libraries from a makefile.<br />

In this example, we compiled the hello-world.c program using the<br />

powerpc-bgl-blrts-gnu-gcc compiler and linked it using the MPI library<br />

(lmpich.rts), the (lrts.rts) and the messaging layer and devices layer<br />

(lmsglayer.rts, ldevices.rts) that are specific to the Blue Gene/L environment.<br />

Note: The include directory path is required for compiling MPI programs<br />

because it contains the mpi.h file (plus additional header files).<br />

146 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Table 4-1 includes a list of libraries that are required to compile and link<br />

applications.<br />

Table 4-1 Libraries and their associated RPMs in the Blue Gene/L driver<br />

Libraries RPMs (under which they are present)<br />

libmpich.rts.a bglmpi-2006.1.2-1<br />

libmsglayer.rts bglmpi-2006.1.2-1<br />

librts.rts.a bglcnk-2006.1.2-1<br />

libdevices.rts.a bglcnk-2006.1.2-1<br />

You can also compile and link parallel applications using the mpicc command, as<br />

shown in Example 4-3. Because MPI is already installed as a separate package<br />

(RPM), installing the Blue Gene/L driver adds the following scripts:<br />

► mpicc (C compiler)<br />

► mpicxx (C++ compiler)<br />

► mpif77 (FORTRAN compiler)<br />

Example 4-3 Compiling the “Hello world!” program using the mpicc command<br />

test1@bglfen1:~/Examples/codes> ls<br />

hello-world.c<br />

test1@bglfen1:~/Examples/codes> mpicc -o hello-world.rts hello-world.c<br />

test1@bglfen1:~/Examples/codes> ls<br />

hello-world.c hello-world.rts<br />

test1@bglfen1:~/Examples/codes> file hello-world.rts<br />

hello-world.rts: ELF 32-bit MSB executable, PowerPC or cisco 4500,<br />

version 1 (embedded), statically linked, not stripped<br />

If the compile process is successful, you should move the executable code into<br />

the /bgl or /bglscratch file systems (depending on which file systems are shared<br />

across the SN, FENs, and I/O nodes) in order to make the binary available for job<br />

execution. For further information about mpicc refer to the<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/mpicc script.<br />

Note: Versions of the previously mentioned RPMs vary depending on the Blue<br />

Gene/L driver release.<br />

Chapter 4. Running jobs 147


4.2.2 The IBM XLC/XLF compilers<br />

This section discusses the add-ons and extra RPMs that are essential for<br />

compiling applications using the IBM XLC/XLF compilers. The basic XLC/XLF<br />

compilers on SLES9 are installed on the FENs as a set of three RPMs for each of<br />

the C and Fortran compilers (with two of the RPMs common across both). These<br />

RPMs are required to wrap around the generally available XLC/XLF compilers on<br />

the FEN systems because these are tuned especially for the Blue Gene/L<br />

version of PowerPC 440.<br />

While compiling applications using these compilers, applications can take<br />

advantage of the underlying processor architecture and, in turn, yield good timing<br />

and performance results. Also, depending on the application, you can experiment<br />

with optimization flags for best results.<br />

The XL compilers require the blrts tool chain for functioning. Some of the bltrs<br />

tool chain features are:<br />

► The assembler and linker from the blrts tool chain are used to create<br />

programs that run in the Blue Gene/L environment<br />

► Runtime libraries are built with the blrts tool chain<br />

► Runtime routines from the blrts tool chain (glibc) are used for applications<br />

► Maintain binary compatibility with the gcc compiler that are in the blrts tool<br />

chain (that is, you can link .o files from the XL compilers and the gcc<br />

compilers and they should run)<br />

► Support many of the other tools in the blrts tool chain (that is, gdb, gmon,<br />

and so forth)<br />

There are four XLC/XLF RPMs that you must install on the FENs for compiling<br />

the applications on the system:<br />

► bgl-vacpp-7.0.0-5.ppc64.rpm (C/C++ compiler for Blue Gene/L)<br />

► bgl-xlf-9.1.0-5.ppc64.rpm (Fortran compiler for Blue Gene/L)<br />

► bgl-xlmass.lib-4.3.0-5.ppc64.rpm (MASS mathematical library for Blue<br />

Gene/L)<br />

► bgl-xlsmp.lib-1.5.0-5.ppc64.rpm (dummy SMP library in case the -qsmp<br />

option has been used - for pre-compiled code)<br />

You can download these from the Blue Gene/L customer site.<br />

148 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Tip: The xlmass and xlsmp options are common across C & Fortran compilers.<br />

Because XLC/XLF compilers support the -qsmp option, the dummy SMP<br />

library is provided to avoid any errors in programs that require the use of this<br />

flag. These versions can be updated at a later stage. Refer to the IBM<br />

compiler download site for the latest information about the releases.<br />

Compiling using blrts_xlc/xlf<br />

With the IBM blrts_xlc/xlf compilers, user applications are compiled on the FEN<br />

nodes. Then, the executables files are copied to the shared file system (NFS<br />

mounted /bgl or /bglscratch) that mounts to the I/O nodes. After the executable<br />

files are mounted, the process follows similar compiler steps as described in<br />

“Compilation process using blrts-gnu-gcc/g77” on page 146 and as illustrated in<br />

Example 4-4. The advantage of using the blrts_xlc/xlf compiler over the GNU<br />

compiler is that the IBM compilers provide numerous additional flags that can<br />

optimize and tune large scale applications on the Blue Gene/L system.<br />

Example 4-4 Compiling the “Hello World!” program using the blrts_xlc compiler<br />

test1@bglfen1:~/Examples/codes> ls<br />

hello-world.c<br />

test1@bglfen1:~/Examples/codes> blrts_xlc -o hello-world.rts<br />

hello-world.c -I/bgl/BlueLight/ppcfloor/bglsys/include<br />

-L/bgl/BlueLight/ppcfloor/bglsys/lib -lmpich.rts -lrts.rts<br />

-lmsglayer.rts -ldevices.rts<br />

test1@bglfen1:~/Examples/codes> ls<br />

hello-world.c hello-world.rts<br />

4.3 Submitting jobs using built-in tools<br />

You can submit a job on a Blue Gene/L system in three ways:<br />

► Using the submit_job command from Midplane Management Control System<br />

(MMCS)<br />

► Using the mpirun command (a stand-alone program for submitting jobs)<br />

► Using LoadLeveler, which is the IBM job management for AIX and Linux<br />

(batch queuing system)<br />

4.3.1 Submitting a job using MMCS<br />

The system administrator can submit jobs on Blue Gene/L/L by logging on to the<br />

MMCS console. As shown in 2.2.8, “Job submission” on page 66, the main idea<br />

Chapter 4. Running jobs 149


of providing this access for the administrator is to check the status of the<br />

partitions , I/O node, and IP addresses to know what file systems have been<br />

mounted on the I/O node(s) and so on. For submitting a job, refer to 2.3.8,<br />

“Check that a simple job can run (mmcs_console)” on page 96.<br />

4.3.2 The mpirun program<br />

The mpirun program is a stand-alone program that is used to execute parallel<br />

applications on the system. It is command that is bundled in the MPI RPM of the<br />

Blue Gene/L driver set and is named bglmpi-2006.1.2-1 (for the Blue Gene/L<br />

driver version used in this book). This RPM is also installed along with the control<br />

system on the SN. Users are not allowed to log on to the SN. So, mpirun is<br />

designed so that how it operates through FENs and SNs is transparent to the<br />

user.<br />

The mpirun program is divided into two components, namely front-end mpirun<br />

(which executes on the FEN) and back-end mpirun (which executes on the SN).<br />

This distinction is made to ensure secure access to the control system database.<br />

Normally, the DB2 client is installed on the SN only.<br />

Let us consider an example in which we submit a job using mpirun on the<br />

front-end node. This node communicates to the back-end mpirun and queries the<br />

database about the partition state (in our case, a predefined one). Depending on<br />

the query result, it makes a decision on whether to go ahead and boot the<br />

partition (or, if it has already been allocated, to return a message). See<br />

Figure 4-1.<br />

mpirun mpirun_be mpirun_be mpirun<br />

rsh / ssh rshd / sshd rshd / sshd rsh / ssh<br />

Front end node Blue Gene/L Service node Front end node<br />

Figure 4-1 Front-end and back-end mpirun communication<br />

150 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Service Node running<br />

SLES9 PPC<br />

64 bit<br />

IP: 10.0.0.1, bglsn_sn<br />

netmask: 255.255.0.0<br />

IP: 172.30.1.1, bglsn_fn<br />

netmask: 255.255.0.0<br />

IP: 192.168.00.49, bglsn<br />

netmask: 255.255.255.0 eth5<br />

Public<br />

Network<br />

Remote command execution setup<br />

In this section, we discuss two remote execution environments that we used in<br />

our sample environment, as shown in Figure 4-2.<br />

► Remote shell (rsh/rshd)<br />

► Secure shell (ssh/sshd)<br />

eth0<br />

eth1<br />

FrontEnd Node running<br />

SLES9 PPC<br />

64 bit<br />

IP: 172.30.1.41, bglfen1_fn<br />

netmask: 255.255.0.0<br />

IP: 192.168.100.41, bglfen1<br />

netmask: 255.255.255.0<br />

eth0<br />

eth1<br />

Functional<br />

Network<br />

Switch<br />

Service<br />

Network<br />

Switch<br />

FrontEnd Node running<br />

SLES9 PPC<br />

64 bit<br />

IP: 172.30.1.42, bglfen2_fn<br />

netmask: 255.255.0.0<br />

IP: 192.168.100.42, bglfen2<br />

netmask: 255.255.255.0<br />

1 2<br />

Figure 4-2 Sample environment used for this redbook<br />

eth0<br />

eth1<br />

Gbit ido<br />

Node card 3<br />

Node card 2<br />

Node card 1<br />

Node card 0<br />

BLUE GENE Rack 00,<br />

Front half of Midplane:<br />

R00-M0<br />

Clock card<br />

master<br />

Remote shell (rsh/rshd)<br />

The system administrator must enable the rshd service (part of the xinetd<br />

system) on the SN and the FENs so that users can execute remote commands<br />

between the SN and FENs.<br />

slave<br />

Service<br />

card<br />

Chapter 4. Running jobs 151


The system administrator can set up remote shell server (rshd) on SN and FENs<br />

by first checking whether rsh is enabled with the /sbin/chkconfig command as<br />

follows:<br />

# /sbin/chkconfig --list rsh<br />

xinetd based services:<br />

rsh: on<br />

If the status is off, you can enable the service as follows:<br />

# /sbin/chkconfig --set rsh on ; /etc/init.d/xinetd start rsh<br />

Note: The xinetd stores the child daemons’ configuration files in the<br />

/etc/xinetd.d directory. Alternately, you can check whether rshd is enabled in<br />

the rsh configuration file that located in this directory. If the line shown in bold<br />

in the following example is present or if this line is missing, then rshd is<br />

enabled.<br />

# cat /etc/xinetd.d/rsh<br />

# default: off<br />

# description:<br />

# The rshd server is a server for the rcmd(3) routine and,<br />

# consequently, for the rsh(1) program. The server provides<br />

# remote execution facilities with authentication based on<br />

# privileged port numbers from trusted hosts.<br />

#<br />

service shell<br />

{<br />

socket_type = stream<br />

protocol = tcp<br />

flags = NAMEINARGS<br />

wait = no<br />

user = root<br />

group = root<br />

log_on_success += USERID<br />

log_on_failure += USERID<br />

server = /usr/sbin/tcpd<br />

# server_args = /usr/sbin/in.rshd -L<br />

server_args = /usr/sbin/in.rshd -aL<br />

instances = 200<br />

disable = no<br />

}<br />

152 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


To allow remote commands to execute without being prompted for a password<br />

between SN and FENs, the remote shell server (rshd) must be aware of the<br />

identities that are allowed for these operations. We tested two ways to implement<br />

this type of execution:<br />

► System-wide implementation (/etc/hosts.equiv could pose a potential security<br />

threat)<br />

► User-based implementation which requires populating the ~/.rhosts file in the<br />

user’s home directories (permission 600)<br />

Example 4-5 shows the contents of the /etc/hosts.equiv file and the ~/.rhosts file<br />

that we created for our environment (one SN, two FENs, service network,<br />

functional network, and public network). Figure 4-2 on page 151 illustrates this<br />

environment.<br />

Example 4-5 The rsh setup using /etc/hosts.equiv<br />

test1@bglfen1:~> cat /etc/hosts.equiv<br />

#<br />

# hosts.equiv This file describes the names of the hosts which are<br />

# to be considered "equivalent", i.e. which are to be<br />

# trusted enough for allowing rsh(1) commands.<br />

#<br />

# hostname<br />

bglfen1<br />

bglfen2<br />

bglfen1_fn<br />

bglfen2_fn<br />

bglsn<br />

bglsn_fn<br />

Example 4-6 shows the ~/.rhosts file for user test1 on SN and FENs. We<br />

recommend this method, because it allows for more access granularity.<br />

Example 4-6 rsh setup on a per-user basis (user test1)<br />

test1@bglfen1:~>cat ~/.rhosts<br />

bglfen1 test1 bglfen1<br />

bglfen2 test1<br />

bglsn test1<br />

bglfen1_fn test1<br />

bglfen2_fn test1<br />

bglsn_fn test1<br />

Chapter 4. Running jobs 153


Note: A similar configuration is performed on service node and the Front-End<br />

Nodes (the ~/.rhosts file or the /etc/hosts.equiv file are distributed to SN and<br />

FENs). Best cluster practices require that you set up the same user identities<br />

(user ID and name) on all nodes (in our case, test1).<br />

After remote shell is set up, you need to check to ensure that it is working as<br />

expected. You have to test remote command execution through rsh from each<br />

node to every other node. Check that the SN and FENs can talk to each other<br />

using /usr/bin/rsh in both directions across the functional network:<br />

► From SN:<br />

– Using the date command, check connection to the FEN, this should be<br />

repeated to each FEN.<br />

test1@bglsn_fn:~> rsh bglfen1_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

– Using the date command, check to the SN itself.<br />

test1@bglsn_fn:~> rsh bglsn_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

► From the FENs:<br />

– Using the date command, check to the SN itself.<br />

test1@bglfen1_fn:~> rsh bglsn_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

– Using the date command, check connection to the FEN, this should be<br />

repeated to each FEN.<br />

test1@bglfen1_fn:~> rsh bglfen2_fn date<br />

Mon Apr 3 14:41:44 EDT 2006<br />

If there is any problem in executing the date command across nodes, check the<br />

rshd server on the SN and FENs and the /etc/hosts.equiv or the ~/.rhosts files.<br />

Secure shell (ssh/sshd)<br />

As an alternate way for remote command execution (which is needed by the job<br />

submission process if your environment requires enhanced security) is secure<br />

shell (ssh/sshd). You can set up secure shell using mpirun. The mpirun<br />

command has an option for specifying which shell to execute when submitting a<br />

job (-shell). We recommend that you set up secure shell on SN and FENs in<br />

such a way that allows remote command execution (for the designated users)<br />

but without being able to login on to the service node.<br />

154 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Note: This behavior (remote command execution allowed but login not<br />

allowed) is desired for security reasons. Remote shell (rsh/rshd) allows this<br />

behavior because another daemon is used for servicing the login requests,<br />

usually either rlogind or telnetd. However, by default, the secure shell server<br />

(sshd) allows for both remote command execution and login, and you can alter<br />

this behavior to serve your purpose. Refer to 7.2.4, “Using ssh in a Blue<br />

Gene/L environment” on page 406 and the no-pty option in authorized_keys<br />

file.<br />

Environment setup for mpirun<br />

You can use the mpirun command to submit (parallel) jobs to the Blue Gene/L<br />

system. You must set up the correct environment for mpirun to function properly.<br />

This section presents the tasks that you need to accomplish to set up mpirun<br />

correctly.<br />

Front-End Nodes (FENs)<br />

You can set up the Front-End Node in two ways:<br />

► Export the MMCS_SERVER_IP variable in the ~/.bashrc or ~/.cshrc (depending on<br />

the user shell) using one of the following commands:<br />

– export MMCS_SERVER_IP=SN_IP_Addr (in ~/.bashrc)<br />

– setenv MMCS_SERVER_IP SN_IP_Addr (in ~/.cshrc)<br />

► Using one of the numerous option of the mpirun command. For example, you<br />

can use the -host option to submit a job instead of specifying the<br />

MMCS_SERVER_IP environment variable.<br />

In addition to MMCS_SERVER_IP, there are other variables that you can specify in<br />

the user’s shell to control several aspects of the job execution (however, most<br />

options can be overwritten using the mpirun command line arguments while<br />

submitting jobs from FENs).<br />

Service Node<br />

There are three basic settings that are required for the user environment in order<br />

to execute the job successfully using mpirun:<br />

► BRIDGE_CONFIG_FILE<br />

This bridge configuration file includes the images that are required to boot the<br />

dynamic partition on the Blue Gene/L system. For example, you can use<br />

either the -partition or -shape command line arguments while submitting<br />

jobs. You can use the -partition option only if you have a predefined<br />

partition stored in the Blue Gene/L database. Also, if you want to specify the<br />

shape of the partition (configuration of the Blue Gene/L internal networks),<br />

you can use the -shape parameter (for a dynamically generated partition).<br />

Chapter 4. Running jobs 155


► DB_PROPERTY<br />

This setting is defined for the mpirun back-end to access the database on the<br />

SN while trying to check the availability of the partition that is requested and<br />

its state. For information about block states refer Table 4-3 on page 162.<br />

► Sourcing the db2profile<br />

The db2profile (script file) sets the DB2 database environment variables<br />

(including, but not limited to, binaries and library path for back-end mpirun).<br />

You should set the variables for every user on the system in their ~/.bashrc or<br />

.cshrc files, as shown in Example 4-7.<br />

Example 4-7 Contents of bridge config file and db.properties file<br />

test1@bglfen1:~> cat bridge_config_file.txt<br />

BGL_MACHINE_SN <br />

BGL_MLOADER_IMAGE <br />

BGL_BLRTS_IMAGE <br />

BGL_LINUX_IMAGE <br />

BGL_RAMDISK_IMAGE <br />

test1@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> cat db.properties<br />

database_name=bgdb0<br />

database_user=bglsysdb<br />

database_password=bglsysdb<br />

# database_password=db24bgls<br />

database_schema_name=bglsysdb<br />

system=BGL<br />

min_pool_connections=1<br />

# Web Console Configuration<br />

mmcs_db_server_ip=127.0.0.1<br />

mmcs_db_server_port=32031<br />

mmcs_max_reply_size=8029<br />

mmcs_max_history_size=2097152<br />

mmcs_redirect_server_ip=default<br />

mmcs_redirect_server_port=32032<br />

Bridge config file is present in the<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />

156 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


These environment variables are required for every user in the Blue Gene/L<br />

system before submitting the jobs. Example 4-8 shows the sample ~/.bashrc file<br />

for user test1.<br />

Example 4-8 Simple ~/.bashrc file<br />

test1@bglfen1:~> cat ~/.bashrc<br />

hstnm=`hostname`<br />

if [ $hstnm = "bglsn" ]<br />

then<br />

export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />

export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />

fi<br />

Job setup validation<br />

After you have configured the remote shell and set up the environment, you can<br />

test user applications using the test that is provided with mpirun. Example 4-9<br />

shows mpirun command checks across the FENs and SN. An Exit status: 0<br />

message indicates that the environment is configured properly.<br />

Example 4-9 Mpirun using only_test_protocol argument<br />

test1@bglfen1:/bglscratch/test1> mpirun -only_test_protocol -exe<br />

/bglscratch/test1/hello-world.rts -np 128 -verbose 1<br />

FE_MPI (Info) : Initializing MPIRUN<br />

FE_MPI (Info) : Scheduler interface library<br />

loaded<br />

FE_MPI (WARN) :<br />

======================================<br />

FE_MPI (WARN) : = Front-End - Only checking<br />

protocol =<br />

FE_MPI (WARN) : = No actual usage of the BG/L<br />

Bridge =<br />

FE_MPI (WARN) :<br />

======================================<br />

BE_MPI (WARN) :<br />

======================================<br />

BE_MPI (WARN) : = Back-End - Only checking<br />

protocol =<br />

BE_MPI (WARN) : = No actual usage of the BG/L<br />

Bridge =<br />

BE_MPI (WARN) :<br />

======================================<br />

Chapter 4. Running jobs 157


BRIDGE (Info) : The machine serial number<br />

(alias) is BGL<br />

FE_MPI (Info) : Back-End invoked:<br />

FE_MPI (Info) : - Service Node: bglsn<br />

FE_MPI (Info) : - Back-End pid: 6806 (on<br />

service node)<br />

FE_MPI (Info) : Preparing partition<br />

FE_MPI (Info) : Adding job<br />

FE_MPI (Info) : Job added with the following<br />

id: 123<br />

FE_MPI (Info) : Starting job 123<br />

FE_MPI (Info) : Waiting for job to terminate<br />

FE_MPI (Info) : BG/L job exit status = (0)<br />

FE_MPI (Info) : Job terminated normally<br />

BE_MPI (Info) : Starting cleanup sequence<br />

BE_MPI (Info) : == BE completed ==<br />

select: Interrupted system call<br />

FE_MPI (Info) : == FE completed ==<br />

FE_MPI (Info) : == Exit status: 0 ==<br />

Note: The mpirun command has numerous command line arguments. To find<br />

out about them in detail, refer to the mpirun user guide that was installed<br />

during your Blue Gene/L system setup.<br />

Job tracking using the Web interface<br />

After the job environment has been configured, you can proceed to job<br />

submission. When you have installed and configured the Blue Gene/L drivers,<br />

you can use the Web interface to check your system. You can access the Web<br />

interface by pointing your Web browser to the following URL:<br />

http:///web/index.php<br />

For more details about the Web interface, see 3.3, “Web interface” on page 119.<br />

Note: On the SN, the system administrator should set up the Web interface<br />

(Web server configuration) to allows browsing from the local network. For<br />

more information, refer to 2.3.3, “Check that BGWEB is running” on page 86.<br />

System administrators define the set of partitions (blocks) based on the Blue<br />

Gene/L system configuration at their site. For example, if we have a four-rack<br />

system, a set of predefined blocks can be 32-node blocks, 128-node blocks,<br />

2-Rack, 4-Rack, apart from the normal midplane, and rack configuration. This<br />

setup is very much dependent on the site configuration. When this configuration<br />

158 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


is defined, the user community can browse through the Web site and can know<br />

the availability of the partitions and their states.<br />

Figure 4-3 shows the Web interface.<br />

Figure 4-3 Web interface<br />

Chapter 4. Running jobs 159


In Figure 4-3, clicking Runtime displays the total number of available predefined<br />

partitions and the dynamically created partitions on the system. Each row<br />

contains information about the partitions’ state, as shown in Figure 4-4.<br />

Figure 4-4 Partitions available on the system<br />

Note: When you submit a job using mpirun with the -shape option or through<br />

IBM LoadLeveler, the partition is created dynamically and the name of the<br />

block allocated starts with RMP.<br />

Table 4-2 briefly describes the columns shown in Figure 4-4.<br />

Table 4-2 Description of the job information Web page<br />

Column Description<br />

Block ID Indicates the name of the partition or block<br />

Owner Indicates who has created the block<br />

Description Indicates how the block is created (using LoadLeveler or<br />

predefined in the database)<br />

160 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Column Description<br />

Status Indicates the status of the block<br />

Time status updated Indicates the last time status of the block is modified<br />

Time created Indicates the time that the block was created<br />

Size Indicates the size of the block (32, 128, 512, 1024, 2048,<br />

and so forth)<br />

Job status Indicates the job’s current status (refer Table 4-4 on<br />

page 162)<br />

Before submitting the job, you should check the job and block Web interface to<br />

see which partitions are free for use. Another way to check the availability of the<br />

blocks is to use the DB2 command CLI on SN (which can be done by the system<br />

administrator only), as shown in Example 4-10.<br />

Example 4-10 The DB2 CLI command to get the block information about SN<br />

bglsn:~ # db2 'connect to bgdb0 user bglsysdb using bglsysdb'<br />

Database Connection Information<br />

Database server = DB2/LINUXPPC 8.2.3<br />

SQL authorization ID = BGLSYSDB<br />

Local database alias = BGDB0<br />

bglsn:~ # db2 "select substr(blockid,1,16)blockid,STATUS,OWNER from<br />

BGLBLOCK"<br />

BLOCKID STATUS OWNER<br />

---------------- ------ ---------------------------<br />

R000_128 I root<br />

R000_J102_32 F<br />

R000_J104_32 F<br />

R000_J106_32 F<br />

R000_J108_32 F<br />

5 record(s) selected.<br />

Chapter 4. Running jobs 161


The Web page shown in Figure 4-4 also includes links to Job Information and<br />

Block Information. Table 4-3 describes the block states.<br />

Table 4-3 Block states on the Blue Gene/L system<br />

Block states Description<br />

(E)rror An initialization error has occurred. You must issue an allocate,<br />

allocate_block, or free command to reset the block<br />

(F)ree The block is available to allocate.<br />

(A)llocated The block has been allocated by a user. IDo connections have<br />

been established, but the block has not been booted.<br />

(C)onfiguring The block is in the process of being booted. This is an internal<br />

state for communicating between LoadLeveler and MMCS DB.<br />

(B)ooting The block is in the process of being booted but has not yet<br />

reached the point that CIOD has been contacted on all I/O nodes.<br />

(I)nitialized The block is ready to run application programs. CIODB has<br />

contacted CIOD on all I/O nodes.<br />

(D)e-allocating The block is in the process of being freed. This state is an internal<br />

state for communicating between LoadLeveler and the MMCS<br />

DB.<br />

Until now, we have discussed the partition states. Initially, when mpirun boots the<br />

partition, the job state is not known. However, after the block is booted, the job<br />

status (LoadLeveler) changes from Queued to Start, then to Running and, finally,<br />

to Terminated. Table 4-4 describes the job status types after the block has<br />

booted.<br />

Table 4-4 Blue Gene/L job states (without IBM LoadLeveler)<br />

Job status Description<br />

(E)rror An initialization error occurred. The _errtext field contains the<br />

error message.<br />

(Q)ueued The job has been created in the BGLJOB database table but<br />

has not yet started.<br />

(S)tart The job has been released to start but has not yet started<br />

running.<br />

(R)unning The job is running.<br />

(T)erminated The job has ended.<br />

(D)ying The job has been killed but has not yet ended.<br />

162 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Note: Job states shown in Table 4-4 differ from the jobs submitted through<br />

LoadLeveler. Refer to 4.4, “IBM LoadLeveler” on page 167 for more<br />

information.<br />

4.3.3 Example of submitting a job using mpirun<br />

Each job or application runs in its own block and takes over the entire block. (A<br />

block cannot be shared, entirely or partially, at any time between two different<br />

jobs.) Therefore, each job can take anywhere from 32 compute nodes (the<br />

smallest block size) up to the entire system.<br />

Note: Details of the job submission process are different depending on<br />

whether you are using a job scheduler such as LoadLeveler, directly invoking<br />

mpirun, or submitting jobs through the SN through mmcs_db_console.<br />

Submitting a job to the Blue Gene/L system using mpirun requires several<br />

command line arguments, as shown in Example 4-11. In this example, we use<br />

rsh, which is the default remote command program.<br />

Example 4-11 Job submission using the mpirun command (by default uses rsh)<br />

test1@bglfen1:/bglscratch/test1> mpirun -partition -np<br />

-exe <br />

-cwd -args mpirun -partition -np<br />

-exe <br />

-cwd -args


Figure 4-5 shows a diagram of the job submission process using mpirun.<br />

User<br />

1<br />

Administrator<br />

Edit/Compile<br />

Job<br />

4<br />

Create/<br />

Allocate<br />

block<br />

5<br />

Service<br />

Node<br />

Figure 4-5 Job submission process<br />

The job submission process is as follows:<br />

1. The user edits, compiles, and so forth the job code on one of the FENs.<br />

2. The user moves the job to the Cluster Wide File System (CWFS). The CWFS<br />

can be any supported type of file system, such as the NFS or GPFS that is<br />

available to the FENs, the SN, and the I/O nodes. The job and work files can<br />

reside permanently on the CWFS or can be copied there from the FEN’s local<br />

file system.<br />

3. The job scheduler assigns a block to the user.<br />

4. Alternatively to step 3, the system administrator working through the SN can<br />

assign a block to the user.<br />

5. The SN initializes the block, (that is, it boots the Compute Nodes and I/O<br />

nodes).<br />

6. The block’s I/O nodes load the job code and data from the CWFS and begin<br />

its execution.<br />

7. All job I/O runs to and from the CWFS. The job executes on the Compute<br />

Node. Its data travels across the collective network to the I/O node and<br />

across the Functional Network (Gigabit Ethernet) to the CWFS.<br />

164 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

3<br />

Job Scheduler<br />

2<br />

Job/data on<br />

file system<br />

Boot I/O nodes<br />

Compute nodes<br />

User job<br />

Cluster-Wide File System<br />

Job/data into<br />

6 7<br />

block<br />

Block<br />

Block<br />

Job I/O<br />

Block<br />

Blue Gene/L<br />

Block


Job runtime information (Web interface)<br />

After the job is submitted using the mpirun, you can view its status through the<br />

Web interface. Figure 4-6 shows information about the jobs’ state.<br />

Figure 4-6 Information about each running job<br />

Table 4-5 explains the columns shown in Figure 4-6.<br />

Table 4-5 Job information columns<br />

Column Description<br />

Job ID Currently running job on the system<br />

Block ID On which partition is the job running<br />

User name Displays the user running the job<br />

Job name The mpirun job name (indicates from which FEN job is<br />

submitted)<br />

Mode Under which mode the job is or will be running (mpirun<br />

option)<br />

Status Relates to the discussion in Table 4-4<br />

Status last modified Relates to the change of each state (for example, from Q<br />

to S to R to T)<br />

Chapter 4. Running jobs 165


Users and system administrators can check the job status and obtain detailed<br />

information by following the Job ID link. Figure 4-7 shows a sample job (ID 240).<br />

Figure 4-7 Detailed description of a sample job<br />

The following paragraphs provide an overview of the mpirun checklist for the<br />

required options to submit a job on the Blue Gene/L system.<br />

The mpirun checklist<br />

Use this mpirun checklist for the required options to submit a job on the Blue<br />

Gene/L system:<br />

► Check for the environment variable echo $MMCS_SERVER_IP on the FEN. If this<br />

variable is not set to the SN IP address or it is empty, set it. Refer to the FEN<br />

environment setup as described in “Environment setup for mpirun” on<br />

page 155.<br />

166 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


► Check for the bridge configuration and database properties files and<br />

environment variables, respectively, ($BRIDGE_CONFIG_FILE and<br />

$DB_PROPERTY). If these variables are empty or if the files have the wrong<br />

contents, correct them and then set the variables.<br />

► Check whether the db2profile file (which is in the<br />

/bgl/BlueLight/ppcfloor/bglsys/bin directory) is sourced on the SN.<br />

► Set and verify the remote command execution environment as described in<br />

“Remote shell (rsh/rshd)” on page 151 or “Secure shell (ssh/sshd)” on<br />

page 154.<br />

► Test the environment by using mpirun with the -only_test_protocol option to<br />

check the job submission through FEN and SN (refer to Example 4-9).<br />

4.4 IBM LoadLeveler<br />

IBM LoadLeveler is a software package that provides utilities for job submission<br />

and scheduling (workload management) on various UNIX platforms. Blue<br />

Gene/L is one of the platforms supported by IBM LoadLeveler.<br />

Due to the characteristics of the Blue Gene/L platform, to use LoadLeveler, in<br />

addition to the basic LoadLeveler knowledge, you need information that is<br />

specific to the Blue Gene/L environment. You can find information about<br />

LoadLeveler in the official IBM manual IBM LoadLeveler Using and<br />

Administering <strong>Guide</strong>, SA22-7881. LoadLeveler software comes separately from<br />

the Blue Gene/L software. Depending on the system type, a different version of<br />

LoadLeveler is installed. For the latest documentation, see:<br />

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=<br />

/com.ibm.cluster.loadl.doc/llbooks.html<br />

In this section, we present an overview of how LoadLeveler works in the Blue<br />

Gene/L environment. This information is essential for identifying and analyzing<br />

problems.<br />

4.4.1 LoadLeveler overview<br />

A LoadLeveler cluster is a collection of nodes (stand-alone systems and LPARs)<br />

that you can use to run your computing jobs. LoadLeveler manages the defined<br />

resources and schedules jobs based on resource availability and workload<br />

characteristics. Jobs are run on the nodes. LoadLeveler uses a central instance<br />

for managing the entire cluster. Thus, this can be considered a management<br />

domain type of cluster.<br />

Chapter 4. Running jobs 167


A node part of a LoadLeveler cluster has a file structure that contains a<br />

LoadLeveler code and configuration. Some nodes in the cluster are assigned<br />

special functionalities that are carried out by daemons. The most important<br />

daemons are Master, (also known as the Central Manager or Negotiator),<br />

Schedd, and Startd.<br />

Figure 4-8 illustrates the daemons and their relationship on a single-node<br />

LoadLeveler cluster. The job to be submitted goes to the Central Manager, which<br />

dispatches it to the Scheduler. After analyzing the job characteristics<br />

(requirements) and checking for available resources, the Scheduler sends the<br />

job to the Start daemon. The Start daemon starts the Starter process to run the<br />

job.<br />

In case of a parallel job, the Start daemon initiates multiple Starters. Each Starter<br />

runs a parallel task. However, this is true only in the case of “classic” parallel<br />

systems, which run their jobs on SMP machines (each node running a full copy<br />

of the operating system), clustered by a traditional clustering infrastructure. For<br />

Blue Gene/L, LoadLeveler works in a different way, because it does not interact<br />

with Compute nodes (which do not provide the kernel support for running<br />

multiple processes at the same time) or I/O nodes.<br />

Negotiator<br />

(Central Manager)<br />

start<br />

Job<br />

dispatching<br />

Status<br />

reporting<br />

Master<br />

Schedd<br />

Figure 4-8 LoadLeveler daemons on a single-node cluster<br />

To expand a single-node cluster into a multi-node cluster, multiple instances of<br />

the daemons can run on each node. Not all the daemons are required to run on<br />

every node. Depending on which daemons are running on a node, the node has<br />

a specific role in a LoadLeveler cluster.<br />

168 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

start<br />

start<br />

Job<br />

scheduling<br />

Status<br />

reporting<br />

Job<br />

finishing<br />

Startd<br />

Starter<br />

Job<br />

starting


The most important node is the Central Manager node, which runs the Central<br />

Manager daemon. There is only one Central Manager instance in a LoadLeveler<br />

cluster, even though one (or more) alternate Central Managers can be defined<br />

for failover and recovery purposes. The remainder of the nodes in the cluster<br />

have the option of running the Scheduler daemon, the Start daemon, or both.<br />

Figure 4-9 shows LoadLeveler daemons running on different nodes in a<br />

multi-node cluster.<br />

Master<br />

AIX<br />

Master<br />

Node 2<br />

Node 5<br />

Schedd<br />

Startd<br />

Schedd<br />

Central Manager node<br />

Master<br />

(Node 1)<br />

Negotiator<br />

Figure 4-9 A multi-node LoadLeveler cluster<br />

Schedd<br />

Startd<br />

Master<br />

SLES<br />

Startd<br />

AIX RH<br />

Master<br />

Node 3<br />

Node 4<br />

Schedd<br />

Startd<br />

Schedd<br />

Startd<br />

Note: The LoadLeveler Central Manager daemon serves as the single point of<br />

control, storage, and management of the cluster and job information. This<br />

daemon must be running for the LoadLeveler cluster to function.<br />

It is possible to have a mixed LoadLeveler cluster, that is nodes in the cluster<br />

running different operating systems. The operating systems supported by<br />

LoadLeveler are IBM AIX, SUSE Linux, or Red Hat Linux. For details on versions<br />

that are supported, check the IBM LoadLeveler readme files that come with the<br />

product.<br />

Note: In a mixed cluster, the binary files are not compatible between different<br />

operating systems. You must configure the cluster so that jobs are scheduled<br />

on the appropriate nodes. See IBM LoadLeveler Using and Administering<br />

<strong>Guide</strong>, SA22-7881 for further information.<br />

Chapter 4. Running jobs 169


4.4.2 Principles of operation in a Blue Gene/L environment<br />

In a Blue Gene/L environment, LoadLeveler code runs on the SN and several<br />

FENs. These nodes become part of the LoadLeveler cluster, which is in fact a<br />

LoadLeveler cluster that consists of Linux nodes with the usual LoadLeveler<br />

configuration (and nothing specific to Blue Gene/L).<br />

Because LoadLeveler does not run on the Blue Gene/L system or the I/O nodes,<br />

it relies on the bridge API from the SN to access these nodes. (Figure 4-10<br />

illustrates that the Blue Gene/L system is not actually part of the LoadLeveler<br />

cluster.) LoadLeveler calls the bridge API functions through the control server to<br />

set up the Blue Gene/L partitions. After the partitions are created, the job<br />

information is passed to the mpirun front-end and back-end programs, which<br />

have the role of submitting the job to the Blue Gene/L system.<br />

Central<br />

Manager<br />

mpirun<br />

starter<br />

startd<br />

mpirun<br />

starter<br />

startd<br />

Figure 4-10 The Blue Gene/L nodes are outside of the LoadLeveler cluster<br />

The LoadLeveler Central Manager daemon (CM or negotiator) has been<br />

enhanced with bridge API calls for Blue Gene/L operations. It uses the bridge<br />

API function calls to query the partition information from the Blue Gene/L<br />

database and uses these function calls to carry out operations such as:<br />

► Adding or removing a partition record<br />

► Sending operations to a service controller<br />

► Checking the status of the partitions<br />

170 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

Real Machine


The system administrator has to define the Central Manager on the SN when<br />

configuring LoadLeveler cluster. Thus, the SN must be part of the LoadLeveler<br />

cluster. Besides the Blue Gene/L file structure and the database, other<br />

LoadLeveler file structures and locations are set up similar to a regular<br />

LoadLeveler cluster.<br />

Note: Although you can configure an alternate Central Manager (as in a<br />

regular LoadLeveler cluster), failover to a node other than the SN is not<br />

desirable. As the result of a failover, the alternate Central Manager no longer<br />

has access to the Blue Gene/L control server and database.<br />

The LoadLeveler Scheduler daemon (schedd) schedules jobs according to<br />

workloads and resources on each node in the cluster. In a Blue Gene/L system,<br />

schedd treats mpirun jobs as common jobs (not Blue Gene/L specific) that run on<br />

the SN or FENs. However, the SN is reserved for Blue Gene/L administrative<br />

workload. Therefore, it is not desirable that schedd runs on the SN. The system<br />

administrator can choose to run schedd on all or some FENs. Thus, schedd is not<br />

shown in Figure 4-10 and Figure 4-11 on page 171.<br />

The scheduler (schedd) does not schedule the job on the Blue Gene/L compute<br />

nodes. It decides which FEN runs the mpirun and passes the job information to<br />

the LoadLeveler Start daemon (startd) on that node.<br />

Service Node<br />

Central Manager<br />

FEN<br />

Startd<br />

Starter<br />

FEN<br />

Startd<br />

Starter<br />

bridge<br />

Actions<br />

Figure 4-11 Central Manager accesses Blue Gene/L nodes through the bridge API<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

Physical<br />

Machine<br />

Usually, the FENs are where users submit jobs into the LoadLeveler queue. Not<br />

all of the FENs need to run schedd either. However, the LoadL_admin file<br />

specifies at least one node as the public scheduler (schedd). The LoadLeveler<br />

Chapter 4. Running jobs 171


Start daemon (startd) runs on each FEN. The startd daemon receives mpirun<br />

information from schedd. Then, startd starts the LoadLeveler Starter process<br />

(starter), which starts the mpirun job on the local node.<br />

Figure 4-12 shows the output from the LoadLeveler llstatus command with the<br />

LoadLeveler cluster running.<br />

Figure 4-12 The llstatus output on a Blue Gene/L system<br />

4.4.3 How LoadLeveler plugs into Blue Gene/L<br />

The LoadLeveler software works similarly on many platforms. The following<br />

tasks apply specifically to Blue Gene/L:<br />

► Configuring LoadLeveler for Blue Gene/L<br />

► Making the Blue Gene/L libraries available to LoadLeveler<br />

► Setting Blue Gene/L specific environment variables<br />

► Using Blue Gene/L specific keywords in job command file<br />

In the following sections, we discuss these tasks in more detail.<br />

4.4.4 Configuring LoadLeveler for Blue Gene/L<br />

To enable LoadLeveler to recognize a Blue Gene/L system, the LoadL_config file<br />

must contain specific keywords, as shown in Example 4-13 with the<br />

recommended values. Only the LoadLeveler system administrator can change<br />

these keywords in LoadL_config, and they should not be changed while<br />

LoadLeveler is running. See also IBM LoadLeveler Using and Administering<br />

<strong>Guide</strong>, SA22-7881.<br />

172 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 4-13 Blue Gene/L specific configuration keywords<br />

BG_ENABLED = true<br />

BG_CACHE_PARTIONS = true<br />

BG_ALLOW_LL_JOBS_ONLY = false<br />

BG_MIN_PARTITION_SIZE = 32<br />

The keyword BG_ENABLED is essential. Setting it to true tells LoadLeveler that this<br />

is a Blue Gene/L cluster. LoadLeveler then uses the Blue Gene/L bridge API to<br />

talk with the Blue Gene/L control system.<br />

Setting the keyword BG_CACHE_PARTITIONS to true tells LoadLeveler to reuse<br />

existing partitions which have been previously allocated by LoadLeveler.<br />

The keyword BG_ALLOW_LL_JOBS_ONLY is to false to allow users to run mpirun<br />

programs without using LoadLeveler.<br />

The keyword BG_MIN_PARTITION_SIZE specifies that the smallest number of<br />

computing nodes allowed in a partition is 32.<br />

4.4.5 Making the Blue Gene/L libraries available to LoadLeveler<br />

The Blue Gene/L bridge API library is provided with the Blue Gene/L code,<br />

together with the DB2 libraries. In order for LoadLeveler to access these libraries,<br />

the system administrator needs to set up appropriate symbolic links.<br />

On the SN or FEN, these libraries are mostly referred to at the /usr/lib64 or<br />

/usr/lib directories. However. the actual library binary can reside in another<br />

location. Instead of copying the binaries to /usr/lib64 or /usr/lib, using symbolic<br />

links avoids the situation where two different binaries of the same library exist in<br />

two separate locations.<br />

Note: Alternatively, symbolic links might be broken when there are changes in<br />

software levels. Broken links sometimes are not detected and can cause<br />

problems.<br />

Therefore, the libraries used by LoadLeveler are listed for reference purposes.<br />

You should check the links and the binary checksums when users encounter<br />

errors. A brief description of each library is provided to help identify problems.<br />

Note: The library directory paths and names can be system-specific. You can<br />

use the ldconfig and ldd commands to check library links and dynamic<br />

dependencies.<br />

Chapter 4. Running jobs 173


The binaries are located in /bgl/BlueLight/ppcfloor/bglsys/lib64. Their symbolic<br />

links are in /usr/lib64. The following group of libraries belongs to the 64-bit code<br />

version:<br />

libbgldb.so Provide the interface to the database tables on the SN<br />

libtableapi.so for bridge API.<br />

libbglmachine.so Provides the interface to the Blue Gene/L hardware.<br />

libbglbridge.so Provides a 64-bit bridge API library that is used to access<br />

the state of and send orders to the Blue Gene/L system.<br />

libsaymessage.so Provides a 64-bit library that is used by Blue Gene/L<br />

software for producing log messages.<br />

Some libraries come in both 64-bit and 32-bit versions (from both DB2 and<br />

LoadLeveler). Although the libraries’ name is the same, you should link the 64-bit<br />

version with /usr/lib64 and the 32-bit version with /usr/lib.<br />

We use libdb2.so as an example. You can find the file libdb2.so in<br />

/opt/IBM/db2/V8.1/lib64 and /opt/IBM/db2/V8.1/lib. Create proper links to point to<br />

the appropriate version.<br />

The libdb2.so library is a standard 64-bit DB2 client library. It is used by 64-bit<br />

programs to connect to the DB2 database for queries and updates.<br />

The LoadLeveler libraries that are located in /opt/ibmll/LoadL/full/lib are:<br />

► libllapi.so<br />

► libsched_if.so<br />

► libsched_if32.so<br />

► libllpoe.so<br />

The first two libraries are in 64-bit format and need to be linked to /usr/lib64. The<br />

remaining two libraries are in 32-bit format and need to be linked to /usr/lib.<br />

libllapi.so Provides LoadLeveler's 64-bit API library. It is used by the<br />

LoadLeveler daemons, commands, and by external 64-bit<br />

programs that need to access the LoadLeveler API.<br />

libsched_if.so Include the interfaces between mpirun and<br />

libsched_if32.so LoadLeveler. The mpirun program uses these API calls to<br />

get job parameters from LoadLeveler (partition in which to<br />

start and so on).<br />

libllpoe.so Provides 32-bit version of libllapi.so. Although the binary<br />

name is libllpoe.so, point the link to /usr/lib/libllapi.so.<br />

174 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Note: In a conventional LoadLeveler installation (using rpm and the install_ll<br />

script that is provided), libllpoe.so, libsched_if.so, and libsched_if32.so are<br />

copied to the appropriate directories and do not need to be linked.<br />

Example 4-14 LoadLeveler script to set up required links<br />

DVR_DIR=/bgl/BlueLight/ppcfloor<br />

cd /usr/lib64<br />

ln -f -s /opt/IBM/db2/V8.1/lib64/libdb2.so.1 libdb2.so.1<br />

ln -f -s libdb2.so.1 libdb2.so<br />

ln -f -s $DVR_DIR/bglsys/lib64/libbgldb.so.1 libbgldb.so.1<br />

ln -f -s libbgldb.so.1 libbgldb.so<br />

ln -f -s $DVR_DIR/bglsys/lib64/libtableapi.so.1 libtableapi.so.1<br />

ln -f -s libtableapi.so.1 libtableapi.so<br />

ln -f -s $DVR_DIR/bglsys/lib64/libbglmachine.so.1 libbglmachine.so.1<br />

ln -f -s libbglmachine.so.1 libbglmachine.so<br />

ln -f -s $DVR_DIR/bglsys/lib64/libbglbridge.so.1 libbglbridge.so.1<br />

ln -f -s libbglbridge.so.1 libbglbridge.so<br />

ln -f -s $DVR_DIR/bglsys/lib64/libsaymessage.so.1 libsaymessage.so.1<br />

ln -f -s libsaymessage.so.1 libsaymessage.so<br />

ln -f -s /opt/ibmll/LoadL/full/lib/libllapi.so libllapi.so.1<br />

ln -f -s libllapi.so.1 libllapi.so<br />

cd /usr/lib<br />

ln -f -s /opt/IBM/db2/V8.1/lib/libdb2.so.1 libdb2.so.1<br />

ln -f -s libdb2.so.1 libdb2.so<br />

4.4.6 Setting Blue Gene/L specific environment variables<br />

You can use the following variables in the Blue Gene/L environment. You set<br />

these variables to point to the appropriate configuration files and SN. The<br />

variables are:<br />

► BRIDGE_CONFIG_FILE<br />

► DB_PROPERTY<br />

► MMCS_SERVER_IP<br />

Chapter 4. Running jobs 175


Because these variable are user specific, you can set them in the ~/.bashrc file<br />

under the user’s home directory. Example 4-15 shows the content of the<br />

~/.bashrc that we used in our test environment.<br />

Example 4-15 Setting up environment variables with ~/.bashrc<br />

test1@bglsn_~/>cat ~/.bashrc<br />

# .bashrc<br />

# User specific aliases and functions<br />

# Source global definitions<br />

if [ -f /etc/bashrc ]; then<br />

. /etc/bashrc<br />

fi<br />

source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />

. /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

export<br />

BRIDGE_CONFIG_FILE=/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />

export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

export MMCS_SERVER_IP=bglsn.itso.ibm.com<br />

The bridge.config file contains the Blue Gene/L system serial number and the<br />

names and locations of the images to be loaded onto compute and I/O nodes.<br />

The db.properties file contains DB2 database information and the Blue Gene/L<br />

Web console configuration.<br />

Note: These environment variables are required for the LoadLeveler<br />

commands to work properly. Checking the values of these variables and the<br />

contents of the configuration files is an important task in problem<br />

determination.<br />

4.4.7 LoadLeveler and the Blue Gene/L job cycle<br />

LoadLeveler is a complex job scheduling subsystem. A job can go through<br />

multiple stages from the time it is submitted into the queue until it finishes. A job<br />

in the queue can frequently be seen in one of these states: Idle, Starting,<br />

Running, Held, Remove Pending, and so on. For detail information, see IBM<br />

LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881.<br />

176 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Table 4-6 provides a quick overview of the LoadLeveler job states, which are<br />

referred to in the problem determination process.<br />

Table 4-6 Summary of job states in LoadLeveler<br />

State Brief description Remarks<br />

(I)dle The job is being considered to run on a<br />

system, although no system has been<br />

selected.<br />

The system here is a FEN, not the Blue<br />

Gene/L system or nodes.<br />

User (H)old The job has been put on user hold. There are many reasons as to why a job is<br />

put on hold.<br />

(ST)arting The job is starting: dispatched, received by<br />

target system, job env. is being set up.<br />

(R)unning The job is running: dispatched and started<br />

on designated system.<br />

Remove<br />

Pending (RP)<br />

Job information is being passed to mpirun.<br />

LoadLeveler has completed passing job<br />

information to mpirun. See different Blue<br />

Gene/L job states.<br />

The job is in the process of being removed The mpirun job is being removed<br />

according to Blue Gene/L job status.<br />

A LoadLeveler job in Blue Gene/L goes through different states. During the job<br />

starting process, the job information is passed to the mpirun front end. At this<br />

point, the job is in Running state in the LoadLeveler queue. However, the mpirun<br />

process picks up the job from there and starts the tasks on a Blue Gene/L<br />

partition. Then , the Blue Gene/L job goes through different states in the partition.<br />

Note: The Blue Gene/L job status is different from the LoadLeveler job states<br />

(see Table 4-6 and Table 4-4 on page 162).<br />

The user who submits a job into LoadLeveler queue usually checks the job<br />

status through LoadLeveler commands. However, using LoadLeveler commands<br />

you cannot see the Blue Gene/L partition and job status. The system<br />

administrator has access to the Blue Gene/L service console to check the Blue<br />

Gene/L database.<br />

The user and system administrator have two different views of the job. Seeing<br />

where the job is in its life cycle can help determine its status. Also, if a job fails,<br />

the user and system administrator have to trace back to determine where the<br />

failure has occurred.<br />

Chapter 4. Running jobs 177


4.4.8 LoadLeveler job submission process<br />

Throughout the steps in the LoadLeveler job submission process, we use a<br />

simple mpirun job running a “Hello world!” program. Example 4-16 lists a sample<br />

job command file that we used in our environment. See , “Job command file” on<br />

page 198 for brief descriptions of the keywords that this file uses.<br />

Example 4-16 A sample LoadLeveler job command file named hello.cmd<br />

#@ job_type = bluegene<br />

##@ executable = /usr/bin/mpirun<br />

#@ executable = /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun<br />

#@ bg_size = 128<br />

#@ arguments = -verbose 4 -exe /bgl/hello/hello.rts<br />

#@ output = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />

#@ error = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />

#@ environment =<br />

COPY_ALL:BRIDGE_CONFIG_FILE=/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.c<br />

onfig:DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties:MMCS<br />

_SERVER_IP=bglsn.itso.ibm.com<br />

#@ notification = error<br />

#@ notify_user = loadl<br />

#@ class = small<br />

#@ queue<br />

Note: In the steps in this process, we assume that the user is logged in to one<br />

of the FENs.<br />

The following steps detail the LoadLeveler job’s life cycle:<br />

Step 1: Submitting the job<br />

A user submits a job from an FEN using the following LoadLeveler command:<br />

llsubmit hello.cmd<br />

The command returns a message that indicates that the job has been submitted.<br />

The job is associated with a LoadLeveler Job ID, and the job information is sent<br />

to the LoadLeveler Central Manager (CM). The CM daemon runs on the SN and<br />

receives job information through IP communication (the IP port on which the CM<br />

is listening). This TCP/IP port is specified in the LoadL_config file.<br />

As shown in Example 4-17, the llq command does not display the Blue Gene/L<br />

job information at this point, because the job has only been queued into<br />

LoadLeveler.<br />

178 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 4-17 Output of the llq -b command<br />

loadl@bglfen1:/bgl/loadl> llq -b<br />

Id Owner Submitted LL BG PT Partition Size<br />

________________________ __________ ___________ __ __ __ ________________ _____<br />

bglfen1.47.0 loadl 3/29 12:14 I<br />

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted<br />

Figure 4-13 shows the LoadLeveler job submission process on Blue Gene/L.<br />

Service Node<br />

Central Manager<br />

Job Info<br />

FEN<br />

llsubmit 1<br />

Startd<br />

Starter<br />

FEN<br />

Startd<br />

Starter<br />

Figure 4-13 Job submission process<br />

bridge<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

Actions<br />

Physical<br />

Machine<br />

Step 2: Getting Blue Gene/L information<br />

The LoadLeveler CM receives the job information (at this point, the status of the<br />

job is I for idle). Meanwhile, the CM retrieves a snapshot of the Blue Gene/L<br />

system using the bridge API (from the Blue Gene/L database). The information<br />

requested by CM includes:<br />

► A list of nodes<br />

► The locations of those node<br />

► The status of those nodes<br />

The bridge API calls are responded to by the control server, which accesses the<br />

DB2 database for this information. The control server daemon also updates the<br />

database with the status of the Blue Gene/L hardware.<br />

Chapter 4. Running jobs 179


Figure 4-14 shows the process of LoadLeveler retrieving information from the<br />

Blue Gene/L database through the bridge API.<br />

Service Node<br />

Central Manager<br />

FEN<br />

Startd<br />

Starter<br />

BGL Info<br />

Figure 4-14 Getting Blue Gene/L information<br />

Step 3: Deciding on the partition to use<br />

With the information received from the previous step, the CM constructs a 3D<br />

model of the Blue Gene/L system in memory. The CM then uses the 3D model to<br />

determine the partition to use for the job. Part of this decision process is whether<br />

to reuse an existing partition or to create a new partition. Figure 4-15 illustrates<br />

this process.<br />

Note: LoadLeveler has full responsibility in manipulating the partitions that it<br />

creates for running jobs.<br />

180 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

2<br />

FEN<br />

Startd<br />

Starter<br />

bridge<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

Actions<br />

Physical<br />

Machine


Service Node<br />

Central Manager<br />

3<br />

BGL/Snapshot<br />

FEN<br />

Startd<br />

Starter<br />

Figure 4-15 Deciding on the partition<br />

bridge<br />

Step 4: Updating the Blue Gene/L database<br />

The chosen partition for the job is either an existing one or a new one. If it is a<br />

new partition, it is created dynamically. The CM uses the bridge API to insert the<br />

partition record into the database (see Figure 4-16). At this point in the process,<br />

nothing happens (yet) on the Blue Gene/L system.<br />

Service Node<br />

Central Manager<br />

BGL/Snapshot<br />

FEN<br />

Startd<br />

Starter<br />

FEN<br />

Startd<br />

Starter<br />

bridge<br />

DB<br />

DB<br />

Figure 4-16 Updating the Blue Gene/L database<br />

4<br />

partition<br />

FEN<br />

Startd<br />

Starter<br />

Status Updates<br />

Control<br />

Daemon<br />

Actions<br />

Status Updates<br />

Control<br />

Daemon<br />

Actions<br />

Physical<br />

Machine<br />

Physical<br />

Machine<br />

Chapter 4. Running jobs 181


Steps 5 and 6: Initializing the partition<br />

For the dynamically created partition, the CM uses the bridge API to change the<br />

partition state to allocating (A). This change in state triggers the booting of the<br />

partition under the control of the Blue Gene/L daemons (see Figure 4-17). The<br />

state of the Blue Gene/L partition now changes to I for initializing and<br />

LoadLeveler, represented by the root user on the SN, is the owner of this<br />

dynamic partition.<br />

Note: The state of the job in LoadLeveler queue is idle (I) but the state of the<br />

Blue Gene/L partition is initializing (also I). Although the two states are both<br />

abbreviated to I, the meaning of the state is obviously different. At the end of<br />

this step, the job in LoadLeveler queue is changing to ST for “starting,” which is<br />

a transition state, for a very short time.<br />

Service Node<br />

Central Manager<br />

FEN<br />

Startd<br />

Starter<br />

Figure 4-17 Initializing the partition<br />

Step 7: Launching mpirun<br />

LoadLeveler goes ahead and schedules the job on one of the FENs. The Start<br />

daemon (startd) receives the job and initiates a Starter process. The Starter<br />

launches the mpirun front-end process, which communicates with the mpirun<br />

back-end process on the SN (Figure 4-18).<br />

At this point, the job status in the LoadLeveler queue is changed to running (R).<br />

LoadLeveler is basically done passing the job to mpirun, which has now the<br />

responsibility to run the job.<br />

182 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

5<br />

Allocate Partition<br />

FEN<br />

Startd<br />

Starter<br />

bridge<br />

A<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

6<br />

Actions<br />

Partition being<br />

configured<br />

Physical<br />

Machine


Note: The state of the mpirun job in LoadLeveler queue is running (R).<br />

However, the state of the Blue Gene/L partition is now changing from<br />

initializing (I) to ready (R).<br />

This step can be thought of as when an user invokes the mpirun command<br />

(without using LoadLeveler). However, there is a small additional check that the<br />

mpirun front-end process needs to do. It needs to check to see if LoadLeveler is<br />

installed and can run by calling the LoadLeveler API. One of the checks is to<br />

make sure that the LoadLeveler configuration allows mpirun to run jobs outside<br />

of LoadLeveler. See the discussion on BG_ALLOW_LL_JOBS_ONLY configuration<br />

parameter in 4.4.4, “Configuring LoadLeveler for Blue Gene/L” on page 172.<br />

Service Node<br />

Central Manager<br />

FEN<br />

mpirun<br />

Startd<br />

Starter<br />

FEN<br />

Startd<br />

Starter<br />

bridge<br />

Figure 4-18 LoadLeveler launching mpirun<br />

7<br />

I<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

Actions<br />

Physical<br />

Machine<br />

Chapter 4. Running jobs 183


Steps 8 and 9: Starting the parallel job<br />

The back-end mpirun process uses the bridge API to set the partition to ready (R)<br />

state in the database, which triggers the control daemon to execute the job on<br />

the partition (see Figure 4-19).<br />

Service Node<br />

Central Manager<br />

FEN<br />

mpirun<br />

Startd<br />

Starter<br />

FEN<br />

Startd<br />

Starter<br />

Figure 4-19 Starting the parallel job<br />

Note: Again, the job state in LoadLeveler queue is running (R) while the<br />

partition state that represent the job is in ready (also R). Both states are<br />

abbreviated as R but obviously have a different meaning.<br />

184 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

8<br />

bridge<br />

I<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

9<br />

Actions<br />

Job begins<br />

executing<br />

Physical<br />

Machine


Step 10: Waiting for the job to complete<br />

The mpirun back-end process uses the bridge API to poll for the job status until<br />

the job is complete (see Figure 4-20). During this time, mpirun monitors job<br />

activities.<br />

Service Node<br />

Central Manager<br />

FEN<br />

mpirun<br />

Startd<br />

Starter<br />

FEN<br />

Startd<br />

Starter<br />

10<br />

bridge<br />

Figure 4-20 Waiting for the job to complete<br />

Actions<br />

Note: LoadLeveler is not aware of what is going on with the job that is running<br />

in the Blue Gene/L partition. Thus, the job status in LoadLeveler queue<br />

remains in the running (R) state during this period.<br />

Steps 11 and 12: Cleaning the partition<br />

After mpirun receives “Job complete” status, the mpirun processes terminate,<br />

and the LoadLeveler Starter process terminates as well. The startd daemon<br />

receives the job status from Starter process and sends it to schedd, which reports<br />

the job status back to the CM on the SN. The CM uses the bridge API to set the<br />

state of the partition to F for free, which means that it will not be reused. At this<br />

point, the job cycle completes (see Figure 4-21).<br />

I<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

Job Is Running<br />

Physical<br />

Machine<br />

Chapter 4. Running jobs 185


Service Node<br />

Central Manager<br />

Figure 4-21 Cleaning the partition<br />

Note: If the LoadLeveler is configured to reuse partitions, then the partition is<br />

not freed. Instead, is marked ready (R) to be reused for the next job (if it fits).<br />

4.4.9 LoadLeveler checklist<br />

FEN<br />

Startd<br />

Starter<br />

Free Partition<br />

You can use the tasks presented in this section for scanning normal LoadLeveler<br />

status with attention to details. You can use these checks to spot any abnormal<br />

aspects and to investigate hard-to-find problems or pitfalls.<br />

The checklist includes:<br />

► LoadLeveler cluster and node status<br />

► LoadLeveler run queue<br />

► Job command file<br />

► LoadLeveler processes, logs, and persistent storage<br />

► LoadLeveler configuration keywords<br />

► Environment variables, network, and library links<br />

LoadLeveler cluster and node status<br />

The llstatus command displays the status of the LoadLeveler cluster.<br />

Figure 4-22 shows the following important information that is provided by the<br />

llstatus command:<br />

1. Blue Gene is present. This message means that LoadLeveler can talk with<br />

the Blue Gene/L control server.<br />

186 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

11<br />

FEN<br />

Startd<br />

Starter<br />

bridge<br />

DB<br />

Status Updates<br />

Control<br />

Daemon<br />

12<br />

Actions<br />

Physical<br />

Machine


2. The Central Manager is defined on node bglsn.itso.ibm.com. This node is<br />

the Blue Gene/L service node. This message is a good indication that the CM<br />

is up and running.<br />

3. Scheduler daemons (Schedd) are available. They are ready to schedule jobs<br />

on two of the FENs.<br />

Tip: Schedd dispatches the mpirun jobs<br />

4. The job starting daemons (Startd) are idle. They are ready to start jobs that<br />

come their way. Startd forks a child process called Starter, which then starts<br />

the mpirun front-end process.<br />

Figure 4-22 llstatus command output<br />

Schedds are<br />

running<br />

loadl@bglfen1:/usr/lib64> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 1184 PPC64 Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

Startds are running<br />

CM is up & running on<br />

this machine<br />

LoadLeverler<br />

Can talk to<br />

Blue Gene<br />

When comparing the Figure 4-22 to Figure 4-23 on page 188, you can see the<br />

following problems in the LoadLeveler cluster:<br />

1. Blue Gene is absent. This message indicates that LoadLeveler cannot talk<br />

to the control server. As a result, jobs are not able to run.<br />

2. The Central Manager is defined on node, bglfen2.itso.ibm.com. This node is a<br />

node other than the service node. This might not be possible in a Blue<br />

Gene/L cluster but it is worthwhile to pay attention to the error messages.<br />

3. One of the schedd daemons is down, which means that LoadLeveler is having<br />

some problems on node bglfen1.itso.ibm.com. However, this issue could also<br />

be normal if the system administrator previously decided not to run schedd on<br />

this node.<br />

3<br />

4<br />

Chapter 4. Running jobs 187<br />

1<br />

2


4. One of the start daemons is down (indicated by the idle status, which means<br />

also that there are some problems with LoadLeveler on node<br />

bglfen2.itso.ibm.com.<br />

5. One node is absent. Although LoadLeveler can function with missing (or<br />

absent) nodes, the individual nodes might have important roles in the cluster.<br />

1<br />

LoadLeveler<br />

Cannot talk<br />

to Blue Gene<br />

3 4<br />

One schedd is<br />

down<br />

loadl@bglfen1:/usr/lib64> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />

bglfen1.itso.ibm.com Down 0 0 Idle 0 0.00 1184 PPC64 Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com, but is unusable<br />

Alternate Central Manager is serving from bglfen2.itso.ibm.com<br />

CM is running on a<br />

different machine<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is absent<br />

The following machines are absent<br />

bglsn.itso.ibm.com<br />

Figure 4-23 <strong>Problem</strong>s reported from the llstatus command<br />

In the worst case scenario, the llstatus command does not return any<br />

information but just error messages similar to those in Example 4-18.<br />

Example 4-18 Error messages reported from llstatus regarding LoadL_negotiator errors<br />

loadl@bglfen1:/usr/lib64> llstatus<br />

03/29 16:52:50 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />

"LoadL_negotiator" on port 9614. errno = 111<br />

03/29 16:52:50 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />

"LoadL_negotiator" on port 9614. errno = 111<br />

llstatus: 2512-301 An error occurred while receiving data from the<br />

LoadL_negotiator daemon on host bglsn.itso.ibm.com.<br />

The error messages should interpreted accordingly. For example, the following<br />

error message could be interpreted in a two ways:<br />

2539-463 Cannot connect to bglsn.itso.ibm.com "LoadL_negotiator" on port<br />

9614. errno = 111<br />

188 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

5<br />

One node is absent<br />

One startd is down<br />

2


This message could mean either that LoadLeveler Negotiator or Central<br />

Manager is down or that LoadLeveler might not be running at all.<br />

The llstatus command in Blue Gene/L<br />

The llstatus command provides Blue Gene/L specific option (command flags)<br />

such as -b, -B, -P, which display information that is related to Blue Gene/L. (See<br />

IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881, for a complete<br />

reference on the llstatus command.) Using these option flags, you can check<br />

and compare the LoadLeveler Blue Gene/L related information with information<br />

from other sources such as the Web interfaces.<br />

Issuing the llstatus -b command shows the overall dimension of the Blue<br />

Gene/L system and jobs in the queue. In Example 4-19 and Example 4-20, the<br />

llstatus -b is issued on two different Blue Gene/L systems.<br />

Example 4-19 The llstatus -b command on a one-midplane Blue Gene/L system<br />

loadl@bglfen1:~> llstatus -b<br />

Name Base Partitions c-nodes InQ Run<br />

BGL 1x1x1 8x8x8 0 0<br />

Example 4-20 The llstatus -b command on a 20-rack Blue Gene/L system<br />

thanhlam@bgwfen1:~> llstatus -b<br />

Name Base Partitions c-nodes InQ Run<br />

BGL 5x4x2 40x32x16 0 0<br />

Issuing llstatus with the -b and -l options (combined) displays more detail of<br />

the Blue Gene/L system structure, including network switches and cabling.<br />

Example 4-21 shows output from llstatus -b -l on an one-midplane Blue<br />

Gene/L system.<br />

Note: Because this is the minimum configuration possible (one midplane), all<br />

Blue Gene/L specific networks (Torus, Barrier, and Collective) are inside the<br />

midplane. Thus, there is no additional cabling (wiring).<br />

Example 4-21 The llstatus -b -l command on a single midplane system<br />

loadl@bglfen1:~> llstatus -b -l | more<br />

Total Blue Gene Base Partitions 1<br />

Total Blue Gene Compute Nodes 512<br />

Machine Size in Base Partitons X=1 Y=1 Z=1<br />

Machine Size in Compute Nodes X=8 Y=8 Z=8<br />

-- list of base partitions --<br />

Chapter 4. Running jobs 189


Z = 0<br />

=====<br />

+------------+<br />

| R000|<br />

0 | |<br />

| |<br />

+------------+<br />

-- list of switches --<br />

Switch ID: X_R000<br />

Switch State: UP<br />

Base Partition: R000<br />

Switch Dimension: X<br />

Switch Connections: NONE<br />

Switch ID: Y_R000<br />

Switch State: UP<br />

Base Partition: R000<br />

Switch Dimension: Y<br />

Switch Connections: NONE<br />

Switch ID: Z_R000<br />

Switch State: UP<br />

Base Partition: R000<br />

Switch Dimension: Z<br />

Switch Connections: NONE<br />

-- list of wires --<br />

Wire Id: R000X_R000<br />

Wire State: UP<br />

FromComponent=R000 FromPort=MINUS_X<br />

ToComponent=X_R000 ToPort=PORT_S0<br />

PartitionState=NONE Partition=NONE<br />

Wire Id: R000Y_R000<br />

Wire State: UP<br />

FromComponent=R000 FromPort=MINUS_Y<br />

ToComponent=Y_R000 ToPort=PORT_S0<br />

PartitionState=NONE Partition=NONE<br />

Wire Id: R000Z_R000<br />

Wire State: UP<br />

FromComponent=R000 FromPort=MINUS_Z<br />

ToComponent=Z_R000 ToPort=PORT_S0<br />

PartitionState=NONE Partition=NONE<br />

Wire Id: X_R000R000<br />

Wire State: UP<br />

FromComponent=X_R000 FromPort=PORT_S1<br />

ToComponent=R000 ToPort=PLUS_X<br />

190 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


PartitionState=NONE Partition=NONE<br />

Wire Id: Y_R000R000<br />

Wire State: UP<br />

FromComponent=Y_R000 FromPort=PORT_S1<br />

ToComponent=R000 ToPort=PLUS_Y<br />

PartitionState=NONE Partition=NONE<br />

Wire Id: Z_R000R000<br />

Wire State: UP<br />

FromComponent=Z_R000 FromPort=PORT_S1<br />

ToComponent=R000 ToPort=PLUS_Z<br />

PartitionState=NONE Partition=NONE<br />

Example 4-22 is extracted from the output of the llstatus -l -b on a 20-rack<br />

Blue Gene/L system.<br />

Note: Because the output is very long, only partial is captured in this example.<br />

Example 4-22 The llstatus -b -l command on a 20-rack Blue Gene/L system<br />

thanhlam@bgwfen1:~> llstatus -b<br />

Total Blue Gene Base Partitions 40<br />

Total Blue Gene Compute Nodes 20480<br />

Machine Size in Base Partitons X=5 Y=4 Z=2<br />

Machine Size in Compute Nodes X=40 Y=32 Z=16<br />

-- list of base partitions --<br />

Z = 1<br />

=====<br />

+----------------------------------------------------------------+<br />

| R011| R110| R311| R411| R210|<br />

3 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR21R31| wR411| wR21R31|<br />

|----------------------------------------------------------------|<br />

| R031| R130| R331| R431| R230|<br />

2 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR33| wR431| wR23|<br />

|----------------------------------------------------------------|<br />

| R021| R120| R321| R421| R220|<br />

1 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR22R32| wR421| wR22R32|<br />

|----------------------------------------------------------------|<br />

| R001| R100| R301| R401| R200|<br />

0 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR20R30| wR401| wR20R30|<br />

+----------------------------------------------------------------+<br />

Chapter 4. Running jobs 191


Z = 0<br />

=====<br />

+----------------------------------------------------------------+<br />

| R010| R111| R310| R410| R211|<br />

3 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR21R31| wR410| wR21R31|<br />

|----------------------------------------------------------------|<br />

| R030| R131| R330| R430| R231|<br />

2 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR33| wR430| wR23|<br />

|----------------------------------------------------------------|<br />

| R020| R121| R320| R420| R221|<br />

1 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR22R32| wR420| wR22R32|<br />

|----------------------------------------------------------------|<br />

| R000| R101| R300| R400| R201|<br />

0 | READY| READY| READY| READY| READY|<br />

| wR0| wR1| wR20R30| wR400| wR20R30|<br />

+----------------------------------------------------------------+<br />

-- list of switches --<br />

........ >>>>> Omitted lines


LoadLeveler run queue<br />

LoadLeveler is based on a job queuing principle. To identify a particular job that a<br />

user has submitted, a job identification (job ID) is returned when the job is sent<br />

successfully by the llsubmit command, as shown in Example 4-24.<br />

Example 4-24 A job ID returned by the llsubmit command<br />

loadl@bglfen1:/bgl/loadl> llsubmit hello.cmd<br />

llsubmit: The job "bglfen1.itso.ibm.com.53" has been submitted.<br />

loadl@bglfen1:/bgl/loadl> llq<br />

Id Owner Submitted ST PRI Class Running On<br />

------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.53.0<br />

loadl 4/3 09:32 I 50 small<br />

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted<br />

In this example, after the job is submitted successfully, the job number 53 shows<br />

up in the queue. The job ID bglfen1.53.0 is a unique identifier for this job. The<br />

number zero (0) is the job step identifier (jobstepid) in case of a job that contains<br />

more than one step.<br />

Note: The job ID that is returned by llsubmit has the long host name with the<br />

domain name, but the queue displays only the short host name with the job ID<br />

because of the limited number of characters that can fit on a line of the screen.<br />

However, LoadLeveler log files include the job ID with long host names<br />

(FQDN).<br />

In a queue that has hundreds or perhaps thousands of jobs, you can filter the llq<br />

command output with the job ID.<br />

The job ID is also used in other LoadLeveler commands such as llcancel. The<br />

llcancel command tells LoadLeveler to terminate the job if it is running and to<br />

remove it from the queue. For example:<br />

llcancel bglfen1.53<br />

Chapter 4. Running jobs 193


You can use the llq -l command to get more information about a job.<br />

Another useful flag is -s, used as llq -s . If a job is in idle (I) state,<br />

using the -s flag tells the llq command to analyze and to display the reasons<br />

that the job cannot run at the moment (see Example 4-25).<br />

Example 4-25 Reasons for job in idle state<br />

loadl@bglfen1:/bgl/loadl> llsubmit hello.cmd<br />

llsubmit: The job "bglfen1.itso.ibm.com.54" has been submitted.<br />

loadl@bglfen1:/bgl/loadl> llq<br />

Id Owner Submitted ST PRI Class Running On<br />

------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.54.0<br />

loadl 4/3 10:10 I 50 small<br />

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted<br />

loadl@bglfen1:/bgl/loadl> llq -s bglfen1.54.0<br />

=============== Job Step bglfen1.itso.ibm.com.54.0 ===============<br />

Job Step Id: bglfen1.itso.ibm.com.54.0<br />

Job Name: bglfen1.itso.ibm.com.54<br />

Step Name: 0<br />

Structure Version: 10<br />

Owner: loadl<br />

Queue Date: Mon 03 Apr 2006 10:10:20 AM EDT<br />

Status: Idle<br />

Reservation ID:<br />

Requested Res. ID:<br />

Scheduling Cluster:<br />

Submitting Cluster:<br />

Sending Cluster:<br />

Requested Cluster:<br />

Schedd History:<br />

Outbound Schedds:<br />

Submitting User:<br />

Execution Factor: 1<br />

Dispatch Time:<br />

Completion Date:<br />

Completion Code:<br />

Favored Job: No<br />

User Priority: 50<br />

user_sysprio: 0<br />

class_sysprio: 0<br />

group_sysprio: 0<br />

System Priority: -157448<br />

q_sysprio: -157448<br />

194 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Previous q_sysprio: 0<br />

Notifications: Error<br />

Virtual Image Size: 472 kb<br />

Large Page: N<br />

Checkpointable: no<br />

Ckpt Start Time:<br />

Good Ckpt Time/Date:<br />

Ckpt Elapse Time: 0 seconds<br />

Fail Ckpt Time/Date:<br />

Ckpt Accum Time: 0 seconds<br />

Checkpoint File:<br />

Ckpt Execute Dir:<br />

Restart From Ckpt: no<br />

Restart Same Nodes: no<br />

Restart: yes<br />

Preemptable: no<br />

Preempt Wait Count: 0<br />

Hold Job Until:<br />

RSet: RSET_NONE<br />

Mcm Affinity Options:<br />

Cmd: /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun<br />

Args: -verbose 2 -exe /bgl/hello/hello.rts<br />

Env:<br />

In: /dev/null<br />

Out: /bgl/loadl/out/hello.bglfen1.54.0.out<br />

Err: /bgl/loadl/out/hello.bglfen1.54.0.err<br />

Initial Working Dir: /bgl/loadl<br />

Dependency:<br />

Resources:<br />

Requirements: (Arch == "PPC64") && (OpSys == "Linux2")<br />

Preferences:<br />

Step Type: Blue Gene<br />

Size Requested: 128<br />

Size Allocated:<br />

Shape Requested:<br />

Shape Allocated:<br />

Wiring Requested: MESH<br />

Wiring Allocated:<br />

Rotate: True<br />

Blue Gene Status:<br />

Blue Gene Job Id:<br />

Partition Requested:<br />

Partition Allocated:<br />

Error Text:<br />

Node Usage: shared<br />

Chapter 4. Running jobs 195


Submitting Host: bglfen1.itso.ibm.com<br />

Schedd Host: bglfen1.itso.ibm.com<br />

Job Queue Key:<br />

Notify User: loadl<br />

Shell: /bin/bash<br />

LoadLeveler <strong>Group</strong>: No_<strong>Group</strong><br />

Class: small<br />

Ckpt Hard Limit: undefined<br />

Ckpt Soft Limit: undefined<br />

Cpu Hard Limit: undefined<br />

Cpu Soft Limit: undefined<br />

Data Hard Limit: undefined<br />

Data Soft Limit: undefined<br />

Core Hard Limit: undefined<br />

Core Soft Limit: undefined<br />

File Hard Limit: undefined<br />

File Soft Limit: undefined<br />

Stack Hard Limit: undefined<br />

Stack Soft Limit: undefined<br />

Rss Hard Limit: undefined<br />

Rss Soft Limit: undefined<br />

Step Cpu Hard Limit: undefined<br />

Step Cpu Soft Limit: undefined<br />

Wall Clk Hard Limit: 00:30:00 (1800 seconds)<br />

Wall Clk Soft Limit: 00:30:00 (1800 seconds)<br />

Comment:<br />

Account:<br />

Unix <strong>Group</strong>: loadl<br />

NQS Submit Queue:<br />

NQS Query Queues:<br />

Negotiator Messages:<br />

Bulk Transfer: No<br />

Step Adapter Memory: 0 bytes<br />

Adapter Requirement:<br />

Step Cpus: 0<br />

Step Virtual Memory: 0.000 mb<br />

Step Real Memory: 0.000 mb<br />

================= EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.54.0 ================<br />

Not enough resources to start now:<br />

Quarter "Q1" of BP "R000" is busy.<br />

Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

196 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Not enough resources for this step as top-dog:<br />

Quarter "Q1" of BP "R000" is busy.<br />

Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Not enough resources to start now:<br />

Quarter "Q1" of BP "R000" is busy.<br />

Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Not enough resources for this step as top-dog:<br />

Quarter "Q1" of BP "R000" is busy.<br />

Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />

Blue Gene/L specific information for the run queue<br />

The LoadLeveler job ID is also stored in the Blue Gene/L database with the job<br />

record. You can use the DB2 select command to retrieve job information, as<br />

shown in Example 4-26.<br />

Example 4-26 Retrieving jobid from DB2 database<br />

bglsn:~ # db2 "select jobid,blockid,jobname from tbgljob_history where<br />

username='loadl'"<br />

JOBID BLOCKID JOBNAME<br />

----------- ---------------- ------------------------------------------------<br />

184 RMP22Mr154151027 bglsn.itso.ibm.com.6.0<br />

183 RMP22Mr153314019 bglsn.itso.ibm.com.5.0<br />

185 RMP22Mr160740038 bglsn.itso.ibm.com.7.0<br />

202 RMP24Mr135201017 bglfen1.itso.ibm.com.18.0<br />

203 RMP24Mr151425028 bglfen1.itso.ibm.com.19.0<br />

229 R000_128 mpirun.2954.bglfen1<br />

230 R000_128 mpirun.3140.bglfen1<br />

255 RMP03Ap112830043 bglfen1.itso.ibm.com.53.0<br />

8 record(s) selected.<br />

Note: The dynamic conditions of the Blue Gene/L partitions and the<br />

LoadLeveler nodes make it complicate for llq to diagnose all possible<br />

problems in such an environment. However, the reasons displayed can serve<br />

as hints and starting points for looking at the live problems on the system.<br />

Chapter 4. Running jobs 197


Note: The JOBID from Blue Gene/L database is the Blue Gene job identifier.<br />

The LoadLeveler job ID is called JOBNAME in the database table.<br />

You can view the same job information from the Blue Gene/L Web interface.<br />

From the Blue Gene/L home page, click Runtime and then Job Information<br />

(see Figure 4-24).<br />

Figure 4-24 Job information from the Web interface<br />

Job command file<br />

To submit a job using LoadLeveler, a job command file is required by the<br />

llsubmit command. We have used a sample job command file (Example 4-16 on<br />

page 178) throughout the discussions in this chapter. We kept the number of<br />

keywords in this file to a minimum. There are many other keywords available.<br />

See IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881, for a<br />

complete reference on job command file keywords.<br />

198 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Note: In the job command file, each keyword starts with a new line, followed<br />

by two special characters, the number sign (#) and the “at” sign (@). The<br />

number sign character (#) is equivalent to the same character that starts the<br />

comments field in shell scripting language (bash). If job_type = serial,<br />

LoadLeveler executes the job command file as though it were a shell script.<br />

The @ character tells LoadLeveler that this line contains a job command<br />

keyword to be evaluated, and this line is not interpreted by the shell because it<br />

is a considered a comment.<br />

The following list describes each keyword in the file briefly and provides hints at<br />

what could go wrong with the keyword where possible:<br />

► #@ job_type<br />

There are basically three types of job supported by LoadLeveler:<br />

– serial<br />

– parallel<br />

– bluegene<br />

In this book, we only discuss the bluegene job type.<br />

► #@ job_type = bluegene<br />

This line is mandatory in the job command file. Without this keyword,<br />

LoadLeveler does not understand other Blue Gene/L related keywords. In<br />

fact, this keyword tells LoadLeveler to use the bridge API to exchange<br />

information with the Blue Gene/L system. Example 4-27 shows the llsubmit<br />

command returning an error message when the job command file does not<br />

contain the keyword #@ job_type = bluegene.<br />

Example 4-27 Missing job_type in job command file<br />

loadl@bglfen1:/bgl/loadl> llsubmit hello.cmd<br />

llsubmit: 2512-585 The "bg_size" keyword is only valid for "job_type =<br />

BLUEGENE" job steps.<br />

llsubmit: 2512-051 This job has not been submitted to LoadLeveler.<br />

► #@ executable = /usr/bin/mpirun<br />

This directive tells LoadLeveler that the executable for a Blue Gene/L job is<br />

the mpirun command, which is invoked by LoadLeveler to launch an MPI job<br />

on the Blue Gene/L nodes. If this keyword is missing or points to a different<br />

executable, LoadLeveler cannot find the mpirun program (and fails to submit<br />

the job).<br />

Chapter 4. Running jobs 199


► #@ arguments<br />

This keyword contains the arguments that are passed to mpirun. They must<br />

be typed here exactly as they are entered on the mpirun command line. For<br />

example, to run the “Hello world!” program with mpirun on the command line,<br />

we used the following syntax:<br />

mpirun -exe /bglscratch/test1/hello-file.rts -args 10 -verbose 1<br />

In the LoadLeveler job command file, this line is translated to:<br />

#@ executable = /usr/bin/mpirun<br />

#@ arguments = -exe /bglscratch/test1/hello-file.rts -args 10<br />

-verbose 1<br />

Note: LoadLeveler does not validate the syntax of the command string that is<br />

passed to mpirun. If there is problem with an argument or value, mpirun<br />

returns a message in the job’s stderr(2).<br />

► #@ environment<br />

This keyword passes any environment variables that the job needs to set<br />

when it is running. The reserved word COPY_ALL copies all the user’s shell<br />

environment variables for the job (as displayed by the commands set or env).<br />

► #@ output and #@ error<br />

These two keywords contain the directory (directories) where the job’s<br />

stdout(1) and stderr(2) are sent. If these two keywords are not specified in<br />

the job command file, LoadLeveler redirects the job’s stderr(2) and<br />

stdout(1) to /dev/null.<br />

Note: These directories have to be writable for the user ID that runs the job. If<br />

the directories do not exist or are not accessible, the job is rejected by<br />

LoadLeveler.<br />

► #@ notification<br />

This keyword consists of a reserved word that indicates that LoadLeveler<br />

should notify the user ID specified in the #@ notify_user keyword.<br />

► #@ notify_user<br />

This keyword contains the user ID that is going to receive LoadLeveler’s<br />

notification in case the notification condition is set (#@ notification).<br />

► #@ class<br />

Depending on how the LoadLeveler cluster is configured, the job can chose to<br />

run under a LoadLeveler class.<br />

200 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Note: If the class specified by the job command file is not defined in<br />

LoadLeveler configuration, the job is going to remain idle (I) in the queue. See<br />

, “LoadLeveler configuration keywords” on page 205 for information about how<br />

to identify class definitions in LoadLeveler configuration.<br />

► #@ queue<br />

This keyword is usually the last keyword in a job command file. It is not set to<br />

any value. The command llsubmit returns an error if the #@ queue line is<br />

missing from the job command file (as shown in Example 4-28).<br />

Example 4-28 The llsubmit error when missing #@queue keyword<br />

loadl@bglsn:/bgl/loadl> llsubmit hello.cmd<br />

llsubmit: 2512-058 The command file "hello.cmd" does not contain any<br />

"queue" keywords.<br />

llsubmit: 2512-051 This job has not been submitted to LoadLeveler.<br />

Blue Gene/L specific keywords<br />

The following list provides the keywords that are specific to Blue Gene/L:<br />

► #@ bg_size<br />

This keyword contains an integer that specifies the number of compute nodes<br />

that are required for this job to run. This is equivalent to the argument -np<br />

for mpirun.<br />

► #@ bg_shape<br />

This keyword contains the reserved word, mesh or torus. It is equivalent to the<br />

argument flag -shape for mpirun.<br />

► #@ bg_partition<br />

This keyword can be set to a partition name. The partition has to be<br />

predefined. It is equivalent to the argument flag -partition for mpirun.<br />

Note: One of these keywords should be used in the job command file instead<br />

of the mpirun argument flags -np, -shape, -partition in the #@ arguments<br />

keyword. When mpirun receives the job information from LoadLeveler, the<br />

partition is either created with the specified number of compute nodes, shape,<br />

or the specified predefined partition is chosen.<br />

When debugging problems for jobs with a complex command file, start with a<br />

simple file as described in this section. Make sure that the job can run with this<br />

file. Use the “Hello world!” program if necessary. Then, you can add keywords<br />

gradually into the same job command file until the problem is observed.<br />

Chapter 4. Running jobs 201


LoadLeveler processes, logs, and persistent storage<br />

As discussed in 4.4.1, “LoadLeveler overview” on page 167, the cluster is<br />

managed and run by different LoadLeveler daemons. Figure 4-8 on page 168<br />

shows all the LoadLeveler daemons running in the background on a node.<br />

However, LoadLeveler is designed in such a way that not all daemons should run<br />

on every node in the cluster. It is normal to have only a subset of daemons<br />

running on every node.<br />

In order to know for sure which LoadLeveler daemons are running, you can use<br />

the ps -ef command (filtered for LoadLeveler processes), as shown in<br />

Example 4-29. In our case, the Negotiator daemon (or CM) and the Master<br />

daemon are running on the SN.<br />

Example 4-29 LoadLeveler daemons running on Blue Gene/L SN<br />

loadl@bglsn:/bgl/loadl> ps -ef | grep LoadL<br />

loadl 27425 1 0 Apr01 ? 00:00:00<br />

/opt/ibmll/LoadL/full/bin/LoadL_master<br />

loadl 27436 27425 0 Apr01 ? 00:03:17 LoadL_negotiator -f -c<br />

/tmp -C /tmp<br />

loadl 14892 11456 0 15:43 pts/34 00:00:00 grep LoadL<br />

Running the same command on an FEN shows the Master daemon, the<br />

Scheduler daemon, the Start daemon, and the Starter daemon running, as<br />

shown in Example 4-30.<br />

Example 4-30 LoadLeveler daemons running on Blue Gene/L FEN<br />

loadl@bglfen1:~> ps -ef | grep LoadL<br />

loadl 18931 1 0 Apr01 ? 00:00:00<br />

/opt/ibmll/LoadL/full/bin/LoadL_master<br />

loadl 18940 18931 0 Apr01 ? 00:00:00 LoadL_schedd -f -c /tmp<br />

-C /tmp<br />

loadl 18941 18931 0 Apr01 ? 00:04:48 LoadL_startd -f -C /tmp<br />

-c /tmp<br />

loadl 820 18941 0 08:51 ? 00:00:00 LoadL_starter -p 131 -c<br />

/tmp -C /tmp<br />

loadl 1950 1891 0 13:45 pts/6 00:00:00 grep LoadL<br />

202 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


The two previous examples match with the LoadLeveler cluster configuration as<br />

shown by the llstatus command in Example 4-31.<br />

Example 4-31 Matching daemons with llstatus command<br />

loadl@bglfen1:~> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />

OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 395 PPC64<br />

Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.03 9999 PPC64<br />

Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

This configuration is defined in the LoadLeveler configuration files, which are<br />

described in “LoadLeveler configuration keywords” on page 205.<br />

The most important daemon is the Negotiator. If the Negotiator cannot start,<br />

LoadLeveler commands such as llstatus and llq will not work. To determine<br />

which node is supposed to be the CM, check the LoadL_admin file and look for<br />

the following line:<br />

central_manager = true<br />

Example 4-32 shows the lines extracted from the file LoadL_admin, in which CM<br />

is defined on node bglsn.itso.ibm.com.<br />

Example 4-32 CM defined in LoadL_admin<br />

bglsn.itso.ibm.com: type = machine<br />

schedd_host = true<br />

central_manager = true<br />

Note: In a common LoadLeveler configuration, there could be other nodes<br />

defined with central_manager = alt (these are alternate CMs). One of these<br />

nodes takes over the role of the CM if the designated CM fails. However, an<br />

alternate CM is not supported in Blue Gene/L environment.<br />

Chapter 4. Running jobs 203


You can configure the LoadLeveler daemons to be respawned in case of<br />

abnormal termination. Therefore, for problems that happened in the past, each<br />

daemon’s log file has to be investigated. The log file names and locations are<br />

defined in the LoadL_config file, as shown in Example 4-33.<br />

Example 4-33 Log file names and locations<br />

loadl@bglfen1:/bgl/loadlcfg> grep -i _log LoadL_config | grep -v MAX<br />

KBDD_LOG = $(LOG)/KbdLog<br />

STARTD_LOG = $(LOG)/StartLog<br />

SCHEDD_LOG = $(LOG)/SchedLog<br />

NEGOTIATOR_LOG = $(LOG)/NegotiatorLog<br />

GSMONITOR_LOG = $(LOG)/GSmonitorLog<br />

STARTER_LOG = $(LOG)/StarterLog<br />

MASTER_LOG = $(LOG)/MasterLog<br />

Also in LoadL_config the $(LOG) variable can be defined as:<br />

LOG = $(tilde)/log<br />

where $(titde) is the home directory of the LoadLeveler administrator or the<br />

user ID that starts LoadLeveler.<br />

If the log file names and locations are not defined in LoadL_config, they are set<br />

to the location that is specified when the command llinit is issued. The syntax<br />

of the llinit command is:<br />

llinit -local /home/loadl<br />

where the option flag -local specifies the location where the following three<br />

directories are created:<br />

► log<br />

► execute<br />

► spool<br />

In addition to the log directory, the llinit command also creates two other<br />

directories: execute and spool. These directories serve as persistent storage for<br />

job data and history information. Therefore, if LoadLeveler is stopped with jobs in<br />

the queue, job data and information are saved for next time when LoadLeveler is<br />

started.<br />

Depending on the state of a job at the time LoadLeveler stops, it can be<br />

restarted, resumed, or started when LoadLeveler starts next time.<br />

Note: For complete information regarding LoadLeveler processes and logs,<br />

see IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881.<br />

204 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


LoadLeveler configuration keywords<br />

The LoadL_config file includes the global LoadLeveler configuration keywords.<br />

To determine the location of this file, check the contents of the /etc/LoadL.cfg file,<br />

which contains the basic LoadLeveler configuration. Example 4-34 shows the<br />

contents of this file.<br />

Example 4-34 The contents of the /etc/LoadL.cfg file<br />

loadl@bglsn:~/log.save> cat /etc/LoadL.cfg<br />

LoadLUserid = loadl<br />

LoadL<strong>Group</strong>id = loadl<br />

LoadLConfig = /bgl/loadlcfg/LoadL_config<br />

This file resides on a common file system so that you can access it from any<br />

node in the cluster (that NFS mounts the /bgl file system). The most important<br />

Blue Gene/L keywords in the LoadL_config file are described in 4.4.4,<br />

“Configuring LoadLeveler for Blue Gene/L” on page 172. Some of the<br />

configuration keywords that define the log files and locations are also discussed<br />

in , “LoadLeveler processes, logs, and persistent storage” on page 202.<br />

In addition to the global configuration file, LoadLeveler also uses a local<br />

configuration file that resides on a local file system on each node. This is<br />

specified in the global configuration file with the keyword LOCAL_CONFIG as<br />

following:<br />

LOCAL_CONFIG = $(tilde)/LoadL_config.local<br />

This local configuration file is needed in case the system administrator wants<br />

different nodes having different configurations for the following:<br />

► LoadLeveler daemons running<br />

► Job classes<br />

► Number of Starters<br />

To specify these parameters, the following keywords are used:<br />

► SCHEDD_RUNS_HERE<br />

If this keyword is set to FALSE, then LoadLeveler is not going to start the<br />

Scheduler daemon on this node. If the number of nodes defined to run the<br />

Scheduler is the majority in the cluster, this keyword can be set to TRUE in the<br />

global configuration file. Then, in the local configuration file of the node that<br />

does not need to run the Scheduler set the value to FALSE.<br />

Note: The setup value in the local configuration file overrides the one in the<br />

global configuration file.<br />

Chapter 4. Running jobs 205


► STARTD_RUNS_HERE<br />

This keyword specifies whether LoadLeveler should start the Start daemon<br />

on the local node.<br />

Note: It is usually not desirable to run Scheduler and Start daemon on the<br />

Blue Gene/L SN.<br />

► CLASS<br />

To control the types of jobs to run on particular nodes, you can specify the<br />

CLASS keyword either in the global or local configuration file. An example is<br />

following:<br />

CLASS = small(8) medium(5) large(2)<br />

Unless a default class is defined in the LoadL_admin file, a job has to specify<br />

the keyword #@ class in its job command file to be able to run. The keyword<br />

is set to one of the class name. See Example 4-16 on page 178 for the use of<br />

the keyword #@ class.<br />

► MAX_STARTERS<br />

This configuration keyword sets the maximum number of jobs that can run on<br />

the local node. Depending on the capability of the node, the appropriate<br />

number of jobs is recommended.<br />

Blue Gene/L specific configuration keywords<br />

As discussed in 4.4.4, “Configuring LoadLeveler for Blue Gene/L” on page 172,<br />

there are special configuration keywords that enable Blue Gene/L functionalities<br />

in LoadLeveler. It is also recommended that Scheduler and Start daemon are not<br />

running on the Service Node. See 4.4.2, “Principles of operation in a Blue<br />

Gene/L environment” on page 170.<br />

Environment variables, network, and library links<br />

This section explains the variables that are critical for Blue Gene/L LoadLeveler<br />

software environment (job running).<br />

Environment variables<br />

In 4.4.6, “Setting Blue Gene/L specific environment variables” on page 175, we<br />

discuss the environment variables that you need to set up for a user ID to start<br />

LoadLeveler. A simple way to check these variables in a UNIX shell is to issue<br />

the echo command, as shown in Example 4-35.<br />

206 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 4-35 Checking environment variables<br />

loadl@bglfen1:/bgl/loadlcfg> echo $BRIDGE_CONFIG_FILE<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />

loadl@bglfen1:/bgl/loadlcfg> echo $DB_PROPERTY<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

loadl@bglfen1:/bgl/loadlcfg> echo $MMCS_SERVER_IP<br />

bglsn.itso.ibm.com<br />

If these variables are not set up correctly for the LoadLeveler administrator user<br />

ID, LoadLeveler will not start. For other users, they need to have these variables<br />

set up once at the beginning to be able to submit jobs. However, the following<br />

should be noted:<br />

► The location of the configuration files to which these variables point can be<br />

changed for individual users.<br />

► The contents of the configuration files for individual users can also be<br />

changed for various needs.<br />

Another common variable is the PATH ($PATH). For accessing the LoadLeveler<br />

commands, users should have $PATH, including the directory where the<br />

LoadLeveler binaries and libraries are installed. For example:<br />

/opt/ibmll/LoadL/full/bin/<br />

Network and communications<br />

Because LoadLeveler is a cluster managed subsystem, network and<br />

communication between nodes are vital to its operations. The daemons use<br />

sockets to communicate to each other. Basic TCP/IP and sockets knowledge is<br />

helpful in problem determination.<br />

The global LoadL_config file defines the port numbers that the daemons use, as<br />

shown in Example 4-36.<br />

Example 4-36 Network ports defined for LoadLeveler daemons<br />

# Specify port numbers<br />

MASTER_STREAM_PORT = 9616<br />

NEGOTIATOR_STREAM_PORT = 9614<br />

SCHEDD_STREAM_PORT = 9605<br />

STARTD_STREAM_PORT = 9611<br />

COLLECTOR_DGRAM_PORT = 9613<br />

STARTD_DGRAM_PORT = 9615<br />

MASTER_DGRAM_PORT = 9617<br />

Chapter 4. Running jobs 207


For example, knowing how sockets work helps when a socket is closed abruptly,<br />

and it might need to wait a certain time for all the traffic to be quiesced. This<br />

situation might occur when you stop and restart LoadLeveler very quickly,<br />

without waiting for a short while (1 minute). Example 4-37 shows the<br />

NegotiatorLog, which includes messages that it cannot start and has to wait on<br />

port 9614.<br />

Example 4-37 Negotiator daemon waiting on port 9614<br />

03/19 16:54:30 TI-1 *************************************************<br />

03/19 16:54:30 TI-1 *** LOADL_NEGOTIATOR STARTING UP ***<br />

03/19 16:54:30 TI-1 *************************************************<br />

03/19 16:54:30 TI-1<br />

03/19 16:54:30 TI-1 LoadLeveler: LoadL_negotiator started, pid = 14176<br />

03/19 16:54:30 TI-5 LoadLeveler: 2539-479 Cannot listen on port 9614 for service<br />

LoadL_negotiator.<br />

03/19 16:54:30 TI-5 LoadLeveler: Batch service may already be running on this<br />

machine.<br />

03/19 16:54:30 TI-5 LoadLeveler: Delaying 1 seconds and retrying ...<br />

03/19 16:54:30 TI-5 LoadLeveler: 2539-479 Cannot listen on port 9614 for service<br />

LoadL_negotiator.<br />

03/19 16:54:30 TI-5 LoadLeveler: Batch service may already be running on this<br />

machine.<br />

03/19 16:54:30 TI-5 LoadLeveler: Delaying 2 seconds and retrying ...<br />

03/19 16:54:30 TI-5 LoadLeveler: 2539-479 Cannot listen on port 9614 for service<br />

LoadL_negotiator.<br />

One way to alleviate this problem is to set the TCP recycle attribute with the<br />

following command:<br />

/sbin/sysctl -w netipv4.tcp_tw_recycle=”1”<br />

The netstat command is also a helpful to understand the status of the sockets or<br />

ports. For example:<br />

netstat -an | grep 9614<br />

Library links<br />

Libraries are other vital resources for LoadLeveler to run. As shown in some<br />

scenarios, some bad or broken links can cause problems with running<br />

LoadLeveler and submitting jobs. This verification can be useful as a last resort<br />

when everything else fails to find the problems for following reasons:<br />

► Links can be removed or broken<br />

► Libraries can be updated<br />

► A link can be pointing to a different library for various users<br />

208 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Blue Gene/L specific links<br />

Library links are used extensively with LoadLeveler setup as described in 4.4.5,<br />

“Making the Blue Gene/L libraries available to LoadLeveler” on page 173. When<br />

in doubt, check and validate all the library links. You can use a script similar to<br />

that shown in Example 4-14 on page 175.<br />

4.4.10 Updating LoadLeveler in a Blue Gene/L environment<br />

In a Blue Gene/L system, LoadLeveler is not part of the Blue Gene/L code<br />

distribution. You can update LoadLeveler code separately on the SN and FENs.<br />

Consider the following recommendations:<br />

► Check the code levels on all the nodes in the cluster to make sure that there<br />

are no version or level mismatches.<br />

► If in doubt, check the libraries and their symbolic links.<br />

► Note that some installation scripts copy the library binaries rather than<br />

creating a symbolic link. In this case, the checksum command can help<br />

validate the binary files.<br />

Chapter 4. Running jobs 209


210 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chapter 5. File systems<br />

5<br />

This chapter provides an understanding of the steps that are required to fix<br />

problems with the file systems (persistent storage) that are currently supported<br />

on Blue Gene/L. Currently both NFS and GPFS file systems are supported for<br />

use with Blue Gene/L, and in this chapter we discuss problem determination for<br />

both these file system types.<br />

For each of the file systems supported, we begin with a general introduction, and<br />

then we describe how the file system plugs in to Blue Gene/L. We discuss both<br />

the concepts and the steps that are required to configure the file systems. For<br />

each file system type, we then present a problem determination methodology<br />

that recommends a specific sequence of checks, including a checklist of steps<br />

that show how to make each of the checks, along with an explanation and<br />

suggested commands.<br />

© Copyright IBM Corp. 2006. All rights reserved. 211


5.1 NFS and GPFS<br />

In a basic configuration, Blue Gene/L requires an NFS file system, regardless of<br />

whether a GPFS file system is also used. Native access to a GPFS file system<br />

from a Blue Gene/L I/O node requires an available NFS file system so that the<br />

GPFS code on the I/O node can read the centrally held GPFS configuration files<br />

while during the node startup.<br />

NFS is the most convenient to set up because most operating systems provide<br />

the facilities to both create and mount the NFS file system. GPFS provides a<br />

more scalable solution for those configurations where high performance and<br />

large file system storage is needed (higher requirements than NFS can provide).<br />

You can configure GPFS with multiple Storage Server nodes that can work<br />

together to provide cumulated performance (unlike NFS, where all the storage<br />

that belongs to a NFS file system must be attached physically to a single node).<br />

GPFS currently supports a file size up 200 TB. This size limit is not the actual<br />

architectural limit (which in GPFS 2.3 is 2 99 bytes), rather it is the configuration<br />

that we tested.<br />

Both NFS and GPFS file systems must be mounted on the I/O node<br />

automatically during the node boot process. This mounting is essential because<br />

each time a job is run, a new block might have to be allocated and all the I/O<br />

nodes belonging to this block, therefore, will be booted. To be able to<br />

immediately run a job, the file system must be available.<br />

Understanding the I/O node boot sequence is key to understanding problem<br />

determination for Blue Gene/L file systems.<br />

5.1.1 I/O node boot sequence<br />

The IBM proprietary kernel Compute Node Kernel (CNK) that runs on the<br />

compute node is a single user, single process run time environment that has no<br />

paging mechanism. The compute node can communicate to the outside world<br />

only through the I/O node, and any executable program that runs on the compute<br />

node must be loaded from the I/O node through the Blue Gene/L internal<br />

collective network.<br />

Depending on the configuration, a Blue Gene/L system includes a number of I/O<br />

nodes. The I/O node runs the Mini-Control Program (MCP), which is a cut down<br />

Linux OS that runs a 32-bit PPC 2.4 uniprocessor kernel. The Compute node I/O<br />

daemon (ciod) that is loaded during MCP initialization and runs on the I/O node<br />

is responsible for handling the I/O calls made by the Compute nodes. The MCP,<br />

unlike the Compute Node Kernel, supports TCP/IP communication programs,<br />

212 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


such as NFS, ping, and other I/O related system functions that help with problem<br />

determination.<br />

Steps in the I/O node boot sequence<br />

Note: The variables that we use in this section are set during the I/O node<br />

system init (rc.sysinit) when it runs the /etc/rc.dist script that is built into the<br />

RAM disk image. Here are the contents of this file on our system:<br />

export<br />

BGL_DISTDIR="/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/dist"<br />

export BGL_SITEDISTDIR="/bgl/dist"<br />

export BGL_OSDIR="/bgl/OS/4.1"<br />

The steps of the I/O node boot sequence that we discuss in detail further in this<br />

section are:<br />

1. MCP kernel and ramdisk are loaded over the service network.<br />

2. The MCP launches from the ramdisk image the /sbin/init process that reads<br />

/etc/inittab. The system init rule in inittab is coded to run /etc/rc.d/rc.sysinit.<br />

3. The rc.sysinit is invoked from within the MCP ramdisk image. (You can find a<br />

copy of this file in the /bgl/BlueLight/ppcfloor/dist/etc/rc.d directory.) This<br />

script attempts to do the following:<br />

a. NFS mounts the /bgl directory from the Service Node (SN) or the directory<br />

that defined by the BGL_EXPORTDIR variable, if that variable is set.<br />

b. Runs the /etc/rc.dist script from the ramdisk image.<br />

4. The rc.sysinit2 is next invoked from the NFS mounted directory<br />

(/bgl/BlueLight/ppcfloor/dist/etc/rc.d) and does the following:<br />

a. Replaces empty /lib with symbolic link to $BGL_OSDIR/lib.<br />

b. Replaces empty /usr with symbolic link to $BGL_OSDIR/usr.<br />

c. Replaces empty /etc/rc.d/rc3.d with symbolic link under $BGL_DISTDIR.<br />

d. Loads the collective/tree network device drivers.<br />

e. Runs $BGL_DISTDIR/etc/rc.d/rc3.d/S* start scripts.<br />

f. Runs $BGL_SITEDISTDIR/etc/rc.d/rc3.d/S* start scripts.<br />

The scripts that are run by default by these start scripts are found in the<br />

/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d directory and are listed below<br />

along with the jobs that they perform:<br />

nfs Starts the portmap daemon.<br />

xntpd Starts the network time protocol daemon.<br />

Chapter 5. File systems 213


sshd (optional) Starts the secure shell daemon if required. This<br />

occurs if either GPFS_STARTUP=1 is set in<br />

/etc/sysconfig/gpfs on the I/O node OR the<br />

/etc/sysconfig/ssh is found.<br />

You can find this file on the SN as<br />

/bgl/BlueLight/ppcfloor/dist/etc/sysconfig/ssh<br />

gpfs Starts and mounts GPFS file systems if<br />

GPFS_STARTUP=1 is set in /etc/sysconfig/gpfs.<br />

syslog Starts syslog services.<br />

ciod Starts ciod.<br />

ibmcmp Starts the XL Compiler Environment for I/O node<br />

g. runs $BGL_SITEDISTDIR/etc/rc.local script.<br />

5. As each I/O node completes its MCP boot process, it looks for additional<br />

scripts to run. These additional scripts can be found in two separate<br />

directories documented in the following paragraphs.<br />

5.1.2 Additional scripts in I/O node boot sequence<br />

You can save scripts that you want invoked during the I/O node boot sequence in<br />

either of the following two directories:<br />

► /bgl/BlueLight/ppcfloor/dist/etc/rc.d/rc3.d - (installation dist directory)<br />

Warning: The /bgl/BlueLight/ppcfloor/ppc/dist/etc/rc.d/rc3.d directory is<br />

part of the installed software. Its contents are lost when you install a new<br />

driver or release.<br />

► /bgl/dist/etc/rc.d/rc3.d - (site dist directory)<br />

To be considered, the script’s file name must begin with the uppercase letter S<br />

(for start) or K (for kill), followed by two decimal digits (for example, S10mynfs,<br />

K10mynfs, and so forth) and a relevant name for the service.<br />

Note: The general rule is that the scripts starting with S are run at system<br />

init, and the scripts starting with K ar run when the system is shut down.<br />

At system initialization the list of scripts starting with S are sorted by the<br />

subsequent two digits, which specify the relative order that the scripts will be run<br />

by the I/O node. In a similar way, the kill scripts that start with the letter K are<br />

used when the I/O node is shut down when the associated block is freed.<br />

The scripts in both directories are sorted into a single list and then run one at a<br />

time in that order.<br />

214 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


5.2 NFS<br />

Warning: If a start script in the site dist directory has the same name as a<br />

start script in the installation dist directory, only the script in the installation dist<br />

directory is run.<br />

Let us assume that you have a script named /bgl/dist/etc/rc.d/rc3.d/S10mynfs<br />

that mounts additional file systems that contain your data. Because the script<br />

name begins with S10, it runs before S50ciod, which starts the ciod, and after<br />

S05nfs, which starts the port mapper. This sequence is correct, because your file<br />

systems are mounted before jobs can be started and after NFS is already<br />

running.<br />

Blue Gene/L requires an NFS file system regardless of whether a GPFS file<br />

system is also required. The default NFS file system, the /bgl directory, is<br />

exported from the SN.<br />

5.2.1 How NFS plugs into a Blue Gene/L system<br />

It is important that the /blg directory on the SN is NFS exported because the I/O<br />

nodes must be able to mount this file system when a block is booted. While<br />

applications can be run from the /blg directory, it is recommended that the /bgl<br />

file system is preserved for the installed Blue Gene/L code and that another file<br />

system is used for user applications. Figure 5-1 shows how NFS plugs in to a<br />

Blue Gene/L system.<br />

Note: Even though the system can run without an additional NFS server (SN<br />

provides the basic NFS services and file systems), we strongly recommend<br />

that you configure additional NFS servers due to applications performance<br />

and storage requirements and to avoid overloading the SN.<br />

Chapter 5. File systems 215


1. Create NFS Server with storage and create and<br />

export the NFS file system (/bglscratch).<br />

172.30.1.33/16<br />

p5<br />

Storage<br />

NFS Server<br />

2. Export the NFS file system<br />

from the Server to the Blue<br />

Gene functional network.<br />

3. Check the NFS file<br />

system can be mounted over<br />

the functional network to the<br />

Service Node<br />

Functional<br />

Ethernet<br />

Figure 5-1 Adding an NFS file system to the Blue Gene/L<br />

172.30.0.0/16<br />

5.2.2 Adding an NFS file system to the Blue Gene/L system<br />

This section provides an example of how to make a file system available through<br />

NFS to the Blue Gene/L system. The file system can be served by any file server<br />

that complies to the NFS V3 protocol. The file system is made available through<br />

the Functional Network and must be mounted by the I/O nodes, SN, and the<br />

FENs that are used to compile and execute the jobs.<br />

In our environment, we used an IBM System p 630 running SUSE SLES 9<br />

connected to an IBM DS4500 storage. Because it is outside the scope of this<br />

book, we do not present the basic operating system, storage (file systems), and<br />

networking configuration steps. Instead, we emphasize the steps to make the file<br />

system available to the Blue Gene/L system.<br />

216 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

IO Nodes<br />

connections<br />

pSeries<br />

Service Node<br />

BlueGene<br />

NFS clients<br />

4. Add command to the sitefs to mount the<br />

NFS file system (/bglscratch) when the IO<br />

nodes boot.<br />

Important: The File Server name in this section refers to either the SN or<br />

another system that is used to host the NFS file system (NFS-FS) that is used<br />

to run user jobs.


Here are the steps that are required to make an NFS file system available for<br />

running jobs on Blue Gene/L:<br />

1. Create the file system (NFS-FS) on the File Server system (172.30.1.33,<br />

p630n03_fn) and mount it on the File Server system.<br />

mount /bglscratch<br />

The File Server system could be the SN, one FEN, or another system.<br />

2. Export the NFS-FS from the File Server.<br />

Set USE_KERNEL_NFSD_NUMBER="64" in /etc/sysconfig/nfs.<br />

Add the following line to /etc/exports and then activate it:<br />

/bglscratch 172.30.0.0/255.255.0.0 (rw,no_root_squash,async)<br />

exportfs -a<br />

Now check the export on the FS by issuing the command:<br />

showmount -e<br />

Check that the NFS server is started.<br />

/etc/init.d/nfsserver status<br />

This should return:<br />

Checking for kernel based NFS server: running<br />

3. Check that this NFS-FS can be mounted and accessed on the SN.<br />

On the SN issue the command (172.1.1.31 is the File Server IP address on<br />

the functional network):<br />

mount 172.30.1.33:/bglscratch /mnt<br />

Check that you can access the command on the SN:<br />

cd /mnt; touch foo<br />

4. Update the site customization script (sitefs) to enable the NFS-FS to be<br />

mounted when the I/O nodes boot. Then check that a job can access files on<br />

the NFS-FS when run. See sitefs entries in “Step 3 - Check that the NFS-FS<br />

is mounted when the block boots” on page 221.<br />

5.2.3 NFS problem determination methodology<br />

The methodology that we present here is intended to help with a wide variety of<br />

problems. The first sections cover the basics and the later sections cover the<br />

more unlikely and esoteric problem areas. If you think you already know in which<br />

area the problem lies, then we encourage you to go straight to that section.<br />

However, if you are unsure where the problem lies, we suggest that you use the<br />

methodology in the order presented here, because this approach often uncovers<br />

Chapter 5. File systems 217


5.2.4 NFS checklists<br />

the simplest problems quickly and easily before you spend a long time looking for<br />

a solution to a presumed problem rather than the real one.<br />

Check that the NFS-FS can be mounted on the SN<br />

After each step mentioned here, check whether you can mount the NFS-FS:<br />

► Check that the NFS-FS is exported from the Server, as described in “Step 1 -<br />

Check that NFS-FS is exported from the File Server” on page 218.<br />

► Check that you can ping the FS Server over the functional network.<br />

► If you still cannot mount the file system, then check error messages (screen,<br />

console, system log) in /var/log/messages.<br />

Check that the NFS-FS can be mounted on the I/O nodes<br />

Also check whether you can mount the NFS-FS on the I/O nodes:<br />

► First boot a block that uses the I/O node that has the problem.<br />

► Check if the NFS-FS can be mounted on the I/O node, as described in “Step 2<br />

- Check if the NFS-FS can be mounted on the I/O node” on page 220.<br />

► Check that you can ping the File Server’s IP address from the I/O node.<br />

► Check that the NFS-FS is mounted when the block boots, as described in<br />

“Step 3 - Check that the NFS-FS is mounted when the block boots” on<br />

page 221<br />

Check network tuning parameters<br />

Network tuning parameters are unlikely to prevent NFS from mounting. However,<br />

if you are experiencing performance or intermittent connection problems, this<br />

check might help solving the problem. See “Step 4 - Check the network tuning<br />

parameters on the SN” on page 223.<br />

In addition to the problem determination methodology, the following detailed<br />

checklists (steps) can aid in NFS problem determination.<br />

Step 1 - Check that NFS-FS is exported from the File Server<br />

The best way to check if a file system is exported is to use the showmount<br />

command. Example 5-1 issues the showmount command on the SN to check that<br />

the /bgl directory is exported from the SN over the functional network.<br />

218 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 5-1 Checking NFS exports<br />

bglsn:/tmp # showmount -e<br />

Export list for bglsn:<br />

/bgl 172.30.0.0/255.255.0.0<br />

Example 5-2 uses the showmount command from the SN to check an additional<br />

NFS server (that holds the user application code and data) to see what NFS file<br />

systems can be mounted on the SN (and on I/O nodes).<br />

Example 5-2 Checking exports for additional servers<br />

bglsn:/tmp # showmount -e p630n03_fn<br />

Export list for p630n03_fn:<br />

/nfs_mnt (everyone)<br />

/bglscratch (everyone)<br />

/bglhome (everyone)<br />

If the showmount command returns the following error then the rpc.mountd or nfsd<br />

services are not running:<br />

'mount clntudp_create: RPC: Program not registered'.<br />

To fix this issue, run the following command:<br />

/etc/init.d/nfsserver restart<br />

Another error returned by the showmount command might be the following<br />

message, which means that the portmap service is not running:<br />

'mount clntudp_create: RPC: Port mapper failure - RPC: Unable to<br />

receive'<br />

To fix this issue run the following:<br />

/etc/init.d/portmap restart<br />

/etc/init.d/nfsserver restart<br />

You can use the following command to check the server:<br />

/etc/init.d/nfsserver status<br />

Checking for kernel based NFS server: running<br />

To check the port mapper service, you can use the following command:<br />

bglsn:/tmp # /etc/init.d/portmap status<br />

Checking for RPC portmap daemon: running<br />

Checking for kernel based NFS server: running<br />

Chapter 5. File systems 219


Step 2 - Check if the NFS-FS can be mounted on the I/O node<br />

Use the mmcs_db_console to check which file systems are mounted on a<br />

particular I/O node (in our case, ionode4) using the mount command and then<br />

use the same technique to check connectivity (ping) the SN - 172.30.1.1.<br />

Example 5-3 shows the commands we used in a mmcs_db_console session and<br />

the write_con command (command lines are in bold font).<br />

Example 5-3 Using mmcs_db_console to mount NFS file system on I/O node<br />

mmcs$ allocate R000_J108_32<br />

OK<br />

mmcs$ redirect R000_J108_32 on<br />

OK<br />

mmcs$ {i} write_con hostname<br />

OK<br />

mmcs$ Mar 29 13:42:35 (I) [1079301344] {17}.0: h<br />

Mar 29 13:42:35 (I) [1079301344] {0}.0: h<br />

Mar 29 13:42:35 (I) [1079301344] {0}.0: ostname<br />

ionode3<br />

$<br />

Mar 29 13:42:35 (I) [1079301344] {17}.0: ostname<br />

ionode4<br />

$<br />

mmcs$ {17} write_con hostname<br />

OK<br />

mmcs$ Mar 29 13:48:46 (I) [1079301344] {17}.0: h<br />

Mar 29 13:48:46 (I) [1079301344] {17}.0: ostname<br />

ionode4<br />

mmcs$ {17} write_con mount<br />

OK<br />

mmcs$ Mar 29 13:43:36 (I) [1079301344] {17}.0: m<br />

Mar 29 13:43:36 (I) [1079301344] {17}.0: ount<br />

rootfs on / type rootfs (rw)<br />

/dev/root on / type ext2 (rw)<br />

none on /proc type proc (rw)<br />

172.30.1.1:/bgl on /bgl type nfs<br />

(rw,v3,rsize=8192,wsize=8192,hard,udp,nolock,addr=172.30.1.1)<br />

220 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


172.30.1.33:/bglscratch on /bglscratch type nfs<br />

(rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.30.1.33)<br />

/dev/bubu_gpfs1 on /bubu type gpfs (rw)<br />

$<br />

mmcs$ {17} write_con ping -c 2 172.30.1.1<br />

OK<br />

mmcs$ Mar 29 13:44:07 (I) [1079301344] {17}.0: p<br />

Mar 29 13:44:07 (I) [1079301344] {17}.0: ing -c 2 172.30.1.1<br />

PING 172.30.1.1 (172.30.1.1) 56(84) bytes of data.<br />

64 bytes from 172.30.1.1: icmp_seq=1 ttl=64 time=0.126 ms<br />

Mar 29 13:44:08 (I) [1079301344] {17}.0: 64 bytes from 172.30.1.1:<br />

icmp_seq=2 ttl=64 time=0.098 ms<br />

Mar 29 13:44:08 (I) [1079301344] {17}.0:<br />

--- 172.30.1.1 ping statistics ---<br />

2 packets transmitted, 2 received, 0% packet loss, time 999ms<br />

rtt min/avg/max/mdev = 0.098/0.112/0.126/0.014 ms<br />

$<br />

From Example 5-3, you can see how to check for mounted file systems on an I/O<br />

node that is booted and also how to ping the SN from that node to check basic<br />

functional network connectivity. This technique (using the mmcs_db_console and<br />

the write_con commands) can also be used to mount the NFS-FS if it is NOT<br />

automatically mounted.<br />

Step 3 - Check that the NFS-FS is mounted when the block boots<br />

You start by checking that the sitefs file has the correct entries (see<br />

Example 5-4). Next, check that the correct links are in place to invoke the sitefs<br />

file when I/O nodes are booted.<br />

The complete sitefs file that we used is shown in Appendix B, “The sitefs file” on<br />

page 423. Example 5-4 shows the relevant lines (in bold) from our sitefs file.<br />

Example 5-4 Sample sitefs file with /bglscratch file system<br />

bglsn:/bgl/dist/etc/rc.d/init.d # ls<br />

. .. sitefs<br />

bglsn:/bgl/dist/etc/rc.d/init.d # cat sitefs<br />

#!/bin/sh<br />

#<br />

# Sample sitefs script.<br />

#<br />

Chapter 5. File systems 221


# It mounts a filesystem on /scratch, mounts /home for user files<br />

# (applications), creates a symlink for /tmp to point into some<br />

directory<br />

# in /scratch using the IP address of the I/O node as part of the<br />

directory<br />

# name to make it unique to this I/O node, and sets up environment<br />

# variables for ciod.<br />

#<br />

. /proc/personality.sh<br />

. /etc/rc.status<br />

#-------------------------------------------------------------------<br />

# Function: mountSiteFs()<br />

#<br />

# Mount a site file system<br />

# Attempt the mount up to 5 times.<br />

# If all attempts fail, send a fatal RAS event so the block fails<br />

# to boot.<br />

#<br />

# Parameter 1: File server IP address<br />

# Parameter 2: Exported directory name<br />

# Parameter 3: Directory to be mounted over<br />

# Parameter 4: Mount options<br />

#------------------------------------------------------------------mountSiteFs()<br />

{<br />

#............>...............<br />

#-------------------------------------------------------------------<br />

# Script: sitefs()<br />

#<br />

# Perform site-specific functions during startup and shutdown<br />

#<br />

# Parameter 1: "start" - perform startup functions<br />

# "stop" - perform shutdown functions<br />

#-------------------------------------------------------------------<br />

# Set to ip address of site fileserver<br />

SITEFS=172.30.1.33<br />

# First reset status of this service<br />

rc_reset<br />

# Handle startup (start) and shutdown (stop)<br />

case "$1" in<br />

222 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


start)<br />

echo Mounting site filesystems<br />

# Mount a scratch file system...<br />

mountSiteFs $SITEFS /bglscratch /bglscratch<br />

tcp,rsize=32768,wsize=32768,async<br />

##..............>>>>>>>> Omitted lines > Omitted lines


5.3 GPFS<br />

Example 5-6 Recommended network tuning parameters<br />

# set UDP receive buffer default (and max, below) so that we don't drop<br />

packets<br />

net.core.rmem_default = 1024000<br />

net.core.rmem_max = 8388608<br />

net.core.wmem_max = 8388608<br />

net.core.netdev_max_backlog = 3072<br />

# ARP cache area size to avoid Neighbour table overflow messages<br />

# defaults are 128, 512, 1024. For 64 racks should be 512, 2048, and<br />

4096 .<br />

net.ipv4.neigh.default.gc_thresh1 = 256<br />

net.ipv4.neigh.default.gc_thresh2 = 1024<br />

net.ipv4.neigh.default.gc_thresh3 = 2048<br />

# NFS tuning parameters<br />

net.ipv4.tcp_timestamps = 1<br />

net.ipv4.tcp_window_scaling = 1<br />

net.ipv4.tcp_sack = 1<br />

net.ipv4.tcp_rmem = 4096 87380 4194304<br />

net.ipv4.tcp_wmem = 4096 65536 4194304<br />

net.ipv4.ipfrag_low_thresh = 393216<br />

net.ipv4.ipfrag_high_thresh = 524288<br />

Important: If you have changed any of the network parameters, then you<br />

must run /etc/rc.d/boot.sysctl start or sysctl -p /etc/sysctl.conf for the<br />

settings to take effect immediately.<br />

To view the current settings for these parameters use:<br />

sysctl -A | grep net.<br />

GPFS stands for General Parallel File System. GPFS is a high I/O performance<br />

and scalable file system that is intended primarily for clusters of computers<br />

where a large number of processors are required to access the same copy of the<br />

data (one of the basic requirements for parallel computing environments).<br />

GPFS is based on a client-server model, with the server part responsible for<br />

managing the storage, and the client part providing application access.The<br />

GPFS client software is highly efficient in handling data so that the CPU slice<br />

224 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


equired to read and write data is typically much less than for other file systems<br />

(like NFS).<br />

Unlike NFS, where managing the storage associated with a file system is the<br />

responsibility of a single server (OS image), in GPFS the storage can be<br />

distributed among multiple servers, eliminating the single server bottleneck.<br />

Using GPFS it is easy to create file systems that store the data on many<br />

separate disks connected to many separate servers. In addition to performance,<br />

storage, and server load balancing, GPFS also provides excellent scalability,<br />

reliability, and high availability by providing continuos operation while adding or<br />

removing nodes, disks, and file systems.<br />

Blue Gene/L is a highly scalable processing engine that is designed for highly<br />

parallel applications. It is likely that many parallel applications that are designed<br />

to run efficiently on Blue Gene/L will also benefit from the increased and scalable<br />

I/O performance that GPFS file systems can provide. By combining the<br />

scalability and performance of the Blue Gene/L processing platform with the<br />

scalability and I/O performance of a GPFS file system it is possible to provide a<br />

highly optimized computing environment to run parallel applications.<br />

This section provides a general overview of GPFS.<br />

5.3.1 When to use GPFS<br />

Because GPFS is a client-server application that requires additional knowledge<br />

and system administration skills, you should consider carefully when deciding to<br />

go for a GPFS implementation. Because the benefits of such a product are<br />

major, you should also consider specific elements when deciding to implement it.<br />

The following considerations can help you make the correct decision:<br />

► If you need a file system that can provide high performance (and a single<br />

server cannot deliver), then GPFS would be a good solution.<br />

► If you need a reliable file system for a cluster that is unaffected by a failure of<br />

a storage server or disk then GPFS can be configured to provide such a<br />

system.<br />

► If you want to allow a parallel applications running on a cluster of machines to<br />

access a single file at the same time with tight control over data integrity<br />

(multiple application instances accessing same data file concurrently), then<br />

GPFS has the appropriate architectural features and also the proven track<br />

record in providing these functions.<br />

Chapter 5. File systems 225


However, if you have the following requirements, then GPFS is not mandatory:<br />

► File system performance offered by one NFS server system is adequate.<br />

► The files you are using are smaller than 2 GB.<br />

► You have no requirement to run parallel applications that write to the same<br />

file.<br />

► You do not intend to scale up in performance or storage capacity.<br />

5.3.2 Features and concepts of GPFS<br />

Some of the features of GPFS include:<br />

► File consistency<br />

GPFS uses a sophisticated token management system to provide data<br />

consistency while allowing multiple independent paths to the same file by the<br />

same name from anywhere in the cluster.<br />

► High recoverability and increased data availability<br />

Using GPFS replication, it is possible to configure GPFS to have two copies<br />

of the data on separate groups of disks (failure groups) should a single disk or<br />

group for disks fail, access to the data is not lost.<br />

GPFS is a journaling file system that creates separate journal files for each<br />

node. These logs record the allocation and modification of metadata, aiding in<br />

fast recovery and the restoration of data consistency in the event of node<br />

failure.<br />

► High I/O performance<br />

GPFS can provide high I/O performance and achieves this partly by striping<br />

the files across all the disks in the file system. Managing its own striping<br />

affords GPFS the control it needs to achieve fault tolerance and to balance<br />

load across adapters, storage controllers, and disks. Large files in GPFS are<br />

divided into equal sized blocks, and consecutive blocks are placed on<br />

different disks in a round-robin fashion.<br />

To exploit disk parallelism when reading a large file from a single-threaded<br />

application, whenever it can recognize a pattern, GPFS prefetches data into<br />

its buffer pool (pagepool), issuing I/O requests in parallel to as many disks as<br />

necessary to achieve the bandwidth of which the switching fabric is capable.<br />

GPFS recognizes sequential, reverse sequential, and various forms of strided<br />

access patterns.<br />

For parallel applications GPFS provides enhanced performance by allowing<br />

multiple processes or applications on all nodes in the cluster simultaneous<br />

access to the same file using standard file system calls. GPFS also allows<br />

concurrent reads and writes from multiple nodes. This is a key concept in<br />

226 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


parallel processing. Also useful for parallel applications is GPFS’s support of<br />

byte-range locks on file writes so that multiple clients can write to different<br />

byte-ranges within the same file at the same time.<br />

► Very large file and file system sizes<br />

The currently supported limits for both GPFS file system and file size are<br />

currently 95 TB for Linux and 100 TB for AIX. These supported limits are<br />

confined to those configurations that have been tested. The architectural limit<br />

for GPFS however in 2 PB.<br />

This is substantially more that most available file systems and can be a key<br />

advantage as data volumes and file sizes continue to increase.<br />

► Cross cluster file system access<br />

GPFS allows users shared access to files in either the cluster where the file<br />

system was created, or other (remote) GPFS clusters. Each cluster in the<br />

network is managed as a separate cluster, while allowing shared file system<br />

access. When multiple clusters are configured to access the same GPFS file<br />

system, Open Secure Sockets Layer (OpenSSL) is used to authenticate<br />

cross-cluster network connections.<br />

5.3.3 GPFS requirements for Blue Gene/L<br />

Due to the internal structure of the Blue Gene/L system, adding GPFS support<br />

has been a challenging task. In this section we describe some of the major<br />

challenges and the solutions designed to overcome them.<br />

Tip: The GPFS implementation for Blue Gene/L exploits in one of the main<br />

features of GPFS: cross cluster mounting of GPFS file systems.<br />

In fact the configuration consists of two GPFS clusters, a storage “server”<br />

cluster (in fact a common GPFS cluster with storage nodes), and a “client“<br />

cluster, consisting of the Blue Gene/L I/O nodes and the SN.<br />

Challenges for GPFS on Blue Gene/L<br />

Challenges for GPFS on Blue Gene/L systems include:<br />

► bgIO cluster - SN is the only quorum node<br />

► bgIO has no local storage, another cluster is required for storage (gpfsNSD)<br />

► Blue Gene/L I/O nodes are diskless<br />

GPFS code usually runs on AIX and Linux, on stand-alone machines that have<br />

dedicated OS disk(s) from which the system boots. Due to its tight integration<br />

with the OS, GPFS has been designed to store its code, configuration and log<br />

files on the local disk(s). On Blue Gene/L all the I/O traffic is done through the I/O<br />

Chapter 5. File systems 227


nodes and these have no boot/OS disks. GPFS, however, needs a place to store<br />

the files that it uses. These include:<br />

► GPFS code and utilities<br />

► GPFS configuration files (one individual set per cluster node)<br />

► Console log files (one set per node)<br />

► Syslog files<br />

► Traces (if debug is needed)<br />

In addition, due to GPFS structure (clustering layer, storage abstraction layer<br />

and file system device driver), each node can assume various roles (file system<br />

manager, configuration manager, and so on). Because the availability of the I/O<br />

nodes is dynamic (per job block allocation and releasing), if one of these nodes<br />

would assume a management role inside the cluster, this would cause huge<br />

performance problems. Performance would be affected by two factors:<br />

► Cluster reconfiguration requiring GPFS management roles takeover which<br />

can suspend I/O during such an operation.<br />

► The additional load induced by the GPFS management roles create an<br />

imbalance between I/O nodes that are allocated for the same job.<br />

Solutions for GPFS on Blue Gene/L<br />

This section describes the approach that IBM development takes (GPFS and<br />

Blue Gene/L) to the challenges for GPFS on Blue Gene/L.<br />

► <strong>Problem</strong>: Access to GPFS files for each I/O node<br />

Solution: Because the SN is the only in the bgIO cluster that has disks, the<br />

files needed by GPFS must be stored on those disks. The /bgl file system on<br />

the SN is used for this purpose. The Blue Gene/L I/O nodes access the GPFS<br />

files in this file system by means of NFS mounts .<br />

► <strong>Problem</strong>: I/O node bootup sequence must include GPFS handling<br />

Solution: The ‘/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d/gpfs’ script has been<br />

provided, and is installed when installing the Blue Gene/L RPMs.<br />

When called during I/O node boot up, the $BGL_DISTDIR/etc/rc.d/init.d/gpfs<br />

script is responsible for creating the necessary symbolic links that allow the<br />

GPFS client code to find all the files that it normally uses .<br />

228 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 5-7 presents the directories and links that are created in the gpfs<br />

script if GPFS_STARTUP = 1 is found in the /etc/sysconfig/gpfs file.<br />

Example 5-7 Excerpt from ‘gpfs’ script<br />

# Directories ....<br />

/bgl/gpfsvar//var/mmfs/gen<br />

/bgl/gpfsvar//var/mmfs/etc<br />

/bgl/gpfsvar//var/mmfs/tmp<br />

/bgl/gpfsvar//var/adm/ras<br />

# Links......<br />

ln -s /bgl/gpfsvar//var/mmfs /var/mmfs<br />

ln -s /bgl/gpfsvar//var/adm/ras /var/adm/ras<br />

► <strong>Problem</strong>: Preventing I/O nodes from assuming GPFS management roles<br />

(cluster manager, file system/stripe group manager)<br />

Solution: The Blue Gene/L I/O cluster, referred to hereafter as bgIO, does<br />

not own any file system, rather it cross- mounts it from another GPFS cluster<br />

(referred to hereafter as gpfsNSD).<br />

This solutions also clearly separates the administration of the GPFS file<br />

system storage from the administration of the bgIO cluster (SN plus I/O<br />

nodes). This has the advantage that both the Blue Gene/L system and the<br />

GPFS storage cluster (gpfsNSD) can be scaled independently, if required.<br />

► <strong>Problem</strong>: The GPFS cluster for the Blue Gene/L system (bgIO), is unusual in<br />

as much as it contains only one quorum node, the SN. If this single quorum<br />

node goes down then the cross mounted GPFS file system, referred to<br />

hereafter as the GPFS-FS, will become unmounted.<br />

Solution: This is consistent with the same dependency that the Blue Gene/L<br />

system has on the SN. Thus, if the SN is turned off (or becomes unavailable<br />

for any reason), the Blue Gene/L system cannot operate anyway. Therefore,<br />

it is acceptable that the GPFS-FS will also be unavailable.<br />

5.3.4 GPFS supported levels<br />

It is important that the GPFS packages that are installed for the Blue Gene/L I/O<br />

nodes match the level of the code that are installed for the Blue Gene/L driver<br />

itself. The installation levels must match because for the Blue Gene/L I/O nodes,<br />

we do not have to build the portability layer because this is already provided by<br />

the Blue Gene/L GPFS RPMs.<br />

Chapter 5. File systems 229


Important: The GPFS code that is installed for the Blue Gene/L I/O nodes is<br />

different from that installed for the SN. The SN runs Linux on IBM System p<br />

64-bit hardware. This hardware, therefore, uses GPFS for SUSE Linux (SLES<br />

9) on IBM System p code. The I/O nodes uses a PowerPC 440 CPU that is 32<br />

bit. So, this hardware uses a special version of GPFS code that is specific to<br />

this environment.<br />

This Blue Gene/L I/O node specific GPFS code can only be downloaded from the<br />

following IBM Web site (which is password protected):<br />

https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGL-BL<br />

UEGENE<br />

Access to this IBM Web site is granted to organizations that have purchased the<br />

GPFS for Blue Gene/L code.<br />

When the Blue Gene/L driver code level is updated, the GPFS code for the I/O<br />

nodes must be reinstalled at the correct level. The code must be at the correct<br />

level because when the I/O nodes boot, they depend on files that exist under the<br />

following directory:<br />

/bgl/BlueLight/ppcfloor/dist/etc/rc.d<br />

When the Blue Gene/L driver is updated, the /bgl/BlueLight/ppcfloor symbolic<br />

link points to another directory that does not have the GPFS files that are<br />

required during I/O nodes boot process. You have to re-install the GPFS for MCP<br />

(I/O nodes) code into the new Blue Gene/L driver directory.<br />

The GPFS Portability Layer for the SN must be built on the SN and Front-End<br />

Nodes (FENs) when the GPFS for SLES RPMs have been installed. The GPFS<br />

Portability Layer for the Blue Gene/L I/O nodes is shipped pre-built, so there is no<br />

need to build the GPFS Portability Layer for the I/O nodes after installing the<br />

GPFS for Blue Gene/L RPMs.<br />

5.3.5 How GPFS plugs in<br />

In this section we describe the steps needed to add a GPFS file system to the<br />

Blue Gene/L system. We present only the GPFS commands that enable an<br />

existing GPFS file system to be cross-mounted on to the Blue Gene/L system.<br />

This section assumes that you follow the GPFS installation and administration<br />

procedures that are documented in the GPFS product manuals that are<br />

documented at the end of this chapter in 5.3.11, “References” on page 264 to<br />

build the GPFS storage cluster (gpfsNSD).<br />

230 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 5-2 presents the three high-level steps that needed to make a GPFS file<br />

system available on Blue Gene/L. The essential concept that we use to make this<br />

possible is the ability of GPFS 2.3 to allow one GPFS cluster to mount a GPFS<br />

file systems that belongs to a remote GPFS cluster.<br />

While it is possible to add the NSD (Network Shared Disk) storage servers<br />

directly to the bgIO cluster and provide a locally owned GPFS file system this is<br />

not recommended, as the dynamic nature of the Blue Gene/L system might<br />

cause the GPFS cluster performance problems (see “Challenges for GPFS on<br />

Blue Gene/L” on page 227).<br />

1. Create GPFS cluster (gpfsNSD) with storage<br />

and create a GPFS file system(/gpfs1).<br />

172.30.1.31,32,33<br />

p5 p5 p5<br />

Storage<br />

IBM<br />

DS4500<br />

gpfsNSD Cluster<br />

Figure 5-2 Plugging in GPFS in steps<br />

3. Cross mount GPFS file<br />

system (/gpfs1) from gpfsNSD<br />

onto Blue Gene cluster (bgIO).<br />

Functional Ethernet<br />

172.30.0.0/16<br />

IO Nodes<br />

connections<br />

172.30.1.1<br />

BlueGene<br />

You can find the latest detailed instructions to install Blue Gene/L in the GPFS<br />

“How to“ document, which is available at:<br />

https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGL-BL<br />

UEGENE<br />

The three major steps that are required to enable a GPFS file system on a Blue<br />

Gene/L system are:<br />

1. Create the GPFS file system on a remote cluster (gpfsNSD).<br />

2. Create a GPFS cluster on Blue Gene/L (bgIO). This step creates the bgIO<br />

cluster with just the SN.<br />

3. Cross mount the GPFS file system from gpfsNSD on to bgIO. This step<br />

includes adding the I/O nodes for Blue Gene/L to the bgIO cluster.<br />

pSeries<br />

Service Node<br />

bgIO GPFS Cluster<br />

2. Create GPFS cluster with the<br />

Service Node and IO<br />

nodes.(bgIO)<br />

Chapter 5. File systems 231


5.3.6 Creating the GPFS file system on a remote cluster (gpfsNSD)<br />

Figure 5-3 show the GPFS cluster that we built for our test environment. This<br />

GPFS cluster uses three nodes running AIX 5L V5.3 and has the GPFS file<br />

system (mounted as /gpfs1) using four LUNs that reside on 4+P RAID5 arrays<br />

from a DS4500 storage controller.<br />

p630n01_fn p630n02_fn p630n03_fn<br />

OpenPower<br />

Functional<br />

Ethernet (Gbit)<br />

Power Module<br />

Fan<br />

Cont r oller<br />

Cache<br />

DS4500<br />

Storage<br />

DS4500<br />

TotalStorage<br />

EXP710<br />

Storage <strong>Group</strong><br />

Figure 5-3 GPFS storage cluster<br />

GPFS storage cluster:<br />

gpfsNSD<br />

GPFS file system<br />

mounted as: /gpfs1<br />

You can create the GPFS file system on a cluster using either AIX or Linux<br />

nodes. We do not discuss the process of creating this cluster in detail here.<br />

However, the remote cluster must conform to the following rules:<br />

► The SN is not included in this cluster<br />

► All nodes in this storage cluster must be able to access the SN and all Blue<br />

Gene/L I/O nodes across the functional network.<br />

5.3.7 Creating a GPFS cluster on Blue Gene/L (bgIO)<br />

The GPFS code that is installed on the SN is different to that installed for the<br />

Blue Gene/L I/O nodes. In this section we create the GPFS cluster on the Blue<br />

Gene/L with only one node that is the SN. Figure 5-4 presents the diagram of our<br />

cluster before we installed GPFS and configured the bgIO cluster.<br />

232 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

OpenPower<br />

TotalStorage<br />

172.30.0.0/16<br />

OpenPower


172.30.1.1,<br />

bglsn_fn<br />

OpenPower<br />

SLES 9 PPC 64bit<br />

Service Node<br />

/bgl<br />

(local)<br />

Functional<br />

Ethernet (Gbit)<br />

I/O Node<br />

I/O Node<br />

/bgl (nfs)<br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

Figure 5-4 Blue Gene/L system before bgIO cluster is created<br />

I/O Node<br />

I/O Node<br />

The I/O nodes can only be added to this GPFS cluster when the block that they<br />

serve has been initialized and after the cross-mount of the GPFS file system.<br />

Here are the high-level steps that are required to create a GPFS cluster that uses<br />

only the Blue Gene/L SN:<br />

1. Install the GPFS code for SN, as described in “Installing the GPFS code for<br />

SN” on page 234<br />

2. Install the GPFS code for Blue Gene/L I/O nodes, as described in “Installing<br />

the GPFS code for Blue Gene/L I/O nodes” on page 235<br />

3. Configure ssh and scp on all Blue Gene/L nodes, as described in “Configuring<br />

ssh and scp on SN and I/O nodes” on page 237<br />

4. Create the bgIO cluster, as described in “Creating the bgIO cluster” on<br />

page 244<br />

……..<br />

Chapter 5. File systems 233


Installing the GPFS code for SN<br />

Figure 5-5 illustrates the steps to install the GPFS code for SN.<br />

172.30.1.1<br />

bglsn_fn<br />

OpenPower<br />

SLES 9 PPC 64bit<br />

Install GPFS<br />

RPMs (PPC64)<br />

Complile portability<br />

layer<br />

Service Node<br />

/bgl<br />

(local)<br />

Functional<br />

Ethernet (Gbit)<br />

Install GPFS<br />

BlueGene<br />

RPMs (MCP)<br />

Figure 5-5 GPFS code install on the Blue Gene/L system<br />

To install the code, follow these steps:<br />

1. Create a new directory for the GPFS code for the SN and populate it with the<br />

correct GPFS RPMs.<br />

You can do this using the self-extracting product image,<br />

gpfs_install-2.3.0-0_sles9_ppc64, from the GPFS for Linux on POWER<br />

CD-ROM to the new directory. Example 5-8 shows commands that we used.<br />

Example 5-8 Installing GPFS code on SN<br />

I/O Node<br />

I/O Node<br />

/bgl (nfs)<br />

root@bglsn_fn~/> mkdir -p /tmp/gpfslpp_for_servicenode/updates<br />

root@bglsn_fn~/> cd /tmp/gpfslpp_for_servicenode<br />

root@bglsn_fn~/> cp /cdrom/*/gpfs_install-2.3.0-0_sles9_ppc64 .<br />

root@bglsn_fn~/> ./gpfs_install-2.3.0-0_sles9_ppc64 --silent<br />

234 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

I/O Node


After you accept the license agreement, the GPFS product installation images<br />

reside in the extraction target directory (in our case,<br />

/tmp/gpfslpp_for_servicenode). See Example 5-9.<br />

Example 5-9 GPFS RPMs for SN<br />

root@bglsn_fn~/> cd /tmp/gpfslpp_for_servicenode<br />

root@bglsn_fn~/> ls gpfs.*<br />

gpfs.base-2.3.0-11.sles9.ppc64.rpm<br />

gpfs.docs-2.3.0-11.noarch.rpm<br />

gpfs.gpl-2.3.0-11.noarch.rpm<br />

gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />

2. Install the GPFS code for the SN:<br />

cd /tmp/gpfslpp_for_servicenode<br />

rpm -ivh gpfs*.rpm<br />

3. Install any updates available. To do this, copy any update rpms to the<br />

/tmp/gpfslpp_for_servicenode/updates directory and then issue the following<br />

commands:<br />

cd /tmp/gpfslpp_for_servicenode/updates<br />

rpm -Uvh gpfs*.rpm<br />

4. Create the GPFS portability layer binaries. Follow the instructions in the<br />

/usr/lpp/mmfs/src/README (on the SN). The files mmfslinux, lxtrace,<br />

tracedev, and dumpconv are installed in /usr/lpp/mmfs/bin after you have<br />

completed the instructions<br />

Installing the GPFS code for Blue Gene/L I/O nodes<br />

To install the GPFS code for Blue Gene/L I/O nodes, follow these steps:<br />

1. Download the GPFS for MCP RPMs on to the SN.<br />

Create a new directory for the GPFS code for the I/O nodes and copy the<br />

correct RPMs. You can copy these RPMs from the secure Blue Gene/L<br />

software portal:<br />

mkdir -p /tmp/gpfslpp_for_ionodes/updates<br />

Attention: Make sure you do not mix the RPMs for PPC64 with the ones for<br />

I/O nodes (PPC440, 32-bit)!<br />

Chapter 5. File systems 235


You should see a list similar to the following:<br />

cd /tmp/gpfslpp_for_ionodes; ls gpfs.*<br />

gpfs.base-2.3.0-11.ppc.rpm<br />

gpfs.docs-2.3.0-11.noarch.rpm<br />

gpfs.gplbin-2.3.0-11.ppc.rpm<br />

gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />

gpfs.gpl-2.3.0-11.noarch.rpm<br />

2. Install the GPFS code for the I/O nodes:<br />

cd /tmp/gpfslpp_for_ionodes<br />

rpm --root /bgl/BlueLight/driver/ppc/bglsys/bin/bglOS --nodeps<br />

-ivh gpfs*.rpm<br />

Note: It is important to note the following rpm command argument.<br />

--root <br />

This command argument forces the to be used as root<br />

directory for the RPMs installation.<br />

3. Install any updates that are available to the code. To do this, copy any update<br />

RPMs to the /tmp/gpfslpp_for_ionodes/updates directory and issue the<br />

following commands:<br />

cd /tmp/gpfslpp_for_iondoes/updates<br />

rpm -Uvh gpfs*.rpm<br />

236 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Configuring ssh and scp on SN and I/O nodes<br />

Important: In this section, we use the following naming conventions:<br />

► $BGL_SITEDISTDIR: normally points to the /bgl/dist directory.<br />

► $BGL_DISTDIR: normally points to the /bgl/Bluelight/ppcfloor/dist directory.<br />

► $BGL_SNIP is the SN’s IP address on the functional network.<br />

► $SN_HOSTNAME is the SN’s host name on the functional network.<br />

► $IONODE_IPS is a wildcarded IP address representing all I/O nodes.<br />

For example, if the I/O nodes have IP addresses 172.30.100.1 through<br />

172.30.100.128, and 172.30.101.1 through 172.30.101.128, a reasonable<br />

value for $IONODE_IPS would be 172.30.10?.*<br />

► $IONODE_HOSTNAMES is a wildcard for the host name representing all I/O<br />

nodes.<br />

For example, if the I/O nodes have host names ionode1, ionode2, and so<br />

forth, a reasonable value for $IONODE_HOSTNAMES would be ionode*.<br />

In these examples we have chosen to use the RSA2 type for ssh keys. You<br />

can choose other key types (RSA1, DSA). However, when chosen, it is<br />

strongly recommended to use the same key type for all ssh keys.<br />

GPFS requires to execute commands and copy configuration files between all<br />

nodes in the cluster without being prompted for a password. For GPFS on Linux,<br />

the default remote command execution and copy program is secure shell/copy.<br />

This is why we have to prepare the nodes (SN and I/O - see Figure 5-6).<br />

Chapter 5. File systems 237


172.30.1.1<br />

bglsn_fn<br />

OpenPower<br />

Service<br />

Node<br />

SLES 9 PPC 64bit<br />

/bgl<br />

(local)<br />

GPFS GPFS<br />

Configure SSH<br />

for Service Node<br />

Functional<br />

Ethernet (Gbit)<br />

Configure SSH<br />

for I/O nodes<br />

I/O Node<br />

I/O Node<br />

/bgl (nfs)<br />

Figure 5-6 Configuring ssh on SN and I/O nodes (for GPFS)<br />

To configure ssh and scp on SN and I/O nodes, follow these steps:<br />

1. Ensure that the host name that is associated with the SN is unique. Check<br />

both the /etc/hosts file and DNS.<br />

Note: For avoiding network and name resolution problems, we strongly<br />

recommend that you maintain a consistent name resolution using local<br />

files (/etc/hosts). Even though you can use DNS, it is not useful to add<br />

DNS entries for I/O nodes, because they should not be accessible directly<br />

by the users.<br />

2. In the /etc/hosts file, add an entry for each I/O node, and check for duplicate<br />

IP addresses or IP labels (names).<br />

3. Copy this newly update /etc/hosts file to the correct directory in the Blue<br />

Gene/L tree (to make it available to the I/O nodes):<br />

cp /etc/hosts $BGL_SITEDISTDIR/etc/hosts<br />

chmod 644 $BGL_SITEDISTDIR/etc/hosts<br />

238 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

I/O Node<br />

I/O Node


4. Create and verify the directories for the root user on the SN and the I/O<br />

nodes, as shown in Example 5-10.<br />

Example 5-10 Directories for ssh client files<br />

root@bglsn_fn~/> chmod 755 $BGL_SITEDISTDIR<br />

root@bglsn_fn~/> mkdir $BGL_SITEDISTDIR/root<br />

root@bglsn_fn~/> chmod 700 $BGL_SITEDISTDIR/root<br />

root@bglsn_fn~/> mkdir $BGL_SITEDISTDIR/root/.ssh<br />

root@bglsn_fn~/> chmod 700 $BGL_SITEDISTDIR/root/.ssh<br />

5. Check the ssh keys pair for root user on the I/O nodes. Check for both the<br />

private key file (/bgl/dist/root/.ssh/id_rsa), and the public key file<br />

(/bgl/dist/root/.ssh/id_rsa.pub). If the keys have not been created, use the<br />

following command:<br />

ssh-keygen -t rsa -b 1024 -f $BGL_SITEDISTDIR/root/.ssh/id_rsa -N<br />

''<br />

6. Check the ssh keys pair for root user on the on the SN. Check for both the<br />

private key file (~/.ssh/id_rsa.pub), and the public key file (~/.ssh/id_rsa.pub).<br />

If these have not been created, use the following command:<br />

ssh-keygen -t rsa -b 1024 -f /root/.ssh/id_rsa -N ''<br />

7. Check the ssh keys pair for ssh daemon on the I/O nodes. Check for both the<br />

private key file (/bgl/dist/etc/ssh/ssh_host_rsa_key), and the public key file<br />

(/bgl/dist/etc/ssh/ssh_host_rsa_key.pub). If these have not been created, use<br />

the following command:<br />

ssh-keygen -t rsa -b 1024 -f<br />

$BGL_SITEDISTDIR/etc/ssh/ssh_host_rsa_key -N ''<br />

8. Check the ssh keys pair for ssh daemon on the SN. Check for both the private<br />

key file (/etc/ssh/ssh_host_rsa_key), and the public key file<br />

(/etc/ssh/ssh_host_rsa_key.pub). If these have not been created (most<br />

unlikely!), use the following command.<br />

ssh-keygen -t rsa -b 1024 -f /etc/ssh/ssh_host_rsa_key -N ''<br />

9. Create the authorized_keys file for all nodes in the bgIO cluster. Copy the root<br />

user’s public key file from the SN to a temporary file (/tmp/authorized_keys).<br />

Then, append this file with root user’s public key file from the I/O nodes:<br />

cat /root/.ssh/id_rsa.pub >>/tmp/authorized_keys<br />

cat $BGL_SITEDISTDIR/root/.ssh/id_rsa.pub >>/tmp/authorized_keys<br />

Chapter 5. File systems 239


Having created the /tmp/authorized_keys file, then distribute it. Check if either<br />

the SN or the I/O nodes already have an authorized_keys file. If one already<br />

exists, then append the /tmp/authorized_keys file to the end of the existing<br />

one:<br />

cat /tmp/authorized_keys >>/root/.ssh/authorized_keys<br />

cat /tmp/authorized_keys >><br />

$BGL_SITEDISTDIR/root/.ssh/authorized_keys<br />

10.Create the known_hosts file for all nodes in the bgIO cluster. Create a<br />

temporary known_hosts file for both the SN and the I/O nodes. Then combine<br />

these two files to create the /tmp/known_hosts_gpfs file, as shown in<br />

Example 5-11.<br />

Example 5-11 Creating the known_hosts file<br />

root@bglsn_fn~/> echo ”$BGL_SNIP,$SN_HOSTNAME $(cat /etc/ssh/ssh_host_rsa_key.pub)”<br />

>>\ /tmp/known_hosts_sn<br />

root@bglsn_fn~/> echo ”$IONODE_IPS,$IONODE_HOSTNAMES $(cat \<br />

/bgl/dist/etc/ssh/ssh_host_rsa_key.pub)” >> /tmp/known_hosts_io<br />

root@bglsn_fn~/> cp /tmp/known_hosts_sn /tmp/known_hosts_gpfs<br />

root@bglsn_fn~/> cat /tmp/known_hosts_io >> /tmp/known_hosts_gpfs<br />

Note: The variables $BGL_SNIP, $SN_HOSTNAME, $IONODE_IPS, and<br />

$IONODE_HOSTNAMES are explained in the Note on page 213.<br />

Example 5-12 shows one entry in the file that uses the wildcard character (*).<br />

This character saves having to add entries for every I/O node individually.<br />

Example 5-12 The known_hosts file entry that uses wild card chars<br />

172.30.2.*,ionode* ssh-rsa<br />

AAAAB3NzaC1yc2EAAAABIwAAAIEA27GK+WllP58rmK//LGhE4NKBHDdb30x4Kvrkb3ibbRs<br />

41eHuLE3/KIV0IQkwi36F4hg5gRBC2vbBINaIJvwiybovpoL2gfpFTeRworWvVI3goBAJh/<br />

/hIeT+J9sm+Iogxe2iQ6Q6TfsdPss4dkq3nvGM/HmUULsohgT3u494vVc= root@bglsn<br />

After creating the /tmp/known_hosts_gpfs file, distribute it (see<br />

Example 5-13). Check whether either the SN or the I/O nodes already have a<br />

known_hosts file. If one already exists, then append the<br />

/tmp/known_hosts_gpfs file to the end of the existing one.<br />

240 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 5-13 Distributing known_hosts file<br />

root@bglsn_fn~/> cat /tmp/known_hosts_gpfs >> /root/.ssh/known_hosts<br />

root@bglsn_fn~/> touch $BGL_SITEDISTDIR/root/.ssh/known_hosts<br />

root@bglsn_fn~/> cat /tmp/known_hosts_gpfs >>\<br />

$BGL_SITEDISTDIR/root/.ssh/known_hosts<br />

Attention: If the authorized_keys and known_hosts files already exist and if<br />

you want to append to these files, check for duplicate entries in these files. If<br />

there are duplicate entries, only the first occurrence are considered. Thus, you<br />

can have authentication problems!<br />

11.Test unprompted command execution.<br />

Verify that the ssh files are configured properly by using ssh between all the<br />

bgIO cluster nodes without being prompted for password or host key<br />

acceptance. This test requires the sshd daemon to be running on the I/O<br />

nodes to be tested. The simplest way to achieve this is to ensure that you<br />

have a sitefs file in the $BGL_SITEDISTDIR/etc/rc.d/init.d/sitefs directory and<br />

that this sitefs file includes the following line:<br />

echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />

If you do not have a sitefs file, then you can create one using the example<br />

found at the end of the following file:<br />

$BGL_SITEDISTDIR/docs/ionode.README<br />

For your convenience, this file is also included in Appendix C, “The<br />

ionode.README file” on page 431.<br />

The sitefs script that we used for this book is shown in Appendix B, “The sitefs<br />

file” on page 423. The lines that are important to check for in your sitefs script<br />

are shown in bold in Example 5-14.<br />

Example 5-14 The sitefs file with GPFS enabled<br />

# Uncomment the first line to start GPFS.<br />

# Optionally uncomment the other lines to change the defaults for<br />

# GPFS.<br />

echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />

# echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />

Chapter 5. File systems 241


When the "GPFS_STARTUP=1" line is included in the sitefs script and the sitefs<br />

script is also linked into the startup script files then the sshd daemon will be<br />

started on the I/O nodes at bootup by the S16sshd startup script. Check that<br />

the following symbolic links are in place so that your sitefs file will be called<br />

during the I/O node initialization:<br />

ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S10sitefs<br />

ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K90sitefs<br />

Now you can boot a block and check that you can connect using ssh to one<br />

I/O node and then from this I/O node back to the SN. First, boot a block and<br />

establish the IP address of the I/O nodes by using the {i} write_con<br />

hostname command from the mmcs_db_console (see Example 5-15).<br />

Example 5-15 Using mmcs_db_console to boot a block and check for I/O nodes<br />

mmcs$ list_blocks<br />

OK<br />

R000_128 root(1) connected<br />

mmcs$ select_block R000_128<br />

OK<br />

mmcs$ redirect R000_128 on<br />

OK<br />

mmcs$ {i} write_con hostname<br />

OK<br />

mmcs$ Mar 22 14:47:25 (I) [1079031008] {119}.0: h<br />

Mar 22 14:47:25 (I) [1079031008] {102}.0: h<br />

Mar 22 14:47:25 (I) [1079031008] {17}.0: h<br />

Mar 22 14:47:25 (I) [1079031008] {17}.0: ostname<br />

172.30.2.2<br />

................. >> ...............<br />

Mar 22 14:47:25 (I) [1079031008] {0}.0: ostname<br />

172.30.2.1<br />

$<br />

Mar 22 14:47:25 (I) [1079031008] {34}.0: ostname<br />

172.30.2.5<br />

$<br />

Example 5-15 shows that we have eight I/O nodes with IP addresses from<br />

172.30.2.1 - 172.30.2.8. Now, check whether you can connect using ssh to<br />

242 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


one of these nodes from the SN and then back again using both the IP<br />

address and the node name as listed in the /etc/hosts file:<br />

bglsn:/tmp # ssh root@172.30.2.1<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

This shows that you have sshd into ionode with the IP address of 172.30.2.1.<br />

Now, you ssh back to the SN using the IP address:<br />

$ ssh root@172.30.1.1<br />

Last login: Wed Mar 22 14:45:13 2006 from 192.168.100.60<br />

bglsn:~ #<br />

bglsn:~ # exit<br />

logout<br />

Connection to bglsn_fn.itso.ibm.com closed.<br />

Now, you show that you can also ssh between an I/O node and the SN using<br />

the alias names rather than the IP addresses, as shown in Example 5-16.<br />

Example 5-16 Verifying ssh connection using IP labels<br />

$ hostname<br />

ionode1<br />

$ ssh root@bglsn_fn.itso.ibm.com<br />

Last login: Wed Mar 22 15:31:09 2006 from ionode1<br />

bglsn:~ # ssh root@ionode1<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

$ hostname<br />

ionode1<br />

$ exit<br />

Connection to ionode1 closed.<br />

bglsn:~ # exit<br />

logout<br />

Connection to bglsn_fn.itso.ibm.com closed.<br />

$ exit<br />

Connection to ionode1 closed.<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

This concludes our ssh tests, and we have now confirmed that the ssh setup is<br />

correct.<br />

Chapter 5. File systems 243


Creating the bgIO cluster<br />

After you have installed the GPFS packages and have configured remote<br />

command execution, you can create the GPFS cluster (bgIO). Figure 5-7<br />

illustrates this step.<br />

Figure 5-7 Creating the GPFS cluster named bgIO on the Blue Gene/L system<br />

To create the bgIO cluster, follow these steps:<br />

1. Create a GPFS node file called service.node that contains only the SN entry.<br />

Initially we have a single node in the bgIO cluster (Example 5-17).<br />

Example 5-17 GPFS node definition file for bgIO cluster<br />

bglsn:/tmp # echo ”$SN_HOSTNAME:quorum ” >> /tmp/service.node<br />

bglsn:/tmp # cat service.node<br />

bglsn_fn:quorum<br />

bglsn:/tmp #<br />

2. Use the /tmp/service.node file to create the bgIO cluster. Here is the<br />

command to be issued from the SN:<br />

/usr/lpp/mmfs/bin/mmcrcluster -n service.node -p bglsn_fn -C bgIO<br />

-A -r /usr/bin/ssh -R /usr/bin/scp<br />

Set the pagepool to 128M and any other GPFS configuration parameters that<br />

you might want to change at this time. Changing the pagepool (default 64<br />

MB) to 128 MB improves performance. Larger values than 128 MB can result<br />

in GPFS not being able to load (I/O node has only 2 GB of RAM).<br />

# mmchconfig pagepool=128M<br />

# mmchconfig dataStructureDump=/var/mmfs/tmp<br />

244 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


3. Check the bgIO cluster to verify the parameters as shown in Example 5-18.<br />

Example 5-18 The bgIO cluster configuration<br />

bglsn:/tmp # /usr/lpp/mmfs/bin/mmlscluster<br />

GPFS cluster information<br />

========================<br />

GPFS cluster name: bgIO.itso.ibm.com<br />

GPFS cluster id: 12402351528774401789<br />

GPFS UID domain: bgIO.itso.ibm.com<br />

Remote shell command: /usr/bin/ssh<br />

Remote file copy command: /usr/bin/scp<br />

GPFS cluster configuration servers:<br />

-----------------------------------<br />

Primary server: bglsn_fn.itso.ibm.com<br />

Secondary server: (none)<br />

Node number Node name IP address Full node name Remarks<br />

-----------------------------------------------------------------------------------<br />

1 bglsn_fn 172.30.1.1 bglsn_fn.itso.ibm.com quorum node<br />

4. Start the bgIO GPFS cluster using the mmstartup -a command and check the<br />

/var/adm/ras/mmfs.log.latest file to ensure that GPFS has started (see<br />

Example 5-19) and look for ‘mmfsd ready’.<br />

Example 5-19 The mfmfs.log.latest file showing that GPFS is started and ready<br />

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmstartup -a<br />

Mon Mar 20 14:14:30 EST 2006: mmstartup: Starting GPFS ...<br />

bglsn:/mnt/chriss/gpfs/BGL # cat /var/adm/ras/mmfs.log.latest<br />

Mon Mar 20 14:14:31 EST 2006 runmmfs starting<br />

Removing old /var/adm/ras/mmfs.log.* files:<br />

/bin/mv: cannot stat `/var/adm/ras/mmfs.log.previous': No such file or<br />

directory<br />

Unloading modules from /usr/lpp/mmfs/bin<br />

Loading modules from /usr/lpp/mmfs/bin<br />

Module Size Used by<br />

mmfslinux 268384 1 mmfs<br />

tracedev 35552 2 mmfs,mmfslinux<br />

Removing old /var/mmfs/tmp files:<br />

Mon Mar 20 14:14:33 2006: mmfsd initializing. {Version: 2.3.0.10<br />

Built: Jan 16 2006 13:07:54} ...<br />

Mon Mar 20 14:14:34 EST 2006 /var/mmfs/etc/gpfsready invoked<br />

Chapter 5. File systems 245


Mon Mar 20 14:14:34 2006: mmfsd ready<br />

Mon Mar 20 14:14:34 EST 2006: mmcommon mmfsup invoked<br />

Mon Mar 20 14:14:34 EST 2006: /var/mmfs/etc/mmfsup.scr invoked<br />

bglsn:/mnt/chriss/gpfs/BGL #<br />

5.3.8 Cross mounting the GPFS file system on to Blue Gene/L cluster<br />

Most of this section deals with authentication and the swapping that is associated<br />

with openssl based certificates (keys).<br />

The following steps are necessary to cross mount the GPFS file system on the<br />

Blue Gene/L:<br />

1. Configure GPFS authentication on gpfsNSD and bgIO clusters.<br />

2. Mount the GPFS file system on the SN.<br />

3. Add all the I/O nodes to the bgIO cluster.<br />

4. Boot a block and check for automatic mount of the GPFS file system.<br />

Configuring authentication on gpfsNSD and bgIO cluster<br />

Figure 5-8 illustrates our environment before the bgIO and gpfsNSD cluster are<br />

authenticated with each other.<br />

Figure 5-8 Configure GPFS authentication on both clusters<br />

246 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


To configure authentication on gpfsNSD and bgIO cluster, follow these steps:<br />

1. Generate the SSL keys on gpfsNSD and bgIO clusters.<br />

On both gpfsNSD and bgIO clusters, first ensure that GPFS is stopped and<br />

openssl packages are installed.<br />

On one node in gpfsNSD cluster and on the SN run the mmauth genkey<br />

command, as shown in Example 5-20. This command generates the<br />

public/private key pair that is saved in the /var/mmfs/ssl directory.<br />

Example 5-20 Generating GPFS cluster ssl keys<br />

###### On one node in gpfsNSD cluster (p630n01):<br />

p630n01][/]> /usr/lpp/mmfs/bin/mmauth genkey<br />

Verifying GPFS is stopped on all nodes ...<br />

Generating RSA private key, 512 bit long modulus<br />

...............++++++++++++<br />

...............++++++++++++<br />

e is 65537 (0x10001)<br />

id_rsa1 100% 497<br />

0.5KB/s 00:00<br />

id_rsa1 100% 497<br />

0.5KB/s 00:00<br />

mmauth: Command successfully completed<br />

####### and on the Service Node:<br />

bglsn:/root # /usr/lpp/mmfs/bin/mmauth genkey<br />

Verifying GPFS is stopped on all nodes ...<br />

Generating RSA private key, 512 bit long modulus<br />

...............++++++++++++<br />

...............++++++++++++<br />

e is 65537 (0x10001)<br />

id_rsa1 100% 497<br />

0.5KB/s 00:00<br />

id_rsa1 100% 497<br />

0.5KB/s 00:00<br />

mmauth: Command successfully completed<br />

2. Set cipherList=AUTHONLY on gpfsNSD.<br />

On both gpfsNSD and bgIO clusters ensure that GPFS is stopped. On one<br />

node in each cluster run the mmchconfig cipherList=AUTHONLY command, as<br />

shown in Example 5-21. This command sets GPFS to authenticate and<br />

checks authorization for network connections. This is required for cross<br />

cluster communications.<br />

Chapter 5. File systems 247


Example 5-21 Telling clusters to authenticate cross-cluster connections<br />

bglsn:/root # /usr/lpp/mmfs/bin/mmchconfig cipherList=AUTHONLY<br />

Verifying GPFS is stopped on all nodes ...<br />

mmchconfig: Command successfully completed<br />

mmchconfig: 6027-1371 Propagating the changes to all affected nodes.<br />

This is an asynchronous process.<br />

3. Exchange the SSL public keys between clusters<br />

Copy the bgIO cluster public key to one of the nodes in the gpfsNSD cluster,<br />

and the gpfsNSD cluster public key to the SN.<br />

Then, add the keys to the authorization list on each cluster. Use mmauth add<br />

command on both clusters, as shown in Example 5-22.<br />

Example 5-22 Authenticating bgIO and gpfsNSD clusters<br />

# On one node in gpfsNSD cluster:<br />

p630n01][/]> scp bglsn_fn:/var/mmfs/ssl/id_rsa.pub ~/id_rsa.pub.bgIO<br />

p630n01][/]> /usr/lpp/mmfs/bin/mmauth add bgIO -k ~/id_rsa.pub.bgIO<br />

and, on Service Node:<br />

bglsn:/ # scp p630n01_fn:/var/mmfs/ssl/id_rsa.pub ~/id_rsa.pub.gpfsNSD<br />

bglsn:/ # /usr/lpp/mmfs/bin/mmauth add gpfsNSD -k ~/id_rsa.pub.gpfsNSD<br />

-n p630n01_fn,p630n02_fn<br />

Note: As of the current GPFS version (V2.3), you need to specify the NSD<br />

nodes when authorizing the remote cluster, in our case p630n01_fn and<br />

p630n02_fn.<br />

4. Allow bgIO access to the GPFS-FS exported by gpfsNSD.<br />

Start GPFS daemons on gpfsNSD cluster, and then use the mmauth grant<br />

command to allow the bgIO access to the GPFS file system. In the case<br />

shown the device name of the GPFS file system that bgIO is granted access<br />

is gpfs1. The bgIO cluster name used with the mmauth grant command<br />

MUST be the actual name of the cluster as shown by the mmlscluster<br />

command on the SN (in our case, bgIO).<br />

p630n01][/]> /usr/lpp/mmfs/bin/mmstartup -a<br />

p630n01][/]> /usr/lpp/mmfs/bin/mmauth grant bgIO -f gpfs1<br />

Note: The cluster name is bgIO.itso.ibm.com but the short name is<br />

allowed.<br />

248 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


5. Add the remote file system to the bgIO cluster.<br />

On the bgIO cluster, ensure that GPFS is shut down. Then, run the<br />

mmremotefs add command. This command tells the bgIO cluster about the<br />

remote file system it can mount. Note that these commands must be issued<br />

by the root user. The gpfsNSD cluster name gpfsNSD used with the<br />

mmremotefs add command after the -C parameter must be the actual name of<br />

the gpfsNSD cluster as shown by the mmlscluster command run on the<br />

gpfsNSD cluster.<br />

The local device name for the remote GPFS-FS is bubu_gpfs, and /bubu is<br />

the local mount point for the file system.<br />

bglsn:/root # /usr/lpp/mmfs/bin/mmremotefs add bubu_gpfs1 -f<br />

gpfs1 -C gpfsNSD -T /bubu<br />

Mounting the GPFS file system on the SN<br />

Figure 5-9 illustrates the cross-mounted file system that we provided for our Blue<br />

Gene/L system.<br />

Figure 5-9 Cross-mount /gpfs1 from the gpfsNSD cluster on the SN<br />

Chapter 5. File systems 249


To mount the GPFS file system on the SN, follow these steps:<br />

1. On the SN start GPFS and ensure that the remote file system can be<br />

mounted, as shown in Example 5-23.<br />

Example 5-23 Mounting the remote file system on bgIO<br />

bglsn:/root # /usr/lpp/mmfs/bin/mmstartup -a<br />

Mon Mar 20 15:00:11 EST 2006: mmstartup: Starting GPFS ...<br />

bglsn:/root # mount /bubu<br />

bglsn:/root # df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

/dev/sdb3 70614928 4632540 65982388 7% /<br />

tmpfs 1898508 8 1898500 1% /dev/shm<br />

/dev/sda4 489452 50972 438480 11% /tmp<br />

/dev/sda1 9766544 1997804 7768740 21% /bgl<br />

/dev/sda2 9766608 698992 9067616 8% /dbhome<br />

p630n03:/bglscratch 36700160 5952 36694208 1% /bglscratch<br />

p630n03_fn:/nfs_mnt 104857600 11300064 93557536 11% /mnt<br />

/dev/bubu_gpfs1 1138597888 2918400 1135679488 1% /bubu<br />

2. Enable the remote file system from the gpfsNSD cluster (/gpfs1) to mount<br />

automatically over the local mount point (/bubu) when GPFS is started on the<br />

bgIO cluster. Use the mmremotefs update command, as shown in<br />

Example 5-24.<br />

Example 5-24 Changing remote file system to automount at bgIO cluster startup<br />

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmremotefs update bubu_gpfs1 -A yes<br />

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmshutdown -a<br />

......<br />

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmstartup -a<br />

Mon Mar 20 15:08:12 EST 2006: mmstartup: Starting GPFS ...<br />

bglsn:/mnt/chriss/gpfs/BGL # df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

/dev/sdb3 70614928 4635404 65979524 7% /<br />

tmpfs 1898508 8 1898500 1% /dev/shm<br />

/dev/sda4 489452 50972 438480 11% /tmp<br />

/dev/sda1 9766544 2005192 7761352 21% /bgl<br />

/dev/sda2 9766608 698992 9067616 8% /dbhome<br />

p630n03:/bglscratch 36700160 5952 36694208 1% /bglscratch<br />

p630n03_fn:/nfs_mnt 104857600 11300064 93557536 11% /mnt<br />

/dev/bubu_gpfs1 1138597888 2918400 1135679488 1% /bubu<br />

bglsn:/mnt/chriss/gpfs/BGL #<br />

250 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Adding all the I/O nodes to the bgIO cluster<br />

Figure 5-10 illustrates our complete GPFS on Blue Gene/L configuration. Adding<br />

all I/O nodes to the bgIO cluster is the last step before you can actually start<br />

using the GPFS file system for running user jobs.<br />

Figure 5-10 Adding I/O nodes to the bgIO GPFS cluster<br />

To add all the I/O nodes to the bgIO cluster, follow these steps:<br />

1. Create a node definition file (ionodes) that contains a list of the I/O nodes to<br />

be added (their IP label), as in Example 5-25.<br />

Note: You can choose not to add all I/O nodes to the bgIO cluster.<br />

However, this means that certain I/O nodes will not be able to access the<br />

GPFS file systems. This is acceptable if you use manual block allocation or<br />

if the job submission system (automated scheduler) can be made aware of<br />

this configuration.<br />

Chapter 5. File systems 251


Example 5-25 Node definition file for I/O nodes<br />

bglsn:/mnt/chriss/gpfs/BGL # cat /tmp/ionodes<br />

ionode1<br />

ionode2<br />

ionode3<br />

ionode4<br />

ionode5<br />

ionode6<br />

ionode7<br />

ionode8<br />

2. To add the nodes to the bgIO cluster we must ensure first that we have a<br />

block booted that contains all the I/O nodes you want to use with GPFS. Then<br />

add the nodes to the bgIO cluster using mmadnode command, as shown in<br />

Example 5-26.<br />

Example 5-26 Adding the I/O nodes to the bgIO cluster<br />

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmaddnode -n<br />

/tmp/ionodes<br />

Mon Mar 20 15:13:34 EST 2006: mmaddnode: Processing node ionode1<br />

Mon Mar 20 15:13:35 EST 2006: mmaddnode: Processing node ionode2<br />

Mon Mar 20 15:13:36 EST 2006: mmaddnode: Processing node ionode3<br />

Mon Mar 20 15:13:37 EST 2006: mmaddnode: Processing node ionode4<br />

Mon Mar 20 15:13:38 EST 2006: mmaddnode: Processing node ionode5<br />

Mon Mar 20 15:13:39 EST 2006: mmaddnode: Processing node ionode6<br />

Mon Mar 20 15:13:40 EST 2006: mmaddnode: Processing node ionode7<br />

Mon Mar 20 15:13:41 EST 2006: mmaddnode: Processing node ionode8<br />

mmaddnode: Command successfully completed<br />

mmaddnode: Propagating the changes to all affected nodes.<br />

This is an asynchronous process.<br />

Attention: If some of the nodes are not available (not booted, bad network<br />

connection, ssh not configured, and so forth), you need to correct the situation<br />

and retry the mmadnode command with the respective nodes.<br />

Booting a block an check automatic mount of GPFS-FS<br />

You now have to start GPFS on the newly added nodes. The best way to do this<br />

is to de-allocate the block, and then allocate it again. In this way, you can check<br />

that the file system (/bubu ) is mounted on the I/O nodes automatically.<br />

Example 5-27 shows the check performed for a small block that has just two I/O<br />

nodes using mmcs_db_console.<br />

252 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 5-27 Using mmcs_db_console to check the GPFS file system on I/O nodes<br />

mmcs$ allocate R000_J106_32<br />

OK<br />

mmcs$ select_block R000_J106_32<br />

OK<br />

mmcs$ redirect R000_J106_32 on<br />

OK<br />

mmcs$ {i} write_con df | grep bubu<br />

OK<br />

mmcs$ Apr 04 16:09:20 (I) [1083225312] {17}.0: d<br />

Apr 04 16:09:20 (I) [1083225312] {17}.0: f | grep bubu<br />

/dev/bubu_gpfs1 1138597888 3204096 1135393792 1% /bubu<br />

$<br />

Apr 04 16:09:20 (I) [1083225312] {0}.0: d<br />

Apr 04 16:09:20 (I) [1083225312] {0}.0: f | grep bubu<br />

/dev/bubu_gpfs1 1138597888 3204096 1135393792 1% /bubu<br />

$<br />

Finally, Figure 5-11 shows the bgIO cluster with the I/O nodes active and having<br />

the GPFS file system mounted. The block is ready now for running jobs.<br />

Figure 5-11 GPFS file system cross mounted to bgIO and I/O nodes added<br />

Chapter 5. File systems 253


5.3.9 GPFS problem determination methodology<br />

The methodology presented in this section is intended to help with a wide variety<br />

of problems. It is set out in sections that first deal with the SN and then the bgIO<br />

cluster. It is intended that you undertake the methodology in the order that we<br />

present it and, if a check passes, then you proceed to the next check until you<br />

find the problem. If you think you already know in which area the problem lies,<br />

then by all means go straight to that section. If, however, you are unsure of<br />

exactly where the problem lies, then use the methodology in the order presented<br />

because this often uncovers the simplest problems quickly and easily before you<br />

spend a long time looking for a solution to an assumed problem rather than the<br />

one you actually have.<br />

Checking that the GPFS-FS can be mounted on the SN<br />

Here is the methodology. All the commands in this section should be run on the<br />

SN as root.<br />

► Check that the GPFS is started on the SN, as described in “Checking that the<br />

GPFS is started” on page 255<br />

► Check the GPFS log files for problems, as described in “Checking the GPFS<br />

log files for problems” on page 256<br />

► Check that the GPFS-FS can mount on the SN, as described in “Checking<br />

that the GPFS-FS can mount on the SN” on page 257<br />

► Check that the GPFS-FS can mount on the I/O nodes, as described in<br />

“Checking that the GPFS-FS can mount on the I/O nodes” on page 257<br />

► Check that the GPFS-FS is configured on the SN, as described in “Checking<br />

that the file system is configured on the SN” on page 259<br />

► Check that the GPFS-FS is authorized to mount on the SN, as described in<br />

“Checking that the SN is authorized to mount the GPFS-FS” on page 261<br />

Checking that the GPFS-FS can be mounted on the gpfsNSD<br />

cluster<br />

Here is the methodology. You should run all of the commands in this section on<br />

one of the gpfsNSD cluster nodes as root.<br />

► Check GPFS is started on all nodes of gpfsNSD cluster, as described in<br />

“Checking that the GPFS is started” on page 255<br />

► Check GPFS-FS is configured on gfpsNSD cluster, as described in “Checking<br />

that GPFS-FS is configured on gfpsNSD cluster” on page 261<br />

► Check GPFS-FS can mount on the gfpsNSD cluster, as described in<br />

“Checking that GPFS-FS can mount on the gfpsNSD cluster” on page 262<br />

254 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


► Check GPFS-FS disks are available on gpfsNSD, as described in “Checking<br />

that the GPFS-FS disks are available on gpfsNSD” on page 263<br />

► Check the bgIO cluster is authorized to mount the GPFS-FS, as described in<br />

“Checking that bgIO cluster is authorized to mount the GPFS-FS” on<br />

page 263<br />

5.3.10 GPFS Checklists<br />

This section includes checklist for the GPFS.<br />

Checking that the GPFS is started<br />

Use the mmgetstate command to check if GPFS is started on either the SN or<br />

any node in the gpfsNSD cluster. Example 5-28 shows three (all) active nodes in<br />

our gpfsNSD cluster.<br />

Example 5-28 Checking GPFS node status<br />

[p630n01][/]> mmgetstate -a<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 p630n01_fn active<br />

2 p630n02_fn active<br />

3 p630n03_fn active<br />

The same command was run on the SN and shows that GPFS is started on the<br />

SN (see Example 5-29).<br />

Example 5-29 The mmgetstate command on the SN<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate -a<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

If this command shows that GPFS is not active on all nodes, you need to start<br />

GPFS on all nodes using the mmstartup -a command on any node in the cluster.<br />

If only some of the nodes are inactive, start them individually as follows (in this<br />

case we started GPFS on the SN only):<br />

bglsn:/tmp # mmstartup<br />

Wed Mar 29 17:40:27 EST 2006: mmstartup: Starting GPFS ...<br />

Chapter 5. File systems 255


Checking the GPFS log files for problems<br />

The GPFS log files for nodes with local storage are kept in /var/adm/ras<br />

directory. The latest log is named mmfs.log.latest and shows the history of the<br />

message since the last time that GPFS was started on that node.<br />

Example 5-30 shows the latest GPFS log file on the SN. This shows mmfs ready,<br />

which means that GPFS was functioning properly on this node. It also shows the<br />

file system /bubu_gpfs has been mounted from a remote cluster known as<br />

gpfsNSD.<br />

Example 5-30 Latest GPFS log on SN<br />

bglsn:/var/adm/ras # cat mmfs.log.latest<br />

Wed Mar 29 17:40:27 EST 2006 runmmfs starting<br />

Removing old /var/adm/ras/mmfs.log.* files:<br />

Unloading modules from /usr/lpp/mmfs/bin<br />

Loading modules from /usr/lpp/mmfs/bin<br />

Module Size Used by<br />

mmfslinux 268384 1 mmfs<br />

tracedev 35552 2 mmfs,mmfslinux<br />

Removing old /var/mmfs/tmp files:<br />

Wed Mar 29 17:40:30 2006: mmfsd initializing. {Version: 2.3.0.10<br />

Built: Jan 16 2006 13:07:54} ...<br />

Wed Mar 29 17:40:30 2006: OpenSSL library loaded<br />

Wed Mar 29 17:40:30 EST 2006 /var/mmfs/etc/gpfsready invoked<br />

Wed Mar 29 17:40:30 2006: mmfsd ready<br />

Wed Mar 29 17:40:30 EST 2006: mmcommon mmfsup invoked<br />

Wed Mar 29 17:40:30 EST 2006: /var/mmfs/etc/mmfsup.scr invoked<br />

Wed Mar 29 17:40:30 EST 2006: mounting /dev/bubu_gpfs1<br />

Wed Mar 29 17:40:31 2006: Waiting to join remote cluster p630n01_fn<br />

Wed Mar 29 17:40:31 2006: Connecting to 172.30.1.31 p630n01_fn<br />

Wed Mar 29 17:40:31 2006: Connected to 172.30.1.31 p630n01_fn<br />

Wed Mar 29 17:40:31 2006: Joined remote cluster gpfsNSD<br />

Wed Mar 29 17:40:31 2006: Command: mount bubu_gpfs1<br />

Wed Mar 29 17:40:31 2006: Connecting to 172.30.1.32 p630n02_fn<br />

Wed Mar 29 17:40:31 2006: Connected to 172.30.1.32 p630n02_fn<br />

Wed Mar 29 17:40:32 2006: Command: err 0: mount p630n01_fn:gpfs1<br />

Wed Mar 29 17:40:32 EST 2006: finished mounting /dev/bubu_gpfs1<br />

If you are experiencing problems with GPFS starting on not mounting the file<br />

system then this is a good place to look.<br />

GPFS log files on the I/O nodes are usually found under the following directory<br />

on the SN (or as specified in $BGL_DISTDIR/etc/rc.d/init.d/gpfs script).<br />

‘/bgl/gpfsvar//var/adm/ras’<br />

256 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Checking that the GPFS-FS can mount on the SN<br />

If the remote GPFS file system is not mounted, you need to identify it first. To find<br />

the remote file system to be mounted, use the mmremotefs command, and then<br />

try to mount it, as shown in Example 5-31.<br />

Example 5-31 Checking remote file systems<br />

bglsn:/tmp # mmremotefs show all<br />

Local Name Remote Name Cluster name Mount Point Mount Options<br />

Automount<br />

bubu_gpfs1 gpfs1 p630n01_fn /bubu rw<br />

yes<br />

bglsn:/tmp # mount /bubu<br />

mount: /dev/bubu_gpfs1 already mounted or /bubu busy<br />

mount: according to mtab, /dev/bubu_gpfs1 is already mounted on /bubu<br />

In our example the /bubu file system was already mounted.<br />

Checking that the GPFS-FS can mount on the I/O nodes<br />

Note: Only attempt this test if the GPFS-FS can be mounted on the SN (see<br />

previous check).<br />

First boot a block using the mmcs_db_console and check that GPFS has started<br />

on the nodes (Example 5-32).<br />

Example 5-32 Checking GPFS on I/O nodes<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ list_blocks<br />

OK<br />

mmcs$ allocate R000_128<br />

OK<br />

mmcs$ quit<br />

OK<br />

mmcs_db_console is terminating, please wait...<br />

mmcs_db_console: closing database connection<br />

mmcs_db_console: closed database connection<br />

mmcs_db_console: closing console port<br />

mmcs_db_console: closed console port<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate -a<br />

Chapter 5. File systems 257


Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

2 ionode1 active<br />

3 ionode2 active<br />

4 ionode3 active<br />

5 ionode4 active<br />

6 ionode5 active<br />

7 ionode6 active<br />

8 ionode7 active<br />

9 ionode8 active<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

Then, connect using ssh to an I/O node and try to mount the GPFS file system on<br />

the I/O node, as shown in Example 5-33.<br />

Example 5-33 Checking GPFS-FS can mount on I/O nodes<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode4<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

$ df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

rootfs 7931 1644 5878 22% /<br />

/dev/root 7931 1644 5878 22% /<br />

172.30.1.1:/bgl 9766544 2450872 7315672 26% /bgl<br />

172.30.1.33:/bglscratch<br />

36700160 71776 36628384 1% /bglscratch<br />

$ /usr/lpp/mmfs/bin//mmgetstate<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

5 ionode4 active<br />

$ mount /bubu<br />

$ df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

rootfs 7931 1644 5878 22% /<br />

/dev/root 7931 1644 5878 22% /<br />

172.30.1.1:/bgl 9766544 2450872 7315672 26% /bgl<br />

172.30.1.33:/bglscratch 36700160 71776 36628384 1% /bglscratch<br />

/dev/bubu_gpfs1 1138597888 3204096 1135393792 1% /bubu<br />

258 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


If you are unable to connect through ssh in to an I/O node or if the mmgetstate -a<br />

command shows that no I/O nodes are active, you need to investigate your sitefs<br />

script (see also Appendix B, “The sitefs file” on page 423) and ensure that the<br />

following two important steps have been executed:<br />

► Added the following line to the /bgl/dist/etc/rc.d/init.d/sitefs file:<br />

echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />

► Added the following symbolic links to ensure the sitefs called at bootup.<br />

fbglsn:/bgl/dist/etc/rc.d/rc3.d # ls -als<br />

total 0<br />

0 drwxr-xr-x 2 root root 112 Mar 27 15:16 .<br />

0 drwxr-xr-x 4 root root 96 Mar 27 14:52 ..<br />

0 lrwxrwxrwx 1 root root 16 Mar 27 14:52 K90sitefs -><br />

../init.d/sitefs<br />

0 lrwxrwxrwx 1 root root 16 Mar 27 14:52 S10sitefs -><br />

../init.d/sitefs<br />

bglsn:/bgl/dist/etc/rc.d/rc3.d #<br />

Checking that the file system is configured on the SN<br />

To check that GPFS is configured, we ran both the mmlsconfig and the<br />

mmlscluster commands. Example 5-34 shows that the following important<br />

information from the mmlsconfig command:<br />

► Cluster name - bgIO.itso.ibm.com<br />

► pagepool - we have set it is 128M (maximum recommended value)<br />

► cipherList set to AUTHONLY<br />

Example 5-34 The mmlsconfig output<br />

bglsn:/tmp # mmlsconfig<br />

Configuration data for cluster bgIO.itso.ibm.com:<br />

------------------------------------------------clusterName<br />

bgIO.itso.ibm.com<br />

clusterId 12402351528774401789<br />

clusterType lc<br />

multinode yes<br />

autoload yes<br />

useDiskLease yes<br />

maxFeatureLevelAllowed 813<br />

cipherList AUTHONLY<br />

pagepool 128M<br />

File systems in cluster bgIO.itso.ibm.com:<br />

------------------------------------------<br />

(none)<br />

Chapter 5. File systems 259


Example 5-35 shows the output of the mmlscluster command and displays the<br />

following relevant information:<br />

► Cluster name - on SN this is bgIO.itso.ibm.com<br />

► Remote shell command - on SN must be /usr/bin/ssh.<br />

► Remote copy command - on SN must be /usr/bin/scp<br />

► SN is the only quorum node.<br />

Example 5-35 The mmlscluster output<br />

bglsn:/tmp # mmlscluster<br />

GPFS cluster information<br />

========================<br />

GPFS cluster name: bgIO.itso.ibm.com<br />

GPFS cluster id: 12402351528774401789<br />

GPFS UID domain: bgIO.itso.ibm.com<br />

Remote shell command: /usr/bin/ssh<br />

Remote file copy command: /usr/bin/scp<br />

GPFS cluster configuration servers:<br />

-----------------------------------<br />

Primary server: bglsn_fn.itso.ibm.com<br />

Secondary server: (none)<br />

Node number Node name IP address Full node name<br />

Remarks<br />

-----------------------------------------------------------------------<br />

--------<br />

1 bglsn_fn 172.30.1.1 bglsn_fn.itso.ibm.com<br />

quorum node<br />

2 ionode1 172.30.2.1 ionode1<br />

3 ionode2 172.30.2.2 ionode2<br />

4 ionode3 172.30.2.3 ionode3<br />

5 ionode4 172.30.2.4 ionode4<br />

6 ionode5 172.30.2.5 ionode5<br />

7 ionode6 172.30.2.6 ionode6<br />

8 ionode7 172.30.2.7 ionode7<br />

9 ionode8 172.30.2.8 ionode8<br />

Example 5-34 and Example 5-35 also reveal that GPFS is configured correctly. If<br />

you have problems with GPFS cluster configuration, refer to the installation<br />

instructions and to GPFS manuals (see 5.3.11, “References” on page 264).<br />

260 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Checking that the SN is authorized to mount the GPFS-FS<br />

To check that the SN is authorized to mount the GPFS file system. Use the<br />

mmauth command, as shown in Example 5-36.<br />

Example 5-36 Checking authorized access for the remote file system<br />

bglsn:/tmp # mmremotefs show all<br />

Local Name Remote Name Cluster name Mount Point Mount Options<br />

Automount<br />

bubu_gpfs1 gpfs1 gpfsNSD /bubu rw<br />

yes<br />

This output shows that the SN is authorized to mount the /bubu_gpfs file system<br />

from the cluster named gpfsNSD for both read and write operations. If this is not<br />

correct, run the mmauth command to fix (see also 5.3.8, “Cross mounting the<br />

GPFS file system on to Blue Gene/L cluster” on page 246).<br />

Checking that GPFS-FS is configured on gfpsNSD cluster<br />

To check that GPFS is configured on a cluster we run both the mmlsconfig<br />

command and the mmlscluster command. Example 5-37 shows the following<br />

important information from the mmlsconfig command:<br />

► Cluster name - gpfsNSD<br />

► pagepool - if not shown, it means it is set to default (64 MB)<br />

► cipherList set to AUTHONLY<br />

► local file system device name is /dev/gpfs1<br />

Example 5-37 mmlsconfig output on the gpfsNSD cluster<br />

[p630n01][/]> mmlsconfig<br />

Configuration data for cluster gpfsNSD:<br />

-----------------------------------------clusterName<br />

gpfsNSD<br />

clusterId 12402351657622744194<br />

clusterType lc<br />

multinode yes<br />

autoload no<br />

useDiskLease yes<br />

maxFeatureLevelAllowed 813<br />

cipherList AUTHONLY<br />

[p630n01_fn]<br />

File systems in cluster gpfsNSD:<br />

-----------------------------------<br />

/dev/gpfs1<br />

Chapter 5. File systems 261


We also ran the mmlscluster command. Example 5-38 shows the following<br />

relevant information from the mmlscluster command:<br />

► Cluster name - gpfsNSD<br />

► Remote shell command - /usr/bin/ssh<br />

► Remote copy command - /usr/bin/scp<br />

► Configured nodes - in our cluster all three nodes participate in quorum<br />

decisions<br />

Example 5-38 The mmlscluster output on the gpfsNSD cluster<br />

[p630n01][/]> mmlscluster<br />

GPFS cluster information<br />

========================<br />

GPFS cluster name: gpfsNSD<br />

GPFS cluster id: 12402351657622744194<br />

GPFS UID domain: gpfsNSD<br />

Remote shell command: /usr/bin/ssh<br />

Remote file copy command: /usr/bin/scp<br />

GPFS cluster configuration servers:<br />

-----------------------------------<br />

Primary server: p630n01_fn<br />

Secondary server: p630n02_fn<br />

Node number Node name IP address Full node name Remarks<br />

-----------------------------------------------------------------------------------<br />

1 p630n01_fn 172.30.1.31 p630n01_fn quorum node<br />

2 p630n02_fn 172.30.1.32 p630n02_fn quorum node<br />

3 p630n03_fn 172.30.1.33 p630n03_fn quorum node<br />

As in the previous Example 5-38, we can see that GPFS is configured correctly. If<br />

you have problems with GPFS cluster configuration, refer to the installation and<br />

problem determination instructions found in GPFS manuals (see 5.3.11,<br />

“References” on page 264).<br />

Checking that GPFS-FS can mount on the gfpsNSD cluster<br />

To mount the GPFS file system on the gpfsNSD check the locally configured file<br />

system device name from “Checking that GPFS-FS is configured on gfpsNSD<br />

cluster” on page 261, then use the mount command:<br />

[p630n01][/]> mount /dev/gpfs1<br />

GPFS: 6027-514 Cannot mount /dev/gpfs1 on /gpfs1: Already mounted.<br />

As you can see, in this case the file system was already mounted.<br />

262 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Checking that the GPFS-FS disks are available on gpfsNSD<br />

Use the mmlsdisk command to check that the disks belonging to the GPFS-FS<br />

are available and sane. This is only required if the file system cannot be locally<br />

mounted on the gpfsNSD cluster (Example 5-39).<br />

Example 5-39 Checking disk availability<br />

[p630n01][/]> mmlsdisk /dev/gpfs1<br />

disk driver sector failure holds holds<br />

name type size group metadata data status<br />

availability<br />

------------ -------- ------ ------- -------- ----- -------------<br />

------------<br />

GPFS1_n01_b nsd 512 1 yes yes ready up<br />

GPFS2_n01_a nsd 512 1 yes yes ready up<br />

GPFS3_n02_b nsd 512 2 yes yes ready up<br />

GPFS4_n02_a nsd 512 2 yes yes ready up<br />

As you can see, in this case all the disks allocated for the GPFS-FS are ready<br />

and available. If some of these disks were unavailable (“down”) this would<br />

prevent the GPFS-FS from mounting. To fix this problem, first check that all disks<br />

are properly connected to the servers and available to the operating system, then<br />

use the mmchdisk command to recover the disks to the ready state.<br />

Checking that bgIO cluster is authorized to mount the<br />

GPFS-FS<br />

Use the mmauth show all command to show if the bgIO cluster is authorized to<br />

mount the GPFS-FS, as shown in Example 5-40.<br />

Example 5-40 Checking file system is “exported” on storage cluster (gpfsNSD)<br />

[p630n01][/]> mmauth show all<br />

Cluster name: bgIO.itso.ibm.com<br />

Cipher list: AUTHONLY<br />

SHA digest: 22813e0fa7f4aa76982cd33cf705ee8c085b21a0<br />

File system access: gpfs1 (rw, root allowed)<br />

Cluster name: gpfsNSD (this cluster)<br />

Cipher list: AUTHONLY<br />

SHA digest: d02d0d706c8f7f14ce6366e1d6fe8a1a217ae1c5<br />

File system access: (all rw)<br />

As you can see, in this case the bgIO cluster named bgIO.itso.ibm.com is<br />

authorized to mount the locally created GPFS-FS device called gpfs1.<br />

Chapter 5. File systems 263


5.3.11 References<br />

For more information about the GPFS command and concepts, refer to the<br />

GPFS V2.3 documentation, which also available from the Cluster Information<br />

Center:<br />

► Concepts, Planning, and Installation <strong>Guide</strong>, GA22-7968-02<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23/bl1ins10/bl1ins10.html<br />

► Administration and Programming Reference, SA22-7967-02<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23/bl1adm10/bl1adm10.html<br />

► <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>, GA22-7969-02<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23/bl1pdg10/bl1pdg10.html<br />

► Documentation updates<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23_doc_updates/docerrata.html<br />

264 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chapter 6. Scenarios<br />

6<br />

This chapter contains a variety of problem determination scenarios that we<br />

captured while running on the Blue Gene/L system that we used in the<br />

development of this redbook. We constructed each scenario based on the<br />

problem determination methodology that we discuss throughout the book.<br />

We approach each problem with similar patterns:<br />

► A description of the problem<br />

► Detailed checking on how related problems can be revealed<br />

► How to resolve the problem or to transfer the problem to other scenarios<br />

► Lessons learned<br />

© Copyright IBM Corp. 2006. All rights reserved. 265


6.1 Introduction<br />

In some scenarios, we need to intentionally inject the error that we assume that<br />

could happen in real life and cause problems. Creating these scenarios helps to<br />

hone the problem determination procedures. And, if we can create the error,<br />

there are chances that a similar problem in the field can occur. The following list<br />

contains the error injection scenarios that are presented in this chapter.<br />

Due to design and usability considerations, we have divided a Blue Gene/L<br />

environment into core system and functional components, and the scenarios are<br />

grouped into categories of problems in:<br />

► Blue Gene/L core system<br />

– Hardware (cards, power supplies, cables and so forth)<br />

– Software (DB2 and processes)<br />

– System configuration (remote command execution: ssh, rsh, NFS)<br />

► File system (NFS and GPFS)<br />

► Job submission (mpirun and LoadLeveler)<br />

Depending on past experiences and the actual situations, usually there is more<br />

than one way to approach a problem. Our intention was that our problem<br />

determination methodology leads to one of these categories. At the beginning of<br />

a category, scanning through the table of scenarios can help you to spot a similar<br />

(if not identical) problem.<br />

To describe the problem at hand, we try not to give away the cause of the<br />

problem. However, having some hunch on what has just happened or gone<br />

wrong can serve as a starting point. This starting point is chosen from multiple<br />

hypotheses listed in the problem description section.<br />

The starting point leads to a checklist. The checklists that we discuss in this<br />

chapter are specific to the category. Only this time, with a problem at hand,<br />

detailed and specific checking are performed in every step of identifying a<br />

problem, based on the checklist.<br />

At the end of each scenario, the problem might be resolved or just identified.<br />

Otherwise, some pointer is provided to aid in transferring the problem<br />

determination to another scenario. Because all of this process is carried out on a<br />

new system that has just been set up, we might run into pitfalls and unexpected<br />

findings. These findings are included in the section on what we have learned.<br />

When a new problem is discovered, we use the same methodology. A new<br />

scenario is created and added into a category. Odd problems are gathered under<br />

the miscellaneous scenarios.<br />

266 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6.2 Blue Gene/L core system scenarios<br />

This section looks into problem scenarios that will affect the core Blue Gene/L<br />

components. The core system consists of:<br />

► Blue Gene/L racks<br />

► Functional Network<br />

► Service Network<br />

► Service Node<br />

► Blue Gene/L Database (running on Service Node)<br />

► Blue Gene/L system processes<br />

Here is a list of scenarios we tested for the core Blue Gene/L system<br />

1. Hardware error: Compute card error<br />

2. Functional network: Defective cable<br />

3. Service network: Defective cable<br />

4. Service Node functional network interface down<br />

5. SN service network interface down<br />

6. The /bgl file system full on the SN (no GPFS)<br />

7. The / file system full on the SN<br />

8. The /tmp file system is full on the SN<br />

9. The ciodb daemon is not running on the SN<br />

10.The idoproxy daemon not running on the SN<br />

11.The mmcs_server is not running on the SN<br />

12.DB2 not started on the SN<br />

13.The bglsysdb user OS password changed (Linux)<br />

14.Uncontrolled rack power off<br />

In each scenario, we follow the same process. First, we check the system is<br />

currently operational. We do this by allocating a block in mmcs_db_console,<br />

submit_job to run the hello.rts application and free_block to de-allocate the<br />

block. After we have proved our system, we then inject the scenario that we want<br />

to test. Then, we try our job submission again. This method should cause the<br />

problems and then we investigate.<br />

Each of the scenarios is split into the following sections.<br />

► Error injection<br />

► <strong>Problem</strong> determination<br />

► Lessons learned<br />

Chapter 6. Scenarios 267


6.2.1 Hardware error: Compute card error<br />

In this scenario, we replaced a compute card in a node card with a compute card<br />

with a defect chip.<br />

Error injection<br />

Power off the rack, and replace a compute card with the faulty compute card.<br />

The discovery process successfully detected the compute card and populated<br />

the DB2 database.<br />

<strong>Problem</strong> determination<br />

Because all resources are available, we try and allocate a block which includes<br />

the faulty compute card. This fails with the message shown in Example 6-1.<br />

Example 6-1 A node fails to boot<br />

mmcs$ allocate R000_J104_32<br />

FAIL<br />

Microloader Assertion<br />

Apr 01 16:55:06 (E) [1088451808] test1:R000_J104_32 RAS event:<br />

KERNEL FATAL: Microloader Assertion<br />

Apr 01 16:55:06 (E) [1088451808] test1:R000_J104_32 RAS event:<br />

KERNEL FATAL: VALIDATE_LOAD_IMAGE_CRC_IN_DRAM<br />

Because the content of the log file mentions about the RAS event, we then look<br />

into the RAS event Web page (shown in Figure 6-1).<br />

Figure 6-1 RAS event indicates a location of the faulty card<br />

From the RAS event we find the location of the failing compute card -<br />

R00-M0-N1-CJ16-U01.<br />

268 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Lessons learned<br />

The error has occurred while manipulating a block and has been recorded in the<br />

mmcs server log file. RAS event shows detailed information which was not<br />

included in the log file.<br />

6.2.2 Functional network: Defective cable<br />

In this scenario, we simulate a cable failure on the functional network. We do this<br />

by removing one external ethernet connection from a Node Card.<br />

Error injection<br />

Physically pulled out one of the cables out from the front of R00-M0-N2.<br />

<strong>Problem</strong> <strong>Determination</strong><br />

We try and allocate the block. It fails with:<br />

mmcs$<br />

no ethernet link<br />

Looking in the bglsn-bgdb0-mmcs_db_server-current.log (latest<br />

mmcs_db_server.log) we see:<br />

Mar 09 13:57:53 (E) [1086973152] root:R000_128 RAS event: KERNEL<br />

FATAL: no ethernet link<br />

We can see a RAS event has been raised, so we look for further evidence in the<br />

RAS database. It shows the same message with a location code:<br />

R00-M0-N2-I:J18-U01. We then look at the physical hardware and find the fault.<br />

Lessons learned<br />

Functional network is essential to the operation of Blue Gene/L. The system will<br />

not operate even with one link disabled<br />

6.2.3 Service network: Defective cable<br />

In this scenario, we simulate a cable failure on the service network. We do this by<br />

removing a connection to one the external ports on the Service Card.<br />

Error injection<br />

Physically pulled out the GB cable from the front of Service Card<br />

<strong>Problem</strong> <strong>Determination</strong><br />

We try and allocate the block. It fails with the message shown in Example 6-2.<br />

Chapter 6. Scenarios 269


Example 6-2 mmcs message: service card link failure<br />

mmcs$<br />

idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />

to node/link/service card [R00-M0-N0]<br />

ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />

Looking in the latest mmcs_db_server.log we find the message shown in<br />

Example 6-3.<br />

Example 6-3 Message from mmcs_db_server log<br />

Mar 09 14:43:32 (I) [1084843232] root:R000_128 allocate: FAIL;connect:<br />

idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />

to node/link/service card [R00-M0-N0]<br />

ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />

We conclude that we cannot talk to the Service Card over the service network<br />

and fix the problem by replacing the cable.<br />

We then went on to see what happens if we replace the service network GBit link<br />

and pull the IDo link on the Service Card to the service network. We found the<br />

block booted and the application could run normally.<br />

We then moved onto using Discovery. Nothing happened when we unplugged<br />

the Service Network port marked ‘IDo’ when discovery was running. We we did<br />

the same for the GBit port, and hardware was marked M for missing in the<br />

database.<br />

Lessons learned<br />

We learned the following lessons:<br />

► The Service network is needed to boot a block. However, booting a block<br />

does not use the IDo port on the Service Card.<br />

► A simple ping does not work to the Service card as a method to see whether<br />

it is alive (as this is does not use IP/ICMP communication).<br />

► The idoproxy uses the GBit port on the front of the Service Card, as does the<br />

Discovery process.<br />

► We observed that the System Controller uses the IDo link to initialize the<br />

Service Card, after it initializes everything to use the GBit connection.<br />

270 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6.2.4 Service Node functional network interface down<br />

In this scenario we simulate the functional network interface being disabled or<br />

removed from the Service Node (SN).<br />

Error injection<br />

We have used the following command on the SN to disable the functional<br />

network interface:<br />

bglsn:/bgl/BlueLight/logs/BGL # ifconfig eth3 down<br />

<strong>Problem</strong> determination<br />

When we try and boot the block we get the following message:<br />

mmcs$<br />

Error:unable to mount filesystem.<br />

Looking in the latest mmcs_db_server.log we see:<br />

Mar 09 17:37:43 (E) [1086973152] root:R000_128 RAS event: KERNEL<br />

FATAL: Error: unable to mount filesystem<br />

This message is reported for every I/O node. We know that the I/O nodes use<br />

NFS from the SN. SN check list get us to look at the network configuration. This<br />

shows us that eth3 is disabled.<br />

Lessons learned<br />

The NFS file system is mounted to the I/O nodes over the Functional Network<br />

and is required to run jobs. The functional Network is required to run an<br />

application.<br />

6.2.5 SN service network interface down<br />

In this scenario we simulate the service network interface being disabled or<br />

removed from the SN.<br />

Error injection<br />

We have used the following command on the SN to disable the service network<br />

interface:<br />

bglsn:/bgl/BlueLight/logs/BGL # ifconfig eth0 down<br />

Chapter 6. Scenarios 271


<strong>Problem</strong> determination<br />

When we booted the block the message shown in Example 6-4 appears in the<br />

mmcs_db_console.<br />

Example 6-4 The mmcs message: no network connection to service card<br />

mmcs$<br />

idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />

to node/link/service card [R00-M0-N0]<br />

ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />

Looking in the latest mmcs_db_server.log, we find the message shown in<br />

Example 6-5.<br />

Example 6-5 Failed connection message recorded mmcs_db_server log<br />

Mar 09 17:04:02 (I) [1084843232] root:R000_128 allocate: FAIL;connect:<br />

idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />

to node/link/service card [R00-M0-N0]<br />

ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />

Using the knowledge gained from previous scenarios we deduce this is a Service<br />

Network fault. Using the methodology we check the network interfaces on the SN<br />

as shown in Example 6-6.<br />

Example 6-6 Checking interface status<br />

bglsn:/bgl/BlueLight/logs/BGL # ip ad<br />

1: lo: mtu 16436 qdisc noqueue<br />

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00<br />

inet 127.0.0.1/8 brd 127.255.255.255 scope host lo<br />

inet6 ::1/128 scope host<br />

valid_lft forever preferred_lft forever<br />

2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />

link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />

inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />

In this case, eth0 is not up on the SN. We bring it back up using the ifup<br />

command, and check again, as in Example 6-7.<br />

Example 6-7 Bringing up the Service Network interface (eth0)<br />

bglsn:/bgl/BlueLight/logs/BGL # ifup eth0<br />

bglsn:/bgl/BlueLight/logs/BGL # ip ad<br />

1: lo: mtu 16436 qdisc noqueue<br />

272 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00<br />

inet 127.0.0.1/8 brd 127.255.255.255 scope host lo<br />

inet6 ::1/128 scope host<br />

valid_lft forever preferred_lft forever<br />

2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />

link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />

inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />

inet6 fe80::20d:60ff:fe4d:28ea/64 scope link<br />

valid_lft forever preferred_lft forever<br />

After bringing up the interface we can boot the block.<br />

Lessons learned<br />

The service network is required to boot a block because the idoproxy sends the<br />

microloader through the Service Network (IDo bridge).<br />

6.2.6 The /bgl file system full on the SN (no GPFS)<br />

This is a scenario to see what happens when /bgl is 100% full on the SN.<br />

Note: In this scenario we do not use GPFS.<br />

Error injection<br />

Use dd to create a huge file in /bgl on the SN that fills the file system up to 100%.<br />

A similar error would appear if you run out of inodes on the same /bgl file system.<br />

<strong>Problem</strong> determination<br />

We submitted the job but did not receive the output of that job. We got the<br />

message shown on Example 6-8.<br />

Example 6-8 Job error due to /bgl file system full<br />

mmcs$ list_jobs<br />

OK<br />

JOBID STATUS USERNAME BLOCKID<br />

EXECUTABLE<br />

28 E root R000_128<br />

/bgl/hello/hello.rts<br />

Job is in error state. We look at the output for the job:<br />

could not open /bgl/hello/R000_128-28.stdout: No space left on<br />

device<br />

Chapter 6. Scenarios 273


Looking in the system logs we see the ciodb records shown in Example 6-9.<br />

Example 6-9 Extract from ciodb records revealing file system full<br />

Mar 13 14:35:35 (I) [1074048864] Starting Job 28<br />

Mar 13 14:35:35 (I) [1079563488] New thread 1079563488, for jobid 28<br />

Mar 13 14:35:35 (I) [1079563488] Jobid is 28, homedir is /bgl/hello<br />

Mar 13 14:35:35 (E) [1079563488] 0x4058bea8<br />

Mar 13 14:35:35 (E) [1079563488] could not open<br />

/bgl/hello/R000_128-28.stdout: No space left on device<br />

Mar 13 14:35:35 (E) [1079563488] Job 28 set to START_ERROR, exit<br />

status= 255, errtext= could not open /bgl/hello/R000_128-28.stdout: No<br />

space left on device<br />

Mar 13 14:35:35 (I) [1079563488] cleanup job polling thread 1079563488<br />

Correlating the messages, we determine that the /bgl file system is full (this file<br />

system is used for this job’s output) by using df /bgl on the SN.<br />

Lessons learned<br />

We learned the following lessons:<br />

► We need space for jobs to write their output (stdout(1), stderr(2)).<br />

► ciodb records any problems while writing the output file. In Example 6-9 (I)<br />

means I/O node output, (E) means error. ciodb talks to the ciod daemons on<br />

the I/O nodes (which are doing the file I/O operations), and the ciod daemons<br />

report back the that they are not able to write on the NFS file system (/bgl<br />

exported from the SN).<br />

6.2.7 The / file system full on the SN<br />

This is a scenario that tests what happens when / is 100% full on the SN.<br />

Error injection<br />

Use the dd command to create a huge file in / on the SN that fills the file system<br />

(100% reported by the df / command run on the SN).<br />

<strong>Problem</strong> determination<br />

Submitted the job. Job ran OK. There are no error messages related to the job.<br />

274 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Lessons learned<br />

We learned the following lessons:<br />

► The root (“/”) file system is not written to when running a job.<br />

► Having / full does not affect running jobs on Blue Gene/L. However, some OS<br />

functionality (Linux) and the database (DB2) will eventually have problems if<br />

they cannot write to ‘/’, and this in turn will have an effect on Blue Gene/L<br />

system processes.<br />

6.2.8 The /tmp file system is full on the SN<br />

This is a scenario to see what happens when /tmp is 100% full on the SN.<br />

Error injection<br />

Use dd to create a huge file in /tmp on the SN that fills the file system up to 100%.<br />

<strong>Problem</strong> determination<br />

Allocate a block fails with the following message:<br />

mmcs$ allocate R000_128<br />

DBBlockController::allocateBlock failed: invalid XML<br />

Looking in the latest mmcs_db_server.log we see the output shown in<br />

Example 6-10.<br />

Example 6-10 Message when allocating a block fails due to /tmp full<br />

Mar 13 12:12:29 (I) [1084843232] root allocate R000_128<br />

Mar 13 12:12:29 (I) [1084843232] root<br />

DBMidplaneController::addBlock(R000_128)<br />

Mar 13 12:12:32 (I) [1084843232] root:R000_128 allocate:<br />

FAIL;DBBlockController::allocateBlock failed: invalid XML<br />

Mar 13 12:12:32 (I) [1084843232] root:R000_128<br />

DBMidplaneController::removeBlock(R000_128)<br />

Mar 13 12:12:32 (I) [1084843232] root DBBlockController::disconnect()<br />

setBlockState(R000_128, FREE) successful<br />

Using the methodology we are directed to check to see if any file systems are<br />

full. We find /tmp 100% full. Freeing space up allows the block to allocate and the<br />

job to run. However, some other processes might be affected also (no pty can be<br />

created when /tmp is full), thus a reboot in maintenance mode might be required<br />

for the SN.<br />

Chapter 6. Scenarios 275


Lessons learned<br />

The /tmp directory is used when a block is manipulated. An image of the block in<br />

XML format is written to /tmp. You must have space in /tmp for Blue Gene/L to<br />

work.<br />

6.2.9 The ciodb daemon is not running on the SN<br />

In this scenario check what happens when the ciodb daemon is not running, and<br />

if this affects starting a job.<br />

Error injection<br />

Even though this scenario might seem unlikely, we considered that this could<br />

actually happen (ciodb daemon not running) due to a bad library (OS upgrade,<br />

Blue Gene/L driver update, and so on).<br />

First we tried to kill the ciodb daemon (kill -9 ciodb_pid), but the bglmaster<br />

daemon respawns the process. Then we decided to attach a debugger to the<br />

process and stop its execution, as shown in Example 6-11.<br />

Example 6-11 Attaching the debugger to the ciodb process<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />

Starting BGLMaster<br />

Logging to<br />

/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-12:29:03.log<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [30625]<br />

ciodb started [30626]<br />

mmcs_server started [30627]<br />

monitor stopped<br />

perfmon stopped<br />

bglsn:/bgl/BlueLight/logs/BGL # gdb --pid=30626<br />

<strong>Problem</strong> determination<br />

We then submit a job and check it for a while, and we see that it just sits there in<br />

starting (S) state (see Example 6-12), JOBID #38.<br />

Example 6-12 Checking the job status<br />

mmcs$ list_blocks<br />

OK<br />

R000_128 root(1) connected<br />

mmcs$ list_jobs<br />

OK<br />

276 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


JOBID STATUS USERNAME BLOCKID<br />

EXECUTABLE<br />

38 S root R000_128<br />

/bgl/hello/hello.rts<br />

We further investigate the RAS events, but we cannot find anything relevant.<br />

We looked at the Runtime information in the Web browser and found that the<br />

status of the job #38 listed as “Ready to start” and the block was in the initialized<br />

(I) state.<br />

We further checked the Configuration information in the Web browser and found<br />

no disabled hardware (everything OK).<br />

Looking in the system logs for ciodb, we see the messages shown in<br />

Example 6-13.<br />

Example 6-13 The ciodb log messages<br />

03/14/06-12:29:03 ./startciodb STARTED<br />

03/14/06-12:29:03 RUN CIODB ./ciodb --useDatabase BGL --dbproperties<br />

db.properties --nortschecking<br />

03/14/06-12:29:03 logging to<br />

/bgl/BlueLight/logs/BGL/bglsn-ciodb-2006-0314-12:29:03.log<br />

Mar 14 12:29:03 (I) [1074052960] ciodb[30626]: started: $Name:<br />

V1R2M1_020_2006 $<br />

Mar 14 12:29:03 (E) [1074052960] No running job records in the database<br />

We look further for job with JOBID #38, but there is no record for this JOBID.<br />

Instead we see the messages for the job with the JOBID #37, which looks OK, a<br />

shown in Example 6-14.<br />

Example 6-14 Records from ciodb showing correct execution of Job #37<br />

Mar 14 12:27:02 (I) [1074052960] Starting Job 37<br />

Mar 14 12:27:02 (I) [1079567584] New thread 1079567584, for jobid 37<br />

Mar 14 12:27:02 (I) [1079567584] Jobid is 37, homedir is /bgl/hello/<br />

contacting control node 0 at 172.30.2.1:7000...ok<br />

contacting control node 1 at 172.30.2.2:7000...ok<br />

contacting control node 2 at 172.30.2.5:7000...ok<br />

contacting control node 3 at 172.30.2.6:7000...ok<br />

contacting control node 4 at 172.30.2.7:7000...ok<br />

contacting control node 5 at 172.30.2.8:7000...ok<br />

contacting control node 6 at 172.30.2.3:7000...ok<br />

contacting control node 7 at 172.30.2.4:7000...Mar 14 12:27:04 (I)<br />

[1079567584] Job loaded: 37<br />

Chapter 6. Scenarios 277


Mar 14 12:27:04 (I) [1079567584] About to launch /bgl/hello/hello.rts<br />

Mar 14 12:27:04 (I) [1079567584] Job 37 set to RUNNING<br />

Mar 14 12:27:04 (E) [1079567584] Job 37 set to TERMINATED, exit status=<br />

0, errtext=<br />

Mar 14 12:27:04 (I) [1079567584] cleanup job polling thread 1079567584<br />

We can conclude that ciodb does not seem to be handing our job submission. At<br />

this point we suspect ciodb. We try and stop and restart the bglmaster (see<br />

Example 6-15).<br />

Example 6-15 Stopping and checking the bglmaster<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />

Stopping BGLMaster<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [30625]<br />

ciodb started [30626]<br />

mmcs_server started [30627]<br />

monitor stopped<br />

perfmon stopped<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

Timed out on socket connection to BGLMaster daemon at 127.0.0.1:32035<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # netstat -a | grep 32035<br />

tcp 0 0 localhost:32035 *:*<br />

LISTEN<br />

tcp 1 0 localhost:32035 localhost:42999<br />

CLOSE_WAIT<br />

tcp 0 0 localhost:42999 localhost:32035<br />

FIN_WAIT2<br />

At first, we can see that the bglmaster does not stop. We can see the that the<br />

bglmaster status command returns a socket timeout message for port<br />

localhost:32035. Further we check with netstat -a| grep tcp and see<br />

FIN_WAIT2 status.<br />

278 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


The connection should be closed on this port for the bglmaster to be able to<br />

finish the stopping request. We use the ps -ef command to see the status of the<br />

system processes, bglmaster, ciodb, idoproxy and mmcs_db_server, as shown<br />

in Example 6-16.<br />

Example 6-16 Checking the bglmaster process<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i bglmaster|grep -v grep<br />

root 30619 1 0 12:29 ? 00:00:00 ./BGLMaster --consoleip 127.0.0.1<br />

--consoleport 32035 --configfile bglmaster.init --autorestart y --db2profile<br />

/dbhome/bgdb2cli/sqllib/ db2profile --dbproperties db.properties<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i idoproxy|grep -v grep<br />

root 21551 26793 0 14:59 pts/5 00:00:00 grep -i idoproxy<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ciodb |grep -v grep<br />

root 30626 1 0 12:29 ? 00:00:00 [ciodb] <br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i mmcs_server |grep -v grep<br />

root 21601 26793 0 15:00 pts/5 00:00:00 grep -i mmcs_server<br />

We can see that idoproxy and mmcs_db_server are not running. However,<br />

because bglmaster would not stop and we also see a ciodb process,<br />

we first kill the bglmaster, then the ciodb process:<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # kill -9 30619; kill -9<br />

30626<br />

Now that we have removed the bad processes, we can try and restart them, as<br />

shown in Example 6-17.<br />

Example 6-17 Restarting the bglmaster after cleanup<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />

Starting BGLMaster<br />

Logging to<br />

/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-15:01:17.log<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [21809]<br />

ciodb started [21810]<br />

mmcs_server started [21811]<br />

monitor stopped<br />

perfmon stopped<br />

When the bglmaster started clean, our job (#38) that was stuck in (S)tarting<br />

status ran correctly.<br />

Chapter 6. Scenarios 279


Lessons learned<br />

We learned the following lessons:<br />

► The ciodb daemon controls the submission of the job to the Blue Gene/L. If it<br />

is not running jobs do not start.<br />

► If your jobs are not running, it is a good idea to check your system’s Blue<br />

Gene/L daemons using the ps -ef command, and look for any <br />

Blue Gene/L system processes.<br />

6.2.10 The idoproxy daemon not running on the SN<br />

In this scenario we stop idoproxy daemon to check the impact on starting and<br />

running a job.<br />

Error injection<br />

Again, we use the same debugger technique as we did for ciodb: attach the gdb<br />

debugger to the idoproxy process, as in Example 6-18.<br />

Example 6-18 Attaching gdb to the idoproxy process<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [12455]<br />

ciodb started [14394]<br />

mmcs_server started [14395]<br />

monitor stopped<br />

perfmon stopped<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ido<br />

root 12455 14324 0 10:42 ? 00:00:00 ./idoproxydb<br />

-enableflush -loguserinfo db.properties BlueGene1<br />

bglsn:/bgl/BlueLight/logs/BGL # gdb --pid=12455<br />

<strong>Problem</strong> determination<br />

When we try to allocate the block we get the following error in the<br />

mmcs_db_console:<br />

connect: idoproxy communication failure: socket recv timeout<br />

The following message was also found in the mmcs server log:<br />

Mar 14 10:48:33 (I) [1084843232] root:R000_128 allocate:<br />

FAIL;connect: idoproxy communication failure: socket recv timeout<br />

280 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


No other message was found in any of the remaining system logs. Moreover,<br />

bglmaster shows everything is running, as shown in Example 6-19.<br />

Example 6-19 Bglmaster status with idoproxy stopped<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [12455]<br />

ciodb started [14394]<br />

mmcs_server started [14395]<br />

monitor stopped<br />

perfmon stopped<br />

Because we suspect an issue with idoproxy (which is controlled by the<br />

bglmaster), we then try to restart the bglmaster:<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster restart<br />

Stopping BGLMaster<br />

....<br />

The command hangs when trying to start the bglmaster, waiting for the socket on<br />

port 32035 (owned by idoproxy) to close the connection (so it can finish the<br />

actual stopping process). However, this does not happen, thus we have to open<br />

another terminal to be able to kill the “hanging” process, then restart the<br />

bglmaster, as in Example 6-20.<br />

Example 6-20 Recovering from a “hanging” idoproxy daemon<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i bglmaster| grep -v grep<br />

root 14324 1 0 Mar09 ? 00:00:00 ./BGLMaster --consoleip 127.0.0.1<br />

--consoleport 32035 --configfile bglmaster.init --autorestart y --db2profile<br />

/dbhome/bgdb2cli/sqllib/ db2profile --dbproperties db.properties<br />

## >><br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />

Stopping BGLMaster<br />

## ><br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

Timed out on socket connection to BGLMaster daemon at 127.0.0.1:32035<br />

## ><br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i idoproxy|grep -v grep<br />

root 12455 1 0 12:29 ? 00:00:00 [idoproxy] <br />

Chapter 6. Scenarios 281


glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # kill -9 12455<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ciodb |grep -v grep<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i mmcs_server |grep -v grep<br />

root 14395 14395 0 15:00 pts/5 00:00:00 grep -i mmcs_server<br />

## ><br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />

Stopping BGLMaster<br />

## ><br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />

Starting BGLMaster<br />

Logging to /bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0406-17:07:53.log<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [8260]<br />

ciodb started [8261]<br />

mmcs_server started [8262]<br />

monitor stopped<br />

perfmon stopped<br />

Finally, we can now start mmcs_db_console and run a job.<br />

Lessons learned<br />

We learned the following lessons:<br />

► mmcs_db_console requires a connection to idoproxy to work.<br />

► If you are having problems with Blue Gene/L system processes (controlled by<br />

bglmaster), ensure that all old instances of the system processes are<br />

properly cleaned up before restarting bglmaster.<br />

6.2.11 The mmcs_server is not running on the SN<br />

In this scenario, we stop mmcs_db_server to check the impact on starting or<br />

running a job.<br />

Error injection<br />

We use the same technique as we used for ciodb and idoproxy (attach the<br />

debugger, then stop the process), as shown in Example 6-21.<br />

282 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 6-21 Attaching the debugger to the mmcs_server process<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [21809]<br />

ciodb started [26636]<br />

mmcs_server started [28222]<br />

monitor stopped<br />

perfmon stopped<br />

bglsn:/bgl/BlueLight/logs/BGL # gdb --pid=28222<br />

<strong>Problem</strong> determination<br />

We try to connect to the mmcs_server using the mmcs_db_console:<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

The mmcs_db_console hangs and never returns a prompt (which means we<br />

cannot submit a job), thus we connect to the SN using another terminal, and start<br />

the problem determination.<br />

First, we checked the RAS events log, but we could not find anything relevant.<br />

Next, we looked at the “Runtime” and “Configuration” information in the Blue<br />

Gene/L Web browser, but we could not find anything related to this issue.<br />

Next, we check the status of the Blue Gene/L system processes:<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

Timed out on socket connection to BGLMaster daemon at<br />

127.0.0.1:32035<br />

We follow the same procedure as in 6.2.9, “The ciodb daemon is not running on<br />

the SN” on page 276 (ciodb) and 6.2.10, “The idoproxy daemon not running on<br />

the SN” on page 280 (idoproxy). We have to kill all the bglmaster spawned<br />

processes, then we restart the bglmaster.<br />

We can now start mmcs_db_console. We are able to submit and run a job.<br />

Lessons learned<br />

We learned the following lessons:<br />

► The mmcs_server daemon is essential to the running of Blue Gene/L.<br />

► mmcs_db_console is an interface to this process.<br />

► If you are having problems with system processes, ensure that all old system<br />

processes are properly cleaned up before restarting bglmaster.<br />

Chapter 6. Scenarios 283


6.2.12 DB2 not started on the SN<br />

In this scenario we reboot the SN on a system where the DB2 Blue Gene/L<br />

instance is not set to start automatically. This could happen if during DB2<br />

installation on the SN, automatic restart of the DB2 instances was not chosen.<br />

Error injection<br />

We turned off automatic DB2 start at system startup using the command shown<br />

in Example 6-22.<br />

Example 6-22 Turning off automatic DB2 start<br />

bglsn:~ # su -l bglsysdb<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

/opt/IBM/db2/V8.1/instance/db2iauto -off bglsysdb<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2set -i bglsysdb<br />

DB2COMM=tcpip<br />

Note: If the instance was set to autostart, we would see:<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

/opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2set -i bglsysdb<br />

DB2COMM=tcpip<br />

DB2AUTOSTART=YES<br />

We then rebooted the SN.<br />

<strong>Problem</strong> determination<br />

After the SN rebooted, we try to start the Blue Gene/L system processes, as<br />

shown in Example 6-23.<br />

Example 6-23 Starting bglmaster daemon when DB2 is stopped<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />

Starting BGLMaster<br />

Logging to /bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-18:35:29.log<br />

./BGLMaster: error while loading shared libraries: libdb2.so.1: cannot open shared<br />

object file: No such file or directory<br />

bglmaster start command failed: ./BGLMaster --consoleip 127.0.0.1 --consoleport 32035<br />

--configfile bglmaster.init --autorestart y --db2profile ~bgdb2cli/sqllib/db2profile<br />

--dbproperties db.properties 2>&1<br />

>/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-18:35:29.log<br />

## ><br />

284 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # . /discovery/db.src<br />

SQL30081N A communication error has been detected. Communication protocol<br />

being used: "TCP/IP". Communication API being used: "SOCKETS". Location<br />

where the error was detected: "127.0.0.1". Communication function detecting<br />

the error: "connect". Protocol specific error code(s): "111", "*", "*".<br />

SQLSTATE=08001<br />

The db.src file contains a DB2 connect statement (db2 connect to bgdb0 user<br />

bglsysdb using bglsysdb) which gives the error message seen in Example 6-23<br />

(SQL30081N), thus we can conclude that DB2 was not started.<br />

Next, we tried to see what the RAS drill down shows, and we found the message<br />

shown in Example 6-24 at the top of the Web page.<br />

Example 6-24 Message returned to the Web page when DB2 not running<br />

Warning: odbc_connect(): SQL error: [IBM][CLI Driver] SQL30081N A<br />

communication error has been detected. Communication protocol being<br />

used: "TCP/IP". Communication API being used: "SOCKETS". Location where<br />

the error was detected: "127.0.0.1". Communication function detecting<br />

the error: "connect". Protocol specific error code(s): "111", "*", "*".<br />

SQLSTATE=08001 , SQL state 08001 in SQLConnect in<br />

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/web/ras.php on line 110<br />

Again, the same SQL message (SQL30081N) indicates that the Web interface<br />

cannot communicate with DB2 either.<br />

We then executed the command shown in Example 6-25 to confirm that DB2 is<br />

not running.<br />

Example 6-25 Checking the DB2 processes<br />

bglsn:~ # ps -ef | grep db2<br />

root 7711 1 0 18:24 ? 00:00:00 /opt/IBM/db2/V8.1/bin/db2fmcd<br />

root 8593 1 0 18:30 pts/0 00:00:00 /dbhome/bgdb2cli/sqllib/bin/db2bp<br />

8468A0 5 A<br />

bglsysdb 10119 1 0 18:32 pts/0 00:00:00 /dbhome/bglsysdb/sqllib/bin/db2bp<br />

9735A1000 5 A<br />

root 10725 1 0 18:35 pts/4 00:00:00 /dbhome/bgdb2cli/sqllib/bin/db2bp<br />

10277A0 5 A<br />

root 12387 12339 0 18:46 pts/7 00:00:00 grep db2<br />

Chapter 6. Scenarios 285


Now we start DB2 on the SN; after DB2 starts, we also start the bglsmaster<br />

daemon, as shown in Example 6-26.<br />

Example 6-26 Starting DB2 and bglmaster<br />

bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> db2start<br />

03/14/2006 19:36:50 0 0 SQL1063N DB2START processing was<br />

successful.<br />

SQL1063N DB2START processing was successful.<br />

bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> # ./bglmaster start<br />

bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> # ./bglmaster status<br />

idoproxy started [21262]<br />

ciodb started [21263]<br />

mmcs_server started [21264]<br />

monitor stopped<br />

perfmon stopped<br />

We submit a job which starts and runs to completion. Finally, we make sure DB2<br />

is going to start automatically the next time system is started, as shown in<br />

Example 6-27.<br />

Example 6-27 Turning on automatic DB2 start<br />

bglsn:~ # su -l bglsysdb<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

/opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2set -i bglsysdb<br />

DB2COMM=tcpip<br />

DB2AUTOSTART=YES<br />

Lessons learned<br />

We learned the following lessons:<br />

► DB2 is the core element of Blue Gene/L. Nothing works if it is not started.<br />

► DB2 should be configured to automatically start on reboot.<br />

286 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6.2.13 The bglsysdb user OS password changed (Linux)<br />

Because DB2 user (bglsysdb) password authentication is set to “unix” during the<br />

installation of the SN, we decided to try and change the UNIX password and test.<br />

Error injection<br />

We changed the OS password for the bglsysdb user:<br />

bglsn:/bgl/ # passwd bglsysdb<br />

<strong>Problem</strong> determination<br />

We then try to allocate a block so we can run our job.<br />

mmcs_console error :<br />

mmcs$ allocate R000_128<br />

FAIL<br />

lost connection to mmcs_server<br />

use mmcs_server_connect to reconnect<br />

When we check the system logs we find the following messages in the mmcs and<br />

ciodb logs (see Example 6-28).<br />

Example 6-28 Error message in system logs when bglsysdb password changed<br />

--CLI ERROR-------------cliRC<br />

= -1<br />

line = 167<br />

file = DBConnection.cc<br />

SQLSTATE = 08001<br />

Native Error Code = -30082<br />

[IBM][CLI Driver] SQL30082N Attempt to establish connection failed<br />

with security reason "24" ("USERNAME AND/OR PASSWORD INVALID").<br />

SQLSTATE=08001<br />

-------------------------<br />

Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />

schema is {bglsysdb}<br />

Unable to connect, aborting...<br />

We can see for this we cannot connect to the system as bglsysdb, with the<br />

message "USERNAME AND/OR PASSWORD INVALID". However, as superuser<br />

(root) we can switch to bglsysdb (su - bglsysdb) and check if the user context is<br />

fine, so it must be the password we need to change, and update the<br />

db.properties file with the new password.<br />

Chapter 6. Scenarios 287


After changing the db.properties with the new password we need to restart the<br />

bglmaster, as shown in Example 6-29.<br />

Example 6-29 Changing the password in DB2.properties file<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # vi db.properties<br />

... >; save and exit...<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster restart<br />

Stopping BGLMaster<br />

Starting BGLMaster<br />

Logging to<br />

/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0315-14:56:25.log<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [14561]<br />

ciodb started [14562]<br />

mmcs_server started [14563]<br />

monitor stopped<br />

perfmon stopped<br />

We can now use the mmcs_db_console to allocate a block, as shown in<br />

Example 6-30.<br />

Example 6-30 Allocating a block<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ allocate R000_128<br />

OK<br />

Finally, we run a job.<br />

Lessons learned<br />

DB2 user password control is tied to unix authentication. The password that the<br />

Blue Gene/L system processes use to talk to DB2 is contained in the<br />

db.properties file. If you change the DB2 user password you MUST update this<br />

file.<br />

288 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6.2.14 Uncontrolled rack power off<br />

This time we decided to power off the rack without doing any of the<br />

PrepareForService preparation before hand.<br />

Error injection<br />

We switched the rack circuit breaker off. We wait for 20 seconds and power back<br />

on. This could happen in real life when there are power line fluctuations.<br />

<strong>Problem</strong> <strong>Determination</strong><br />

We try to run a job, but block allocation fails twice in a row—first with<br />

communication failure and next time with initchip error invalid JtagID, as<br />

shown in Example 6-31.<br />

Example 6-31 Trying to allocate a block after a rack power glitch<br />

mmcs$ allocate R000_128<br />

FAIL<br />

connect: idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT<br />

connection lost to node/link/service card [R00-M0-N0]<br />

ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />

mmcs$ allocate R000_128<br />

FAIL<br />

connect: initchip error invalid JtagID<br />

As we can see, idoproxy cannot talk to the Service Card.<br />

We next check the system logs. mmcs_server logs and see the messages shown<br />

in Example 6-32.<br />

Example 6-32 Message in mmcs_server log<br />

Mar 15 17:04:29 (I) [1086321888] root allocate R000_128<br />

Mar 15 17:04:29 (I) [1086321888] root<br />

DBMidplaneController::addBlock(R000_128)<br />

Mar 15 17:04:29 (I) [1086321888] root:R000_128<br />

BlockController::connect()<br />

Mar 15 17:04:33 (I) [1086321888] root:R000_128<br />

BlockController::disconnect() releasing node and ido connections<br />

Mar 15 17:04:33 (I) [1086321888] root:R000_128 allocate: FAIL;connect:<br />

idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />

to node/link/service card [R00-M0-N0]<br />

ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />

Chapter 6. Scenarios 289


Mar 15 17:04:33 (I) [1086321888] root:R000_128<br />

DBMidplaneController::removeBlock(R000_128)<br />

The idoproxy logs shows the messages in Example 6-33.<br />

Example 6-33 The idoproxy messages<br />

Mar 15 17:04:29 (I) [1102423264]<br />

root:ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL OPEN(s)<br />

Mar 15 17:04:33 (E) [1084568800] Send Timeout... IPAddr=10.0.0.18<br />

IPMask=255.255.0.0 LicensePlate=ff:f2:9f:16:f0:95:00:0d:60:e9:0f:6a<br />

Mar 15 17:04:33 (E) [1084568800] packet failure -1... IPAddr=10.0.0.18<br />

IPMask=255.255.0.0 LicensePlate=ff:f2:9f:16:f0:95:00:0d:60:e9:0f:6a<br />

Mar 15 17:04:33 (I) [1102423264]<br />

root:ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL CLOSE<br />

► The RAS errors show this for all Node Cards.<br />

► System processes seem running fine, apart from the logged messages<br />

(Example 6-32 and Example 6-33).<br />

► Checking DB2 reveals it is up and running.<br />

Because we have communication errors, we then check out the Service Network<br />

and the Functional Network (eth0 and eth3 in our case), using the ifconfig and<br />

ethtool commands, as shown in Example 6-34.<br />

Example 6-34 Checking network interfaces<br />

bglsn:/ # ifconfig<br />

eth0 Link encap:Ethernet HWaddr 00:0D:60:4D:28:EA<br />

inet addr:10.0.0.1 Bcast:10.0.255.255 Mask:255.255.0.0<br />

inet6 addr: fe80::20d:60ff:fe4d:28ea/64 Scope:Link<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

RX packets:35228867 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:35220594 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:1000<br />

RX bytes:4595599148 (4382.7 Mb) TX bytes:6839888628 (6523.0<br />

Mb)<br />

Base address:0xe800 Memory:f8120000-f8140000<br />

..... >....<br />

eth3 Link encap:Ethernet HWaddr 00:11:25:08:30:90<br />

inet addr:172.30.1.1 Bcast:172.30.255.255 Mask:255.255.0.0<br />

inet6 addr: fe80::211:25ff:fe08:3090/64 Scope:Link<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

RX packets:14054383 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:24871645 errors:0 dropped:0 overruns:0 carrier:0<br />

290 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


collisions:0 txqueuelen:1000<br />

RX bytes:7954413066 (7585.9 Mb) TX bytes:20968793643<br />

(19997.3 Mb)<br />

Base address:0xec00 Memory:c0080000-c00a0000<br />

..... >....<br />

bglsn:/ # ethtool eth0<br />

Settings for eth0:<br />

Supported ports: [ TP ]<br />

Supported link modes: 10baseT/Half 10baseT/Full<br />

100baseT/Half 100baseT/Full<br />

1000baseT/Full<br />

Supports auto-negotiation: Yes<br />

Advertised link modes: 10baseT/Half 10baseT/Full<br />

100baseT/Half 100baseT/Full<br />

1000baseT/Full<br />

Advertised auto-negotiation: Yes<br />

Speed: 1000Mb/s<br />

Duplex: Full<br />

Port: Twisted Pair<br />

PHYAD: 0<br />

Transceiver: internal<br />

Auto-negotiation: on<br />

Supports Wake-on: umbg<br />

Wake-on: g<br />

Current message level: 0x00000007 (7)<br />

Link detected: yes<br />

bglsn:/ # ethtool eth3<br />

Settings for eth3:<br />

Supported ports: [ TP ]<br />

Supported link modes: 10baseT/Half 10baseT/Full<br />

100baseT/Half 100baseT/Full<br />

1000baseT/Full<br />

Supports auto-negotiation: Yes<br />

Advertised link modes: 10baseT/Half 10baseT/Full<br />

100baseT/Half 100baseT/Full<br />

1000baseT/Full<br />

Advertised auto-negotiation: Yes<br />

Speed: 1000Mb/s<br />

Duplex: Full<br />

Port: Twisted Pair<br />

PHYAD: 0<br />

Transceiver: internal<br />

Auto-negotiation: on<br />

Supports Wake-on: umbg<br />

Wake-on: g<br />

Chapter 6. Scenarios 291


Current message level: 0x00000007 (7)<br />

Link detected: yes<br />

► From Example 6-34 we can also see that the IP configuration is correct on the<br />

network interfaces.<br />

► Next, we use the ping command over Functional and Service Networks, and<br />

check the lights of the RJ45 jacks functional and service interface. These<br />

verifications do no reveal any problem.<br />

► Following, we check the lights of the Service and Node Cards. Node Cards<br />

Light are out and the Service Cards lights are cycling. This means the cards<br />

are uninitialized. We need to run the discovery process to re-discover the<br />

system.<br />

► For this, we to do a PrepareForService operation to get the system into a<br />

good state, as shown in Example 6-35:<br />

Example 6-35 Running PrepareForService on the system<br />

bglsn:/discovery # ./PrepareForService R00<br />

Logging to<br />

/bgl/BlueLight/logs/BGL/PrepareForService-2006-03-16-10:47:01.log<br />

Mar 16 10:47:02.912 EST: PrepareForService started<br />

Mar 16 10:47:03.125 EST: Preparing 1 Midplanes in rack R00<br />

Mar 16 10:47:03.702 EST: @ killMidplaneBlocks - kill_midplane_jobs R000<br />

failed (FAIL;command?)<br />

Mar 16 10:47:06.213 EST: Freed any blocks using R000<br />

Mar 16 10:47:06.222 EST: @ killMidplaneBlocks - Retried 1 time(s)<br />

before we were able to 'free any blocks using this midplane' -<br />

Midplane(R000)!<br />

Mar 16 10:47:06.222 EST:<br />

Mar 16 10:47:11.403 EST: @ buildServiceCardObj - Exception occurred<br />

while building an iDo for ServiceCard(mLctn(R00-M0-S),<br />

mCardSernum(2033394a3033373900000000594c31304b3530363130304a),<br />

mLp(FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), mIp(10.0.0.12), mType(2))<br />

Mar 16 10:47:11.403 EST: @ buildServiceCardObj - Exception was<br />

(java.io.IOException: Could not contact iDo with<br />

LP=FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2 and IP=/10.0.0.12 because<br />

java.lang.RuntimeException: Communication error: (DirectIDo for<br />

Uninitialized DirectIDo for<br />

FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2@/10.0.0.12:0 is in state =<br />

COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber<br />

= 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout<br />

= 1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync<br />

Command = 10, Actual Sync Reply = -1))<br />

..... > .....<br />

292 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Mar 16 10:48:11.470 EST: @ buildServiceCardObj - Exception occurred<br />

while building an iDo for ServiceCard(mLctn(R00-M0-S),<br />

mCardSernum(2033394a3033373900000000594c31304b3530363130304a),<br />

mLp(FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), mIp(10.0.0.12), mType(2))<br />

Mar 16 10:48:11.471 EST: @ buildServiceCardObj - Exception was<br />

(java.io.IOException: Could not contact iDo with<br />

LP=FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2 and IP=/10.0.0.12 because<br />

java.lang.RuntimeException: Communication error: (DirectIDo for<br />

Uninitialized DirectIDo for<br />

FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2@/10.0.0.12:0 is in state =<br />

COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber<br />

= 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout<br />

= 1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync<br />

Command = 10, Actual Sync Reply = -1))<br />

..... > .....<br />

► From the error messages (bold in Example 6-35) we can see the system<br />

cannot talk to the IDo chips, concluding that the Service Card is NOT<br />

initialized.<br />

► We now run discovery on the system:<br />

cd /discovery<br />

./SystemController start<br />

./Discovery0 start<br />

./PostDiscovery start<br />

► The lights of the Service Card come back to normal state, but the Node Cards<br />

do not.<br />

► We try to run PrepareForService again, but it fails because the previously<br />

failed attempt. We need to close the previous service action manually:<br />

db2 "update bglserviceaction set status = 'C' where id = 2"<br />

► We then ran PrepareForService again on the rack.<br />

Note: Unfortunately this failed, as our test system does not have the hardware<br />

expected in a typical Blue Gene/L rack (1/2 rack at least).<br />

To overcome this, we added the -FORCE option and the ServiceAction<br />

completed as expected.<br />

Chapter 6. Scenarios 293


As EndServiceAction did not complete successfully, we decided to manually<br />

mark all the IDo chips as missing (M) in the database, so the discovery process<br />

would pick them up:<br />

db2 "update bglidochip set status = 'M' where ipaddress like<br />

'10.0%'"<br />

Discovery found all existing hardware. After it found all the hardware, we stopped<br />

the discovery process and restarted the Blue Gene/L system processes. We<br />

could then allocate a block submit and a job.<br />

Lessons learned<br />

We learned the following lessons:<br />

► If a rack is power cycled without preparation, the rack it goes into a<br />

un-initialized state. Without the discovery processes running (specially<br />

SystemController), the Service Card does not get initialized and the system<br />

cannot talk to the IDo chips through the switch on the Service Card. You<br />

should always use PrepareForService and EndServiceAction and do a<br />

controlled power down.<br />

► If you have an unplanned rack power outage, you should leave the rack off,<br />

then start the discovery process. This marks all hardware as missing, and<br />

when this is complete, power up and let discovery find and initialize the<br />

hardware. When complete, stop discovery and start the system processes to<br />

bring back the system in production.<br />

6.3 File system scenarios<br />

This section addresses problem determination for issues related to both NFS<br />

and GPFS.<br />

Here is the list of file system scenarios that we ran. In each of these scenarios,<br />

we injected a problem manually. We show the steps taken to determine the<br />

problem and the resolution.<br />

1. Port mapper daemon not running on the SN.<br />

2. NFS daemon not running on the SN.<br />

3. GPFS pagepool (wrongly) set to 512 MB on bgIO cluster nodes.<br />

4. Secure shell (ssh) is broken (interactive authentication required).<br />

5. The /bgl file system becomes full.<br />

6. Install of new Blue Gene/L driver code.<br />

7. Duplicate IP in /etc/hosts.<br />

294 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


8. Missing IO node in /etc/hosts.<br />

9. Duplicate entries (additional aliases) for the SN in /etc/hosts.<br />

In each of the scenario, s we first run the system to prove that the system is<br />

working. To do this we use LoadLeveler to run the IOR application that writes<br />

data to the GPFS file system from all nodes. After this has run successfully, we<br />

then inject the problem and rerun the same job. Each of the scenarios is split into<br />

the following sections.<br />

► Error injection<br />

► <strong>Problem</strong> determination<br />

► Lessons learned<br />

6.3.1 Port mapper daemon not running on the SN<br />

In this scenario we are not using GPFS or LoadLeveler but just the core system<br />

loading the application from the /bgl file system. In this scenario, we cause the<br />

problem by killing the portmap daemon that is required by the NFS server.<br />

Error Injection<br />

The error injected here was to kill the portmap process running on the SN.<br />

<strong>Problem</strong> determination<br />

Here we ran the ““Hello world!” program from a mmcs_db_console session. Here is<br />

the error that we received:<br />

allocate R000_128 == failed with :<br />

mmcs_console : Error: unable to mount filesystem<br />

Example 6-36 presents mmcs_db_server.log when portmapd is not running.<br />

Example 6-36 Messages in mmcs_db_server.log (portmapd not running)<br />

Mar 13 16:02:02 (I) [1084843232] root:R000_128 DBBlockController::waitBoot(R000_128)<br />

Mar 13 16:02:13 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />

unable to mount filesystem<br />

Mar 13 16:02:16 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />

unable to mount filesystem<br />

Mar 13 16:02:16 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />

unable to mount filesystem<br />

Mar 13 16:02:16 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />

unable to mount filesystem<br />

Mar 13 16:02:17 (I) [1084843232] root:R000_128 allocate: FAIL;Error: unable to mount<br />

filesystem<br />

Chapter 6. Scenarios 295


Mar 13 16:02:17 (I) [1084843232] root:R000_128<br />

DBMidplaneController::removeBlock(R000_128)<br />

Mar 13 16:02:17 (I) [1084843232] root BlockController::quiesceMailbox() waiting for<br />

ras events and I/O node shutdown<br />

Mar 13 16:02:17 (I) [1097856224] mmcs DatabaseCommandThread started: block<br />

R000_128, user root, action 3<br />

Mar 13 16:02:17 (I) [1097856224] mmcs setusername root<br />

Mar 13 16:02:17 (I) [1097856224] root db_free R000_128<br />

Mar 13 16:02:17 (I) [1097856224] root DBMidplaneController::addBlock(R000_128)<br />

Mar 13 16:02:17 (I) [1097856224] root:R000_128 DBBlockController::freeBlock()<br />

setBlockState(R000_128, TERMINATING) successful<br />

As we can see the file system cannot be mounted, according to the problem<br />

determination methodology we go straight to the NFS checks:<br />

► Check the export list on the SN:<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # showmount -e<br />

mount clntudp_create: RPC: Port mapper failure - RPC: Unable to<br />

receive<br />

The RPC: Port mapper failure - RPC: Unable to receive error indicates a<br />

port mapper daemon issue. To fix this problem, we ran the following command:<br />

/etc/init.d/portmap restart && /etc/init.d/nfsserver restart<br />

Lessons learned<br />

We learned the following lessons:<br />

► If the port mapper daemon dies on the SN we get the following error in either<br />

the mmcs_db_console or the mmcs_db_server error log.<br />

Error: unable to mount filesystem<br />

► To diagnose the problem, we can run the showmount -e command on the SN.<br />

6.3.2 NFS daemon not running on the SN<br />

In this scenario we use only the core system (SN, racks, networks) loading the<br />

application from the /bgl file system exported by the SN.<br />

Error Injection<br />

The error injected here was to kill the nfsd process running on the SN.<br />

296 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


<strong>Problem</strong> determination<br />

Here we ran the “Hello world!” program from a mmcs_db_console session. Here is<br />

the error that we received:<br />

allocate R000_128 == failed with :<br />

mmcs_console : Error: unable to mount filesystem<br />

Now looking at the mmcs_db_server log we see the error:<br />

KERNEL FATAL: Error: unable to mount filesystem<br />

Using the problem determination methodology we go straight to the NFS checks.<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # showmount -e<br />

mount clntudp_create: RPC: Program not registered<br />

This error indicates a problem with the NFS server.To fix this we did the<br />

following.<br />

/etc/init.d/nfsserver restart<br />

Lessons learned<br />

We learned the following lessons:<br />

► If the nfsd daemon dies on the SN, we get the following error in either the<br />

mmcs_db_console or the mmcs_db_server error log when allocating a block:<br />

Error: unable to mount file system<br />

► To diagnose the problem we can run the ‘showmount -e’ command on the SN<br />

6.3.3 GPFS pagepool (wrongly) set to 512MB on bgIO cluster nodes<br />

This scenario we change the GPFS pagepool for the I/O nodes on the bgIO<br />

cluster. This is pinned kernel memory used for file and metadata caching by the<br />

GPFS daemon. Due to limited memory (RAM) on the I/O nodes, a large<br />

pagepool will prevent the applications from running (or even prevent GPFS<br />

daemon from starting on the I/O nodes).<br />

Error injection<br />

Example 6-37 shows the error we injected (no blocks were allocated at this time).<br />

Example 6-37 Changing the GPFS pagepool<br />

bglsn:/bgl/BlueLight/logs/BGL # mmchconfig pagepool=512M<br />

mmchconfig: Command successfully completed<br />

mmchconfig: Propagating the changes to all affected nodes.<br />

This is an asynchronous process.<br />

Chapter 6. Scenarios 297


<strong>Problem</strong> determination<br />

In this scenario we use IBM LoadLeveler to submit a job (for details see 4.4, “IBM<br />

LoadLeveler” on page 167). We leave LoadLeveler to automatically allocate a<br />

block. We observe that the job did not produce any I/O files after five minutes<br />

(usually, this job runs in under one minute). As a starting point we check the<br />

mmcs_server log:<br />

bglsn:/bgl/BlueLight/logs/BGL # view<br />

bglsn-mmcs_db_server-2006-0316-15:42:04.log<br />

Initially we did not find any relevant message, so we decided to log on the an I/O<br />

node and check from there (see Example 6-38).<br />

Example 6-38 Checking GPFS on one I/O node<br />

bglsn:/bgl/BlueLight/logs/BGL # ssh root@ionode5<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

$ df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

rootfs 7931 1644 5878 22% /<br />

/dev/root 7931 1644 5878 22% /<br />

172.30.1.1:/bgl 9766544 2449008 7317536 26% /bgl<br />

172.30.1.33:/bglscratch<br />

36700160 64448 36635712 1% /bglscratch<br />

$ /usr/lpp/mmfs/bin/mmgetstate<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

6 ionode5 down<br />

Note: We could also use mmgetstate -a command on the SN. This would<br />

return the status of all nodes in the bgIO GPFS cluster. However, we have<br />

chosen to go directly to one of the nodes allocated for the LoadLeveler job.<br />

We then look into GPFS log (mmfs.log.latest) on the respective I/O node, as<br />

shown in Example 6-39.<br />

Example 6-39 GPFS log on the I/O node<br />

$ cat /var/mmfs/gen/mmfslog<br />

/bin/cat: /proc/kallsyms: No such file or directory<br />

Tue Mar 28 15:56:47 2006: mmfsd initializing. {Version: 2.3.0.10<br />

Built: Jan 16 2006 13:08:25} ...<br />

298 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Tue Mar 28 15:56:48 2006: Not enough memory to allocate internal data<br />

structure.<br />

Tue Mar 28 15:56:48 2006: The mmfs daemon is shutting down abnormally.<br />

Tue Mar 28 15:56:48 2006: mmfsd is shutting down.<br />

Tue Mar 28 15:56:48 2006: Reason for shutdown: LOGSHUTDOWN called<br />

Tue Mar 28 15:56:49 EST 2006 runmmfs starting<br />

Removing old /var/adm/ras/mmfs.log.* files:<br />

Tue Mar 28 15:56:49 EST 2006 runmmfs: respawn 9 waiting 336 seconds<br />

before restarting mmfsd<br />

From the GPFS log we can see that the mmfs daemon is respawning and we also<br />

see the following message, which indicates where the problem might be:<br />

Not enough memory to allocate internal data structure.<br />

Moreover, 10 minutes after the job submission, we also get in the application<br />

(job) log file the messages shown in Example 6-40.<br />

Example 6-40 Application messages<br />

test1@bglfen1:/bglscratch/test1> view ior-gpfs.out<br />

IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />

Run began: Tue Mar 28 15:51:26 2006<br />

Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />

/bglscratch/test1/applications/IOR/ior-inputs<br />

Machine: Blue Gene L ionode3<br />

Summary:<br />

api = MPIIO (version=2, subversion=0)<br />

test filename = /bubu/Examples/IOR/IOR-output-MPIIO-1<br />

access = single-shared-file<br />

clients = 128 (16 per node)<br />

repetitions = 4<br />

xfersize = 32768 bytes<br />

blocksize = 1 MiB<br />

aggregate filesize = 128 MiB<br />

delaying 1 seconds . . .<br />

** error **<br />

** error **<br />

** error **<br />

** error **<br />

** error **<br />

** error **<br />

** error **<br />

Chapter 6. Scenarios 299


** error **<br />

** error **<br />

** error **<br />

** error **<br />

** error **<br />

** error **<br />

** error **<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

** error **<br />

** error **<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

** error **<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />

MPI File does not exist, error stack:<br />

MPI File does not exist, error stack:<br />

MPI File does not exist, error stack:<br />

MPI File does not exist, error stack:<br />

ADIOI_BGL_OPEN(54): File /bubu/Examples/IOR/IOR-output-MPIIO-1 does not<br />

exist<br />

ADIOI_BGL_OPEN(54): File /bubu/Examples/IOR/IOR-output-MPIIO-1 does not<br />

exist<br />

ADIOI_BGL_OPEN(54): File /bubu/Examples/IOR/IOR-output-MPIIO-1 does not<br />

exist<br />

** exiting **<br />

** exiting **<br />

** exiting **<br />

** exiting **<br />

From the application output file, we can see that the application is unable to find<br />

the output file in GPFS which is also a good indication that there is a problem<br />

with GPFS.<br />

Note: The GPFS file system (/bubu) is available on the SN, thus the<br />

application output file can be found inside the file system. Only the I/O nodes<br />

cannot write (nor read) to (from) the GPFS file system.<br />

300 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


We can also check one of the I/O node logs in the /bgl/BlueLight/logs/BGL<br />

directory on the SN, as in Example 6-41.<br />

Example 6-41 I/O node log messages<br />

Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Starting<br />

GPFS<br />

Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

Disabling protocol version 1. Could not load host key<br />

Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

Mar 28 15:51:19 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

R00-M0-N0-I:J18-U01 /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up<br />

on I/O nod<br />

Mar 28 15:51:19 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O node ionode1 :<br />

172.30.2.1<br />

Mar 28 15:51:19 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

[ciod:initialized]<br />

Mar 28 15:51:20 (I) [1088451808] root:RMP28Mr154042300 {0}.0: e<br />

ionode1 : 172.30.2.1<br />

Starting syslog services<br />

Starting ciod<br />

Starting XL Compiler Environment for I/O node<br />

ciod: version "Jan 10 2006 16:25:12"<br />

ciod: running in virtual node mode with 32 processors<br />

Mar 28 15:51:20 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

/bin/sh: can't access tty; job control turned off<br />

$<br />

Mar 28 15:51:24 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

Switching to coprocessor mode<br />

Mar 28 15:51:24 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />

Mar 28 15:52:03 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Mar 28<br />

15:52:02<br />

Mar 28 15:52:03 (I) [1088451808] root:RMP28Mr154042300 {0}.0: ionode1<br />

mmfs: mmfsd: Error=MMFS_ABNORMAL_SHUTDOWN, ID=0x1FB9260D, Tag=0<br />

Mar 28 15:52:02 ionode1 mmfs: Shutting down abnormally due to error in<br />

/project/sprelcs3b/build/rcs3bs010a/src/avs/fs/mmfs/ts/bufmgr/linux/buf<br />

mgr-plat.C line 170 retCode -1, reasonCode 0<br />

Mar 28 15:56:48 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Mar 28<br />

15:56:47<br />

Chapter 6. Scenarios 301


Mar 28 15:56:48 (I) [1088451808] root:RMP28Mr154042300 {0}.0: ionode1<br />

mmfs: mmfsd: Error=MMFS_ABNORMAL_SHUTDOWN, ID=0x1FB9260D, Tag=0<br />

Mar 28 15:56:47 ionode1 mmfs: Shutting down abnormally due to error in<br />

/project/sprelcs3b/build/rcs3bs010a/src/avs/fs/mmfs/ts/bufmgr/linux/buf<br />

mgr-plat.C line 170 retCode -1, reasonCode 0<br />

Mar 28 16:02:30 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Mar 28<br />

16:02:30<br />

Mar 28 16:02:30 (I) [1088451808] root:RMP28Mr154042300 {0}.0: ionode1<br />

mmfs: mmfsd: Error=MMFS_ABNORMAL_SHUTDOWN, ID=0x1FB9260D, Tag=0<br />

Mar 28 16:02:30 ionode1 mmfs: Shutting down abnormally due to error in<br />

/project/sprelcs3b/build/rcs3bs010a/src/avs/fs/mmfs/ts/bufmgr/linux/buf<br />

mgr-plat.C line 170 retCode -1, reasonCode 0<br />

bglsn:/bgl/BlueLight/logs/BGL #<br />

From Example 6-41 we can see that GPFS was started at 15:41:15 and it is not<br />

until 15:51:19 that we get the following message from S40gpfs:<br />

: GPFS did not come up on I/O node ionode1 :<br />

Looking in the file /bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d/gpfs, we can see why<br />

from the code shown in Example 6-42.<br />

Example 6-42 Excerpt from GPFS startup script for I/O nodes<br />

# This file will be created by mmfsup.scr to signal that GPFS startup<br />

is<br />

# complete<br />

upfile=/tmp/mmfsup.done<br />

# Create mmfsup script that will run when GPFS is ready<br />

cat


then ras_advisory "$0: GPFS did not come up on I/O node<br />

$HOSTID"<br />

exit 1<br />

fi<br />

done<br />

rm -f $upfile<br />

echo "$0: GPFS is ready on I/O node $HOSTID_LOC"<br />

;;<br />

stop)<br />

# Set defaults for GPFS configuration variables<br />

GPFS_STARTUP=0<br />

# Obtain overrides from config file<br />

GPFSFILE=/etc/sysconfig/gpfs<br />

[ -r $GPFSFILE ] && . $GPFSFILE<br />

The loop (in bold) explains the timeout: 300 x 2 seconds = 600 seconds. Thus,<br />

the 10 minutes wait if GPFS fails to come up.<br />

Lessons learned<br />

Always make sure you follow the Blue Gene/L documentation when changing ant<br />

GPFS parameters. Because I/O nodes have a very particular configuration, you<br />

need to be extra careful with GPFS.<br />

6.3.4 Secure shell (ssh) is broken<br />

For this scenario we remove the following files so that is should be impossible for<br />

the root user to communicate between GPFS nodes in the bgIO cluster without<br />

interactive authentication (keys acceptance or password prompting):<br />

► /bgl/dist/root/.ssh/known_hosts<br />

► /bgl/dist/root/.ssh/authorized_keys<br />

► /root/.ssh/known_hosts<br />

► /root/.ssh/authorized_keys<br />

Error injection<br />

Example 6-43 shows the way that we injected the error.<br />

Example 6-43 Removing ssh authentication files<br />

bglsn:/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d # cd /root/.ssh<br />

bglsn:~/.ssh # ls -lrt<br />

total 38<br />

Chapter 6. Scenarios 303


-rw-r--r-- 1 root root 220 Mar 19 17:12 id_rsa.pub<br />

-rw------- 1 root root 887 Mar 19 17:12 id_rsa<br />

-rw-r--r-- 1 root root 2976 Mar 19 19:12 known_hosts.b4gpfs<br />

-rw-r--r-- 1 root root 440 Mar 28 13:40 authorized_keys<br />

-rw-r--r-- 1 root root 4140 Mar 28 17:31 known_hosts<br />

drwx------ 2 root root 280 Mar 28 17:31 .<br />

drwxr-xr-x 27 root root 1768 Mar 30 14:54 ..<br />

bglsn:~/.ssh # mv known_hosts known_hosts.orig<br />

bglsn:~/.ssh # mv authorized_keys authorized_keys.orig<br />

bglsn:~/.ssh # cd /bgl/dist/root/.ssh/<br />

bglsn:/bgl/dist/root/.ssh # ls -lrt<br />

total 32<br />

drwx------ 3 root root 72 Mar 17 15:57 ..<br />

-rw-r--r-- 1 root root 220 Mar 19 17:05 id_rsa.pub<br />

-rw------- 1 root root 887 Mar 19 17:05 id_rsa<br />

-rw-r--r-- 1 root root 4091 Mar 19 19:14 known_hosts_gpfs<br />

-rw-r--r-- 1 root root 440 Mar 28 13:55 authorized_keys<br />

-rw-r--r-- 1 root root 940 Mar 28 17:33 known_hosts<br />

drwx------ 2 root root 272 Mar 28 17:33 .<br />

bglsn:/bgl/dist/root/.ssh # mv known_hosts known_hosts.orig<br />

bglsn:/bgl/dist/root/.ssh # mv authorized_keys authorized_keys.orig<br />

bglsn:/bgl/dist/root/.ssh #<br />

<strong>Problem</strong> determination<br />

Check the status of LoadLeveler as shown in Example 6-44.<br />

Example 6-44 Checking the LoadLeveler status before submitting the job<br />

test1@bglfen1:/bglscratch/test1> llq<br />

llq: There is currently no job status to report.<br />

test1@bglfen1:/bglscratch/test1> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.02 8 PPC64 Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

304 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Next, submit the LoadLeveler job named ior-gpfs.cmd (see Example 6-45).<br />

Example 6-45 Submitting the LoadLeveler job<br />

test1@bglfen1:/bglscratch/test1> set -o vi<br />

test1@bglfen1:/bglscratch/test1> ls<br />

applications hello-file-2.rts hello.rts ior-gpfs.out hello128.cmd<br />

ello-file.rts ior-gpfs.cmd ior-gpfs.out.ciod-hung-scenario hello.cmd<br />

hello-gpfs.err ior-gpfs.err hello-file-1.rts hello-gpfs.out<br />

ior-gpfs.err.ciod-hung-scenario<br />

test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />

llsubmit: The job "bglfen1.itso.ibm.com.37" has been submitted.<br />

test1@bglfen1:/bglscratch/test1> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

------------------------ ---------- ----------- -- --- ------------<br />

----------bglfen1.37.0<br />

test1 3/28 10:50 I 50 small<br />

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0<br />

Example 6-46 shows that the LoadLeveler the job is actually running.<br />

Example 6-46 LoadLeveler queue shows job is “Running”<br />

test1@bglfen1:/bglscratch/test1> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

------------------------ ---------- ----------- -- --- ------------<br />

---------bglfen1.37.0<br />

test1 3/28 10:50 R 50 small<br />

bglfen1<br />

Now, check at the IOR job output file (see Example 6-47).<br />

Example 6-47 IOR job output file<br />

test1@bglfen1:/bglscratch/test1> tail -f ior-gpfs.out<br />

IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />

Run began: Tue Mar 28 12:11:48 2006<br />

Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />

/bglscratch/test1/applications/IOR/ior-inputs<br />

Machine: Blue Gene L ionode3<br />

Chapter 6. Scenarios 305


Summary:<br />

api = MPIIO (version=2, subversion=0)<br />

test filename = /bubu/Examples/IOR/IOR-output-MPIIO-1<br />

access = single-shared-file<br />

clients = 128 (16 per node)<br />

repetitions = 10<br />

xfersize = 32768 bytes<br />

blocksize = 1 MiB<br />

aggregate filesize = 128 MiB<br />

delaying 1 seconds . . .<br />

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />

------ --------- ---------- --------- -------- -------- -------- ---write<br />

7.27 1024.00 32.00 0.104517 17.00 1.59 0<br />

delaying 1 seconds . . .<br />

read 75.77 1024.00 32.00 0.004636 1.68 0.002146 0<br />

delaying 1 seconds . . .<br />

write 7.15 1024.00 32.00 0.027025 17.35 1.51 1<br />

delaying 1 seconds . . .<br />

read 76.39 1024.00 32.00 0.004542 1.67 0.002301 1<br />

delaying 1 seconds . . .<br />

bglsn:/bubu/Examples/IOR # ls -lrt<br />

total 245888<br />

-rw-r--r-- 1 test1 itso 33554432 Mar 21 17:14 IOR-output<br />

drwxrwxrwx 3 root root 32768 Mar 22 10:00 ..<br />

-rw-r--r-- 1 test1 itso 134217728 Mar 23 10:37 IOR-output-MPIIO<br />

drwxr-xr-x 2 test1 itso 32768 Mar 28 12:14 .<br />

-rw-r--r-- 1 test1 itso 133922816 Mar 28 12:14 IOR-output-MPIIO-1<br />

test1@bglfen1:/bglscratch/test1> llq<br />

llq: There is currently no job status to report.<br />

test1@bglfen1:/bglscratch/test1><br />

Apparently, the job ran correctly. Assuming that we are not application<br />

specialists, we need to consult the application owner to verify that the output is<br />

what was expected.<br />

306 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


We get the confirmation that the application’s output fine, however we decide to<br />

perform additional checks. First do some basic checks on the SN, as shown in<br />

Example 6-48.<br />

Example 6-48 Additional node checks<br />

##### GPFS checks:<br />

bglsn:/bubu/Examples/IOR # df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

/dev/sdb3 70614928 4743632 65871296 7% /<br />

tmpfs 1898508 8 1898500 1% /dev/shm<br />

/dev/sda4 489452 95272 394180 20% /tmp<br />

/dev/sda1 9766544 2448544 7318000 26% /bgl<br />

/dev/sda2 9766608 699636 9066972 8% /dbhome<br />

p630n03:/bglscratch 36700160 64384 36635776 1% /bglscratch<br />

p630n03_fn:/nfs_mnt 104857600 11339904 93517696 11% /mnt<br />

/dev/bubu_gpfs1 1138597888 3269632 1135328256 1% /bubu<br />

p630n03:/bglhome 2621440 98464 2522976 4% /bglhome<br />

bglsn:/bubu/Examples/IOR # touch /bubu/foo<br />

bglsn:/bubu/Examples/IOR # mmgetstate<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

bglsn:/bubu/Examples/IOR # mmgetstate -a<br />

The authenticity of host 'ionode6 (172.30.2.6)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode7 (172.30.2.7)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode5 (172.30.2.5)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode3 (172.30.2.3)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode1 (172.30.2.1)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode4 (172.30.2.4)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Chapter 6. Scenarios 307


Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode2 (172.30.2.2)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode8 (172.30.2.8)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'bglsn_fn.itso.ibm.com (172.30.1.1)' can't be established.<br />

RSA key fingerprint is 52:81:f5:75:be:d6:8e:cf:65:a4:b5:23:46:1a:c6:94.<br />

Are you sure you want to continue connecting (yes/no)? no<br />

^C<br />

mmgetstate: Interrupt received.==============================<br />

Because we should not be prompted to accept the host identity, we realize that<br />

there is a problem. We doublecheck using the mmdsh and mmchconfig commands,<br />

as shown in Example 6-49.<br />

Example 6-49 Checking GPFS remote command execution<br />

bglsn:/bubu/Examples/IOR # export WCOLL=/tmp/ionodes<br />

bglsn:/bubu/Examples/IOR # mmdsh date<br />

mmdsh: ionode1 rsh process had return code 1.<br />

mmdsh: ionode2 rsh process had return code 1.<br />

mmdsh: ionode3 rsh process had return code 1.<br />

mmdsh: ionode4 rsh process had return code 1.<br />

mmdsh: ionode5 rsh process had return code 1.<br />

ionode1: ionode1: Connection refused<br />

ionode2: ionode2: Connection refused<br />

ionode3: ionode3: Connection refused<br />

ionode4: ionode4: Connection refused<br />

ionode5: ionode5: Connection refused<br />

ionode6: ionode6: Connection refused<br />

ionode7: ionode7: Connection refused<br />

ionode8: ionode8: Connection refused<br />

bglsn:/bubu/Examples/IOR # mmchconfig pagepool=128M -i<br />

mmchconfig: Command successfully completed<br />

The authenticity of host 'bglsn_fn.itso.ibm.com (172.30.1.1)' can't be<br />

established.<br />

RSA key fingerprint is 52:81:f5:75:be:d6:8e:cf:65:a4:b5:23:46:1a:c6:94.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode2 (172.30.2.2)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode5 (172.30.2.5)' can't be established.<br />

308 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode7 (172.30.2.7)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode8 (172.30.2.8)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode4 (172.30.2.4)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode6 (172.30.2.6)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode1 (172.30.2.1)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? The authenticity<br />

of host 'ionode3 (172.30.2.3)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? no<br />

no<br />

no<br />

no<br />

Please type 'yes' or 'no': no<br />

Please type 'yes' or 'no': no<br />

Please type 'yes' or 'no': no<br />

no<br />

no<br />

Please type 'yes' or 'no': 'NO'<br />

Please type 'yes' or 'no': no<br />

^C<br />

mmchconfig: Interrupt received: changes not propagated.<br />

From Example 6-49, we can conclude that we have GPFS authentication<br />

problems. However, this does not affect the job, because authentication is only<br />

required when GPFS executes commands to start/stop daemons and modify<br />

GPFS cluster configuration.<br />

Chapter 6. Scenarios 309


Now that we know that the GPFS configuration commands have a problem, we<br />

check that the block is booted, and try to connect interactively into an I/O node<br />

using ssh, as in Example 6-50.<br />

Example 6-50 Allocating a block w/ GPFS and ssh authentication broken<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ list_blocks<br />

OK<br />

RMP28Mr121102270 root(0) connected<br />

mmcs$ redirect RMP28Mr121102270 on<br />

OK<br />

mmcs$ {i} write_con hostname<br />

FAIL<br />

block not selected<br />

mmcs$ select_block RMP28Mr121102270<br />

OK<br />

$<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode5<br />

The authenticity of host 'ionode5 (172.30.2.5)' can't be established.<br />

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />

Are you sure you want to continue connecting (yes/no)? yes<br />

Warning: Permanently added 'ionode5,172.30.2.5' (RSA) to the list of<br />

known hosts.<br />

root@ionode5's password:<br />

Permission denied, please try again.<br />

root@ionode5's password:<br />

Permission denied, please try again.<br />

root@ionode5's password:<br />

Permission denied (publickey,password,keyboard-interactive).<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

Because the root user on the I/O nodes does not have a password, ssh does not<br />

allow us to login, so we need to check the ssh authentication.<br />

Before we leave this scenario, let us check whether GPFS can be stopped and<br />

started with the ssh broken, as shown in Example 6-51.<br />

Example 6-51 Checking GPFS can be stopped/started<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

310 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


connected to DB2<br />

mmcs$ list_blocks<br />

OK<br />

RMP28Mr121102270 root(0) connected<br />

mmcs$ free RMP28Mr121102270<br />

OK<br />

mmcs$<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmshutdown<br />

Tue Mar 28 13:33:58 EST 2006: mmshutdown: Starting force unmount of<br />

GPFS file Tue Mar 28 13:34:13 EST 2006: mmshutdown: Shutting down GPFS<br />

daemons<br />

Tue Mar 28 13:34:43 EST 2006: mmshutdown: Finished<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn down<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmstartup<br />

Tue Mar 28 13:35:46 EST 2006: mmstartup: Starting GPFS ...<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

/dev/sdb3 70614928 4732208 65882720 7% /<br />

tmpfs 1898508 8 1898500 1% /dev/shm<br />

/dev/sda4 489452 95276 394176 20% /tmp<br />

/dev/sda1 9766544 2448564 7317980 26% /bgl<br />

/dev/sda2 9766608 699636 9066972 8% /dbhome<br />

p630n03:/bglscratch 36700160 64416 36635744 1% /bglscratch<br />

p630n03_fn:/nfs_mnt 104857600 11339904 93517696 11% /mnt<br />

p630n03:/bglhome 2621440 98464 2522976 4% /bglhome<br />

/dev/bubu_gpfs1 1138597888 3269632 1135328256 1% /bubu<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

Even though it might look strange, GPFS started correctly and mounted the<br />

remote file system. This is according to GPFS behavior (see the following note).<br />

Chapter 6. Scenarios 311


Lessons learned<br />

From this scenario, we learn that the ssh authentication within the bgIO cluster is<br />

only required for changes to GPFS configuration and does not affect data traffic<br />

or the stopping and starting of GPFS on a local node.<br />

Note: Because GPFS is started on each node individually when the node is<br />

booted, the fact that ssh authentication is broken does not affect the GPFS<br />

cluster as long as the GPFS configuration file for each node (mmsdrfs) does<br />

not change.<br />

6.3.5 The /bgl file system full (Blue Gene/L uses GPFS)<br />

This scenario analyzes what happens when /bgl file system fills up to 100%. We<br />

ran this scenario again to see how it affects a complex Blue Gene/L system.<br />

Note: In this scenario, our Blue Gene/L system USES GPFS.<br />

Error injection<br />

We used the dd command to create a file called /bgl/largefile that filled the /bgl<br />

file system (100%).<br />

<strong>Problem</strong> determination<br />

We started by running the LoadLeveler job (ior-gpfs.cmd). We see that while<br />

LoadLeveler shows the job as running we see that the block is still in the process<br />

of booting. Looking at the mmcs_db_server log reveals the GPFS message shown<br />

in bold in Example 6-52.<br />

Example 6-52 GPFS cannot be started<br />

Mar 27 17:34:33 (I) [1096078560] {0}.0: Mon Mar 27 17:34:32 EST 2006<br />

mmautoload: GPFS is waiting for cluster data reposi<br />

Mar 27 17:34:33 (I) [1096078560] {0}.0: tory<br />

Mar 27 17:37:39 (I) [1096078560] {17}.0: tory<br />

mmautoload: The GPFS environment cannot be initialized.<br />

mmautoload: Correct the problem and use mmstartup to start GPFS.<br />

mmautoload: The GPFS environment cannot be initialized.<br />

Mar 27 17:37:39 (I) [1096078560] {0}.0: Mon Mar 27 17:37:38 EST 2006<br />

mmautoload: GPFS is waiting for cluster data reposi<br />

Mar 27 17:37:39 (I) [1096078560] {0}.0: tory<br />

312 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


mmautoload: The GPFS environment cannot be initialized.<br />

mmautoload: Correct the problem and use mmstartup to start GPFS.<br />

mmautoload: The GPFS environment cannot be initialized.<br />

We now check the block status as shown by the mmcs_console (see<br />

Example 6-53).<br />

Example 6-53 Block status<br />

mmcs_console:<br />

list bglblock <br />

==> DBBlock record<br />

_blockid = RMP27Mr173143250<br />

_numpsets = 0<br />

_numbps = 0<br />

_owner = root<br />

_istorus = 000<br />

_sizex = 0<br />

_sizey = 0<br />

_sizez = 0<br />

_description = LoadLeveler Partition<br />

_mode = C<br />

_options =<br />

_status = B<br />

_statuslastmodified = 2006-03-27 17:31:54.131411<br />

_mloaderimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />

_blrtsimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />

_linuximg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />

_ramdiskimg =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />

_debuggerimg = none<br />

_debuggerparmsize = 0<br />

_createdate = 2006-03-27 17:31:43.673653<br />

Chapter 6. Scenarios 313


We can see that the partition is still in B (booting) state. Next, we check the<br />

LoadLeveler cluster and queue (job) status as shown in Example 6-54.<br />

Example 6-54 LoadLeveler cluster and job status<br />

test1@bglfen1:/bglscratch/test1/applications/IOR> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />

bglfen1.itso.ibm.com Avail 1 1 Run 1 0.00 30 PPC64 Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />

PPC64/Linux2 2 machines 1 jobs 1 running<br />

Total Machines 2 machines 1 jobs 1 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

test1@bglfen1:/bglscratch/test1/applications/IOR> llq<br />

Id Owner Submitted ST PRI Class Running On<br />

------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.35.0<br />

test1 3/27 16:15 R 50 small bglfen1<br />

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted<br />

Next, we test the file system status on the SN (see Example 6-55).<br />

Example 6-55 SN file system status<br />

test1@bglfen1:/bglscratch/test1/applications/IOR> df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

/dev/sdc3 70614928 15459236 55155692 22% /<br />

tmpfs 3955528 8 3955520 1% /dev/shm<br />

p630n03:/nfs_mnt 104857600 11339904 93517696 11% /nfs_mnt<br />

p630n03:/bglscratch 36700160 64384 36635776 1% /bglscratch<br />

bglsn_fn:/bgl 9766560 9766560 0 100% /bgl<br />

p630n03:/bglhome 2621440 98464 2522976 4% /bglhome<br />

314 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


As we can see in Example 6-54, the job with the ID bglfen1.35.0 appears to be<br />

in Running state, so we look for more details (Example 6-56).<br />

Example 6-56 Details on LoadLeveler job bglfen1.35.0<br />

test1@bglfen1:/bglscratch/test1/applications/IOR> llq -s bglfen1.35.0<br />

=============== Job Step bglfen1.itso.ibm.com.35.0 ===============<br />

Job Step Id: bglfen1.itso.ibm.com.35.0<br />

Job Name: bglfen1.itso.ibm.com.35<br />

Step Name: 0<br />

Structure Version: 10<br />

Owner: test1<br />

Queue Date: Mon 27 Mar 2006 04:15:20 PM EST<br />

Status: Running<br />

Reservation ID:<br />

Requested Res. ID:<br />

Scheduling Cluster:<br />

Submitting Cluster:<br />

Sending Cluster:<br />

Requested Cluster:<br />

Schedd History:<br />

Outbound Schedds:<br />

Submitting User:<br />

Execution Factor: 1<br />

Dispatch Time: Mon 27 Mar 2006 05:31:44 PM EST<br />

Completion Date:<br />

Completion Code:<br />

Favored Job: No<br />

User Priority: 50<br />

user_sysprio: 0<br />

class_sysprio: 0<br />

group_sysprio: 0<br />

System Priority: -275512<br />

q_sysprio: -275512<br />

Previous q_sysprio: 0<br />

Notifications: Complete<br />

Virtual Image Size: 472 kb<br />

Large Page: N<br />

Checkpointable: no<br />

Ckpt Start Time:<br />

Good Ckpt Time/Date:<br />

Ckpt Elapse Time: 0 seconds<br />

Fail Ckpt Time/Date:<br />

Ckpt Accum Time: 0 seconds<br />

Checkpoint File:<br />

Chapter 6. Scenarios 315


Ckpt Execute Dir:<br />

Restart From Ckpt: no<br />

Restart Same Nodes: no<br />

Restart: yes<br />

Preemptable: no<br />

Preempt Wait Count: 0<br />

Hold Job Until:<br />

RSet: RSET_NONE<br />

Mcm Affinity Options:<br />

Cmd: /usr/bin/mpirun<br />

Args: -exe /bglscratch/test1/applications/IOR/IOR.rts<br />

-args "-f /bglscratch/test1/applications/IOR/ior-inputs"<br />

Env:<br />

In: /dev/null<br />

Out: /bglscratch/test1/ior-gpfs.out<br />

Err: /bglscratch/test1/ior-gpfs.err<br />

Initial Working Dir: /bglscratch/test1/applications/IOR<br />

Dependency:<br />

Resources:<br />

Requirements: (Arch == "PPC64") && (OpSys == "Linux2")<br />

Preferences:<br />

Step Type: Blue Gene<br />

Size Requested: 128<br />

Size Allocated: 128<br />

Shape Requested:<br />

Shape Allocated:<br />

Wiring Requested: MESH<br />

Wiring Allocated: MESH<br />

Rotate: True<br />

Blue Gene Status:<br />

Blue Gene Job Id:<br />

Partition Requested:<br />

Partition Allocated: RMP27Mr173143250<br />

Error Text:<br />

Node Usage: shared<br />

Submitting Host: bglfen1.itso.ibm.com<br />

Schedd Host: bglfen1.itso.ibm.com<br />

Job Queue Key:<br />

Notify User: test1@bglfen1.itso.ibm.com<br />

Shell: /bin/bash<br />

LoadLeveler <strong>Group</strong>: No_<strong>Group</strong><br />

Class: small<br />

Ckpt Hard Limit: undefined<br />

Ckpt Soft Limit: undefined<br />

Cpu Hard Limit: undefined<br />

316 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Cpu Soft Limit: undefined<br />

Data Hard Limit: undefined<br />

Data Soft Limit: undefined<br />

Core Hard Limit: undefined<br />

Core Soft Limit: undefined<br />

File Hard Limit: undefined<br />

File Soft Limit: undefined<br />

Stack Hard Limit: undefined<br />

Stack Soft Limit: undefined<br />

Rss Hard Limit: undefined<br />

Rss Soft Limit: undefined<br />

Step Cpu Hard Limit: undefined<br />

Step Cpu Soft Limit: undefined<br />

Wall Clk Hard Limit: 00:30:00 (1800 seconds)<br />

Wall Clk Soft Limit: 00:30:00 (1800 seconds)<br />

Comment:<br />

Account:<br />

Unix <strong>Group</strong>: itso<br />

NQS Submit Queue:<br />

NQS Query Queues:<br />

Negotiator Messages:<br />

Bulk Transfer: No<br />

Step Adapter Memory: 0 bytes<br />

Adapter Requirement:<br />

Step Cpus: 0<br />

Step Virtual Memory: 0.000 mb<br />

Step Real Memory: 0.000 mb<br />

==================== EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.35.0<br />

====================<br />

The status of job step is : Running<br />

Since job step status is not Idle, Not Queued, or Deferred, no attempt<br />

has been made to determine why this job step has not been started.<br />

Important: Because job step status is not Idle, Queued, or Deferred, no<br />

attempt has been made to determine why this job step has not been started.<br />

However, as the job stays in Running state for quite some time, we decide to<br />

investigate further. We know from the LoadLeveler job command file that a<br />

GPFS file system is used for storing the files related to running this job. Thus, we<br />

decide to investigate GPFS on the I/O nodes.<br />

Chapter 6. Scenarios 317


We realize that we cannot find either I/O node boot messages, nor GPFS log<br />

messages (which are written on the /bgl file system NFS exported from the SN),<br />

and we realize that /bgl file system this is full (100%); we identify and clear the<br />

/bgl/largefile and the block continues to boot. After the block booted, we logged<br />

in to one of the I/O nodes (ssh) and got the messages shown in Example 6-57.<br />

Example 6-57 Checking the GPFS file system<br />

$ df /bubu<br />

Mar 27 17:47:39 (I) [1096078560] {119}.0: [ciod:initialized]<br />

Mar 27 17:47:39 (I) [1096078560] {102}.0: [ciod:initialized]<br />

Mar 27 17:47:39 (I) [1096078560] {17}.0: R00-M0-N0-I:J18-U11<br />

/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O nod<br />

Mar 27 17:47:39 (I) [1096078560] {0}.0: R00-M0-N0-I:J18-U01<br />

/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O nod<br />

Mar 27 17:47:39 (I) [1096078560] {0}.0: [ciod:initialized]<br />

Mar 27 17:47:39 (I) [1096078560] {0}.0: e ionode1 : 172.30.2.1<br />

Starting syslog services<br />

Starting ciod<br />

Starting XL Compiler Environment for I/O node<br />

ciod: version "Jan 10 2006 16:25:12"<br />

ciod: running in virtual node mode with 32 processors<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

/bin/sh: can't access tty; job control turned off<br />

$ df /bubu<br />

df: `/bubu': No such file or directory<br />

$ mount<br />

rootfs on / type rootfs (rw)<br />

/dev/root on / type ext2 (rw)<br />

none on /proc type proc (rw)<br />

172.30.1.1:/bgl on /bgl type nfs (rw,v3,rsize=8192,wsize=8192,<br />

Mar 27 17:47:39 (I) [1096078560] {85}.0: [ciod:initialized]<br />

Mar 27 17:47:39 (I) [1096078560] {68}.0: rootfs on / type rootfs (rw)<br />

318 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Mar 27 17:47:39 (I) [1096078560] {51}.0: [ciod:initialized]<br />

Mar 27 17:47:39 (I) [1096078560] {51}.0: df:<br />

Mar 27 17:47:39 (I) [1096078560] {34}.0: [ciod:initialized]<br />

Mar 27 17:47:39 (I) [1096078560] {17}.0: [ciod:initialized]<br />

Mar 27 17:47:39 (I) [1096078560] {0}.0: /var/etc/rc.d/rc3.d/S40gpfs:<br />

GPFS did not come up on I/O node ionode1 : 172.30.2.1<br />

Mar 27 17:47:39 (I) [1096078560] {0}.0:<br />

hard,udp,nolock,addr=172.30.1.1)<br />

172.30.1.33:/bglscratch on /bglscratch type nfs<br />

(rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.30.1.33)<br />

$ mount<br />

rootfs on / type rootfs (rw)<br />

/dev/root on / type ext2 (rw)<br />

none on /proc type proc (rw)<br />

172.30.1.1:/bgl on /bgl type nfs<br />

(rw,v3,rsize=8192,wsize=8192,hard,udp,nolock,addr=172.30.1.1)<br />

172.30.1.33:/bglscratch on /bglscratch type nfs<br />

(rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.30.1.33)<br />

$<br />

Mar 27 17:47:39 (I) [1096078560] {85}.0: df:<br />

Mar 27 17:47:39 (I) [1096078560] {85}.0: `/bubu': No such file or<br />

directory<br />

Because the GPFS file system (/bubu) cannot be found, we checked the 10<br />

minute GPFS start timeout and see that it has expired. So, we have to reboot the<br />

block for GPFS to be able to start and allow the job to run.<br />

Lessons learned<br />

We learned the following lessons:<br />

► If the /bgl file system becomes full then GPFS has a problem stated and we<br />

see the following message in the mmcs_db_console log.<br />

The GPFS environment cannot be initialized.<br />

► After the 10 minute timeout period has expired and after fixing the problem<br />

(making some space available in /bgl), we need to reboot the block.<br />

Chapter 6. Scenarios 319


Note: Even though we can manually correct the problem and bring up GPFS,<br />

it is better to reboot the block to allow GPFS to cleanly start and mount the file<br />

system.<br />

6.3.6 Installing new Blue Gene/L driver code (driver update)<br />

This scenario documents the process of upgrading the Blue Gene/L code. It is<br />

designed to show what happens if the GPFS code is not also updated at the<br />

same time.<br />

Error injection<br />

The error injected is updating of the Blue Gene/L code itself. This section is<br />

rather long because we present all the checks that we went through before<br />

updating the code and the actions that are required after the update RPMs are<br />

installed.<br />

First we check the code levels:<br />

1. Check the correct code levels and the process that we intend to use.<br />

Example 6-58 shows the Blue Gene/L driver levels.<br />

Example 6-58 Current Blue Gene/L code levels<br />

bglsn:~ # rpm -qa | grep bgl<br />

libglade-0.17-230.1<br />

bglmpi-2006.1.2-1<br />

bglbaremetal-2006.1.2-1<br />

bglos-4.1-0<br />

bglcmcs-2006.1.2-1<br />

libglade2-2.0.1-501.3<br />

bglblrtstool-2006.1.2-1<br />

bglcnk-2006.1.2-1<br />

bgldiag-2006.1.2-1<br />

bglmcp-2006.1.2-1<br />

Example 6-59 shows the code levels for the GPFS installed for the SN.<br />

Example 6-59 GPFS for Blue Gene/L<br />

bglsn:~ # rpm -qa | grep gpfs<br />

gpfs.docs-2.3.0-10<br />

gpfs.gpl-2.3.0-10<br />

gpfs.msg.en_US-2.3.0-10<br />

gpfs.base-2.3.0-10<br />

320 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Example 6-60 lists the installed code levels for the GPFS for the I/O node or<br />

nodes.<br />

Note: You do not need to compile the GPL portability layer for the MCP. The<br />

gpfs.gplbin package contains the compiled modules for the portability code.<br />

Example 6-60 GPFS for PPC440 (I/O nodes processor) levels<br />

bglsn:/bgl/BlueLight # rpm --root<br />

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/bin/bglOS \ -qa | grep gpfs<br />

gpfs.base-2.3.0-10<br />

gpfs.gpl-2.3.0-10<br />

gpfs.msg.en_US-2.3.0-10<br />

gpfs.docs-2.3.0-10<br />

gpfs.gplbin-2.3.0-11<br />

2. Check for the currently installed version of the Blue Gene/L driver code, as<br />

shown in Example 6-61.<br />

Example 6-61 Current Blue Gene/L driver level<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ version<br />

OK<br />

mmcs_db_server $Name: V1R2M1_020_2006 $ Jan 10 2006 16:23:15<br />

mmcs$<br />

3. Ensure that we have all the required update packages downloaded onto the<br />

SN. Again, we need two sets of packages for GPFS, one for SN (PPC64-bit),<br />

one for I/O nodes’ MCP (PPC440 - 32-bit), as shown in Example 6-62.<br />

Example 6-62 GPFS for PPC440 (MCP - I/O nodes OS)<br />

bglsn:/mnt/LPP/GPFS # cd PPC64_PTF11<br />

bglsn:/mnt/LPP/GPFS/PPC64_PTF11 # ls -lrt<br />

total 4882<br />

-rw-r--r-- 1 root root 4161526 Mar 31 19:21<br />

gpfs.base-2.3.0-11.ppc64.rpm<br />

-rw-r--r-- 1 root root 113285 Mar 31 19:21<br />

gpfs.docs-2.3.0-11.noarch.rpm<br />

-rw-r--r-- 1 root root 331349 Mar 31 19:21<br />

gpfs.gpl-2.3.0-11.noarch.rpm<br />

Chapter 6. Scenarios 321


-rw-r--r-- 1 root root 58882 Mar 31 19:21<br />

gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />

bglsn:/mnt/LPP/GPFS/PPC64_PTF11 # cd ../MCP_PTF11<br />

bglsn:/mnt/LPP/GPFS/MCP_PTF11 # ls -lrt<br />

total 5196<br />

-rw-r--r-- 1 root root 4161526 Mar 31 19:21 gpfs.base-2.3.0-11.ppc.rpm<br />

-rw-r--r-- 1 root root 113285 Mar 31 19:21<br />

gpfs.docs-2.3.0-11.noarch.rpm<br />

-rw-r--r-- 1 root root 331349 Mar 31 19:21<br />

gpfs.gpl-2.3.0-11.noarch.rpm<br />

-rw-r--r-- 1 root root 644241 Mar 31 19:21<br />

gpfs.gplbin-2.3.0-11.ppc.rpm<br />

-rw-r--r-- 1 root root 58882 Mar 31 19:21<br />

gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />

We now review the install process from the readme file that we downloaded from<br />

the Blue Gene/L Web site along with the new driver. Here we present a list of the<br />

actions that are relevant for our configuration:<br />

1. Download RPMs into appropriate directories.<br />

Note: You can use a naming convention of your choice for the download<br />

directories. You just need to make sure you install the right code into the<br />

right directories.<br />

2. Install RPMs and rebuild BLRTS tool chain.<br />

3. If you have any customized scripts that mount your file systems stored in<br />

/bgl/BlueLight/ppcfloor/dist/etc/rc.d/rc3.d directory, you have to re-create<br />

them with each driver upgrade.<br />

4. Stop the control system jobs running on the SN.<br />

5. Update the following symbolic link:<br />

rm /bgl/BlueLight/ppcfloor<br />

ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc<br />

/bgl/BlueLight/ppcfloor<br />

6. Determine the home directory of user bgdb2cli in the file<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/db2profile:<br />

echo ~bgdb2cli<br />

Change "INSTHOME=/u/bgdb2cli" to "INSTHOME=X", where X = the result<br />

from the echo ~bgdb2cli command.<br />

322 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


In the file /bgl/BlueLight/ppcfloor/bglsys/bin/db2cshrc change "setenv<br />

INSTHOME /u/bgdb2cli" to "setenv INSTHOME X", where X = the result from the<br />

echo ~bgdb2cli command.<br />

Check the new settings:<br />

bglsn:~ # echo ~bgdb2cli<br />

/dbhome/bgdb2cli<br />

bglsn:~ # grep INSTHOME=<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />

INSTHOME=/dbhome/bgdb2cli<br />

7. Rebind the new jar file.<br />

8. Update discovery directory.<br />

Everyone should run these two commands:<br />

cp /bgl/BlueLight//ppc/bglsys/discovery/runPopIpPool<br />

/discovery<br />

cp<br />

/bgl/BlueLight//ppc/bglsys/discovery/ServiceNetwork.c<br />

fg /discovery<br />

9. Update the database schema. If you are upgrading from V1R1M1, run:<br />

/bgl/BlueLight/ppcfloor/bglsys/schema/updateSchema.pl<br />

--dbproperties XX --driver V1R2M1<br />

where XX = your db.properties file (for example,<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties).<br />

10.The Blue Gene/L upgrade is now complete. When ready, start the control<br />

system to resume any jobs that you stopped in Step 4.<br />

Having checked these steps, verify that the updated system can run a job using<br />

LoadLeveler:<br />

test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />

The ouptut file shows job ran OK.<br />

Chapter 6. Scenarios 323


Step 1<br />

Now we install the new Blue Gene/L RPMs as shown in Example 6-63.<br />

Example 6-63 Installing new Blue Gene/L driver RPMs<br />

bglsn:/mnt/BGL/BGL_V1R2M2 # rpm -ivh bgl*.rpm<br />

Preparing... ########################################### [100%]<br />

1:bglos ########################################### [ 11%]<br />

2:bglblrtstool ########################################### [ 22%]<br />

=====================================================================================<br />

=<br />

=== RPM has installed but automatic building of the blrts toolchain not successful<br />

===<br />

=== Unable to attempt blrts toolchain build, /bgl/downloads not found<br />

===<br />

=== Follow the manual instructions for building the toolchain<br />

===<br />

=====================================================================================<br />

=<br />

error: %post(bglblrtstool-2006.1.2-2) scriptlet failed, exit status 1<br />

3:bgliontool ########################################### [ 33%]<br />

===================================================================================<br />

=== RPM has installed but automatic building of the IO toolchain not successful ===<br />

=== Unable to attempt blrts toolchain build, /bgl/downloads not found ===<br />

=== Follow the manual instructions to build the IO node toolchain ===<br />

===================================================================================<br />

error: %post(bgliontool-2006.1.2-2) scriptlet failed, exit status 1<br />

4:bglmpi ########################################### [ 44%]<br />

5:bglbaremetal ########################################### [ 56%]<br />

6:bglcmcs ########################################### [ 67%]<br />

7:bglcnk ########################################### [ 78%]<br />

8:bgldiag ########################################### [ 89%]<br />

9:bglmcp ########################################### [100%]<br />

bglsn:/mnt/BGL/BGL_V1R2M2 #<br />

The output from the rpm install command shows that we now need to rebuild the<br />

BLRTS toolchain.<br />

Step 2<br />

To rebuild the BLRTS toolchain, we first have to edit the script as the original<br />

curl -O commands in the retrieveToolChains.sh script located in the<br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/toolchain/ directory. The actual<br />

script would not work on our system due to our firewall configuration which only<br />

allows Web traffic (http protocol).<br />

324 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


We got around this by copying the file to the /tmp/downloads directory and<br />

substituting the wget command for the curl -O commands. Example 6-64 shows<br />

the modified script.<br />

Example 6-64 Modified retrieveToolChains.sh script<br />

bglsn:/bgl/downloads # cat retrieveToolChains.sh<br />

#! /bin/bash<br />

###################################################################<br />

# Product(s): */<br />

# 5733-BG1 */<br />

# */<br />

# (C) Copyright IBM Corp. 2004, 2004 */<br />

# All rights reserved. */<br />

# US Government Users Restricted Rights - */<br />

# Use, duplication or disclosure restricted */<br />

# by GSA ADP Schedule Contract with IBM Corp. */<br />

# */<br />

# Licensed Materials-Property of IBM */<br />

# */<br />

##################################################################<br />

# This script is to help facilitate the retrieval of the GNU<br />

# components necessary to build toolchains<br />

#<br />

# The script utilizes curl to go to the ftp.gnu.org site and<br />

# ftp the appropriate tarballs to your system. These<br />

# files will be installed in the CWD.<br />

#<br />

# To utilize the script, you will need to insure curl is<br />

# installed on your sytem and your system has ftp access to<br />

# outside sites.<br />

#<br />

# Once you run this script to download the appropriate<br />

# packages, you can install the BlueGene/L patches and<br />

# commence on building the Toolchain.<br />

##############################################################<br />

# Grab all of the Toolchain components necessary to build both<br />

# the blrts and the linux toolchains for BlueGene/L<br />

wget ftp://ftp.gnu.org/gnu/binutils/binutils-2.13.tar.gz<br />

wget ftp://ftp.gnu.org/gnu/gcc/gcc-3.2/gcc-3.2.tar.gz<br />

wget ftp://ftp.gnu.org/gnu/gdb/gdb-5.3.tar.gz<br />

wget ftp://ftp.gnu.org/gnu/glibc/glibc-2.2.5.tar.gz<br />

wget ftp://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.2.5.tar.gz<br />

Chapter 6. Scenarios 325


We used the updated script to retrieve the toolchain, as shown in Example 6-65.<br />

Example 6-65 Retrieving the BLRTS tool chain<br />

bglsn:/bgl/downloads # ./retrieveToolChains.sh<br />

--12:02:45-- ftp://ftp.gnu.org/gnu/binutils/binutils-2.13.tar.gz<br />

=> `binutils-2.13.tar.gz'<br />

Resolving ftp.gnu.org... 199.232.41.7<br />

Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />

Logging in as anonymous ... Logged in!<br />

==> SYST ... done. ==> PWD ... done.<br />

==> TYPE I ... done. ==> CWD /gnu/binutils ... done.<br />

==> PASV ... done. ==> RETR binutils-2.13.tar.gz ... done.<br />

Length: 12,790,277 (unauthoritative)<br />

100%[==================================================================<br />

====>] 12,790,277 1.67M/s ETA 00:00<br />

12:02:52 (1.67 MB/s) - `binutils-2.13.tar.gz' saved [12790277]<br />

--12:02:52-- ftp://ftp.gnu.org/gnu/gcc/gcc-3.2/gcc-3.2.tar.gz<br />

=> `gcc-3.2.tar.gz'<br />

Resolving ftp.gnu.org... 199.232.41.7<br />

Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />

Logging in as anonymous ... Logged in!<br />

==> SYST ... done. ==> PWD ... done.<br />

==> TYPE I ... done. ==> CWD /gnu/gcc/gcc-3.2 ... done.<br />

==> PASV ... done. ==> RETR gcc-3.2.tar.gz ... done.<br />

Length: 26,963,731 (unauthoritative)<br />

100%[==================================================================<br />

====>] 26,963,731 1.73M/s ETA 00:00<br />

12:03:08 (1.69 MB/s) - `gcc-3.2.tar.gz' saved [26963731]<br />

--12:03:08-- ftp://ftp.gnu.org/gnu/gdb/gdb-5.3.tar.gz<br />

=> `gdb-5.3.tar.gz'<br />

Resolving ftp.gnu.org... 199.232.41.7<br />

Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />

Logging in as anonymous ... Logged in!<br />

==> SYST ... done. ==> PWD ... done.<br />

==> TYPE I ... done. ==> CWD /gnu/gdb ... done.<br />

==> PASV ... done. ==> RETR gdb-5.3.tar.gz ... done.<br />

Length: 14,707,600 (unauthoritative)<br />

326 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


100%[==================================================================<br />

====>] 14,707,600 1.72M/s ETA 00:00<br />

12:03:16 (1.67 MB/s) - `gdb-5.3.tar.gz' saved [14707600]<br />

--12:03:16-- ftp://ftp.gnu.org/gnu/glibc/glibc-2.2.5.tar.gz<br />

=> `glibc-2.2.5.tar.gz'<br />

Resolving ftp.gnu.org... 199.232.41.7<br />

Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />

Logging in as anonymous ... Logged in!<br />

==> SYST ... done. ==> PWD ... done.<br />

==> TYPE I ... done. ==> CWD /gnu/glibc ... done.<br />

==> PASV ... done. ==> RETR glibc-2.2.5.tar.gz ... done.<br />

Length: 16,657,505 (unauthoritative)<br />

100%[==================================================================<br />

====>] 16,657,505 1.74M/s ETA 00:00<br />

12:03:26 (1.68 MB/s) - `glibc-2.2.5.tar.gz' saved [16657505]<br />

--12:03:26--<br />

ftp://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.2.5.tar.gz<br />

=> `glibc-linuxthreads-2.2.5.tar.gz'<br />

Resolving ftp.gnu.org... 199.232.41.7<br />

Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />

Logging in as anonymous ... Logged in!<br />

==> SYST ... done. ==> PWD ... done.<br />

==> TYPE I ... done. ==> CWD /gnu/glibc ... done.<br />

==> PASV ... done. ==> RETR glibc-linuxthreads-2.2.5.tar.gz ...<br />

done.<br />

Length: 226,543 (unauthoritative)<br />

100%[==================================================================<br />

====>] 226,543 1.15M/s<br />

12:03:26 (1.15 MB/s) - `glibc-linuxthreads-2.2.5.tar.gz' saved [226543]<br />

Chapter 6. Scenarios 327


Example 6-66 shows the files that were downloaded into the /bgl/downloads<br />

directory.<br />

Example 6-66 Downloaded files for the tool chain<br />

bglsn:/bgl/downloads # ls -lrt<br />

total 69751<br />

-rwxr-xr-x 1 root root 1913 Apr 1 12:02 retrieveToolChains.sh<br />

-rw-r--r-- 1 root root 12790277 Apr 1 12:02 binutils-2.13.tar.gz<br />

-rw-r--r-- 1 root root 26963731 Apr 1 12:03 gcc-3.2.tar.gz<br />

-rw-r--r-- 1 root root 14707600 Apr 1 12:03 gdb-5.3.tar.gz<br />

-rw-r--r-- 1 root root 16657505 Apr 1 12:03 glibc-2.2.5.tar.gz<br />

-rw-r--r-- 1 root root 226543 Apr 1 12:03<br />

glibc-linuxthreads-2.2.5.tar.gz<br />

We can now proceed to rebuild the BLRTS toolchain. This process can take<br />

some time (it took us about half an hour). Thus in Example 6-67, we only show<br />

the last part of this process. We could not tell precisely that this process was<br />

successful, thus we tested the return code to verify that the tool chain had in fact<br />

been built correctly.<br />

Example 6-67 BLRTS tool chain built<br />

bglsn:/bgl/downloads #<br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/toolchain/buildBlrtsToolChain<br />

.sh<br />

..... > .....<br />

make[4]: Entering directory<br />

`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb/gdbse<br />

rver'<br />

n=`echo gdbserver | sed 's,x,x,'`; \<br />

if [ x$n = x ]; then n=gdbserver; else true; fi; \<br />

/bin/sh /bgl/downloads/gnu/gdb-5.3/install-sh -c gdbserver<br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/blrts-gnu/bin/$n; \<br />

/bin/sh /bgl/downloads/gnu/gdb-5.3/install-sh -c -m 644<br />

/bgl/downloads/gnu/gdb-5.3/gdb/gdbserver/gdbserver.1<br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/blrts-gnu/man/man1/$n.1<br />

make[4]: Leaving directory<br />

`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb/gdbse<br />

rver'<br />

make[3]: Leaving directory<br />

`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb'<br />

make[2]: Leaving directory<br />

`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb'<br />

328 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


make[1]: Leaving directory<br />

`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build'<br />

bglsn:/bgl/downloads # echo $?<br />

0<br />

Step 3<br />

We saved our customized scripts (containing our local NFS and GPFS<br />

configuration)<br />

Step 4<br />

Because the tool chain build was successful, we stop the control system<br />

(Example 6-68).<br />

Example 6-68 Stopping the bglmaster<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [11220]<br />

ciodb started [11221]<br />

mmcs_server started [11222]<br />

monitor stopped<br />

perfmon stopped<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />

Stopping BGLMaster<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [11220]<br />

ciodb started [11221]<br />

mmcs_server started [11222]<br />

monitor stopped<br />

perfmon stopped<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

BGLMaster is not started<br />

We also check the LoadLeveler status and stop, as shown in Example 6-69.<br />

Example 6-69 Stopping LoadLeveler<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 4318 PPC64 Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

Chapter 6. Scenarios 329


The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llctl -g stop<br />

llctl: Sent stop command to host bglsn.itso.ibm.com<br />

llctl: Sent stop command to host bglfen1.itso.ibm.com<br />

llctl: Sent stop command to host bglfen2.itso.ibm.com<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llstatus<br />

04/01 12:49:39 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />

"LoadL_negotiator" on port 9614. errno = 111<br />

04/01 12:49:39 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />

"LoadL_negotiator" on port 9614. errno = 111<br />

llstatus: 2512-301 An error occurred while receiving data from the LoadL_negotiator<br />

daemon on host bglsn.itso.ibm.com.<br />

Step 5<br />

We now update the ppcfloor symbolic link:<br />

bglsn:/ # rm /bgl/BlueLight/ppcfloor<br />

bglsn:/ # ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />

/bgl/BlueLight/ppcfloor<br />

Step 6<br />

Next, we check and update db2profile and db2cshrc files, as shown in<br />

Example 6-70.<br />

Example 6-70 Updating the DB2 environment files<br />

bglsn:/ # vi /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />

# Also in this file ..<br />

bglsn:/ # vi /bgl/BlueLight/ppcfloor/bglsys/bin/db2cshrc<br />

# Check the original password in the old driver db.properties<br />

database_name=bgdb0<br />

database_user=bglsysdb<br />

database_password=bglsysdb<br />

# database_password=db24bgls<br />

database_schema_name=bglsysdb<br />

system=BGL<br />

min_pool_connections=1<br />

# Web Console Configuration<br />

mmcs_db_server_ip=127.0.0.1<br />

330 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


mmcs_db_server_port=32031<br />

mmcs_max_reply_size=8029<br />

mmcs_max_history_size=2097152<br />

mmcs_redirect_server_ip=default<br />

mmcs_redirect_server_port=32032<br />

Now we update the DB2 user password in the db.properties file that is located in<br />

the /bgl/BlueLight/ppcfloor/bglsys/bin/ directory (use vi) and use this updated<br />

password to rebuild the jar file.<br />

Step 7<br />

We rebind the new jar file.<br />

bglsn:/ # java -cp /bgl/BlueLight/ppcfloor/bglsys/bin/ido.jar<br />

com.ibm.db2.jcc.DB2Binder -url jdbc:db2://127.0.0.1:50001/bgdb0<br />

-user bglsysdb -password bglsysdb -size 200<br />

Step 8<br />

We update the /discovery directory saving the original configuration files, as<br />

shown in Example 6-71.<br />

Example 6-71 Updating the /discovery directory<br />

bglsn:/ # cp /bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/runPopIpPool<br />

/discovery<br />

bglsn:/ # cp<br />

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/ServiceNetwork.cfg<br />

/discovery<br />

Step 9<br />

We also update the database schema, as shown in Example 6-72.<br />

Example 6-72 Updating the Blue Gene/L database schema<br />

bglsn:/ # /bgl/BlueLight/ppcfloor/bglsys/schema/updateSchema.pl<br />

--dbproperties /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

--driver V1R2M1<br />

BlueGene/L database schema update utility, version V1R2M1.<br />

database bgdb0<br />

schema bglsysdb<br />

username bglsysdb<br />

previous driver 020-6<br />

target driver 080-6<br />

verbose level: 0<br />

Chapter 6. Scenarios 331


Log will be written to<br />

/bgl/BlueLight/logs/BGL/updateSchema-2006-04-01-13:18:29.log.<br />

Connected to database bgdb0 as user bglsysdb<br />

Updating to driver 080-6 schema.<br />

Finished updating database schema to driver 080-6.<br />

We have now completed the upgrade process (which constitutes the error<br />

injection for this scenario).<br />

<strong>Problem</strong> determination<br />

To determine the problem, we followed these steps:<br />

1. We start the control system as shown in Example 6-73.<br />

Example 6-73 Starting the bglmaster process<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />

Starting BGLMaster<br />

Logging to<br />

/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0401-13:25:44.log<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [26694]<br />

ciodb started [26695]<br />

mmcs_server started [26696]<br />

monitor stopped<br />

perfmon stopped<br />

2. We start LoadLeveler (see Example 6-74).<br />

Example 6-74 Starting LoadLeveler<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llctl -g start<br />

llctl: Attempting to start LoadLeveler on host bglsn.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

llctl: Attempting to start LoadLeveler on host bglfen1.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

llctl: Attempting to start LoadLeveler on host bglfen2.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 6608 PPC64 Linux2<br />

332 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


glfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

3. We check that the upgrade was successful (Example 6-75).<br />

Example 6-75 Checking the driver version<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ version<br />

OK<br />

mmcs_db_server $Name: V1R2M2_120_2006 $ Mar 28 2006 17:54:10<br />

mmcs$<br />

4. We now try and boot a block:<br />

OK<br />

mmcs_db_server $Name: V1R2M2_120_2006 $ Mar 28 2006 17:54:10<br />

mmcs$ allocate R000_128<br />

This command apparently hung for 10 minutes because the sitefs file still has<br />

the entry to start GPFS and the ~ppcfloor/dist/etc/rc.d/init.d/gpfs file has the<br />

10 minute timeout because it cannot start GPFS.<br />

Example 6-76 shows the output from an I/O node log before the 10 minutes<br />

GPFS load timeout.<br />

Example 6-76 I/O node log messages showing GPFS cannot start<br />

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {51}.0: Starting GPFS<br />

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {51}.0:<br />

/var/etc/rc.d/rc3.d/S40gpfs: 2: /usr/lpp/mmfs/bin/mmautoload: not found<br />

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {51}.0: Disabling<br />

protocol version 1. Could not load host key<br />

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {51}.0:<br />

bglsn:/bgl/BlueLight/logs/BGL #<br />

Chapter 6. Scenarios 333


However, we were able to ssh to the I/O node and run the df command (see<br />

Example 6-77).<br />

Example 6-77 Running the df command on an I/O node<br />

$ df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

rootfs 7931 1640 5882 22% /<br />

/dev/root 7931 1640 5882 22% /<br />

172.30.1.1:/bgl 9766544 4008736 5757808 42% /bgl<br />

172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />

This check proves that the sshd daemon was started. Using the ps -ef<br />

command on that I/O node during the 10 minute GPFS timeout, we confirmed<br />

that the S40gpfs process is still running (with a sleep thread running).<br />

After the 10 minutes, we see the block booted, but the /bubu file system<br />

(GPFS) is not available (see Example 6-78).<br />

Example 6-78 GPFS startup timeout has expired<br />

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0: Starting SSH<br />

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0: Starting GPFS<br />

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0:<br />

/var/etc/rc.d/rc3.d/S40gpfs: 2: /usr/lpp/mmfs/bin/mmautoload: not found<br />

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {17}.0: Disabling<br />

protocol version 1. Could not load host key<br />

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {17}.0:<br />

Apr 01 13:39:01 (I) [1088451808] root:R000_128 {17}.0:<br />

R00-M0-N0-I:J18-U11 /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up<br />

on I/O nod<br />

Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0:<br />

[ciod:initialized]<br />

Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0: e ionode2 :<br />

172.30.2.2<br />

Starting syslog services<br />

Starting ciod<br />

Starting XL Compiler Environment for I/O node<br />

BusyBox v1.00-rc2 (2006.03.24-21:52+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

/bin/sh: can't access tty; job control turned off<br />

$ ciod: version "Mar 28 2006 17:56:13"<br />

ciod: running in virtual node mode with 32 processors<br />

334 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0:<br />

/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O node ionode2 :<br />

172.30.2.2<br />

5. We switch the ppcfloor link back to the old driver to see if we can immediately<br />

recover the situation and begin running jobs. However, we stop the control<br />

system to make the change, as shown in Example 6-79.<br />

Example 6-79 Stopping bglmaster to prepare reverting to previous driver<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />

Stopping BGLMaster<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

BGLMaster is not started<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

6. We switch the symbolic link, as shown in Example 6-80.<br />

Example 6-80 Reverting to the previous driver version (ppcfloor link)<br />

bglsn:/ # #ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc /bgl/BlueLight/ppcfloor<br />

bglsn:/ # ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor<br />

bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />

0 lrwxrwxrwx 1 root root 41 Apr 1 12:53 /bgl/BlueLight/ppcfloor -><br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />

bglsn:/ # ln -fs /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor<br />

bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />

0 lrwxrwxrwx 1 root root 41 Apr 1 12:53 /bgl/BlueLight/ppcfloor -><br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />

bglsn:/ # rm /bgl/BlueLight/ppcfloor<br />

bglsn:/ # ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor<br />

bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />

0 lrwxrwxrwx 1 root root 41 Apr 1 14:31 /bgl/BlueLight/ppcfloor -><br />

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc<br />

7. We then restart the control system (Example 6-81).<br />

Example 6-81 Restarting bglmaster after reverting to previous driver<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster statut<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [28068]<br />

ciodb started [28069]<br />

mmcs_server started [28070]<br />

monitor stopped<br />

perfmon stopped<br />

Chapter 6. Scenarios 335


8. We try to run a job. We start by allocating a block as in Example 6-82.<br />

Example 6-82 Allocating a block (mixed bgl driver versions)<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ list_blocks<br />

OK<br />

mmcs$ allocate R000_128<br />

OK<br />

mmcs$ quit<br />

OK<br />

mmcs_db_console is terminating, please wait...<br />

mmcs_db_console: closing database connection<br />

mmcs_db_console: closed database connection<br />

mmcs_db_console: closing console port<br />

mmcs_db_console: closed console port<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode1<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

$ df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

rootfs 7931 1644 5878 22% /<br />

/dev/root 7931 1644 5878 22% /<br />

172.30.1.1:/bgl 9766544 4008760 5757784 42% /bgl<br />

172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />

/dev/bubu_gpfs1 1138597888 3304448 1135293440 1% /bubu<br />

336 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Here can see that the GPFS file system is again working at the old Blue<br />

Gene/L driver level. Now we try to run the job, as shown in Example 6-83.<br />

Example 6-83 Running a job (mixed driver versions)<br />

test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />

llsubmit: The job "bglfen1.itso.ibm.com.50" has been submitted.<br />

test1@bglfen1:/bglscratch/test1> llq<br />

Id Owner Submitted ST PRI Class Running On<br />

------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.50.0<br />

test1 4/1 12:56 R 50 small bglfen1<br />

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted<br />

test1@bglfen1:/bglscratch/test1> ls -lrt<br />

total 1598796<br />

-rwxr-xr-x 1 test1 itso 7553497 2006-03-21 11:33 hello-file-1.rts<br />

-rwxr--r-- 1 test1 itso 1826286 2006-03-21 15:53 hello.rts<br />

-rwxr-xr-x 1 test1 itso 7553579 2006-03-22 11:05 hello-file-2.rts<br />

drwxr-xr-x 4 test1 itso 4096 2006-03-23 10:16 applications<br />

-rw-r--r-- 1 test1 itso 638 2006-03-23 11:08 hello.cmd<br />

-rw-r--r-- 1 test1 itso 639 2006-03-23 11:16 hello128.cmd<br />

-rw-r--r-- 1 test1 itso 635 2006-03-24 16:48 ior-gpfs.cmd<br />

-rw-r--r-- 1 test1 itso 2643 2006-03-24 16:53 ior-gpfs.out.ciod-hung-scenario<br />

-rw-r--r-- 1 test1 itso 9435 2006-03-24 17:20 ior-gpfs.err.ciod-hung-scenario<br />

-rw-r--r-- 1 test1 itso 637 2006-03-28 17:41 cds_ior-gpfs.cmd<br />

-rwxr-xr-x 1 test1 itso 7546841 2006-03-29 11:47 hello-file.rts<br />

-rw-r--r-- 1 test1 itso 3755 2006-03-29 11:49 core.0<br />

-rw-r--r-- 1 root root 0 2006-03-29 19:36 R000_128-233.stdout<br />

-rw-r--r-- 1 root root 0 2006-03-29 19:36 R000_128-233.stderr<br />

-rw-r--r-- 1 root root 0 2006-03-29 19:38 R000_128-235.stdout<br />

-rw-r--r-- 1 root root 0 2006-03-29 19:38 R000_128-235.stderr<br />

-rw-r--r-- 1 test1 itso 1536 2006-03-29 19:44 hello-gpfs.out<br />

-rw-r--r-- 1 test1 itso 10415 2006-03-29 19:44 hello-gpfs.err<br />

-rwxr-xr-x 1 test1 itso 7539442 2006-03-30 10:02 hello-world.rts<br />

-rw-r--r-- 1 test1 itso 1605021456 2006-03-30 19:05 hello.out<br />

-rw-r--r-- 1 test1 itso 9339 2006-03-30 19:06 hello.err<br />

-rw-r--r-- 1 test1 itso 0 2006-04-01 15:48 ior-gpfs.out<br />

-rw-r--r-- 1 test1 itso 4284 2006-04-01 15:48 ior-gpfs.err<br />

test1@bglfen1:/bglscratch/test1> tail -f ior-gpfs.out<br />

IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />

Run began: Sat Apr 1 14:41:26 2006<br />

Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />

/bglscratch/test1/applications/IOR/ior-inputs<br />

Machine: Blue Gene L ionode3<br />

Chapter 6. Scenarios 337


Summary:<br />

api = MPIIO (version=2, subversion=0)<br />

test filename = /bubu/Examples/IOR/IOR-output-MPIIO-0<br />

access = single-shared-file<br />

clients = 128 (16 per node)<br />

repetitions = 5<br />

xfersize = 32768 bytes<br />

blocksize = 1 MiB<br />

aggregate filesize = 128 MiB<br />

delaying 1 seconds . . .<br />

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />

------ --------- ---------- --------- -------- -------- -------- ---write<br />

6.96 1024.00 32.00 0.108306 17.77 1.71 0<br />

delaying 1 seconds . . .<br />

read 76.59 1024.00 32.00 0.004671 1.67 0.001274 0<br />

delaying 1 seconds . . .<br />

write 6.61 1024.00 32.00 0.026399 19.02 1.35 1<br />

delaying 1 seconds . . .<br />

read 77.22 1024.00 32.00 0.004509 1.65 0.001283 1<br />

delaying 1 seconds . . .<br />

write 6.65 1024.00 32.00 0.024646 18.50 1.85 2<br />

delaying 1 seconds . . .<br />

read 76.83 1024.00 32.00 0.004666 1.66 0.001285 2<br />

delaying 1 seconds . . .<br />

This test proves that the job did indeed run as expected. Now, we switch the<br />

ppcfloor link back to the new driver, as shown in Example 6-84.<br />

Example 6-84 Switching driver versions<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />

Stopping BGLMaster<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # cd /<br />

bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />

0 lrwxrwxrwx 1 root root 41 Apr 1 14:31 /bgl/BlueLight/ppcfloor -><br />

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc<br />

bglsn:/ # rm /bgl/BlueLight/ppcfloor<br />

bglsn:/ # ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />

/bgl/BlueLight/ppcfloor<br />

bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />

0 lrwxrwxrwx 1 root root 41 Apr 1 14:46 /bgl/BlueLight/ppcfloor -><br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />

bglsn:/ # ./bglmaster start<br />

338 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


-bash: ./bglmaster: No such file or directory<br />

bglsn:/ # cd -<br />

/bgl/BlueLight/ppcfloor/bglsys/bin<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />

Starting BGLMaster<br />

Logging to<br />

/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0401-14:47:00.log<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />

idoproxy started [28946]<br />

ciodb started [28947]<br />

mmcs_server started [28948]<br />

monitor stopped<br />

perfmon stopped<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />

9. Having switched to the latest Blue Gene/L driver, we now upgrade or install<br />

the compatible GPFS code just on the I/O nodes to see if that will be able to<br />

access the GPFS file system on the I/O nodes (see Example 6-85).<br />

Example 6-85 Installing or updating GPFS for I/O nodes (for new driver)<br />

bglsn:/mnt/LPP/GPFS/MCP_PTF11 # rpm --root<br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/bglsys \ /bin/bglOS --nodeps -ivh<br />

gpfs*.rpm<br />

Preparing... ########################################### [100%]<br />

1:gpfs.msg.en_US ########################################### [ 20%]<br />

2:gpfs.base ########################################### [ 40%]<br />

3:gpfs.docs ########################################### [ 60%]<br />

4:gpfs.gpl ########################################### [ 80%]<br />

5:gpfs.gplbin ########################################### [100%]<br />

10.We check that the installed or upgraded GPFS for the I/O nodes works (see<br />

Example 6-86).<br />

Example 6-86 Checking the new GPFS for I/O node installation<br />

bglsn:/ # rpm --root<br />

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/bglsys/bin/bglOS -qa | grep<br />

gpfs<br />

gpfs.base-2.3.0-11<br />

gpfs.gpl-2.3.0-11<br />

gpfs.msg.en_US-2.3.0-11<br />

gpfs.docs-2.3.0-11<br />

gpfs.gplbin-2.3.0-11<br />

Chapter 6. Scenarios 339


11.We test that we can boot a block and that the GPFS file system becomes<br />

available, as shown in Example 6-87.<br />

Example 6-87 Booting a block and checking GPFS availability<br />

mmcs$ allocate R000_128<br />

OK<br />

mmcs$ quit<br />

OK<br />

mmcs_db_console is terminating, please wait...<br />

mmcs_db_console: closing database connection<br />

mmcs_db_console: closed database connection<br />

mmcs_db_console: closing console port<br />

mmcs_db_console: closed console port<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode5<br />

BusyBox v1.00-rc2 (2006.03.24-21:52+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

$ df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

rootfs 7931 1644 5878 22% /<br />

/dev/root 7931 1644 5878 22% /<br />

172.30.1.1:/bgl 9766544 4025264 5741280 42% /bgl<br />

172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />

/dev/bubu_gpfs1 1138597888 3304448 1135293440 1% /bubu<br />

We can see that the GPFS file system (/bubu) is available again on the I/O<br />

nodes by just installing the new code.<br />

12.We try to run a LoadLeveler job, but first we free up the block, as shown in<br />

Example 6-88.<br />

Example 6-88 Unallocating the block for preparing the LoadLeveler run<br />

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />

connecting to mmcs_server<br />

connected to mmcs_server<br />

connected to DB2<br />

mmcs$ list_blocks<br />

OK<br />

R000_128 root(0) connected<br />

mmcs$ free R000_128<br />

OK<br />

test1@bglfen1:/bglscratch/test1> llq<br />

llq: There is currently no job status to report.<br />

test1@bglfen1:/bglscratch/test1> llstatus<br />

340 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 1397 PPC64 Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />

llsubmit: The job "bglfen1.itso.ibm.com.51" has been submitted.<br />

test1@bglfen1:/bglscratch/test1> llq<br />

Id Owner Submitted ST PRI Class Running On<br />

------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.51.0<br />

test1 4/1 13:26 R 50 small bglfen2<br />

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted<br />

test1@bglfen1:/bglscratch/test1> tail -f ior-gpfs.out<br />

IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />

Run began: Sat Apr 1 15:11:14 2006<br />

Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />

/bglscratch/test1/applications/IOR/ior-inputs<br />

Machine: Blue Gene L ionode3<br />

Summary:<br />

api = MPIIO (version=2, subversion=0)<br />

test filename = /bubu/Examples/IOR/IOR-output-MPIIO-0<br />

access = single-shared-file<br />

clients = 128 (16 per node)<br />

repetitions = 5<br />

xfersize = 32768 bytes<br />

blocksize = 1 MiB<br />

aggregate filesize = 128 MiB<br />

delaying 1 seconds . . .<br />

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />

------ --------- ---------- --------- -------- -------- -------- ---write<br />

7.46 1024.00 32.00 0.100233 16.60 1.51 0<br />

delaying 1 seconds . . .<br />

read 77.16 1024.00 32.00 0.004662 1.65 0.002156 0<br />

Chapter 6. Scenarios 341


delaying 1 seconds . . .<br />

write 6.61 1024.00 32.00 0.034111 18.72 1.87 1<br />

delaying 1 seconds . . .<br />

read 76.49 1024.00 32.00 0.004613 1.67 0.002134 1<br />

delaying 1 seconds . . .<br />

write 6.25 1024.00 32.00 0.024525 20.08 1.67 2<br />

delaying 1 seconds . . .<br />

read 77.15 1024.00 32.00 0.004607 1.65 0.001303 2<br />

delaying 1 seconds . . .<br />

write 6.88 1024.00 32.00 0.023640 18.38 1.27 3<br />

delaying 1 seconds . . .<br />

read 76.10 1024.00 32.00 0.004649 1.68 0.001303 3<br />

delaying 1 seconds . . .<br />

This series of steps proves that a LoadLeveler initiated job ran successfully as<br />

well.<br />

Lessons learned<br />

We learned the following lessons:<br />

► When upgrading the Blue Gene/L driver version we must also re-install the<br />

GPFS code for the I/O nodes.<br />

► If we upgrade the Blue Gene/L driver version and forget to upgrade the GPFS<br />

code, or do not have it available for install, then Blue Gene/L operation can be<br />

restored by simply switching the ppcfloor symbolic link back to the original<br />

version.<br />

► Again, we found that there is a 10 minute timeout for the GPFS startup on the<br />

I/O nodes to boot if there is a problem.<br />

342 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6.3.7 Duplicate IP addresses in /etc/hosts<br />

This scenario deals with errors in the /etc/hosts file on the SN (duplicate IP<br />

addresses).<br />

Error injection<br />

Here the error is injected with no blocks allocated. Now change the /etc/hosts file<br />

on the SN.The /etc/hosts file was changes by changing the IP address listed for<br />

ionode2, as shown in Example 6-89.<br />

Example 6-89 Duplicate IP address in /etc/hosts<br />

# BGL I/O nodes<br />

172.30.2.1 ionode1<br />

172.30.2.1 ionode2<br />

172.30.2.3 ionode3<br />

172.30.2.4 ionode4<br />

172.30.2.5 ionode5<br />

172.30.2.6 ionode6<br />

172.30.2.7 ionode7<br />

<strong>Problem</strong> determination<br />

When we ran the job this time using the mpirun command, it ran without any<br />

problems, as shown in Example 6-90.<br />

Example 6-90 Job running with mpirun (duplicate I/O node IP address)<br />

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_128 -np 128 -exe<br />

/bglscratch/test1/applications/IOR/IOR.rts -args "-f<br />

/bglscratch/test1/applications/IOR/ior-inputs" -cwd<br />

/bglscratch/test1/applications/IOR<br />

+ mpirun -partition R000_128 -np 128 -exe /bglscratch/test1/applications/IOR/IOR.rts<br />

-args '-f /bglscratch/test1/applications/IOR/ior-inputs' -cwd<br />

/bglscratch/test1/applications/IOR<br />

IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />

Run began: Wed Apr 5 15:23:49 2006<br />

Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />

/bglscratch/test1/applications/IOR/ior-inputs<br />

Machine: Blue Gene L ionode3<br />

Summary:<br />

api = MPIIO (version=2, subversion=0)<br />

test filename = /bubu/Examples/IOR/IOR-output-MPIIO-0<br />

access = single-shared-file<br />

Chapter 6. Scenarios 343


clients = 128 (16 per node)<br />

repetitions = 5<br />

xfersize = 32768 bytes<br />

blocksize = 1 MiB<br />

aggregate filesize = 128 MiB<br />

delaying 1 seconds . . .<br />

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />

------ --------- ---------- --------- -------- -------- -------- ---write<br />

6.97 1024.00 32.00 0.122974 17.73 1.55 0<br />

delaying 1 seconds . . .<br />

read 76.59 1024.00 32.00 0.004633 1.66 0.002440 0<br />

delaying 1 seconds . . .<br />

write 7.46 1024.00 32.00 0.026446 16.61 1.59 1<br />

delaying 1 seconds . . .<br />

read 76.72 1024.00 32.00 0.004517 1.66 0.001283 1<br />

delaying 1 seconds . . .<br />

write 6.14 1024.00 32.00 0.023408 20.57 1.39 2<br />

delaying 1 seconds . . .<br />

read 75.71 1024.00 32.00 0.004603 1.69 0.001307 2<br />

delaying 1 seconds . . .<br />

write 5.99 1024.00 32.00 0.025519 21.04 1.97 3<br />

delaying 1 seconds . . .<br />

read 75.01 1024.00 32.00 0.004542 1.70 0.001311 3<br />

delaying 1 seconds . . .<br />

write 6.90 1024.00 32.00 0.025010 18.18 1.71 4<br />

delaying 1 seconds . . .<br />

read 75.54 1024.00 32.00 0.004614 1.69 0.001288 4<br />

Max Write: 7.46 MiB/sec (7.83 MB/sec)<br />

Max Read: 76.72 MiB/sec (80.45 MB/sec)<br />

Run finished: Wed Apr 5 15:25:45 2006<br />

test1@bglfen1:/bglscratch/test1><br />

We now login through ssh into an I/O node and check the /etc/hosts file:<br />

$ cat /etc/hosts<br />

172.30.2.1 ionode1<br />

172.30.2.2 ionode2<br />

172.30.2.3 ionode3<br />

As we can see, the /etc/hosts file on the I/O nodes was not affected by the<br />

injected error.<br />

344 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


We now investigate possible effects on GPFS (see Example 6-91).<br />

Example 6-91 Investigating GPFS<br />

bglsn:/etc # mmgetstate -a<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

2 ionode1 active<br />

2 ionode1 active<br />

4 ionode3 active<br />

5 ionode4 active<br />

6 ionode5 active<br />

7 ionode6 active<br />

8 ionode7 active<br />

9 ionode8 active<br />

bglsn:/etc #<br />

We also see that output of the mmgetstate -a command looks correct on the SN.<br />

Lessons learned<br />

Duplicate IP addresses in the /etc/hosts file on the SN have little effect on the<br />

running jobs. Even though we can assume GPFS is going to be impacted, this is<br />

revealed only if a cluster GPFS change occurs.<br />

6.3.8 Missing I/O node in /etc/hosts<br />

This scenario shows missing I/O node entries form the /etc/hosts file on the SN.<br />

Error injection<br />

Example 6-92 shows the error injected with no blocks allocated.<br />

Example 6-92 Missing I/O node in /etc/hosts on the SN<br />

bglsn:/ # cat /etc/hosts<br />

.... > ....<br />

172.30.2.1 ionode1<br />

#172.30.2.2 ionode2<br />

172.30.2.3 ionode3<br />

172.30.2.4 ionode4<br />

172.30.2.5 ionode5<br />

172.30.2.6 ionode6<br />

.... > ....<br />

Chapter 6. Scenarios 345


<strong>Problem</strong> determination<br />

When we ran the job using the mpirun command there were no problems. So we<br />

further investigate the state of GPFS on the SN (see Example 6-93).<br />

Example 6-93 The mmgetstate command reveals a missing node<br />

bglsn:/tmp # mmgetstate -a<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

2 ionode1 active<br />

4 ionode3 active<br />

5 ionode4 active<br />

6 ionode5 active<br />

7 ionode6 active<br />

8 ionode7 active<br />

9 ionode8 active<br />

mmgetstate: The following nodes could not be reached: ionode2<br />

Here, we see that ionode2 is not listed in the output, so we check the /etc/hosts<br />

files entries to get the IP address of ionde2:<br />

bglsn:/tmp # grep ionode2 /etc/hosts<br />

#172.30.2.2 ionode2<br />

bglsn:/tmp # grep ionode2 /etc/hosts<br />

We then try to login using ssh into ionode2 to check the GPFS status, as shown<br />

in Example 6-94.<br />

Example 6-94 GPFS status<br />

bglsn:/tmp # ssh root@172.30.2.2<br />

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />

Enter 'help' for a list of built-in commands.<br />

$ df<br />

Filesystem 1K-blocks Used Available Use% Mounted on<br />

rootfs 7931 1644 5878 22% /<br />

/dev/root 7931 1644 5878 22% /<br />

172.30.1.1:/bgl 9766544 4076160 5690384 42% /bgl<br />

172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />

/dev/bubu_gpfs1 1138597888 3353600 1135244288 1% /bubu<br />

$ hostname<br />

ionode2<br />

346 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Lessons learned<br />

We learned the following lessons:<br />

► We learn that the mmgetstate command relies on system name resolution (n<br />

this case the /etc/hosts file) to be able to query the state of the remote nodes<br />

► Missing entries in the /etc/hosts file do not necessarily mean that GPFS are<br />

unable to load on the I/O nodes. Because the GPFS configuration has not<br />

been altered, this actually depends which node has initiated the GPFS<br />

communication.<br />

Note: We emphasize again the importance of correct and identical name<br />

resolution for all nodes in all inter-connected GPFS clusters.<br />

6.3.9 Adding an extra alias for the SN in /etc/hosts<br />

This scenario adds an extra alias SN in the /etc/hosts file on the SN.<br />

Error injection<br />

Here the error is injected with no blocks allocated.<br />

We changed the /etc/hosts file by adding an SN alias:<br />

# Service Node interfaces<br />

10.0.0.1 bglsn_sn.itso.ibm.com bglsn_sn<br />

#172.30.1.1 bglsn_fn.itso.ibm.com bglsn_fn<br />

172.30.1.1 fluff bglsn_fn.itso.ibm.com bglsn_fn<br />

<strong>Problem</strong> determination<br />

When we ran the job this time using the mpirun command, it ran without any<br />

problems. So, while the job was running, we investigate the state of GPFS from<br />

the SN (see Example 6-95).<br />

Example 6-95 Checking GPFS node status<br />

bglsn:/tmp # mmgetstate -a<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

2 ionode1 active<br />

3 ionode2 active<br />

4 ionode3 active<br />

5 ionode4 active<br />

6 ionode5 active<br />

Chapter 6. Scenarios 347


7 ionode6 active<br />

8 ionode7 active<br />

9 ionode8 active<br />

We ran the job, and it was successful. We then stopped GPFS, freed the block,<br />

and restarted GPFS, as shown in Example 6-96.<br />

Example 6-96 Restarting the bgIO GPFS cluster<br />

bglsn:/tmp # mmstartup -a<br />

Wed Apr 5 15:49:25 EDT 2006: mmstartup: Starting GPFS ...<br />

mmdsh: ionode7 rsh process had return code 1.<br />

mmdsh: ionode4 rsh process had return code 1.<br />

mmdsh: ionode1 rsh process had return code 1.<br />

mmdsh: ionode8 rsh process had return code 1.<br />

mmdsh: ionode6 rsh process had return code 1.<br />

mmdsh: ionode3 rsh process had return code 1.<br />

mmdsh: ionode5 rsh process had return code 1.<br />

mmdsh: ionode2 rsh process had return code 1.<br />

ionode7: ssh: connect to host ionode7 port 22: Connection refused<br />

ionode4: ssh: connect to host ionode4 port 22: Connection refused<br />

ionode1: ssh: connect to host ionode1 port 22: Connection refused<br />

ionode8: ssh: connect to host ionode8 port 22: Connection refused<br />

ionode3: ssh: connect to host ionode3 port 22: Connection refused<br />

ionode6: ssh: connect to host ionode6 port 22: Connection refused<br />

ionode5: ssh: connect to host ionode5 port 22: Connection refused<br />

ionode2: ssh: connect to host ionode2 port 22: Connection refused<br />

However, the errors in the mmstartup command were expected because no block<br />

was initialized at the time:<br />

bglsn:/tmp # mmgetstate -a<br />

Node number Node name GPFS state<br />

-----------------------------------------<br />

1 bglsn_fn active<br />

bglsn:/tmp #<br />

Lessons learned<br />

We learn that the mmgetstate command relies on entries in the /etc/hosts file to<br />

be able to access the state of the remote nodes, but the addition of alias entries<br />

in the /etc/hosts file doesn’t mean that GPFS will be unable to load on the I/O<br />

nodes (name resolution still works fine).<br />

348 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


6.4 Job submission scenarios<br />

This section presents scenarios related to mpirun and IBM LoadLeveler. We try<br />

to address the common problems observed by the users of the system.<br />

However, in the real world there might be other unknown problems which we<br />

might not have addressed in this chapter. Our basic goal is to introduce the<br />

common problem determination methodology for the job running environment on<br />

Blue Gene/L.<br />

Depending on the process of job submission method used (that is mpirun or<br />

LoadLeveler), we have divided this section in two subsections, one for running<br />

jobs using mpirun, and one for LoadLeveler. In each subsection we first make<br />

sure we can submit a job which is executed successfully. We then inject the<br />

errors, resulting in a job failure and then perform problem determination following<br />

the methodology mentioned in Chapter 2, “<strong>Problem</strong> determination methodology”<br />

on page 55.<br />

6.4.1 The mpirun command: scenarios description<br />

In this section, we discuss problems generally encountered while submitting jobs<br />

to a Blue Gene/L system. For example, job submission has as mandatory<br />

prerequisites the Blue Gene/L database (DB2) and the Control system<br />

processes up and running (at the higher end), appropriate partitions availability<br />

(number of nodes, partition shape), and the file systems available (at lower end).<br />

In our scenarios we do not care about application errors (wrong libraries,<br />

execution arguments, bad programming and so forth), rather focus on the job<br />

submission action itself, and how the job interacts with the system. We have<br />

tested the following list of scenarios, concentrated on what we consider as basic<br />

errors observed while submitting jobs on the system:<br />

1. Environment variables not set before submitting jobs<br />

– $MMCS_SERVER_IP variable on FEN<br />

– $BRIDGE_CONFIG_FILE variable on SN<br />

– $DB_PROPERTY variable on SN<br />

– Database user profile not sourced (db2profile)<br />

2. Remote command execution not set correctly (rsh)<br />

Each scenario consists of the following topics:<br />

► Error injection<br />

► <strong>Problem</strong> determination<br />

► Lessons learned<br />

Chapter 6. Scenarios 349


6.4.2 The mpirun command: environment variables not set<br />

As previously mentioned, in this scenario we deal with unset environment<br />

variables.<br />

Scenario 1<br />

Environment variable MMCS_SERVER_IP is not set on the Front-End Node (FEN).<br />

Error injection<br />

We omit the line with MMCS_SERVER_IP in users profile (~/.bashrc or ~/.cshrc)<br />

test1@bglfen1:~> env |grep MMCS_SERVER_IP<br />

test1@bglfen1:~><br />

As we can see, the environment variable is missing (not set).<br />

<strong>Problem</strong> determination<br />

By default, the mpirun command writes standard output (1) and error (2) on to the<br />

console from where the job was submitted, unless specifically redirected in the<br />

application. In our scenario, the job uses the console, as shown in Example 6-97.<br />

Example 6-97 Job’s stderr(2) on the console (missing $MMCS_SERVER_IP)<br />

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_J108_32 -exe<br />

/bglscratch/test1/hello.rts -cwd /bglscratch/test1<br />

FE_MPI (ERROR): BG/L Service Node not set:<br />

FE_MPI (ERROR): Please set 'MMCS_SERVER_IP'<br />

env. variable or<br />

FE_MPI (ERROR): specify the Service Node<br />

with '-host' argument.<br />

Usage:<br />

or<br />

mpirun [options]<br />

mpirun [options] binary [arg1 arg2 ...]<br />

Try "mpirun -h" for more details.<br />

We can see from the command output the reason of the failure. We omitted the<br />

$MMCS_SERVER_IP variable and the -host argument. By referring to the “The<br />

mpirun checklist” on page 166 and following the checks mentioned there, we find<br />

a missing environment variable (MMCS_SERVER_IP).<br />

350 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Scenario 2<br />

Environment variable BRIDGE_CONFIG_FILE is not set on the SN.<br />

Error injection<br />

We omit the line with BRIDGE_CONFIG_FILE in user’s profile (~/.bashrc or ~/.cshrc).<br />

Example 6-98 shows the ~/.bashrc file for user test1.<br />

Note: Normally users (other than system administrator - root) are not allowed<br />

to login onto the SN.<br />

Example 6-98 Environment variable BRIDGE_CONFIG_FILE not set on SN<br />

test1@bglfen1:~> cat ~/.bashrc<br />

if [ $hstnm = "bglsn" ]<br />

then<br />

#export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />

export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />

fi<br />

<strong>Problem</strong> determination<br />

By default, the mpirun command writes standard output (1) and error (2) on to the<br />

console from where the job was submitted, unless specifically redirected in the<br />

application. In our scenario, the job uses the console, as shown in Example 6-99.<br />

Example 6-99 Jobs stderr(2) for missing BRIDGE_CONFIG_FILE on SN<br />

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_128 -np 128<br />

-exe /bglscratch/test1/hello-world.rts -verbose 1<br />

FE_MPI (Info) : Initializing MPIRUN<br />

FE_MPI (Info) : Scheduler interface library<br />

loaded<br />

BE_MPI (ERROR): The environment parameter<br />

"BRIDGE_CONFIG_FILE" is not set, set it to point to the configuration<br />

file<br />

BE_MPI (Info) : Starting cleanup sequence<br />

BE_MPI (Info) : BG/L Job alredy terminated /<br />

hasn't been added<br />

BE_MPI (Info) : Partition wasn't allocated by<br />

the mpirun - No need to remove<br />

FE_MPI (Info) : Back-End invoked:<br />

FE_MPI (Info) : - Service Node: bglsn<br />

FE_MPI (Info) : - Back-End pid: 6979 (on<br />

service node)<br />

Chapter 6. Scenarios 351


BE_MPI (Info) : == BE completed ==<br />

FE_MPI (ERROR): Failure list:<br />

FE_MPI (ERROR): - 1. Failed to get machine<br />

serial number (bridge configuration file not found?) (failure #15)<br />

FE_MPI (Info) : == FE completed ==<br />

FE_MPI (Info) : == Exit status: 1 ==<br />

We can see from the command output the reason of the failure. Also, for clarity<br />

reasons, we used the -verbose 1 option. By referring to the “The mpirun<br />

checklist” on page 166 and following the checks mentioned there, we find a<br />

missing environment variable (BRIDGE_CONFIG_FILE not set).<br />

Scenario 3<br />

Environment variable DB_PROPERTY is not set on the SN.<br />

Error injection<br />

We omit the line with DB_PROPERTY in user’s profile (~/.bashrc or ~/.cshrc).<br />

Example 6-100 shows the ~/.bashrc file for user test1.<br />

Example 6-100 Environment variable DB_PROPERTY not set on SN<br />

test1@bglsn:~> cat ~/.bashrc<br />

if [ $hstnm = "bglsn" ]<br />

then<br />

export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />

#export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />

fi<br />

<strong>Problem</strong> determination<br />

By default, the mpirun command writes standard output (1) and error (2) on to the<br />

console from where the job was submitted, unless specifically redirected in the<br />

application. In our scenario, the job uses the console, as shown in<br />

Example 6-101.<br />

Example 6-101 Job’s stderr(2) for missing DB_PROPERTY variable on SN<br />

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_J108_32 -exe<br />

/bglscratch/test1/hello.rts -cwd /bglscratch/test1 -verbose 1<br />

FE_MPI (Info) : Initializing MPIRUN<br />

FE_MPI (Info) : Scheduler interface library<br />

loaded<br />

BE_MPI (ERROR): db.properties file not found<br />

352 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


BE_MPI (ERROR): If it is not in the CWD,<br />

please set DB_PROPERTY env. var. to point to the file's location.<br />

BE_MPI (Info) : Starting cleanup sequence<br />

BE_MPI (Info) : BG/L Job alredy terminated /<br />

hasn't been added<br />

BE_MPI (Info) : Partition wasn't allocated by<br />

the mpirun - No need to remove<br />

FE_MPI (Info) : Back-End invoked:<br />

FE_MPI (Info) : - Service Node: bglsn<br />

FE_MPI (Info) : - Back-End pid: 13613 (on<br />

service node)<br />

BE_MPI (Info) : == BE completed ==<br />

FE_MPI (ERROR): Failure list:<br />

FE_MPI (ERROR): - 1. Failed to locate<br />

db.properties file (failure #14)<br />

FE_MPI (Info) : == FE completed ==<br />

FE_MPI (Info) : == Exit status: 1 ==<br />

We can see from the command output the reason of the failure. Also, for clarity<br />

reasons, we used the -verbose 1 option. By referring to the “The mpirun<br />

checklist” on page 166 and following the checks mentioned there, we find a<br />

missing environment variable (DB_PROPERTY not set).<br />

Scenario 4<br />

Database user profile not sourced (db2profile) on the SN.<br />

Error injection<br />

We omit to source the db2profile in user’s profile (~/.absurd or ~/.cshrc).<br />

Example 6-102 shows the ~/.bashrc file for user test1.<br />

Example 6-102 db2profile not sourced on the SN<br />

test1@bglsn:~> cat ~/.bashrc<br />

if [ $hstnm = "bglsn" ]<br />

then<br />

export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />

export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

# source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />

fi<br />

Chapter 6. Scenarios 353


<strong>Problem</strong> determination<br />

By default, the mpirun command writes standard output (1) and error (2) on to the<br />

console from where the job was submitted, unless specifically redirected in the<br />

application. In our scenario, the job uses the console, as shown in<br />

Example 6-103.<br />

Example 6-103 db2profile not set on the SN<br />

test1@bglfen1:/bglscratch/test1/applications/IOR> mpirun -partition<br />

R000_128 -np 128 -exe /bglscratch/test1/applications/IOR/IOR.rts -args<br />

"-f /bglscratch/test1/applications/IOR/ior-inputs" -cwd<br />

/bglscratch/test1/applications/IOR<br />

-CLI INVALID HANDLE----cliRC<br />

= -2<br />

line = 148<br />

file = DBConnection.cc<br />

-CLI INVALID HANDLE----cliRC<br />

= -2<br />

line = 167<br />

file = DBConnection.cc<br />

Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />

schema is {bglsysdb}<br />

-CLI INVALID HANDLE----cliRC<br />

= -2<br />

line = 167<br />

file = DBConnection.cc<br />

Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />

schema is {bglsysdb}<br />

-CLI INVALID HANDLE----cliRC<br />

= -2<br />

line = 167<br />

file = DBConnection.cc<br />

Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />

schema is {bglsysdb}<br />

-CLI INVALID HANDLE----cliRC<br />

= -2<br />

line = 167<br />

file = DBConnection.cc<br />

Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />

schema is {bglsysdb}<br />

354 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


-CLI INVALID HANDLE----cliRC<br />

= -2<br />

line = 167<br />

file = DBConnection.cc<br />

Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />

schema is {bglsysdb}<br />

Unable to connect, aborting...<br />

FE_MPI (ERROR): blk_receive_incoming_message()<br />

- !<br />

FE_MPI (ERROR): blk_receive_incoming_message()<br />

- ! Receiver thread exited<br />

FE_MPI (ERROR): blk_receive_incoming_message()<br />

- ! Error code = 11<br />

FE_MPI (ERROR): blk_receive_incoming_message()<br />

- !<br />

FE_MPI (ERROR): blk_receive_incoming_message()<br />

- Switching to cleanup sequence...<br />

FE_MPI (ERROR): Failure list:<br />

FE_MPI (ERROR): - 1. One or more threads<br />

died (failure #57)<br />

We can see from the command output the reason of the failure. Also, for clarity<br />

reasons, we used the -verbose 1 option. By referring to the “The mpirun<br />

checklist” on page 166 and following the checks mentioned there, we find that the<br />

db2profile was not sourced on the SN.<br />

6.4.3 The mpirun command: incorrect remote command<br />

execution (rsh) setup<br />

In this scenario, we simulate bad rsh authentication between one FEN and SN.<br />

This prevents mpirun front-end (running on the FEN) from talking to the mpirun<br />

back-end (running on the SN).<br />

Error injection<br />

The error has been injected by omitting (in this case, commented with “#”) the<br />

line corresponding to user test1@bglfen1 in ~/.rhosts file of the user test1@bglsn<br />

(see Example 6-104).<br />

Example 6-104 User’s .rhosts file<br />

test1@bglfen1:~> rsh bglsn ls<br />

Permission denied.<br />

test1@bglsn:~> cat ~/.rhosts<br />

Chapter 6. Scenarios 355


# bglfen1 test1<br />

bglfen2 test1<br />

bglsn test1<br />

<strong>Problem</strong> determination<br />

We submit a job and check the console (default is the terminal from were the job<br />

was submitted). We see the stderr(2) messages shown in Example 6-105.<br />

Example 6-105 remote command execution fails (FEN to SN)<br />

test1@bglfen1:~> mpirun -np 32 -partition R000_J106_32 -exe<br />

/bglscratch/test1/hello-world.rts -cwd /bglscratch/test1 -verbose 1<br />

FE_MPI (Info) : Initializing MPIRUN<br />

FE_MPI (Info) : Scheduler interface library<br />

loaded<br />

Permission denied.<br />

FE_MPI (ERROR): waitForBackendConnections() -<br />

child process died unexpectedly<br />

FE_MPI (ERROR): Failed to get control and data<br />

connections from service node<br />

FE_MPI (ERROR): Failure list:<br />

FE_MPI (ERROR): - 1. Failed to execute<br />

Back-End mpirun on service node (failure #13)<br />

FE_MPI (Info) : == FE completed ==<br />

FE_MPI (Info) : == Exit status: 1 ==<br />

This error shown in Example 6-105 is observed on the console when a job is<br />

submitted using mpirun. The error message (shown in bold) is Permission<br />

denied and also Failed to execute Back-End mpirun on the Service Node.<br />

Because we got the error from the mpirun command, and there is little<br />

information (stderr(2)) about the reason of the failure (Permission denied), we<br />

check according to “The mpirun checklist” on page 166. Going through the steps<br />

one by one we could see that the environment variables are set correctly on FEN<br />

and SN. We drill down and use the -only_test_protocol flag for mpirun, which<br />

tests the remote shell environment across the FENs and the SN, as shown in<br />

Example 6-106.<br />

Example 6-106 The mpirun argument of only_test_protocol<br />

test1@bglfen1:~> mpirun -np 32 -partition R000_J106_32 -exe<br />

/bglscratch/test1/hello-world.rts -cwd /bglscratch/test1 -verbose 1<br />

-only_test_protocol<br />

FE_MPI (Info) : Initializing MPIRUN<br />

356 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


FE_MPI (Info) : Scheduler interface library<br />

loaded<br />

FE_MPI (WARN) :<br />

======================================<br />

FE_MPI (WARN) : = Front-End - Only checking<br />

protocol =<br />

FE_MPI (WARN) : = No actual usage of the BG/L<br />

Bridge =<br />

FE_MPI (WARN) :<br />

======================================<br />

Permission denied.<br />

FE_MPI (ERROR): waitForBackendConnections() -<br />

child process died unexpectedly<br />

FE_MPI (ERROR): Failed to get control and data<br />

connections from service node<br />

FE_MPI (ERROR): Failure list:<br />

FE_MPI (ERROR): - 1. Failed to execute<br />

Back-End mpirun on service node (failure #13)<br />

FE_MPI (Info) : == FE completed ==<br />

FE_MPI (Info) : == Exit status: 1 ==<br />

From Example 6-106 the highlighted messages indicate the same set of errors<br />

which we have observed during our initial job run. This points to us to a problem<br />

with the remote command execution. Depending on the users’ shell environment<br />

(for example, rsh in this case), we proceed to the respective checklist. Because<br />

by default rsh is used by mpirun, we decide to execute a simple date command<br />

from the FEN to the SN:<br />

test1@bglfen1:~> rsh bglsn date<br />

Permission denied.<br />

This states clearly Permission denied, therefore we first check the ~/.rhosts file<br />

of test1 user, and detect the missing line for test1@bglfen1. We correct and<br />

re-submit the job successfully.<br />

Lessons learned<br />

We learned the following lessons:<br />

► Environment variables on the FEN and SN should be set appropriately before<br />

submitting the jobs on the Blue Gene/L system.<br />

► Users’ shell environment needs to be set appropriately in order to<br />

successfully submit a job on the system<br />

Chapter 6. Scenarios 357


6.4.4 LoadLeveler: scenarios description<br />

The scenarios in this section deal with problems that you can encounter when<br />

running IBM LoadLeveler in a Blue Gene/L environment. We assume that you<br />

have basic knowledge on IBM LoadLeveler. See the IBM LoadLeveler Using and<br />

Administering <strong>Guide</strong>, SA22-7881, for reference.<br />

To better understand LoadLeveler problems on Blue Gene/L, a basic knowledge<br />

of Blue Gene/L system administration is also required. The following sections<br />

present the scenarios that we developed and tested for this book.<br />

6.4.5 LoadLeveler: job failed<br />

In this section, we analyze some possible reasons for a LoadLeveler job to fail.<br />

This scenario does not contain manual error injection, because we encountered<br />

these issues during our testing.<br />

<strong>Problem</strong> description<br />

There are many reasons a job could fail. In this scenario, we do not look into<br />

application specific failures. Based on the discussion in 4.4.3, “How LoadLeveler<br />

plugs into Blue Gene/L” on page 172, a MPI job submitted into LoadLeveler<br />

queue might fail if one of the following happen:<br />

► LoadLeveler cannot talk to mpirun processes<br />

► mpirun processes cannot talk to LoadLeveler<br />

► Some files or file systems are not accessible<br />

► <strong>Problem</strong>s with Blue Gene/L partitions<br />

Detailed checking<br />

If one or all of the hypothesized problems above happened, there might be some<br />

error messages recorded in the job stderr(2) file. Therefore, the first place to start<br />

is the job output.<br />

To determine where the job saves those file, the job command file is one place to<br />

look. The file used in this scenario is listed in Example 4-16 on page 178. It<br />

contains the following two lines:<br />

#@ output =<br />

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />

#@ error =<br />

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />

358 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


LoadLeveler replaces these variables such as $(schedd_host), $(jobid), $(stepid)<br />

with their values. The three variables are combined to make up the jobid. The<br />

two file names are:<br />

/bgl/loadl/out/hello.bglfen2.2.0.out<br />

/bgl/loadl/out/hello.bglfen2.2.0.err<br />

Another way to determine where the output files are stored is to use the llq -l<br />

command, which displays a long list of job attributes, as shown<br />

Example 6-107. The output of this command is relevant only if the job is still in<br />

Running (R) state.<br />

Example 6-107 Finding job stdout and stderr(2) from llq -l command<br />

loadl@bglfen1:/bgl/loadlcfg> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

------------------------ ---------- ----------- -- --- ------------<br />

----------bglfen2.2.0<br />

test1 3/30 14:17 R 50 small<br />

bglfen1<br />

1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0<br />

preempted<br />

loadl@bglfen1:/bgl/loadlcfg> llq -l bglfen2.2.0<br />

...<br />

Out: /bglscratch/test1/ior-gpfs.out<br />

Err: /bglscratch/test1/ior-gpfs.err<br />

...<br />

Investigating the stderr(2) and stdout(1) files might reveal that the stdout(1) file is<br />

empty, which means the job either did not start, or did not advance much in<br />

execution before it failed. However, there is information in the stderr(2) file. The<br />

contents of the file is listed in Example 6-108.<br />

Example 6-108 Job stderr(2) contents<br />

FE_MPI (Info) : Initializing MPIRUN<br />

FE_MPI (Debug): Started, pid=2983,<br />

exe=/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/bin/mpirun,<br />

cwd=/bgl/loadlcfg<br />

FE_MPI (Debug): Collecting arguments ...<br />

FE_MPI (Debug): Checking for arguments in the<br />

environment<br />

FE_MPI (Debug): Parsing command line arguments<br />

FE_MPI (Debug): Checking usage...<br />

Chapter 6. Scenarios 359


FE_MPI (Info) : Scheduler interface library<br />

loaded<br />

FE_MPI (Debug): Collecting arguments from<br />

external source (scheduler)<br />

FE_MPI (Debug): Calling external source to<br />

fill parameters...<br />

FE_MPI (Debug): Setting failure #12<br />

FE_MPI (Debug): Closing messaging queues<br />

FE_MPI (Debug): FE threads are down with the<br />

following return codes:<br />

FE_MPI (Debug): FE Sender : N/A (not<br />

initialized)<br />

FE_MPI (Debug): FE Receiver : N/A (not<br />

initialized)<br />

FE_MPI (Debug): FE Input : N/A (not<br />

initialized)<br />

FE_MPI (Debug): FE Output : N/A (not<br />

initialized)<br />

FE_MPI (Debug): Child shell process never<br />

started.<br />

FE_MPI (ERROR): Failure list:<br />

FE_MPI (ERROR): - 1. Front-End<br />

initialization failed (failure #12)<br />

FE_MPI (Info) : == FE completed ==<br />

FE_MPI (Info) : == Exit status: 1 ==<br />

In Example 6-108 the two ERROR messages are:<br />

FE_MPI (ERROR): Failure list:<br />

FE_MPI (ERROR): - 1. Front-End<br />

initialization failed (failure #12)<br />

These messages are from mpirun (FE_MPI). They indicate a Front-End<br />

initialization failure (failure #12).<br />

According to the job command file, from the line we can see that the job was run<br />

with mpirun option flag -verbose 2:<br />

#@ arguments = -verbose 2 -exe /bgl/hello/hello.rts<br />

The next thing to try is to increase the verbosity to a higher number such<br />

(-verbose 4) to get more details on ERROR messages. In our case, the<br />

-verbose 4 option produced the messages shown in Example 6-108.<br />

360 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


From here we get another clue: the option flag value set on the #@ arguments<br />

line in the job command file does not seem to get passed to mpirun, which<br />

means that mpirun cannot talk to LoadLeveler.<br />

Drilling down the LoadLeveler checklist, we scan through the following items:<br />

► LoadLeveler run queue shows normal operation.<br />

► Simple job submission: This is a simple ""Hello world!" job that fails.<br />

Therefore, the simple job submission check is not relevant.<br />

► Job command file: The file used for this job has been checked and altered as<br />

previously mentioned (-verbose 4) with little impact.<br />

► LoadLeveler processes/daemons seem to be working as expected (llstatus<br />

and llq commands).<br />

► LoadLeveler logs: This is a job failure so the log file to check is StartLog and<br />

StarterLog. The StartLog file contains the messages shown in<br />

Example 6-109.<br />

Example 6-109 StartLog file contents<br />

03/23 10:23:24 TI-57 JOB_STATUS: Job bglsn.itso.ibm.com.21.0 Status =<br />

READY<br />

03/23 10:23:24 TI-57 QUEUED_STATUS: Step bglsn.itso.ibm.com.21.0<br />

queueing status to schedd at bglsn.itso.ibm.com<br />

03/23 10:23:24 TI-59 JOB_STATUS: Job bglsn.itso.ibm.com.21.0 Status =<br />

RUNNING<br />

03/23 10:23:24 TI-59 QUEUED_STATUS: Step bglsn.itso.ibm.com.21.0<br />

queueing status to schedd at bglsn.itso.ibm.com<br />

03/23 10:23:24 TI-61 Notification of user tasks termination received<br />

from Starter for job step bglsn.itso.ibm.com.21.0<br />

03/23 10:23:24 TI-62 JOB_STATUS: Job bglsn.itso.ibm.com.21.0 Status =<br />

COMPLETED<br />

03/23 10:23:24 TI-62 QUEUED_STATUS: Step bglsn.itso.ibm.com.21.0<br />

queueing status to schedd at bglsn.itso.ibm.com<br />

03/23 10:23:24 TI-62 Cleanup_dir: Job<br />

/root/execute/bglsn.itso.ibm.com.21.0 Removing directory.<br />

Chapter 6. Scenarios 361


The log messages indicate the job completed. There is no failure information.<br />

Looking into the StarterLog file reveals some error messages even though it<br />

indicates the job completed, as shown in Example 6-110.<br />

Example 6-110 StarterLog file contents<br />

03/23 10:23:24 TI-0 bglsn.21.0 llcheckpriv program exited, termsig = 0,<br />

coredump = 0, retcode = 0<br />

03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending READY status to<br />

Startd<br />

03/23 10:23:24 TI-0 bglsn.21.0 Main task program started (pid=11862<br />

process count=1).<br />

03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending RUNNING status to<br />

Startd<br />

03/23 10:23:24 TI-0 LoadLeveler: 2539-453 Illegal protocol (121),<br />

received from another process on this machine - bglsn.itso.ibm.com.<br />

This daemon "LoadL_starter" is running protocol version (130).<br />

03/23 10:23:24 TI-0 bglsn.21.0 Task exited, termsig = 0, coredump = 0,<br />

retcode = 1<br />

03/23 10:23:24 TI-0 bglsn.21.0 User environment epilog not run, no<br />

program was specified.<br />

03/23 10:23:24 TI-0 bglsn.21.0 Epilog not run, no program was<br />

specified.<br />

03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending COMPLETED status to<br />

Startd<br />

03/23 10:23:24 TI-0 bglsn.21.0 ********** STARTER exiting<br />

***************<br />

The ERROR messages Illegal protocol (121)... indicate some libraries are<br />

mismatched. Therefore, we perform one more check of environment variables<br />

and library links. Because all other checks have been performed, the last step is<br />

to make sure the environment variables and library links are set up correctly on<br />

the SN. Example 6-111 shows how we checked the variables.<br />

Example 6-111 Checking Blue Gene/L environment variables<br />

loadl@bglsn:~> echo $BRIDGE_CONFIG_FILE<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />

loadl@bglsn:~> echo $DB_PROPERTY<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

loadl@bglsn:~> echo $MMCS_SERVER_IP<br />

bglsn.itso.ibm.com<br />

362 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Checking the library links is more subtitle, because this problem might occur if a<br />

link was removed (easy to detect) or a library binary was replaced (not obvious,<br />

requires further investigation). The following error message from the<br />

LoadLeveler Starter log does not explicitly point to a specific library:<br />

03/23 10:23:24 TI-0 LoadLeveler: 2539-453 Illegal protocol (121),<br />

received from another process on this machine - bglsn.itso.ibm.com.<br />

This daemon "LoadL_starter" is running protocol version (130).<br />

This message means that you need to check all LoadLeveler related libraries on<br />

both the SN and FENs. The script shown in Example 4-14 on page 175 can be<br />

used for this checking.<br />

In our scenario two libraries (binaries) are replaced on all nodes and the problem<br />

is resolved.<br />

Lessons learned<br />

We learned the following lessons:<br />

► A simple "Hello world!" job can serve as a LoadLeveler checking tool. If this<br />

job fails, it’s a good indication that mpirun and basic job submission are OK.<br />

Nevertheless, investigating the job stderr(2) and stdout(1) files is a good<br />

place to start. See 4.3, “Submitting jobs using built-in tools” on page 149.<br />

► When the error messages from the job are not explicit, there are usually other<br />

ways to get more debugging information. Because the job goes through<br />

different components such as the LoadLeveler daemons and mpirun<br />

processes, their respective log files should be investigated to find traces (and<br />

errors) of the job. The various logs might reflect different status of the job<br />

depending on the different states. Understanding the job life cycle at various<br />

levels is key to finding errors in job submission process. See 4.3.3, “Example<br />

of submitting a job using mpirun” on page 163.<br />

► Checking the libraries used by LoadLeveler is a sensitive step related to a job<br />

failure. However, when such errors occur, we found out that mismatched<br />

libraries are often the causes.<br />

6.4.6 LoadLeveler: job in hold state<br />

In this section we investigate some possible causes of a job held in queue for a<br />

long time. We do not perform any error injection, as we have already<br />

encountered some of these issues while working with LoadLeveler.<br />

Chapter 6. Scenarios 363


<strong>Problem</strong> description<br />

A job submitted to LoadLeveler is going to run in batch mode. Depending on the<br />

resources available, the job might not be able to run immediately after<br />

submission. It is important to check the status of the job in the LoadLeveler<br />

queue. In this scenario, the job is seen in the Hold state. There are many<br />

hypotheses. Some of them are as following:<br />

► A dynamic problem happens on the node that the Scheduler is not aware of at<br />

the time of sending the job.<br />

► When Starter starts the job, it encounters a problem and the job cannot be<br />

started.<br />

► The job could be held by owner or administrator for some reasons.<br />

Detailed checking<br />

It could be the job’s owner, who does not see the job running for a while after it is<br />

submitted. She checks the job output and does not see it. The LoadLeveler<br />

administrator could also spot the job in Hold state in the queue. See<br />

Example 6-112.<br />

Example 6-112 Job in Hold state<br />

test1@bglfen1:/bgl/loadl> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

------------------------ ---------- ----------- -- --- ------------<br />

----------bglfen1.23.0<br />

test1 3/26 14:06 H 50 small<br />

1 job step(s) in queue, 0 waiting, 0 pending, 0 running, 1 held, 0<br />

preempted<br />

A quick check on LoadLeveler shows the cluster working normally. See<br />

Example 6-113.<br />

Example 6-113 Normal LoadLeveler cluster status<br />

test1@bglfen1:/bgl/loadl> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />

OpSys<br />

bglfen1.itso.ibm.com Avail 1 0 Idle 0 0.02 109 PPC64<br />

Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.07 9999 PPC64<br />

Linux2<br />

PPC64/Linux2 2 machines 1 jobs 0 running<br />

364 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Total Machines 2 machines 1 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

When we verify the run queue (llq command), we can see that the queue is OK.<br />

If the job is Hold state it might be because an user or administrator has done this.<br />

Example 6-114 shows the queue status.<br />

Example 6-114 Job in Hold state by user<br />

loadl@bglfen1:/bgl/loadl> llq -l bglfen1.3.0 | more<br />

=============== Job Step bglfen1.itso.ibm.com.3.0 ===============<br />

Job Step Id: bglfen1.itso.ibm.com.3.0<br />

Job Name: bglfen1.itso.ibm.com.3<br />

Step Name: 0<br />

Structure Version: 10<br />

Owner: loadl<br />

Queue Date: Thu 06 Apr 2006 10:01:18 AM EDT<br />

Status: User Hold<br />

Reservation ID:<br />

Requested Res. ID:<br />

Scheduling Cluster:<br />

....<br />

Next, we check in the LoadL_config file for the following keywords<br />

(Example 6-115):<br />

► MAX_JOB_REJECT<br />

► ACTION_ON_MAX_REJECT<br />

Example 6-115 MAX_JOB_REJECT and ACTION_ON_MAX_REJECT<br />

# The MAX_JOB_REJECT value determines how many times a job can be<br />

# rejected before it is canceled or put on hold. The default<br />

value<br />

# is 0, which indicates a rejected job will immediately be<br />

canceled<br />

# or placed on hold. MAX_JOB_REJECT may be set to unlimited<br />

rejects<br />

# by specifying a value of -1.<br />

#<br />

Chapter 6. Scenarios 365


MAX_JOB_REJECT = 0<br />

#<br />

# When ACTION_ON_MAX_REJECT is HOLD, jobs will be put on user hold<br />

# when the number of rejects reaches the MAX_JOB_REJECT value.<br />

When<br />

# ACTION_ON_MAX_REJECT is CANCEL, jobs will be canceled when the<br />

# number of rejects reaches the MAX_JOB_REJECT value. The default<br />

# value is HOLD.<br />

#<br />

ACTION_ON_MAX_REJECT = HOLD<br />

If the job is already in Hold state, the llq -s command cannot analyze<br />

its status. To verify this, we released the job using the llhold -r command (put it<br />

back to Idle state), then quickly checked with llq -s bglfen1.23.0, as shown in<br />

Example 6-116.<br />

Example 6-116 Releasing held job to check for reasons<br />

test1@bglfen1:/bgl/loadl> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

------------------------ ---------- ----------- -- --- ------------<br />

----------bglfen1.23.0<br />

test1 3/26 14:06 I 50 small<br />

(alloc)<br />

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0<br />

preempted<br />

test1@bglfen1:/bgl/loadl>llq -s bglfen1.23.0<br />

...<br />

...<br />

==================== EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.23.0<br />

====================<br />

Step waiting on following partitions to become FREE:<br />

RMP26Mr151607127<br />

Step waiting on following partitions to become FREE:<br />

RMP26Mr151607127<br />

test1@bglfen1:/bgl/loadl> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

366 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


------------------------ ---------- ----------- -- --- ------------<br />

----------bglfen1.23.0<br />

test1 3/26 14:06 H 50 small<br />

1 job step(s) in queue, 0 waiting, 0 pending, 0 running, 1 held, 0<br />

preempted<br />

From Example 6-116, we can see that the Scheduler allocates the node resource<br />

to the job. The status shown by llq -s command is normal.<br />

The next check is to look into the LoadLeveler StartLog and StarterLog files to<br />

see if there is any information about jobid bglfen1.23.0, as shown in<br />

Example 6-117.<br />

Example 6-117 Job REJECTED according to StartLog<br />

03/26 14:28:59 TI-1667 JOB_START: Received start order from<br />

bglfen1.itso.ibm.com.<br />

03/26 14:28:59 TI-1667 inside StartdStep constructor for 269709664<br />

03/26 14:28:59 TI-1667 shutdown_active_count increasing from 0 to 1<br />

03/26 14:28:59 TI-1667 JOB_START: Step bglfen1.itso.ibm.com.23.0:<br />

Starting. Starter process id = 19333<br />

03/26 14:28:59 TI-4 Starter Table:<br />

StarterPid = 19333, ClientMachine = bglfen1.itso.ibm.com, user = test1,<br />

State = 2048, Flags = 8196, sid = bglfen1.itso.ibm.com.23.0, StateTimer<br />

= 1143401339<br />

03/26 14:28:59 TI-1671 Notification of user tasks termination received<br />

from Starter for job step bglfen1.itso.ibm.com.23.0<br />

03/26 14:28:59 TI-1672 JOB_STATUS: Job bglfen1.itso.ibm.com.23.0 Status<br />

= REJECTED<br />

03/26 14:28:59 TI-1672 QUEUED_STATUS: Step bglfen1.itso.ibm.com.23.0<br />

queueing status to schedd at bglfen1.itso.ibm.com<br />

03/26 14:28:59 TI-1672 Cleanup_dir: Job<br />

/home/loadl/execute/bglfen1.itso.ibm.com.23.0 Removing directory.<br />

03/26 14:28:59 TI-1672 ruid(0) euid(5001)<br />

03/26 14:28:59 TI-1672 shutdown_active_count decreasing from 1 to 0<br />

Chapter 6. Scenarios 367


From the StartLog, we can see that the job is rejected but there is no further<br />

information provided. The next log to check is StarterLog, shown in<br />

Example 6-118.<br />

Example 6-118 Error found in StarterLog: permission denied<br />

03/26 14:07:09 TI-0 ********** STARTER starting up<br />

***********<br />

03/26 14:07:09 TI-0 LoadLeveler: LoadL_starter started, pid = 19333<br />

03/26 14:07:09 TI-0 Sending starter pid 19333.<br />

03/26 14:28:59 TI-0 bglfen1.23.0 Prolog not run, no program was<br />

specified.<br />

03/26 14:28:59 TI-0 bglfen1.23.0 run_dir =<br />

/home/loadl/execute/bglfen1.itso.ibm.com.23.0<br />

03/26 14:28:59 TI-0 bglfen1.23.0 Sending request for executable to<br />

Schedd<br />

03/26 14:28:59 TI-0 03/26 14:28:59 TI-0 bglfen1.23.0 User environment<br />

prolog not run, no program was specified.<br />

03/26 14:28:59 TI-0 LoadLeveler: 2539-475 Cannot receive command from<br />

client bglfen1.itso.ibm.com, errno =2.<br />

03/26 14:28:59 TI-0 bglfen1.23.0 llcheckpriv program exited, termsig =<br />

0, coredump = 0, retcode = -2<br />

03/26 14:28:59 TI-0 bglfen1.23.0 LoadL_starter: Cannot open stdout<br />

file. /bgl/loadl/out/hello.bglfen1.23.0.out: Permission denied (13)<br />

03/26 14:28:59 TI-0 bglfen1.23.0 User environment epilog not run, no<br />

program was specified.<br />

03/26 14:28:59 TI-0 bglfen1.23.0 cleanupStdErr: cannot stat<br />

/bgl/loadl/out/hello.bglfen1.23.0.err. rc=-1 errno=2 [No such file or<br />

directory]<br />

03/26 14:28:59 TI-0 bglfen1.23.0 Epilog not run, no program was<br />

specified.<br />

03/26 14:28:59 TI-0 bglfen1.itso.ibm.com.23.0 Sending REJECTED status<br />

to Startd<br />

03/26 14:28:59 TI-0 bglfen1.23.0 ********** STARTER exiting<br />

***************<br />

In the StarterLog, it we can see a “Permission denied” problem is associated<br />

with the stdout(1) file: /bgl/loadl/out/hello.bglfen1.23.0.out.<br />

Note: There might not be enough information to pinpoint on which node the<br />

job is started. Checks for StartLog and StarterLog files for multiple nodes in<br />

the LoadLeveler cluster might have to be performed.<br />

368 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


In our case, it turned out that the directory belongs to user ID loadl and user ID<br />

test1’s job is trying to write the stderr(2) and stdout(1) into. We corrected the<br />

permissions and solved the problem.<br />

Lessons learned<br />

We learned the following lessons:<br />

► A job can be put into Hold state for many reasons. The job ID is essential for<br />

checking its status.<br />

► The job needs to be released in order for llq -s to assess the<br />

reasons (and get additional clues).<br />

► LoadLeveler NegotiatorLog might not have any information about this job. In<br />

this case, the job was rejected by a Starter on one of the (LoadLeveler)<br />

nodes.<br />

► The most difficult part is to look for the job ID in StartLog or StarterLog on<br />

every node<br />

6.4.7 LoadLeveler: job disappears<br />

In this section, we investigate some possible causes of a job disappearing from<br />

the queue.<br />

Error injection<br />

We pass a wrong argument to the mpirun command in the LoadLeveler job<br />

command file hello.cmd. The wrong argument is -verbose 5 (not supported by<br />

mpirun). This causes the job to quickly fail and disappear from LoadLeveler<br />

queue.<br />

<strong>Problem</strong> description<br />

After the job is submitted, if it is a small job and there are no other jobs in the<br />

queue, it might run quickly and disappear from the queue. Sometimes, the job<br />

owner does not get a chance to see the job in Running state in the LoadLeveler<br />

queue. We could identify the following reasons:<br />

► The job might have run quickly (finished successfully) and nothing is wrong.<br />

► The job runs and fails quickly.<br />

► The job can be canceled by an user or an administrator.<br />

Chapter 6. Scenarios 369


Detailed checking<br />

We start by checking the job stderr(2) and stdout(1) files, following the job<br />

command checklist in 4.4.9, “LoadLeveler checklist” on page 186. The job<br />

command file that we used for this job is hello.cmd. It includes the following lines:<br />

#@ output =<br />

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />

#@ error =<br />

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />

The two files (stdout(1) and stderr(2)) are located in /bgl/loadl/out/ directory, as<br />

shown in Example 6-119.<br />

Example 6-119 Checking job output files<br />

test1@bglfen1:/bgl/loadl/out> ls -ltr<br />

-rw-r--r-- 1 loadl loadl 0 Mar 26 16:15 hello.bglfen1.24.0.out<br />

-rw-r--r-- 1 loadl loadl 193 Mar 26 16:15 hello.bglfen1.24.0.err<br />

Notice that the stdout(1) file is empty, thus we check the stderr(2) contents (listed<br />

in Example 6-120).<br />

Example 6-120 Job error in stderr(2) file<br />

FE_MPI (ERROR): Incorrect verbose option: '5'<br />

Usage:<br />

or<br />

mpirun [options]<br />

mpirun [options] binary [arg1 arg2 ...]<br />

Try "mpirun -h" for more details.<br />

The error message comes from mpirun front-end, and indicates that there is a<br />

usage error when mpirun is invoked. This leads back to the job command file<br />

“hello.cmd”. The keyword to check for is #@ arguments:<br />

#@ arguments = -verbose 5 -exe /bglscratch/test1/hello.rts<br />

That concludes this scenario because the -verbose option flag is set to 5 which is<br />

an invalid value.<br />

In case of the job is canceled by another user or the system (LoadLeveler)<br />

administrator, the stderr(2) and stdout(1) files should show different types of<br />

error messages, thus, the next step would be to check the StartLog and<br />

StarterLog file on the node that runs the job.<br />

370 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Lessons learned<br />

We learned the following lessons:<br />

► When a job runs and fails quickly, you might not see it in the queue.<br />

► It is essential to set the output and error keywords in the job command file.<br />

► Check the job stdout(1) and stderr(2) for errors first.<br />

6.4.8 LoadLeveler: Blue Gene/L is absent<br />

In this section, we analyze what happens if the LoadLeveler cannot talk to the<br />

Blue Gene/L database. As LoadLeveler uses the bridge API to interrogate and<br />

submit requests to the Blue Gene/L Control system, the libraries needed to<br />

perform these tasks are critical for LoadLeveler.<br />

Error injection<br />

We intentionally remove the symbolic link to a Blue Gene/L provided library that<br />

LoadLeveler uses.<br />

<strong>Problem</strong> description<br />

If the llstatus command displays the message Blue Gene is absent,<br />

LoadLeveler is not able to communicate with the Blue Gene/L control server<br />

through the bridge API. On the Blue Gene/L side, there are several scenarios<br />

that might render the bridge API unavailable. Those are described in the core<br />

system scenarios, in 6.2, “Blue Gene/L core system scenarios” on page 267. In<br />

this scenario, we focus on the libraries that LoadLeveler uses to communicate<br />

with the bridge API. Even though this can rarely happen, the following can cause<br />

problems with libraries:<br />

► A library binary file can be removed or corrupted.<br />

► A symbolic link pointing to the library can be broken.<br />

► An LoadLeveler or Blue Gene/L upgrade can be perform on some of the<br />

nodes, but not on all.<br />

► An upgrade of libraries might also alter the symbolic links to point to the<br />

wrong binaries.<br />

Detailed checking<br />

Before injecting the error, we perform some basic checking to make sure<br />

LoadLeveler is operational and a job is submitted and run successfully (see<br />

Example 6-121).<br />

Chapter 6. Scenarios 371


Example 6-121 LoadLeveler basic checks<br />

loadl@bglfen1:~> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />

OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 27 PPC64<br />

Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64<br />

Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is present<br />

All machines on the machine_list are present.<br />

loadl@bglfen1:~> llctl -g stop<br />

llctl: Sent stop command to host bglsn.itso.ibm.com<br />

llctl: Sent stop command to host bglfen1.itso.ibm.com<br />

llctl: Sent stop command to host bglfen2.itso.ibm.com<br />

loadl@bglfen1:~><br />

loadl@bglfen1:~> llctl -g start<br />

llctl: Attempting to start LoadLeveler on host bglsn.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

llctl: Attempting to start LoadLeveler on host bglfen1.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

llctl: Attempting to start LoadLeveler on host bglfen2.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

loadl@bglfen1:~><br />

The link to a library binary is then removed. LoadLeveler is started and no errors<br />

are returned. See Example 6-122.<br />

Example 6-122 Starting LoadLeveler<br />

loadl@bglfen1:~> llctl -g start<br />

llctl: Attempting to start LoadLeveler on host bglsn.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

372 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


llctl: Attempting to start LoadLeveler on host bglfen1.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

llctl: Attempting to start LoadLeveler on host bglfen2.itso.ibm.com<br />

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />

CentralManager = bglsn.itso.ibm.com<br />

An user or the administrator wants to see status of LoadLeveler. The llstatus<br />

command shows Blue Gene is absent, as shown in Example 6-123.<br />

Example 6-123 llstatus message “Blue Gene is absent”<br />

loadl@bglfen1:~> llstatus<br />

Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />

OpSys<br />

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.07 2 PPC64<br />

Linux2<br />

bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.04 9999 PPC64<br />

Linux2<br />

PPC64/Linux2 2 machines 0 jobs 0 running<br />

Total Machines 2 machines 0 jobs 0 running<br />

The Central Manager is defined on bglsn.itso.ibm.com<br />

The BACKFILL scheduler with Blue Gene support is in use<br />

Blue Gene is absent<br />

All machines on the machine_list are present.<br />

Any jobs submitted at this time will stay in Idle state, as shown in<br />

Example 6-124.<br />

Example 6-124 Job in Idle state<br />

loadl@bglfen1:/bgl/loadl> llq<br />

Id Owner Submitted ST PRI Class<br />

Running On<br />

------------------------ ---------- ----------- -- --- ------------<br />

----------bglfen1.19.0<br />

loadl 3/24 13:43 I 50 small<br />

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0<br />

preempted<br />

Chapter 6. Scenarios 373


We first check the LoadLeveler NegotiatorLog, which shows an error of opening<br />

a library, as in Example 6-125.<br />

Example 6-125 The loadBrigeLibrary error in NegotiatorLog<br />

03/28 18:04:48 TI-1 Machine Number index is 0, adapter list size is 0<br />

03/28 18:04:48 TI-1 Machine Number index is 1, adapter list size is 0<br />

03/28 18:04:48 TI-1 Machine Number index is 2, adapter list size is 0<br />

03/28 18:04:48 TI-1 BG: int BgManager::loadBridgeLibrary() - start<br />

03/28 18:04:48 TI-1 int BgManager::loadBridgeLibrary(): Failed to open<br />

library, /usr/lib64/libbglbridge.so, errno=25 (libtableapi.so.1: cannot<br />

open shared object file: No such file or directory)<br />

03/28 18:04:48 TI-1 int BgManager::initializeBg(BgMachine*): Failed to<br />

load Bridge API library<br />

03/28 18:04:48 TI-1 *************************************************<br />

03/28 18:04:48 TI-1 *** LOADL_NEGOTIATOR STARTING UP ***<br />

03/28 18:04:48 TI-1 *************************************************<br />

03/28 18:04:48 TI-1<br />

03/28 18:04:48 TI-1 LoadLeveler: LoadL_negotiator started, pid = 19594<br />

03/28 18:04:48 TI-1 void<br />

LlMachine::queueStreamMaster(OutboundTransAction*): Set destination to<br />

master. Transaction route flag is now central manager sending<br />

transaction CMnotifyCmd to Master<br />

03/28 18:04:48 TI-6 LoadLeveler: Listening on port 9614 service<br />

LoadL_negotiator<br />

The message points to the library /usr/lib64/libbglbridge.so, which is a symbolic<br />

link (we removed earlier) to the shared object file libtableapi.so.1 (binary).<br />

Note: Should this error occur in real life, we would check both the link and the<br />

binary file.<br />

In this scenario, after the link is restored, LoadLeveler detects this, and the next<br />

llstatus command shows the message:<br />

Blue Gene is present.<br />

Jobs will run normal.<br />

374 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Lessons learned<br />

We learned the following lessons:<br />

► As LoadLeveler creates multiple links to a binary shared library, we have to<br />

remove all links to create the problem.<br />

► Without access to the Blue Gene/L bridge API, the jobs in LoadLeveler queue<br />

will stay in Idle state. They cannot be run.<br />

► It is essential to check whether Blue Gene/L is present from LoadLeveler’s<br />

perspective.<br />

6.4.9 LoadLeveler: LoadLeveler cannot start<br />

In this section we investigate possible causes for LoadLeveler daemons not<br />

starting. These daemons include the Central Manager (Master) and the<br />

Negotiator.<br />

<strong>Problem</strong> description<br />

When LoadLeveler cannot be started or the commands return errors, it means<br />

either the LoadLeveler Master or Negotiator daemons cannot start. This scenario<br />

focuses on the Negotiator problems, which could be one of the following:<br />

► LoadLeveler configuration files cannot be accessed<br />

► The Master daemon cannot start the Negotiator daemon on the Central<br />

Manager node<br />

► The Negotiator daemon not starting<br />

► LoadLeveler cannot access the necessary libraries<br />

Similar scenario can be followed to check problems with other daemons as well.<br />

However, the libraries, error messages and log files will be different.<br />

Detailed checking<br />

The common signs for the Negotiator problems are that the LoadLeveler<br />

commands return error messages similar to the ones shown in Example 6-126.<br />

Example 6-126 LoadLeveler Negotiator error messages<br />

loadl@bglsn:~/cmd> llq<br />

03/21 09:35:52 llq: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />

"LoadL_negotiator" on port 9614. errno = 111<br />

03/21 09:35:52 llq: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />

"LoadL_negotiator" on port 9614. errno = 111<br />

llq: 2512-301 An error occurred while receiving data from the<br />

LoadL_negotiator daemon on host bglsn.itso.ibm.com.<br />

Chapter 6. Scenarios 375


The error messages shown in Example 6-126 are the starting point for this<br />

scenario. We have observed that the LoadLeveler commands do not display<br />

normal information or status. Therefore, the checks in “LoadLeveler cluster and<br />

node status” on page 186 cannot be performed.<br />

This is not a problem related to jobs, thus the check related to “LoadLeveler run<br />

queue” on page 193 and “Job command file” on page 198 can be skipped for<br />

now.<br />

We have to determine which node is the Central Manager node, and which one<br />

has the LoadL_negotiator daemon running. As we have seen, when Blue Gene/L<br />

is involved, this is always the SN. In this case, we can perform the checks in<br />

“LoadLeveler configuration keywords” on page 205 which pinpoint the Central<br />

Manager node.<br />

When on the SN, we focus on the checks in “LoadLeveler processes, logs, and<br />

persistent storage” on page 202. The ps command shows if the Negotiator<br />

daemon is running. Example 6-127 shows that the LoadL_negotiator process is<br />

running. Even so, the process could start and crash quickly. Comparing the<br />

Negotiator’s process ID between two instances of the ps command might reveal<br />

this if the process ID is different.<br />

Example 6-127 Looking for the LoadL_negotiator process<br />

bglsn:~ # ps -ef | grep LoadL<br />

loadl 20189 1 0 15:40 ? 00:00:00<br />

/opt/ibmll/LoadL/full/bin/LoadL_master<br />

loadl 20199 20189 0 15:40 ? 00:00:05 LoadL_negotiator -f -c<br />

/tmp -C /tmp<br />

root 24894 24792 0 18:02 pts/22 00:00:00 grep LoadL<br />

If the Negotiator process starts and crashes, a LoadLeveler command should<br />

display the same error messages as in Example 6-126. If this happens quickly,<br />

the Negotiator might have not enough time to write log information, in which case<br />

the MasterLog should be checked first.<br />

376 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


As highlighted in Example 6-128, the Master daemon has recorded some errors<br />

while starting LoadL_negotiator. As we can see, the Negotiator died a couple of<br />

seconds after it started.<br />

Example 6-128 Negotiator error messages in MasterLog<br />

04/05 15:23:17 TI-1 CentralManager = bglsn.itso.ibm.com<br />

04/05 15:23:17 TI-1 Inode monitoring will not be performed on<br />

/home/loadl/log because it is a Reiser Filesystem which does not limit<br />

the number of inodes.<br />

04/05 15:23:17 TI-1 LoadLeveler: LoadL_master started, pid = 17689<br />

04/05 15:23:17 TI-11 LoadLeveler: 2539-463 Cannot connect to<br />

bglsn.itso.ibm.com "LoadL_negotiator" on port 9614. errno = 111<br />

04/05 15:23:17 TI-10 LoadL_negotiator started, pid = 17710<br />

04/05 15:24:21 TI-10 "LoadL_negotiator" on "bglsn.itso.ibm.com" died<br />

due to signal 11, attempting to restart<br />

04/05 15:24:21 TI-10 LoadL_negotiator started, pid = 18330<br />

04/05 15:24:27 TI-21 LoadLeveler: 2539-463 Cannot connect to<br />

bglsn.itso.ibm.com "LoadL_negotiator" on port 9614. errno = 111<br />

04/05 15:30:28 TI-10 "LoadL_negotiator" on "bglsn.itso.ibm.com" died<br />

due to signal 11, attempting to restart<br />

04/05 15:30:28 TI-10 LoadL_negotiator started, pid = 18506<br />

04/05 15:30:28 TI-28 Got SHUTDOWN command from "loadl" with uid = 7001<br />

on machine "bglsn.itso.ibm.com"<br />

04/05 15:30:28 TI-28 Master shutting down now.<br />

04/05 15:30:28 TI-30 LoadLeveler: 2539-463 Cannot connect to<br />

bglsn.itso.ibm.com "LoadL_negotiator" on port 9614. errno = 111<br />

Another error message in the MasterLog also indicates a problem with<br />

connecting to TCP port 9614. Investigating the NegotiatorLog reveals the same<br />

error messages (see Example 6-129).<br />

Example 6-129 Port 9614 error messages in NegotiatorLog<br />

04/05 15:30:28 TI-1 *************************************************<br />

04/05 15:30:28 TI-1 *** LOADL_NEGOTIATOR STARTING UP ***<br />

04/05 15:30:28 TI-1 *************************************************<br />

04/05 15:30:28 TI-1<br />

04/05 15:30:28 TI-1 LoadLeveler: LoadL_negotiator started, pid = 18506<br />

04/05 15:30:28 TI-6 LoadLeveler: 2539-479 Cannot listen on port 9614<br />

for service LoadL_negotiator.<br />

04/05 15:30:28 TI-1 void<br />

LlMachine::queueStreamMaster(OutboundTransAction*): Set destination to<br />

master. Transaction route flag is now central manager sending<br />

transaction CMnotifyCmd to Master<br />

Chapter 6. Scenarios 377


04/05 15:30:28 TI-6 LoadLeveler: Batch service may already be running<br />

on this machine.<br />

04/05 15:30:28 TI-6 LoadLeveler: Delaying 1 seconds and retrying ...<br />

04/05 15:30:28 TI-7 LoadLeveler: Listening on port 9612 service<br />

LoadL_negotiator_collector<br />

04/05 15:30:28 TI-6 LoadLeveler: 2539-479 Cannot listen on port 9614<br />

for service LoadL_negotiator.<br />

04/05 15:30:28 TI-6 LoadLeveler: Batch service may already be running<br />

on this machine.<br />

04/05 15:30:28 TI-6 LoadLeveler: Delaying 2 seconds and retrying ...<br />

04/05 15:30:28 TI-8 LoadLeveler: Listening on path<br />

/tmp/negotiator_unix_stream_socket<br />

04/05 15:30:28 TI-10 Dispatching.<br />

As we can see that there is something with TCP port 9614, we use the netstat<br />

command to check the status, as shown in Example 6-130. The command output<br />

shows that TCP port 9614 is already in LISTEN state which is OK for now.<br />

However, at the time the Negotiator had the problem, this port could’ve been in a<br />

different state, such as FIN_WAIT, thus the port might not be available because<br />

Negotiator starts and terminates rapidly.<br />

Example 6-130 Checking the state of a port or socket<br />

bglsn:/tmp # netstat -an | grep 9614<br />

tcp 0 0 0.0.0.0:9614 0.0.0.0:*<br />

LISTEN<br />

Additional error messages pointing to problems with creating a socket are<br />

revealed in SchedLog and StartLog, as shown in Example 6-131.<br />

Example 6-131 Socket error messages in StartLog and SchedLog<br />

Startlog:<br />

04/05 13:00:18 TI-10 LoadLeveler: 2539-484 Cannot start unix socket on<br />

path /tmp/startd_unix_dgram_socket. errno = 98<br />

04/05 13:00:18 TI-4 Starter Table:<br />

04/05 13:00:18 TI-16 LoadLeveler: 2539-463 Cannot connect to<br />

bglsn.itso.ibm.com "LoadL_negotiator_collector" on port 9612. errno =<br />

111<br />

SchedLog:<br />

04/05 13:22:31 TI-99 LoadLeveler: 2539-475 Cannot receive command from<br />

client bglsn.itso.ibm.com, errno =25.<br />

LoadLeveler: Failed to route task_resource_req_list (43008) in virtual<br />

int Task::encode(LlStream&)<br />

378 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


LoadLeveler: Failed to route node_tasks (34006) in virtual int<br />

Node::encode(LlStream&)<br />

LoadLeveler: Failed to route step_nodes (40033) in virtual int<br />

Step::encode(LlStream&)<br />

LoadLeveler: Failed to route StepList Steps (41002) in virtual int<br />

StepList::encode(LlStream&)<br />

LoadLeveler: Failed to route job_steps (22009) in virtual int<br />

Job::encode(LlStream&)<br />

04/05 13:22:31 TI-97 LoadLeveler: 2539-475 Cannot receive command from<br />

client bglsn.itso.ibm.com, errno =25.<br />

LoadLeveler: Failed to route job_environment_vectors (22008) in virtual<br />

int Job::encode(LlStream&)<br />

As it turns out, the key to this problem is that in Linux environment, certain<br />

daemons can choose to open their temporary communication socket in the /tmp<br />

directory, most LoadLeveler daemons do.<br />

In normal situations, a daemon opens the socket in /tmp then removes it on clean<br />

exit. However, if some environment or network problems occur and the daemon<br />

exits abnormally, the temporary socket file might be left behind.<br />

When the daemon is started next time, it might not be able to overwrite the same<br />

file, specially when the daemon is started as a different user ID. In LoadLeveler<br />

case this could happen if the loadl user starts LoadLeveler, after it has previously<br />

started (and did not exit cleanly) by root user.<br />

As a result, the same file (left over by root user) cannot be overwritten by another<br />

user (loadl). These files need to be cleaned up manually from /tmp directory on<br />

every node before LoadLeveler can be started again.<br />

Note: The socket files created by different daemons have different names, as<br />

shown in Example 6-132.<br />

Example 6-132 Daemons’ socket files in /tmp<br />

bglsn:/tmp # ls -l *socket<br />

srwxrwxrwx 1 loadl loadl 0 Apr 5 15:40 negotiator_unix_stream_socket<br />

bglfen1:/tmp # ls -l *socket<br />

srwxrwxrwx 1 loadl loadl 0 Apr 5 13:33 startd_unix_dgram_socket<br />

srwxrwxrwx 1 loadl loadl 0 Apr 5 13:33 startd_unix_stream_socket<br />

Chapter 6. Scenarios 379


To detect the left-over socket files in /tmp of every node, you need to stop<br />

LoadLeveler on all nodes. Use the ps command to make sure no LoadLeveler<br />

daemons are still around. Then, check the /tmp directory on every node for<br />

socket files with names similar to the ones in Example 6-132.<br />

In our case this was the issue and we solved it by stopping LoadLeveler on all<br />

nodes, cleaning the leftover socket files and restarted LoadLeveler.<br />

If manual removal of the socket files does not resolve the problem, the next step<br />

is to look into the core files generated by Negotiator (on abnormal exit). We can<br />

identify the core files from the following error messages:<br />

04/05 15:24:21 TI-10 "LoadL_negotiator" on "bglsn.itso.ibm.com" died<br />

due to signal 11, attempting to restart<br />

These errors can be found in the MasterLog shown in Example 6-128 on<br />

page 377.<br />

On a Blue Gene/L system, it is usual to set up processes to redirect their core<br />

files into the common directory /bgl/cores (/bgl is NFS mounted on all nodes:<br />

SN, FENs, and I/O nodes).<br />

However, LoadLeveler on Linux has one requirement to be able to generate core<br />

files: LoadLeveler has to be started as the root user ID.<br />

If LoadLeveler is not set up to run as the root user, this needs to be changed first.<br />

See Appendix A, “Installing and setting up LoadLeveler for Blue Gene/L” on<br />

page 409 for a procedure to set up the loadl user ID, then issue the command<br />

llctl -g start as root. If the Negotiator daemon crashes with signal 11 again,<br />

this time a core file is going to be generated in /bgl/cores/.<br />

Then, the core file can be investigated using the debugger (gdb). Depending on<br />

the information revealed from the trace stack of the memory dump, different<br />

procedures can be followed. If the trace stack does not have enough debugging<br />

information, a debugger enabled version of the LoadL_negotiator binary has to<br />

be used for recreating and generating the core file again.<br />

Additional checking can continue with the libraries validation and their symbolic<br />

links. See “Environment variables, network, and library links” on page 206.<br />

380 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Lessons learned<br />

We learned the following lessons:<br />

► If a daemon process exits abnormally, a new process might be spawned<br />

automatically. This is defined in the LoadL_config file with the keyword<br />

#@RESTART_PER_HOUR.<br />

► A temporary socket file is generated in the /tmp directory by the LoadLeveler<br />

daemons. If a daemon exits abnormally, the socket file might be left behind.<br />

► A Negotiator process exit with signal 11 should generate a core file for<br />

debugging purposes. On Linux systems, this core file is generated in /tmp<br />

directory by default. However, on Blue Gene/L system, this is usually<br />

configured by the system administrator to /bgl/cores/.<br />

Chapter 6. Scenarios 381


382 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Chapter 7. Additional topics<br />

This chapter presents two additional topics of interest for a Blue Gene/L<br />

environment:<br />

► Cluster <strong>Systems</strong> Management<br />

► Secure shell<br />

7<br />

Although a basic Blue Gene/L system can function without either Cluster<br />

<strong>Systems</strong> Management or secure shell, these products are needed for the<br />

centralized management and integration of the Blue Gene/L system in your<br />

computing environment.<br />

© Copyright IBM Corp. 2006. All rights reserved. 383


7.1 Cluster <strong>Systems</strong> Management<br />

This section provides a high-level introduction to Cluster <strong>Systems</strong> Management<br />

(CSM) and its use with Blue Gene/L. We do not provide the detailed information<br />

that you need to plan, install, and run CSM. For this, refer to the CSM product<br />

documentation. CSM support for Blue Gene/L was released as part of CSM 1.5<br />

in November 2005. The information in this section pertains to this level of CSM.<br />

However, be sure to check the latest CSM documentation for updates, which is<br />

available at:<br />

http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html<br />

7.1.1 Overview of CSM<br />

The Blue Gene Service Node software records configuration, RAS, and<br />

environmental information in the Blue Gene DB2 database. It also provides a<br />

Web interface and several CLI tools for working with the information that is<br />

stored. However, it is up to the Blue Gene administrators and users to watch or<br />

check the database, to determine when a problem occurs, and then to take<br />

appropriate action.<br />

CSM is an IBM licensed software product that is used to manage clusters<br />

consisting of AIX and Linux systems, from one to thousands. CSM has many<br />

capabilities, but in this section we focus on just one: its rich event monitoring and<br />

automated response capability. With CSM, you can specify what constitutes an<br />

event, and what should happen, automatically, if and when that event occurs.<br />

By installing CSM on your Blue Gene SN, you can automate the task of watching<br />

the database and taking corrective actions. For example, you could use CSM to<br />

watch the status of all the midplanes. If a midplane is marked in error or is<br />

marked as missing, the CSM software can detect this event and take whatever<br />

action that you have specified automatically. Perhaps you want to be paged or to<br />

receive an urgent e-mail when the event occurs. Or perhaps a special script<br />

should be run. Or perhaps all of these responses should happen simultaneously.<br />

CSM allows you to set up whatever monitoring and automated responses you<br />

need.<br />

If you are thinking “This monitoring and automated response stuff sounds<br />

interesting, but I don’t get it. Why drag a cluster management product like CSM<br />

into the picture?” Well, one good reason is precisely the monitoring and<br />

automated response capabilities of CSM! These capabilities are very powerful<br />

and customizable. You can ignore everything else about CSM if you choose.<br />

384 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Over time, as you become more familiar with the other capabilities of CSM and<br />

begin to view your Blue Gene/L systems, such as SN, Front-End Nodes (FENs),<br />

and File Servers, as a set of systems that you would like to manage from a single<br />

point of control, you might find more and more reasons to use other features of<br />

CSM.<br />

Moreover, if you have a raft of other IBM systems around (running Linux and<br />

AIX) these could be centrally managed along with your Blue Gene system from<br />

the same management server. For now, however, we concentrate on a one-node<br />

CSM cluster, your Blue Gene SN.<br />

To use CSM with your Blue Gene in the simplest manner possible, begin with a<br />

fully installed, fully operational Blue Gene/L system, including SN, FENs, and file<br />

servers. Next, obtain CSM through your IBM sales representative, or just grab<br />

the free, full-featured 60 day try-and-buy version from:<br />

http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html<br />

Follow the instructions in the manual CSM for AIX 5L and Linux V1.5 Planning<br />

and Installation <strong>Guide</strong>, SA23-1344-01 to install and configure the CSM<br />

management server software on your SN. Then follow the instructions in the<br />

same book for adding the optional CSM support for Blue Gene. When you’ve<br />

done this, your Blue Gene SN will double as a CSM management server and as<br />

the lone managed node in the CSM cluster.<br />

Note: The CSM software (server and client) is installed on the SN. If you want<br />

to configure your FENs and File Servers as managed nodes too, you can by<br />

installing the CSM client software, but that is optional. However, no CSM<br />

software is installed on the Blue Gene I/O or Compute Nodes.<br />

In the sections that follow, we discuss the monitoring and automated response<br />

capabilities of CSM and how to use them. You can find more information in the<br />

following CSM and Reliable Scalable Cluster Technology (RSCT) product<br />

publications (all available at the link previously mentioned):<br />

► CSM for AIX 5L and Linux V1.5 Administration <strong>Guide</strong>, SA23-1343-01<br />

► CSM for AIX 5L and Linux V1.5 Command and Technical Reference,<br />

SA23-1345-01<br />

► RSCT Administration <strong>Guide</strong>, SA22-7889-10<br />

► RSCT for Linux Technical Reference, SA22-7893-10<br />

Chapter 7. Additional topics 385


7.1.2 Monitoring the Blue Gene/L database with CSM<br />

When CSM is installed and configured on your SN (including the optional Blue<br />

Gene support), you can monitor the Blue Gene database for events of interest<br />

using a few simple commands.<br />

Assuming we have a condition (BGNodeErr), first you need to associate the<br />

condition with a response (BroadcastEventsAnyTime) by running the following<br />

command:<br />

# mkcondresp BGNodeErr BroadcastEventsAnyTime<br />

Then, you have to start monitoring the condition by running the command:<br />

# startcondresp BGNodeErr<br />

► A condition is a persistent CSM monitoring construct that identifies what to<br />

monitor, and what to monitor for. Basically, BGNodeErr is concerned with the<br />

Blue Gene database table TBGLNode, and in particular, is concerned with<br />

TBGLNode row updates that set the status column to E (Error) or M (Missing).<br />

► A response is a persistent CSM monitoring construct that identifies an action<br />

to take. In this example, BroadcastEventsAnyTime is a response that puts up<br />

a wall message for each event passed to it.<br />

► An event is a dynamic CSM monitoring construct generated by CSM when a<br />

monitored condition’s event expression evaluates true (that is, when that<br />

which is being monitoring for occurs).<br />

By running startcondresp BGNodeErr you are effectively telling CSM to monitor<br />

the Blue Gene database table TBGLNode for row updates that set the status<br />

column to E or M. This means that CSM is expected to generate an event<br />

whenever either type of row update occurs. Because you ran mkcondresp<br />

BGNodeErr BroadcastEventsAnyTime before that, CSM also knows that it should<br />

pass all such events to the BroadcastEventsAnyTime response, which, by<br />

design, puts up a wall message for each event passed to it.<br />

BGNodeErr is a predefined condition. CSM provides many predefined<br />

conditions. To get a list, simply run lscondition. To learn what a particular<br />

condition is for, run lscondition condition_name, as shown in Example 7-1.<br />

Example 7-1 Displaying condition information<br />

# lscondition BGNodeErr<br />

Displaying condition information:<br />

condition 1:<br />

Name = "BGNodeErr"<br />

Node = "c96m5sn02"<br />

386 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


MonitorStatus = "Not monitored"<br />

ResourceClass = "IBM.Sensor"<br />

EventExpression = "SD.Uint32>0"<br />

EventDescription = "An event will be generated when the node<br />

status is \"Error\" or \"Missing\" for an I/O or Compute Node in the<br />

Blue Gene system."<br />

RearmExpression = ""<br />

RearmDescription = ""<br />

SelectionString = "Name==\"BGNodeErr\""<br />

Severity = "c"<br />

NodeNames = {}<br />

MgtScope = "m"<br />

BroadcastEventsAnyTime is a predefined response. CSM provides many<br />

predefined responses. To get a list, simply run lsresponse. To learn what a<br />

particular response does, run lsresponse response_name, as shown in<br />

Example 7-2.<br />

Example 7-2 Displaying response information<br />

# lsresponse BroadcastEventsAnyTime<br />

Displaying response information:<br />

ResponseName = "BroadcastEventsAnyTime"<br />

Node = "c96m5sn02"<br />

Action = "wallEvent"<br />

DaysOfWeek = 1-7<br />

TimeOfDay = 0000-2400<br />

ActionScript = "/usr/sbin/rsct/bin/wallevent"<br />

ReturnCode = 0<br />

CheckReturnCode = "n"<br />

EventType = "b"<br />

StandardOut = "n"<br />

EnvironmentVars = ""<br />

UndefRes = "n"<br />

7.1.3 Customizing the monitoring capabilities of CSM<br />

The monitoring capabilities and predefined conditions and responses that come<br />

with CSM are powerful and useful capabilities. However, these capabilities CSM<br />

is also customizable. But before talking about the commands that you can use to<br />

customize the monitoring capabilities of CSM, we need to describe in more detail<br />

the whole CSM Blue Gene database monitoring story.<br />

Chapter 7. Additional topics 387


To simplify the earlier monitoring discussion, we purposely neglected to mention<br />

a few things. If you examine the output of lscondition BGNodeErr shown in<br />

Example 7-1, you will notice that there is no mention of the TBGLNode table, or<br />

our interest in row updates that set the status column to E or M.<br />

So where is this encoded? And how does the monitoring of the Blue Gene<br />

database really work? Let us consider the diagram in Figure 7-1.<br />

Service Node<br />

software<br />

CSM<br />

response<br />

CSM<br />

condition<br />

CSM<br />

sensor<br />

Database stored<br />

procedure<br />

Database trigger<br />

Blue Gene DATABASE<br />

(DB2)<br />

Figure 7-1 Blue Gene CSM monitoring diagram<br />

At the base of the diagram are the Blue Gene database and the SN software that<br />

writes to the database. Everything above that comes with CSM or is created by<br />

CSM when you run various commands. At the top are a couple of CSM<br />

monitoring constructs that we talked about already — a response and a<br />

condition.<br />

Below these is a CSM monitoring construct called a sensor, which we will explain<br />

later. And below the sensor are two DB2 constructs - a stored procedure and a<br />

trigger. The sensor, condition, and response used are up to you. You can use<br />

predefined ones, ones that you define, or a mix of the two types. We discuss how<br />

to define your own later. The trigger and stored procedure are created<br />

388 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


automatically for you by CSM when you start monitoring with the startcondresp<br />

command mentioned in the previous section.<br />

The large up-pointing arrow on the right simply indicates the overall flow; in<br />

layman’s terms, the trigger watches the database for the thing of interest to<br />

happen. If and when that thing happens, the trigger gathers pertinent data and<br />

passes it to the stored procedure. The stored procedure is just a middleman that<br />

passes the data to the sensor.<br />

When the condition becomes aware of new data in the sensor, the condition<br />

evaluates its EventExpression, which is based on sensor data. If the<br />

EventExpression evaluates true, an event is generated and passed to the<br />

response. The response does its thing — puts up a wall message, or sends an<br />

e-mail, or runs a script — whatever it is defined to do.<br />

Whoa! That is pretty complicated! Can’t you just insert your own database trigger<br />

and stored procedure into the database and initiate a desired action directly?<br />

The answer is "Yes," but you would need strong DB2 database administrator and<br />

programmer skills and the willingness to invest the time and effort to develop and<br />

to test the necessary DB2 constructs and code.<br />

Before introducing the CSM commands used to create custom monitoring<br />

constructs, let’s take a closer look at the constructs called sensors. As the<br />

diagram above shows, a condition is paired with a sensor. Look again at the<br />

output of lscondition BGNodeErr shown in Example 7-1. Two attributes specify<br />

with which sensor the condition BGNodeErr is paired:<br />

ResourceClass = "IBM.Sensor"<br />

SelectionString = "Name==\"BGNodeErr\""<br />

That is, condition BGNodeErr is paired with a sensor of the same name. You can<br />

obtain sensor details by running the lssensor sensor_name command, as shown<br />

in Example 7-3.<br />

Example 7-3 Displaying sensor information<br />

# lssensor BGNodeErr<br />

Name = BGNodeErr<br />

ActivePeerDomain =<br />

Command = /opt/csm/csmbin/bgmanage_trigger -t TBGLNODE -C STATUS<br />

-o u -x "n.STATUS = 'E' OR n.STATUS = 'M'" -p LOCATION,STATUS<br />

BGNodeErr<br />

ConfigChanged = 0<br />

ControlFlags = 5<br />

Chapter 7. Additional topics 389


Description = This sensor is updated when the node status is "Error"<br />

or "Missing" in the Blue Gene system. Use "SD.Uint32>0" as the event<br />

expression in all corresponding conditions.<br />

ErrorExitValue = 1<br />

ExitValue = 0<br />

Float32 = 0<br />

Float64 = 0<br />

Int32 = 0<br />

Int64 = 0<br />

NodeNameList = {c96m5sn02}<br />

RefreshInterval = 0<br />

SD = [,0,0,0,0,0,0]<br />

SavedData =<br />

String =<br />

Uint32 = 0<br />

Uint64 = 0<br />

UserName = bglsysdb<br />

Here, you see that sensor BGNodeErr’s Command attribute identifies TBGLNode<br />

as the Blue Gene table of interest, and n.STATUS = 'E' OR n.STATUS = 'M' as<br />

the column values to watch for. When you start monitoring, the sensor’s<br />

Command (bgmanage_trigger), is called with the arguments shown. It is<br />

bgmanage_trigger’s job to create the DB2 Trigger and Stored Procedure<br />

required.<br />

The purpose of this exposé is to point out that there are three main CSM<br />

monitoring constructs that are involved in monitoring the Blue Gene/L database:<br />

a sensor, a condition, and a response. There are many predefined ones that you<br />

can use, but if these do not meet your needs entirely, you can define your own.<br />

7.1.4 Defining your own CSM monitoring constructs<br />

In general, you define custom sensors with the CSM mksensor command.<br />

However, for Blue Gene/L sensors, you should use the CSM bgmksensor<br />

command instead because it understands the Blue Gene/L database and<br />

provides all the flags and options that are needed for creating a Blue Gene/L<br />

database sensor. You define custom conditions and responses using the<br />

standard CSM mkcondition and mkresponse commands.<br />

390 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


For example, suppose that you want to monitor the TBGLFanEnviroment table<br />

for high fan temperatures. CSM provides no predefined sensor for this type of<br />

monitoring. However, it is easy to define your own. On the SN, you can run this<br />

command:<br />

# bgmksensor -t TBGLFanEnvironment -o i -x “n.temperature>35”<br />

-p location,temperature BGFanTempHi<br />

Which translates to:<br />

Create a Blue Gene sensor named BGFanTempHi. Whenever a row is<br />

inserted in the TBGLFanEnvironment table with a temperature above 35<br />

degrees Celsius, the sensor caches the fan location and temperature and<br />

notifies all conditions that care. (The values to cache are specified with the -p<br />

flag, and are the values passed to the response in the generated event<br />

notification).<br />

Your then create a condition to monitor this new sensor. For example, run a<br />

command similar to the following on your CSM management server:<br />

# mkcondition -d ‘Generate an event when the temperature of a Blue<br />

Gene fan module rises above 35 degrees Celsius.’ -r IBM.Sensor -s<br />

Name=”BGFanTempHi”’ -m m -e “SD.Uint32>0” BGFanTempHi<br />

This command translates to:<br />

Create a condition named BGFanTempHi to monitor the sensor with the<br />

same name. (The -m and -e flags must simply be set as shown for all<br />

conditions created to monitor Blue Gene sensors).<br />

To start monitoring, run the following commands on your CSM management<br />

server:<br />

# mkcondresp BGFanTempHi MsgEventsToRootAnyTime "E-mail root<br />

off-shift" LogCSMEventsAnyTime<br />

# startcondresp BGFanTempHi<br />

After you have done all this, CSM creates the necessary DB2 trigger and stored<br />

procedure for you and turns monitoring on for condition BGFanTempHi<br />

automatically.<br />

Chapter 7. Additional topics 391


To verify that all is working properly, check the following items:<br />

► As root, run the following command:<br />

lsaudrec -l<br />

Near or at the bottom of the output, you should see the information shown in<br />

Example 7-4.<br />

Example 7-4 Checking monitoring conditions<br />

# lsaudrec -l<br />

.....>>> Omitted lines Omitted lines


With monitoring on, each time a row is added to the TBGLFanEnvironment<br />

table with a temperature above 35 degrees Celsius, an event will be<br />

generated. Due to the mkcondresp command (see “Monitoring the Blue<br />

Gene/L database with CSM” on page 386), the event will kick off three<br />

responses on the management server:<br />

– A message to root announcing the event (Example 7-7).<br />

Example 7-7 Message to root user when event BGFanTempHi happens<br />

Message from root@c96m5sn02 on at 15:01 ...<br />

Critical Event occurred:<br />

Condition: BGFanTempHi<br />

Node: c96m5sn02.ppd.pok.ibm.com<br />

Resource: BGFanTempHi<br />

Resource Class: Sensor<br />

Resource Attribute: SD<br />

Attribute Type: CT_SD_PTR<br />

Attribute Value: [location=J302 temperature=3.6E1,0,1,0,0,0,0]<br />

Time: Friday 07/21/06 15:00:59<br />

– An e-mail to root user (on CSM management server) with event details if<br />

the event occurs off-shift.<br />

– Logging of the event in the /var/log/csm/systemEvents file.<br />

The e-mail and log entry are shown in Example 7-8.<br />

Example 7-8 The e-mail and log entry for BGFanTempHi event<br />

Friday 07/21/06 15:00:59<br />

Condition Name: BGFanTempHi<br />

Severity: Critical<br />

Event Type: Event<br />

Expression: SD.Uint32>0<br />

Resource Name: BGFanTempHi<br />

Resource Class: IBM.Sensor<br />

Data Type: CT_SD_PTR<br />

Data Value: ["location=J302 temperature=3.6E1",0,1,0,0,0,0]<br />

Node Name: c96m5sn02.ppd.pok.ibm.com<br />

Node NameList: {c96m5sn02.ppd.pok.ibm.com}<br />

Resource Type: 0<br />

Chapter 7. Additional topics 393


7.1.5 Miscellaneous related information<br />

If you are interested in examining the DB2 Trigger and Stored Procedure that<br />

CSM creates for you, run bgmksensor with the -v flag (bgmksensor formulates the<br />

SQL statements used to create the Trigger and Stored Procedure before they’re<br />

actually needed so that it can test them), or query the Blue Gene database<br />

directly (after you have started monitoring with the startcondresp command).<br />

The database trigger name is derived from the sensor name by appending _CSM<br />

suffix (for example, for a sensor named BGFanTempHi, there will be a Trigger<br />

named BGFanTempHi_CSM when that sensor is actively used in monitoring).<br />

The database stored procedure story is actually more complicated than we’ve<br />

described. We created two Stored Procedures. One was named<br />

COMMON_BGP and the other COMMON_BGP_ext. The trigger calls<br />

COMMON_BGP which in turn calls COMMON_BGP_ext. COMMON_BGP exists<br />

to catch any SQL exceptions that occur. COMMON_BGP_ext calls a utility<br />

named refresh_sensor in a shared library named bgrefresh_sensor.so.<br />

refresh_sensor writes the data into the sensor.<br />

We created yet another DB2 construct that we have not mentioned because it<br />

plays a minor role—a DB2 sequence. Its name is derived from the sensor name<br />

by appending _CSM (for example, for a sensor named BGFanTempHi, there will be<br />

a Sequence named BGFanTempHi_CSM). The trigger uses the sequence to<br />

obtain a new number for each event it forwards to COMMON_BGP.<br />

After you have started monitoring, you can log on to the SN as bglsysdb and run<br />

the commands shown in Example 7-9.<br />

Example 7-9 Examining the constructs created by CSM in the Blue Gene database<br />

bglsysdb@bglsn~> db2 connect to bgdb0<br />

bglsysdb@bglsn~> db2 "select text from syscat.triggers where trigname =<br />

'BGFANTEMPHI_CSM'"<br />

bglsysdb@bglsn~> db2 "select procname,text from syscat.procedures where procname like<br />

'COMMON_BGP%'"<br />

bglsysdb@bglsn~> db2 "select * from syscat.sequences where seqname = '<br />

BGFANTEMPHI_CSM'"<br />

7.1.6 Conclusion<br />

CSM brings powerful and customizable monitoring and automated response<br />

capabilities to the Blue Gene/L environment. By exploiting them you can<br />

minimize or eliminate much of the manual problem determination work often<br />

facing the Blue Gene/L system administrator.<br />

394 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


7.2 Secure shell<br />

Furthermore, because these capabilities are extensions to existing CSM<br />

capabilities, you can easily monitor and set up automated responses for<br />

non-Blue Gene/L database problems as well. For example, get paged when the<br />

/var file system on your SN fills up; send an urgent e-mail to the appropriate<br />

person when the number of users on a FEN crosses some threshold. Run a<br />

script when the network adapter on a File Server is being overwhelmed. And<br />

there are so many other examples. Moving beyond monitoring, CSM offers a<br />

wealth of other capabilities that could help you manage your Blue Gene/L<br />

systems, such as distributed command execution, configuration file<br />

management, software maintenance, and so forth.<br />

This section begins with a short introduction to cryptographic techniques in a<br />

computing environment, continues with a secure shell overview, and ends with<br />

an example of how to use secure shell in a clustering environment.<br />

7.2.1 Basic cryptography<br />

One of the biggest problems in a networking environment is to design and<br />

implement a security mechanism that allows the whole computing environment<br />

to function properly, without interruptions, and also to ensure reliable data<br />

manipulation (data you can trust). Depending on the information (data) travelling<br />

across networks, you need to make sure that:<br />

► The data gets form sender to receiver unaltered: data integrity.<br />

► The data cannot be interpreted (understood) by anyone eavesdropping on the<br />

communication channel: data privacy.<br />

► The data arrived at the receiver side is coming from who the receiver thinks is<br />

coming from: data authenticity.<br />

For all these reasons and more, a series of cryptographic techniques has been<br />

developed and implemented. The cryptographic techniques are based on<br />

mathematical algorithms translated (programmed) into computer language.<br />

Thus, security has become an important component of a highly available and<br />

reliable computing environment.<br />

Chapter 7. Additional topics 395


The cryptographic technique are employed in data integrity, data privacy, and<br />

data authenticity in various forms and complexities, depending on the data<br />

security required:<br />

► Authentication (verifying identities)<br />

The parties communicating need to know and verify (with a reasonable level<br />

of trust) each other’s identity.<br />

► Authorization (access control)<br />

After the parties identities have been established, this establishes what<br />

actions will we allow someone to perform.<br />

► Data signing<br />

In addition to establishing the identities at the beginning of the communication<br />

session, it is also necessary to avoid the “man in the middle” attack.<br />

► Data encryption<br />

If we also want to avoid that someone else (third party) understands the data,<br />

we need to encrypt it in such a way that it can only be decrypted at the<br />

destination.<br />

► Accountability<br />

For multiple reasons we need to be able to trace back any system activity.<br />

It is also important to establish the effective level of security, in order to ensure<br />

an acceptable performance level. Sometimes too much security can be<br />

disturbing, because the systems can actually spend more time enforcing the<br />

security than for data computing.<br />

The encryption based mechanisms used in securing network communication<br />

are:<br />

► Symmetric key<br />

► Public/Private key pair (also known as PKI - Public Key Infrastructure)<br />

► Hash functions<br />

► Combinations of these techniques<br />

► HW cryptography<br />

396 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Symmetric key cryptography<br />

Symmetric key cryptography (Figure 7-2) uses a single (secret) key known by<br />

both parties in communication. It is relatively fast, but it has a drawback in the fact<br />

that the shared secret (key) must be distributed (somehow) to the communicating<br />

parties. If this shared secret is distributed over the (insecure) network, additional<br />

precautions must be taken.<br />

Nina encrypts with Dan's Public Key Dan decrypts using his own Private Key<br />

Nina<br />

Figure 7-2 Symmetric key encryption<br />

Message from Nina to<br />

Dan<br />

Unsecured Network<br />

Dan and Nina's Identical Key<br />

Algorithms used for symmetric key implementations include:<br />

► Data Encryption Standard (DES) algorithm<br />

The most commonly-used bulk cipher or block cipher algorithm, which was<br />

developed by IBM.<br />

► Commercial Data Masking Facility (CDMF)<br />

A method to shrink a 56-bit DES key to a 40-bit key suitable for export, which<br />

was also designed by IBM.<br />

► Triple DES<br />

The same as DES, but information is encrypted three times in a row using<br />

different keys each time.<br />

Dan<br />

Chapter 7. Additional topics 397


► RC2/RC4 algorithms<br />

RC2: block cipher (similar to DES), RC4: stream cipher (possible 40-bit key).<br />

Developed by Ron Rivest (RSA Data Security) and permit variable length<br />

keys<br />

► International Data Encryption Algorithm (IDEA)<br />

Has 128-bit key as is not a Government imposed standard. Pretty Good<br />

Privacy (PGP) uses IDEA and is freely available.<br />

Public Key Infrastructure<br />

Public Key Infrastructure (PKI) is based on a pair of asymmetric keys for each<br />

party in communication:<br />

► A private key (which never leaves the host where it has been generated<br />

► A public key which is send over the network to corresponding party for<br />

various reasons, such as digital signature, authentication, and so forth.<br />

In PKI, the information encrypted with one key (either public or private), can only<br />

be decrypted with its pair. There is no need to securely share a secret between<br />

the sender and receiver, however, this mechanism is much less efficient than<br />

symmetric key encryption, thus it is not efficient for bulk data encryption.<br />

The only one widely-used general purpose public key mechanism is Rivest,<br />

Shamir, and Adelman (RSA) algorithm, which relies on the factorization of large<br />

numbers, and property of RSA Data Security, Inc.<br />

Figure 7-3 presents a simplified diagram of how the PKI is used. In this case,<br />

Nina is sending an encrypted message (data) to Dan. For this, Nina uses Dan’s<br />

public key, thus ensuring that the data can only be decrypted by Dan. We do not<br />

show here other mechanisms, like how to make sure Nina has received Dan’s<br />

public key, or how Dan can be sure the message comes from Nina, and that it<br />

has not been altered while travelling through the communication channel.<br />

398 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 7-3 PKI - Nina sending a message to Dan<br />

Secure shell is based on PKI and uses several of these techniques to make sure<br />

data that gets from Nina to Dan can be trusted.<br />

7.2.2 Secure shell basics<br />

Nina encrypts with Dan's Public Key Dan decrypts using his own Private Key<br />

Nina's key<br />

pair<br />

Private<br />

Public<br />

Message from Nina to<br />

Dan<br />

Unsecured Network<br />

Initial key exchange<br />

(DAN to NINA only)<br />

Private<br />

Dan's key<br />

pair<br />

Secure shell is a client-server tool implemented for making possible secure<br />

system administration operations across complex networks. Secure shell<br />

employs a number of cryptographic techniques and provides a series of facilities<br />

(functions, protocols, and so forth) for the system administrators to use in their<br />

job.<br />

Beside the basic remote login, it also provides functions as remote command<br />

execution, remote file copy, and more sophisticated techniques, such as<br />

tunnelling (encrypted communication channel) for various other applications.<br />

In this section we provide basic information about secure shell server and client.<br />

Public<br />

Chapter 7. Additional topics 399


Secure shell server<br />

The secure shell server runs as a service (daemon) on the host system and<br />

provides the infrastructure for allowing incoming clients to connect and perform<br />

various administrative actions.<br />

Figure 7-4 shows the major dependencies of the secure shell server (sshd).<br />

Usually, this is started as a service at system boot time using /etc/rc.d/init.d/sshd<br />

script (depending on the init runtime mode).<br />

Server keys:<br />

rsa1, rsa2,<br />

dsa<br />

Server<br />

configuration<br />

files<br />

SSL Libraries<br />

Secure shell<br />

daemon (sshd)<br />

Server<br />

binaries and<br />

commands<br />

Figure 7-4 Secure shell server dependencies<br />

These dependencies are:<br />

► Secure socket layer (SSL) libraries, commands and headers.<br />

These provide the cryptographic functions and commands for various<br />

operations (key generation, encryption/decrypting).<br />

► Server configuration binaries and scripts<br />

This is the actual server code (executables, libraries, and scripts).<br />

400 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

Listen on port<br />

TCP:22<br />

User<br />

authentication<br />

configuration<br />

System<br />

authentication<br />

files and<br />

libraries


► Server configuration files<br />

These are used by the server for creating its runtime environment and are<br />

usually located in /etc/ssh directory.<br />

► System authentication files and libraries<br />

Secure shell server uses these files to pass authentication requests to the<br />

system (user/password).<br />

► Server keys<br />

As a daemon (service), ssh server runs under the root user ID, and<br />

represents an entity with its own identity, thus it has its own pair of keys<br />

(actually, it has three pairs of keys: rsa1 rsa2, and dsa), also known as ssh<br />

host keys.<br />

► User authentication files are actually located in each user’s ~/.ssh directory,<br />

and they represent the identities (users) known to the server to be<br />

authenticated using their public keys (ssh server “knows” client’s identity<br />

based on client’s public key).<br />

► TCP port 22<br />

This is the default port used by the ssh server to listen for incoming requests.<br />

It is configurable in /etc/ssh/sshd_config.<br />

Secure shell client<br />

Secure shell client provides (among others) a set of tools used for executing<br />

administrative tasks on remote systems. On a Linux (and AIX) system these are:<br />

► /usr/bin/scp<br />

Secure remote copy program; replaces /usr/bin/rcp<br />

► /usr/bin/sftp<br />

Secure file transfer program; replaces /usr/bin/ftp; however it has less<br />

features than classic ftp<br />

► /usr/bin/slogin<br />

Secure remote login program; replaces /usr/bin/rlogin<br />

► /usr/bin/ssh<br />

Secure shell program; replaces /usr/bin/rsh<br />

Chapter 7. Additional topics 401


Secure shell client programs also depend on SSL libraries to perform data<br />

encryption/decryption. The client software has a general configuration file,<br />

/etc/ssh/ssh_config, and a directory for each user (~/.ssh) for storing user pair(s)<br />

of keys and authentication files:<br />

► known_hosts<br />

This file stores the public keys of the servers the user has connected. Beside<br />

the keys, you can also specify certain options for connections to remote<br />

servers.<br />

► authorized_keys<br />

Even though this file is stored in users’ ~/.ssh directory, it is actually used by<br />

the local ssh daemon, and contains the public keys of the remote identities<br />

(user@host) allowed to execute commands on the local operating system. It<br />

can also contain various options, such as command redirection, login<br />

parameters, and so forth.<br />

Note: The ~/.ssh/authorized_keys (or authorized_keys2) is actually used<br />

by the local ssh server (sshd).<br />

► User’s ssh client configuration file (optional) if you want to specify additional<br />

configuration parameters (besides the ones in /etc/ssh/ssh_config)<br />

7.2.3 Sample configuration in a cluster environment<br />

In a cluster environment, because the nodes (systems) have to work together,<br />

some type of remote command must be set up between nodes. Generally, the<br />

requirement for such a program allows known identities (users and services)<br />

access across the network to remote services based on unprompted<br />

authentication (establishing the remote identity without interactive<br />

authentication—prompt for user name or password).<br />

402 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Figure 7-5 presents the files involved in remote command execution using secure<br />

shell with un-prompted authentication.<br />

node1 (ssh client)<br />

user: root@node1<br />

files:<br />

~/.ssh/id_rsa<br />

~/.ssh/id_rsa.pub<br />

~/.known_hosts<br />

command:<br />

/usr/bin/ssh<br />

node2 (ssh server)<br />

Figure 7-5 The ssh files involved in un-prompted authentication<br />

files:<br />

/etc/ssh/ssh_host_rsa_key<br />

/etc/ssh/ssh_host_rsa_key.pub<br />

user: root@node2<br />

files:<br />

~/.ssh/authorized_keys<br />

sshd: Listen on<br />

TCP port 22<br />

In the diagram, we assume that user root@node1 wants to execute a command<br />

(date) on node2, also as user root (root@node2).<br />

root@node1_# ssh root@node2 date<br />

We assume that no previous configuration was done nor that initial trust has<br />

been established. Thus, the following happens:<br />

► User root@node1 sends its identity to the server, specifying he wants to<br />

connect as root@node2.<br />

► The server (sshd) node2 sends its public key<br />

(/etc/ssh/ssh_host_rsa_key.pub) to the terminal that was used to initiate the<br />

connection, asking the user (root@node1) if he wants to accept the key (you<br />

have to explicitly type in “yes”).<br />

► When accepted, the key is stored in ~/.ssh/known_hosts file for root@node1<br />

user, and a session key is generated. The session key will be used to encrypt<br />

the information further transmitted during this session (until connection<br />

closes).<br />

Note: At this point in time, the ~/.ssh directory might not even exist. When<br />

you accept the server’s key, the directory is created along with the<br />

known_hosts file.<br />

► The server looks for root@node2 user’s ~/.ssh/authorized_key file.<br />

Chapter 7. Additional topics 403


► If this file exists, it checks for root@node1 public key (which has not yet been<br />

created on node1).<br />

► Because it cannot authenticate the user, it passes the control to the system<br />

authentication which prompts for root@node2 password.<br />

► When typed in correctly, the date command is executed and its result is<br />

returned to root@node1 user’s terminal.<br />

However, if configuration and initial trust have been performed, the date<br />

command is executed without prompting the user for a password.<br />

To achieve this you have to do the following:<br />

► On node1, as root, generate the ssh client keys, type rsa, no passphrase:<br />

root@node1 #/usr/bin/ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ''<br />

► On node1, as root user, grab node2 ssh server’s key, and store it in local root<br />

user’s known_hosts file:<br />

root@node1 #/usr/bin/ssh-keyscan -t rsa node2 >> ~/.ssh/known_hosts<br />

► On node1, send root user’s public key previously generated to root@node2:<br />

root@node1 #/usr/bin/scp ~/.ssh/id_rsa.pub<br />

root@node2:~/.ssh/node1_rsa.pub<br />

► On node2, add the public key received to root user’s authorized_keys file:<br />

root@node2 #cat ~/.ssh/node1_rsa.pub >> ~/.ssh/authorized_keys<br />

Now, you can execute remote commands as root@node2, from root@node1,<br />

without a password.<br />

Attention: For simplicity, our example refers to a one-way configuration, that<br />

is root user using ssh client on node1 executes a command (without<br />

interactive authentication) as root on node2.<br />

If you want user root@node2 to be able to run a remote command without<br />

interactive authentication) as root on node1, you need to configure symmetric<br />

files on both nodes.<br />

Objective<br />

Because many of the today’s clusters are connected through multiple networks,<br />

we would like our cluster to be as secure as possible and, at the same time, to<br />

allow access to remote resources in the cluster as seamless as possible. For this<br />

purpose, secure shell offers a reasonable solution with a good compromise for<br />

the security/administrative overhead ratio, thus it is the de facto standard tool for<br />

remote administration tasks. However, some basic security knowledge and a<br />

404 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


good understanding of the secure shell client-server implementation is required<br />

to achieve good.<br />

Our objective is to set up a cluster for remote command execution for root user<br />

on all nodes without interactive authentication, using a single set of keys for all<br />

ssh servers (daemons) running on every node in the cluster.<br />

For our exercise we have used the sample cluster shown in Figure 7-6.<br />

p630n01 p630n01 p630n01<br />

172.16.1.31 172.16.1.32 172.16.1.33<br />

Figure 7-6 Sample cluster configuration<br />

The three nodes run AIX 5.3 TL4, openssl 0.9.7-2g, and openssh 4.1. As it is<br />

outside the scope of this material, we do not explain how to install the software or<br />

configure the basic OS.<br />

Checking and setting up the configuration<br />

We started from scratch with ssh (no previous connection, nor authentication<br />

configuration). We have performed the following steps:<br />

► On node p630n01 we checked the server configuration for the location of the<br />

server keys, then generated three pairs of keys for the secure shell server<br />

(rsa1, rsa2, and dsa):<br />

root@p630n01_#/usr/bin/ssh-keygen -t rsa1 -f /etc/ssh/ssh_host_key<br />

-N ''<br />

root@p630n01_#/usr/bin/ssh-keygen -t rsa -f<br />

/etc/ssh/ssh_host_rsa_key -N ''<br />

root@p630n01_#/usr/bin/ssh-keygen -t dsa -f<br />

/etc/ssh/ssh_host_dsa_key -N ''<br />

► We restarted the ssh daemon (sshd) on p630n01 to learn the new keys.<br />

► We propagated all three pairs of keys in /etc/ssh directory to p630n02 and<br />

p630n03 (using scp).<br />

Chapter 7. Additional topics 405


► We restarted the ssh daemons on p630n02 and p630n03.<br />

Note: If you work remotely from a single node, it is good idea to restart the<br />

entire node if you cannot handle just the ssh daemons.<br />

► To avoid any conflicts, we wiped out the contents of the root user’s<br />

~/.ssh/known_hosts file on p630n01:<br />

root@p630n01_# > ~/.ssh/known_hosts<br />

► We grabbed the ssh server public keys from all nodes in a new known_hosts<br />

file. For this we use a file (my_nodes) containing the host names (IP labels) of<br />

nodes p630n01,2,3 (text file, one host name per line):<br />

root@p630n01_# /usr/bin/ssh-keyscan -t rsa -f my_nodes ><br />

~/.ssh/known_hosts<br />

Note: The format of known_hosts file uses one line per every host known.<br />

Using the ssh-keyscan command assumes all nodes are up and running (sshd<br />

as well). However, if not all nodes are up, or if you have tens of nodes (or<br />

more) in your cluster, and you plan to use a single pair of keys for all machines<br />

in your cluster, you can use wildcard characters to specify the host part of the<br />

key in the known_hosts file, using one line (and one key) to cover all hosts in a<br />

specified IP range. For details, check the ssh man pages.<br />

► We generated the root user pair of keys. For this, we have chosen rsa2<br />

(which is, in fact, the rsa type) algorithm:<br />

root@p630n01_#/usr/bin/ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ''<br />

► We copied the public key into a fresh authorized_keys file (if you already have<br />

such a file you need to check it for duplicates and correct entries):<br />

root@p630n01_# cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys<br />

► We propagated the entire contents or the ~/.ssh directory to all nodes in the<br />

cluster (p630n01,2,3).<br />

► We tested the authentication.<br />

7.2.4 Using ssh in a Blue Gene/L environment<br />

In a Blue Gene/L environment, secure shell is used to allow remote command<br />

execution between the SN and the I/O nodes (mainly for GPFS, but not only). It<br />

can also ne used for mpirun (between FENs and SN).<br />

As a particularity, in Linux, when using remote command execution between<br />

FENs and the SN, the pair rsh/rshd does not allow you to interactively login onto<br />

406 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


the SN, as the rlogind (remote login daemon), or telnetd are not usually running<br />

on the SN (default configuration in SUSE SLES9). However, if you plan to use<br />

ssh as remote command execution program between the FENs and the SN, you<br />

need a special configuration for the authorized_keys file of the users allowed to<br />

execute remote commands on the SN.<br />

As previously mentioned, the format of the authorized_keys file allows you to<br />

customize the behavior of the remote command execution with ssh (see the ssh<br />

man pages).<br />

Tip: Using the no-pty option in front of the public key in the authorized_keys<br />

file allows you to execute remote commands without being prompted for<br />

password. However, you will not get a pseudo-terminal (pty) if you try to open<br />

a shell (ssh without any command).<br />

For Open secure shell detailed information, credits and latest news, see:<br />

http://www.openssh.org<br />

Chapter 7. Additional topics 407


408 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


A<br />

Appendix A. Installing and setting up<br />

LoadLeveler for Blue Gene/L<br />

This appendix describes the steps for installing and setting up LoadLeveler for<br />

the Blue Gene/L system built during the writing of this red book. This is to show<br />

one way of setting up and configuring a LoadLeveler cluster. Special attention is<br />

directed to Blue Gene/L specific items and procedures such as:<br />

► The additional 32-bit library RPM<br />

► The specific option flag when running the installation script<br />

► Blue Gene/L configuration keywords<br />

► Blue Gene/L environment variables<br />

At the end of the setting up process, the message Blue Gene is present is key<br />

indication of a successful installation and configuration of LoadLeveler.<br />

© Copyright IBM Corp. 2006. All rights reserved. 409


Installing LoadLeveler on SN and FENs<br />

Obtaining the rpms<br />

This section explains how to install LoadLeveler on SN and FENs.<br />

The RPMs can come from the provided CD-ROM. The upgrade rpms can also be<br />

downloaded from the IBM service support Web site:<br />

http://www14.software.ibm.com/webapp/set2/sas/f/loadleveler/download/un<br />

ix.html<br />

There are five required rpms for LoadLeveler. Example A-1 lists the rpms for the<br />

IBM System p platform and the operating system is SLES9.<br />

Example: A-1 LoadLeveler RPMs<br />

LoadL-full-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />

LoadL-full-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />

LoadL-so-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />

LoadL-so-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />

LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm<br />

Note: LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm is required for Blue<br />

Gene/L only.<br />

On the download Web site, click View to see information about the upgrade<br />

(Figure A-1).<br />

Figure A-1 Downloading LoadLeveler rpms from the Web<br />

In addition to the five RPMs, the Java rpm has to be present in the same<br />

directory. Although the system has a later version of Java installed, the following<br />

rpm is required:<br />

IBMJava2-JRE-ppc64-1.4.2-0.0.ppc64.rpm<br />

410 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Installing the rpms<br />

Note: Without this Java rpm, the install_ll script does not run. See IBM TWS<br />

LoadLeveler Installation <strong>Guide</strong> (GI10-0763-02) for further information.<br />

First, the following command installs the LoadLeveler license in directory<br />

/opt/ibmll/LoadL/.<br />

rpm -ivh LoadL-full-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />

Then, locate the install_ll script in /opt/ibmll/LoadL/sbin/ and invoke it as<br />

following:<br />

.install_ll -y -b -d <br />

where is the directory containing the LoadLeveler rpms.<br />

Note: The option flag -b is important to tell install_ll to install the<br />

LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm. Without this flag specified,<br />

install_ll installs only the rpms for a regular SLES9 node.<br />

The same installation procedure has to be repeated on all nodes. If available,<br />

dsh can be used to do installation on remote nodes.<br />

Setting up the LoadLeveler cluster<br />

This section explains how to set up the LoadLeveler cluster.<br />

LoadLeveler user and group IDs<br />

LoadLeveler requires one user ID to be the administrator. Also, it is<br />

recommended that a group ID is also created for the purposes of running<br />

LoadLeveler. If a group ID does not already exist, create a loadl group with the<br />

following command:<br />

groupadd -g 7000 loadl<br />

Note: The group ID 7000 is arbitrary. However, it has to be the same on all<br />

nodes in the cluster.<br />

If a user ID does not already exist, create a loadl user ID with the following<br />

command:<br />

useradd -d /home/loadl -g loadl -u 7001 -p loadl loadl<br />

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 411


Create the home directory for user loadl. This is a local file system:<br />

mkdir /home/loadl<br />

LoadLeveler configuration<br />

Create a common (NFS mounted) directory to contain LoadLeveler configuration<br />

files:<br />

mkdir /bgl/loadlcfg<br />

Copy the two provided sample files LoadL_admin and LoadL_config from<br />

/opt/ibmll/LoadL/full/samples/ to /bgl/loadlcfg and make appropriate changes.<br />

Create the file /etc/LoadL.cfg with this content similar to Example A-2.<br />

Example: A-2 Content of /etc/LoadL.cfg<br />

LoadLUserid = loadl<br />

LoadL<strong>Group</strong>id = loadl<br />

LoadLConfig = /bgl/loadlcfg/LoadL_config<br />

Create LoadLeveler directories such as log, spool, execute under /home/loadl by<br />

issuing the following command as user loadl:<br />

/opt/ibmll/LoadL/full/bin/llinit -local /home/loadl<br />

Note: The llinit command also creates link to the LoadLeveler bin directory<br />

from /home/loadl so that user loadl can invoke the LoadLeveler commands<br />

such as llctl, llstatus, llq, and so forth, without requiring the directory added<br />

into the $PATH variable.<br />

Again, the llinit command has to be run on all nodes.<br />

Enable rsh on all nodes. Then, start LoadLeveler with the following command:<br />

llctl -g start<br />

412 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Enabling Blue Gene/L capabilities in LoadLeveler<br />

Up to this point, we have a regular LoadL cluster on these Linux nodes.<br />

LoadLeveler does not know anything about Blue Gene/L yet.<br />

To tell LoadLeveler about the Blue Gene/L system, add these Blue Gene/L<br />

specific keywords into the global configuration file /bgl/loadlcfg/LoadL_config:<br />

BG_ENABLED = true<br />

BG_ALLOW_LL_JOBS_ONLY = false<br />

BG_MIN_PARTITION_SIZE = 32<br />

BG_CACHE_PARTITIONS = true<br />

Now, stop and start LoadLeveler and it recognizes Blue Gene/L. But, it may<br />

says "Blue Gene is absent" because LoadLeveler cannot find the relevant Blue<br />

Gene/L libraries yet. See Figure 4-12 on page 172 for the message Blue Gene is<br />

present displayed from the llstatus command.<br />

To create the appropriate symbolic links for the libraries that LoadLeveler needs,<br />

run the following script as root user:<br />

/home/loadl/bglinks<br />

Presumably, the system administrator has to create this file and run it once on<br />

every node. The contents of the script is described in this red book. See 4.4.5,<br />

“Making the Blue Gene/L libraries available to LoadLeveler” on page 173.<br />

Setting Blue Gene/L specific environment variables<br />

Set up the following environment variables in the .bashrc for user loadl:<br />

export BRIDGE_CONFIG_FILE =<br />

/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />

export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />

export MMCS_SERVER_IP=bglsn.itso.ibm.com<br />

Note: The actual directory paths vary on different systems.<br />

Now stop and restart LoadLeveler. It should say Blue Gene is present. See<br />

Figure 4-12 on page 172 and 4.4.9, “LoadLeveler checklist” on page 186 for<br />

detail descriptions of LoadLeveler status checking.<br />

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 413


Example A-3 presents a sample LoadL_admin file we used in our environment.<br />

Example: A-3 Sample LoadL_admin file<br />

# LoadL_admin file: Remove comments and edit this file to suit your<br />

installation.<br />

# This file consists of machine, class, user, group and adapter<br />

stanzas.<br />

# Each stanza has defaults, as specified in a "defaults:" section.<br />

# Default stanzas are used to set specifications for fields which are<br />

# not specified.<br />

# Class, user, group, and adapter stanzas are optional. When no<br />

adapter<br />

# stanzas are specified, LoadLeveler determines adapters dynamically.<br />

Refer to<br />

# Using and Administering LoadLeveler for detailed information about<br />

# keywords and associated values. Also see LoadL_admin.1 in the<br />

# ~loadl/samples directory for sample stanzas.<br />

#######################################################################<br />

######<br />

# DEFAULTS FOR MACHINE, CLASS, USER, AND GROUP STANZAS:<br />

# Remove initial # (comment), and edit to suit.<br />

#<br />

default: type = machine<br />

default: type = class # default class stanza<br />

wall_clock_limit = 30:00 # default wall clock limit<br />

default: type = user # default user stanza<br />

default_class = No_Class # default class = No_Class (not<br />

# optional)<br />

default_group = No_<strong>Group</strong> # default group = No_<strong>Group</strong><br />

(not<br />

# optional)<br />

default_interactive_class = inter_class<br />

default: type = group # default group stanza<br />

# priority = 0 # default <strong>Group</strong>Sysprio<br />

# maxjobs = -1 # default maximum jobs group<br />

is allowed<br />

# to run simultaneously (no<br />

limit)<br />

# maxqueued = -1 # default maximum jobs group<br />

is allowed<br />

414 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


# on system queue (no limit).<br />

does not<br />

# limit jobs submitted.<br />

#######################################################################<br />

######<br />

# MACHINE STANZAS:<br />

# These are the machine stanzas; the first machine is defined as<br />

# the central manager. mach1:, mach2:, etc. are machine name labels -<br />

# revise these placeholder labels with the names of the machines in the<br />

# pool, and specify any schedd_host and submit_only keywords and values<br />

# (true or false), if required.<br />

#######################################################################<br />

######<br />

bglsn.itso.ibm.com: type = machine<br />

schedd_host = true<br />

central_manager = true<br />

bglfen1.itso.ibm.com: type = machine<br />

schedd_host = true<br />

bglfen2.itso.ibm.com: type = machine<br />

schedd_host = true<br />

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 415


Example A-4 presents a sample LoadL_config file we used in our environment.<br />

Example: A-4 Sample LoadL_config file<br />

#<br />

# Machine Description<br />

#<br />

ARCH = PPC64<br />

#<br />

# Blue Gene Specific Settings<br />

#<br />

BG_ENABLED = true<br />

BG_ALLOW_LL_JOBS_ONLY = false<br />

#BG_ALLOW_LL_JOBS_ONLY = true<br />

BG_MIN_PARTITION_SIZE = 32<br />

BG_CACHE_PARTITIONS = true<br />

#<br />

# Specify LoadLeveler Administrators here:<br />

#<br />

LOADL_ADMIN = loadl root<br />

#<br />

# Default to starting LoadLeveler daemons when requested<br />

#<br />

START_DAEMONS = TRUE<br />

#<br />

# Machine authentication<br />

#<br />

# If TRUE, only connections from machines in the ADMIN_LIST are<br />

accepted.<br />

# If FALSE, connections from any machine are accepted. Default<br />

if not<br />

# specified is FALSE.<br />

#<br />

MACHINE_AUTHENTICATE = FALSE<br />

#<br />

# Specify which daemons run on each node<br />

#<br />

SCHEDD_RUNS_HERE = True<br />

STARTD_RUNS_HERE = True<br />

# Specify pathnames<br />

416 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


#<br />

RELEASEDIR = /opt/ibmll/LoadL/full<br />

LOCAL_CONFIG = $(tilde)/LoadL_config.local<br />

ADMIN_FILE = /bgl/loadlcfg/LoadL_admin<br />

LOG = $(tilde)/log<br />

SPOOL = $(tilde)/spool<br />

EXECUTE = $(tilde)/execute<br />

HISTORY = $(SPOOL)/history<br />

RESERVATION_HISTORY = $(SPOOL)/reservation_history<br />

BIN = $(RELEASEDIR)/bin<br />

LIB = $(RELEASEDIR)/lib<br />

#<br />

# Specify port numbers<br />

#<br />

MASTER_STREAM_PORT = 9616<br />

NEGOTIATOR_STREAM_PORT = 9614<br />

SCHEDD_STREAM_PORT = 9605<br />

STARTD_STREAM_PORT = 9611<br />

COLLECTOR_DGRAM_PORT = 9613<br />

STARTD_DGRAM_PORT = 9615<br />

MASTER_DGRAM_PORT = 9617<br />

#<br />

# Specify a scheduler type: LL_DEFAULT, API, BACKFILL, GANG<br />

# API specifies that internal LoadLeveler scheduling algorithms be<br />

# turned off and LL_DEFAULT specifies that the original internal<br />

# LoadLeveler scheduling algorithm be used.<br />

#<br />

SCHEDULER_TYPE = BACKFILL<br />

#<br />

# Specify accounting controls<br />

# To turn reservation data recording on, add the flag A_RES to ACCT<br />

#<br />

ACCT = A_OFF A_RES<br />

ACCT_VALIDATION = $(BIN)/llacctval<br />

GLOBAL_HISTORY = $(SPOOL)<br />

#<br />

# Specify checkpointing intervals<br />

#<br />

MIN_CKPT_INTERVAL = 900<br />

MAX_CKPT_INTERVAL = 7200<br />

# perform cleanup of checkpoint files once a day<br />

# 24 hrs x 60 min/hr x 60 sec/min = 86400 sec/day<br />

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 417


CKPT_CLEANUP_INTERVAL = 86400<br />

# sample source for the ckpt file cleanup program is shipped with<br />

LoadLeveler<br />

# and is found in: /usr/lpp/LoadL/full/samples/llckpt/rmckptfiles.c<br />

#<br />

# compile the source and indicate the location of the executable<br />

# as shown in the following example<br />

CKPT_CLEANUP_PROGRAM = /u/mylladmin/bin/rmckptfiles<br />

# LoadL_KeyboardD Macros<br />

#<br />

KBDD = $(BIN)/LoadL_kbdd<br />

KBDD_LOG = $(LOG)/KbdLog<br />

MAX_KBDD_LOG = 64000<br />

KBDD_DEBUG =<br />

#<br />

# Specify whether to start the keyboard daemon<br />

#<br />

#if HAS_X<br />

X_RUNS_HERE = True<br />

#else<br />

X_RUNS_HERE = False<br />

#endif<br />

#<br />

# LoadL_StartD Macros<br />

#<br />

STARTD = $(BIN)/LoadL_startd<br />

STARTD_LOG = $(LOG)/StartLog<br />

MAX_STARTD_LOG = 64000<br />

STARTD_DEBUG =<br />

POLLING_FREQUENCY = 5<br />

POLLS_PER_UPDATE = 24<br />

JOB_LIMIT_POLICY = 120<br />

JOB_ACCT_Q_POLICY = 300<br />

PROCESS_TRACKING = FALSE<br />

PROCESS_TRACKING_EXTENSION = $(BIN)<br />

#ifdef KbdDeviceName<br />

KBD_DEVICE = KbdDeviceName<br />

#endif<br />

#ifdef MouseDeviceName<br />

418 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


MOUSE_DEVICE = MouseDeviceName<br />

#endif<br />

#<br />

# LoadL_SchedD Macros<br />

#<br />

SCHEDD = $(BIN)/LoadL_schedd<br />

SCHEDD_LOG = $(LOG)/SchedLog<br />

MAX_SCHEDD_LOG = 64000<br />

SCHEDD_DEBUG =<br />

SCHEDD_INTERVAL = 120<br />

CLIENT_TIMEOUT = 30<br />

#<br />

# Negotiator Macros<br />

#<br />

NEGOTIATOR = $(BIN)/LoadL_negotiator<br />

NEGOTIATOR_DEBUG = D_NEGOTIATE D_FULLDEBUG<br />

NEGOTIATOR_LOG = $(LOG)/NegotiatorLog<br />

MAX_NEGOTIATOR_LOG = 64000<br />

NEGOTIATOR_INTERVAL = 60<br />

MACHINE_UPDATE_INTERVAL = 300<br />

NEGOTIATOR_PARALLEL_DEFER = 300<br />

NEGOTIATOR_PARALLEL_HOLD = 300<br />

NEGOTIATOR_REDRIVE_PENDING = 90<br />

NEGOTIATOR_RESCAN_QUEUE = 90<br />

NEGOTIATOR_REMOVE_COMPLETED = 0<br />

NEGOTIATOR_CYCLE_DELAY = 0<br />

NEGOTIATOR_CYCLE_TIME_LIMIT = 0<br />

#<br />

# Sets the interval between recalculation of the SYSPRIO values<br />

# for all the jobs in the queue<br />

#<br />

NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 0<br />

#<br />

# GSmonitor Macros<br />

#<br />

GSMONITOR = $(BIN)/LoadL_GSmonitor<br />

GSMONITOR_DEBUG =<br />

GSMONITOR_LOG = $(LOG)/GSmonitorLog<br />

MAX_GSMONITOR_LOG = 64000<br />

#<br />

# Starter Macros<br />

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 419


#<br />

STARTER = $(BIN)/LoadL_starter<br />

STARTER_DEBUG =<br />

STARTER_LOG = $(LOG)/StarterLog<br />

MAX_STARTER_LOG = 64000<br />

#<br />

# LoadL_Master Macros<br />

#<br />

MASTER = $(BIN)/LoadL_master<br />

MASTER_LOG = $(LOG)/MasterLog<br />

MASTER_DEBUG =<br />

MAX_MASTER_LOG = 64000<br />

RESTARTS_PER_HOUR = 12<br />

PUBLISH_OBITUARIES = TRUE<br />

OBITUARY_LOG_LENGTH = 25<br />

#<br />

# Specify whether log files are truncated when opened<br />

#<br />

TRUNC_MASTER_LOG_ON_OPEN = False<br />

TRUNC_STARTD_LOG_ON_OPEN = False<br />

TRUNC_SCHEDD_LOG_ON_OPEN = False<br />

TRUNC_KBDD_LOG_ON_OPEN = False<br />

TRUNC_STARTER_LOG_ON_OPEN = False<br />

TRUNC_NEGOTIATOR_LOG_ON_OPEN = False<br />

TRUNC_GSMONITOR_LOG_ON_OPEN = False<br />

#<br />

# Machine control expressions and macros<br />

#<br />

OpSys : "$(OPSYS)"<br />

Arch : "$(ARCH)"<br />

Machine : "$(HOST).$(DOMAIN)"<br />

#<br />

# Expressions used to control starting and stopping of foreign<br />

jobs<br />

#<br />

MINUTE = 60<br />

HOUR = (60 * $(MINUTE))<br />

StateTimer = (CurrentTime - EnteredCurrentState)<br />

BackgroundLoad = 0.7<br />

HighLoad = 1.5<br />

StartIdleTime = 15 * $(MINUTE)<br />

ContinueIdleTime = 5 * $(MINUTE)<br />

420 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


MaxSuspendTime = 10 * $(MINUTE)<br />

MaxVacateTime = 10 * $(MINUTE)<br />

KeyboardBusy = KeyboardIdle < $(POLLING_FREQUENCY)<br />

CPU_Idle = LoadAvg = $(HighLoad)<br />

#<br />

# See Using and Administering LoadLeveler for an explanation of these<br />

# control expressions<br />

#<br />

# START : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime)<br />

# SUSPEND : $(CPU_Busy) || $(KeyboardBusy)<br />

# CONTINUE : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime)<br />

# VACATE : $(StateTimer) > $(MaxSuspendTime)<br />

# KILL : $(StateTimer) > $(MaxVacateTime)<br />

START : T<br />

SUSPEND : F<br />

CONTINUE : T<br />

VACATE : F<br />

KILL : F<br />

#<br />

# The following (default) expression for SYSPRIO creates a FIFO<br />

job queue.<br />

#<br />

SYSPRIO: 0 - (QDate)<br />

#MACHPRIO: 0 - (1000 * (LoadAvg / (Cpus * Speed)))<br />

#<br />

# The following (default) expression for MACHPRIO orders<br />

# machines by load average.<br />

#<br />

MACHPRIO: 0 - (LoadAvg)<br />

#<br />

# The MAX_JOB_REJECT value determines how many times a job can be<br />

# rejected before it is canceled or put on hold. The default<br />

value<br />

# is 0, which indicates a rejected job will immediately be<br />

canceled<br />

# or placed on hold. MAX_JOB_REJECT may be set to unlimited<br />

rejects<br />

# by specifying a value of -1.<br />

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 421


#<br />

MAX_JOB_REJECT = 0<br />

#<br />

# When ACTION_ON_MAX_REJECT is HOLD, jobs will be put on user hold<br />

# when the number of rejects reaches the MAX_JOB_REJECT value.<br />

When<br />

# ACTION_ON_MAX_REJECT is CANCEL, jobs will be canceled when the<br />

# number of rejects reaches the MAX_JOB_REJECT value. The default<br />

# value is HOLD.<br />

#<br />

ACTION_ON_MAX_REJECT = HOLD<br />

# Filesystem Monitor Interval and Threshholds<br />

# Monitoring Interval is in minutes and should be set according to how<br />

# Fast the filesystem grows<br />

FS_INTERVAL = 30<br />

# File System space threshholds are specified in Bytes. Scaling<br />

factors<br />

# such as K, M and G are allowed.<br />

FS_NOTIFY = 750KB,1MB<br />

FS_SUSPEND = 500KB, 750KB<br />

FS_TERMINATE = 100MB,100MB<br />

# File System inode threshholds are specified in number of inodes.<br />

Scaling fact<br />

ors<br />

# such as K, M and G are allowed.<br />

INODE_NOTIFY = 1K,1.1K<br />

INODE_SUSPEND = 500,750<br />

INODE_TERMINATE = 50,50<br />

422 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Appendix B. The sitefs file<br />

B<br />

This appendix includes the sitefs file that we used. We created it from the<br />

example sitefs file that is shown in the file:<br />

/bgl/BlueLight/ppcfloor/docs/ionode.README<br />

We added the lines that we needed and saved the file in the following directory<br />

on the Service Node so that it survived any upgrades to the Blue Gene/L driver:<br />

/bgl/dist/etc/rc.d/init.d<br />

© Copyright IBM Corp. 2006. All rights reserved. 423


The /bgl/dist/etc/rc.d/init.d/sitefs file<br />

In this appendix we provide/describe/discuss ...<br />

# US Government Users Restricted Rights -<br />

# Use, duplication or disclosure restricted<br />

# by GSA ADP Schedule Contract with IBM Corp.<br />

#<br />

# Licensed Materials-Property of IBM<br />

# -------------------------------------------------------------<br />

#<br />

# -------------------------------------------------------------<br />

# NOTE: The PATH environment variable is set to the following<br />

# upon entry to this script:<br />

# /bin.rd:/sbin.rd:/usr/bin:/bin:/usr/sbin:/sbin<br />

# The /bin.rd and /sbin.rd directories contain many of<br />

# the busybox commands and Blue Gene specific programs.<br />

# The /bin, /sbin, and /usr directories are symbolic<br />

# links to the NFS-mounted MCP.<br />

#--------------------------------------------------------------<br />

#<br />

#<br />

# /etc/init.d/syslog<br />

# Default-Stop:<br />

# Description: Start the system logging daemons<br />

### END INIT INFO<br />

# Source config file, if it exists.<br />

test -f /etc/sysconfig/syslog || exit 6<br />

. /etc/sysconfig/syslog<br />

BINDIR=/sbin.rd<br />

case "$SYSLOG_DAEMON" in<br />

syslog-ng)<br />

syslog=syslog-ng<br />

config=/etc/syslog-ng/syslog-ng.conf<br />

params="$SYSLOG_NG_PARAMS"<br />

;;<br />

*)<br />

syslog=syslogd<br />

config=/etc/syslog.conf<br />

params="$SYSLOGD_PARAMS"<br />

# Add additional sockets to SYSLOGD_PARAMS<br />

# Extract the names of the variables beginning with<br />

SYSLOGD_ADDITIONAL_SOCKET<br />

424 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


SYSLOGD_ADDITIONAL_SOCKET_LIST=`grep SYSLOGD_ADDITIONAL_SOCKET<br />

/etc/sysconfig/syslog | sed s/=.*//`<br />

for variable in ${SYSLOGD_ADDITIONAL_SOCKET_LIST}; do<br />

eval value=\$$variable<br />

test -n "${value}" && test -d ${value%/*} && \<br />

params="$params -a $value"<br />

done<br />

;;<br />

esac<br />

syslog_pid="/var/run/${syslog}.pid"<br />

# check config and programs<br />

test -x ${BINDIR}/$syslog || exit 5<br />

test -x ${BINDIR}/klogd || exit 5<br />

# If there is no config file in the ramdisk, create a simple one<br />

# that logs important messages to /dev/console<br />

if [ ! -e ${config} ]; then<br />

# Note: "*.warn" produces numerous boot messages that may affect<br />

boot performance.<br />

# Therefore, warnings (and higher log levels) are not logged,<br />

by default.<br />

echo "authpriv.none;*.emerg;*.alert;*.crit;*.err /dev/console" >><br />

${config}<br />

fi<br />

#<br />

# Do not translate symbol addresses for 2.6 kernel<br />

#<br />

case `uname -r` in<br />

0.*|1.*|2.[0-4].*)<br />

#!/bin/sh<br />

# -------------------------------------------------------------<br />

# Product(s):<br />

# 5733-BG1<br />

#<br />

# (C)Copyright IBM Corp. 2004, 2005<br />

# All rights reserved.<br />

# US Government Users Restricted Rights -<br />

# Use, duplication or disclosure restricted<br />

# by GSA ADP Schedule Contract with IBM Corp.<br />

#<br />

# Licensed Materials-Property of IBM<br />

# -------------------------------------------------------------<br />

Appendix B. The sitefs file 425


# -------------------------------------------------------------<br />

# NOTE: The PATH environment variable is set to the following<br />

#!/bin/sh<br />

# -------------------------------------------------------------<br />

# Product(s):<br />

# 5733-BG1<br />

#<br />

# (C)Copyright IBM Corp. 2004, 2005<br />

# All rights reserved.<br />

# US Government Users Restricted Rights -<br />

# Use, duplication or disclosure restricted<br />

chmod +x /var/mmfs/etc/mmfsup.scr<br />

# Start GPFS and wait for it to come up<br />

rm -f $upfile<br />

/usr/lpp/mmfs/bin/mmautoload<br />

retries=300<br />

until test -e $upfile<br />

do sleep 2<br />

let retries=$retries-1<br />

if [ $retries -eq 0 ]<br />

then ras_advisory "$0: GPFS did not come up on I/O node<br />

$HOSTID"<br />

exit 1<br />

fi<br />

done<br />

#!/bin/sh<br />

#<br />

# Sample sitefs script.<br />

#<br />

# It mounts a filesystem on /scratch, mounts /home for user files<br />

# (applications), creates a symlink for /tmp to point into some<br />

directory<br />

# in /scratch using the IP address of the I/O node as part of the<br />

directory<br />

# name to make it unique to this I/O node, and sets up environment<br />

# variables for ciod.<br />

#<br />

. /proc/personality.sh<br />

. /etc/rc.status<br />

#-------------------------------------------------------------------<br />

426 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


# Function: mountSiteFs()<br />

#<br />

# Mount a site file system<br />

# Attempt the mount up to 5 times.<br />

# If all attempts fail, send a fatal RAS event so the block fails<br />

# to boot.<br />

#<br />

# Parameter 1: File server IP address<br />

# Parameter 2: Exported directory name<br />

# Parameter 3: Directory to be mounted over<br />

# Parameter 4: Mount options<br />

#------------------------------------------------------------------mountSiteFs()<br />

{<br />

# Make the directory to be mounted over<br />

[ -d $3 ] || mkdir $3<br />

# Set to attempt the mount 5 times<br />

ATTEMPT_LIMIT=5<br />

ATTEMPT=1<br />

# Attempt the mount up to 5 times.<br />

# Echo a message to the I/O node log for each failed attempt.<br />

until test $ATTEMPT -gt $ATTEMPT_LIMIT || mount $1:$2 $3 -o $4; do<br />

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 failed"<br />

sleep $ATTEMPT<br />

let ATTEMPT=$ATTEMPT+1<br />

done<br />

# If all attempts failed, send a fatal RAS event so the block fails<br />

to<br />

# boot. If the mount worked, echo a message to the I/O node log.<br />

if test $ATTEMPT -gt $ATTEMPT_LIMIT; then<br />

echo "Mounting $1:$2 on $3 failed" > /proc/rasevent/kernel_fatal<br />

exit<br />

else<br />

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 worked"<br />

fi<br />

}<br />

#-------------------------------------------------------------------<br />

# Script: sitefs()<br />

#<br />

# Perform site-specific functions during startup and shutdown<br />

#<br />

Appendix B. The sitefs file 427


# Parameter 1: "start" - perform startup functions<br />

# "stop" - perform shutdown functions<br />

#-------------------------------------------------------------------<br />

# Set to ip address of site fileserver<br />

SITEFS=172.30.1.33<br />

# First reset status of this service<br />

rc_reset<br />

# Handle startup (start) and shutdown (stop)<br />

case "$1" in<br />

start)<br />

echo Mounting site filesystems<br />

# Mount a scratch file system...<br />

mountSiteFs $SITEFS /bglscratch /bglscratch<br />

tcp,rsize=32768,wsize=32768,async<br />

# Mount a home file system...<br />

# mountSiteFs $SITEFS /home /home tcp,rsize=8192,wsize=8192<br />

# Arrange something for tmp...<br />

# - make a unique directory in the scratch file system.<br />

# - rename the original /tmp to /tmp.original.<br />

# - point /tmp to the unique directory.<br />

# tmpdir=/scratch/ionodetmp/$BGL_IP<br />

# [ -d $tmpdir ] || mkdir -p $tmpdir<br />

# mv /tmp /tmp.original<br />

# ln -s $tmpdir /tmp<br />

# Setup environment variables for ciod<br />

# echo "export CIOD_RDWR_BUFFER_SIZE=262144" >><br />

/etc/sysconfig/ciod<br />

# echo "export DEBUG_SOCKET_STARTUP=ALL" >><br />

/etc/sysconfig/ciod<br />

klogd<br />

# Uncomment the following line to not start the NTP daemon<br />

# rm /etc/sysconfig/xntp<br />

# Uncomment the following line to not start the syslogd and<br />

# daemons<br />

# rm /etc/sysconfig/syslog<br />

428 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


for<br />

# Uncomment the first line to start GPFS.<br />

# Optionally uncomment the other lines to change the defaults<br />

# GPFS.<br />

echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />

# echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />

# echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >><br />

/etc/sysconfig/gpfs<br />

rc_status -v<br />

;;<br />

stop)<br />

echo Unmounting site filesystems<br />

# Put /tmp back<br />

# rm /tmp<br />

# mv /tmp.original /tmp<br />

# Unmount the scratch and home file systems<br />

umount -f /bglscratch<br />

# umount -f /home<br />

;;<br />

restart)<br />

;;<br />

status)<br />

;;<br />

esac<br />

rc_exit<br />

#------End of script---------<br />

Appendix B. The sitefs file 429


430 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Appendix C. The ionode.README file<br />

C<br />

This file might well change between releases. You can find the current version of<br />

the ionode.README file that is compatible with the version of the code currently<br />

in use on your Blue Gene/L system at:<br />

/bgl/BlueLight/ppcfloor/docs/ionode.README<br />

This appendix includes the ionode.README file. This file contains useful<br />

information about the ionode bootup sequence. It is installed when the<br />

bglmcp*.rpm is installed. The version shown here is compatible with release<br />

V2R2M1.<br />

© Copyright IBM Corp. 2006. All rights reserved. 431


gl/BlueLight/ppcfloor/docs/ionode.README file<br />

Copyright Notice<br />

All Rights Reserved Legend<br />

US Government Users Restricted Rights Notice<br />

I/O Node Startup and Shutdown Scripts<br />

=====================================<br />

This file contains a summary of the startup and shutdown scripts<br />

executed<br />

on an I/O node.<br />

"Dist" directories<br />

------------------<br />

There are three "distribution" directories involved during startup and<br />

shutdown.<br />

$BGL_DISTDIR<br />

The "dist" subdir located at the top of the Blue Gene driver install<br />

tree.<br />

This is referred to as the "system dist" or just "dist". This is<br />

normally<br />

located at "/bgl/BlueLight//ppc/dist".<br />

$BGL_SITEDISTDIR<br />

The "/bgl/dist" subdir. This is referred to as the "site dist", as it<br />

is<br />

intended for site-specific customization of the I/O node startup and<br />

shutdown.<br />

$BGL_OSDIR<br />

The Mini-Control Program (MCP) subdir. This contains many of the<br />

executables needed by programs that run in the I/O node. The I/O node<br />

/bin, /sbin, /lib, and /usr directories are links to directories within<br />

this subdir. This is normally located at "/bgl/OS/x.y", where "x.y" is<br />

the<br />

version and modification level of the MCP.<br />

All of these are treated as the root of a filesystem even though they<br />

are<br />

NOT explicitly exported and mounted.<br />

432 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Contents of the ramdisk<br />

-----------------------<br />

The ramdisk contains a subset of files located in the system dist<br />

directories. These files remain in the system dist tree for the<br />

convenience of an administrator to read.<br />

/bgl NFS mount point of /bgl for bootstrap<br />

/bin/* shell commands implemented by busybox<br />

/sbin/* shell commands implemented by busybox<br />

/dev/* special device files<br />

/lib empty on startup (busybox is static linked)<br />

/proc mounted proc filesystem<br />

/proc/personality binary personality config (read by ciod)<br />

/proc/personality.sh shell personality config<br />

/tmp temp in ramdisk (very small)<br />

/etc/fstab minimal fstab<br />

/etc/group minimal group (note: ciod does not need group<br />

names<br />

defined)<br />

/etc/passwd minimal passwd (note: ciod does not need names<br />

defined)<br />

/etc/inittab inittab for starting startup and shutdown<br />

scripts<br />

and console shell<br />

/etc/protocols minimal protocols<br />

/etc/rpc minimal rpc<br />

/etc/services minimal services<br />

/etc/sysconfig/xntp minimal NTP options file<br />

/etc/ntp.conf minimal NTP configuration file<br />

/etc/sysconfig/syslog minimal syslog options file<br />

/etc/syslog.conf minimal syslog configuration file<br />

/etc/rc.dist defines $BGL_DISTDIR, $BGL_SITEDISTDIR, and<br />

$BGL_OSDIR<br />

/etc/rc.shutdown first stage shutdown script<br />

/etc/rc.d/rc.sysinit first stage sysinit script<br />

/etc/rc.d/rc.ras verbose, ras_advisory and ras_fatal functions<br />

/etc/rc.d/rc.network bring up network and default route<br />

/etc/rc.d/rc3.d dir of start (S*) and shutdown (K*) scripts<br />

The last few "rc" scripts are of interest in this document.<br />

Startup flow between rc scripts<br />

Appendix C. The ionode.README file 433


-------------------------------<br />

The I/O node startup begins with /sbin/init which reads /etc/inittab.<br />

The<br />

sysinit rule in inittab is coded to run /etc/rc.d/rc.sysinit. From<br />

here<br />

the flow is as follows. Note that BGL_SITEDISTDIR is normally<br />

/bgl/dist.<br />

/etc/rc.d/rc.sysinit<br />

mounts /proc<br />

includes /proc/personality.sh (e.g. $BGL_IP, $BGL_FS, etc)<br />

includes /etc/rc.d/rc.ras<br />

run /etc/rc.d/rc.network to bring up network and default route<br />

test for ethernet link status<br />

mount /bgl<br />

includes /etc/rc.dist to define $BGL_DISTDIR, $BGL_SITEDISTDIR,<br />

and<br />

$BGL_OSDIR<br />

run second stage startup from $BGL_DISTDIR/etc/rc.d/rc.sysinit2<br />

$BGL_DISTDIR/etc/rc.d/rc.sysinit2<br />

NOTE: this second stage startup exists only in NFS<br />

replace empty /lib with symlink to $BGL_OSDIR/lib<br />

replace empty /usr with symlink to $BGL_OSDIR/usr<br />

replace empty /etc/rc.d/rc3.d with symlink under $BGL_DISTDIR<br />

load the tree device driver<br />

run $BGL_DISTDIR/etc/rc.d/rc3.d/S* start scripts and<br />

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S* start scripts<br />

run $BGL_SITEDISTDIR/etc/rc.local<br />

Note that the start scripts run by rc.sysinit2 are selected from both<br />

the<br />

installation dist directory as well as the site dist directory and run<br />

in<br />

numeric order. Therefore the site dist directory can contain scripts<br />

that<br />

start before Blue Gene system software such as ciod. If a start script<br />

in<br />

the site dist directory has the same name as a start script in the<br />

installation dist directory, only the script in the installation dist<br />

directory is run. Start scripts having the same number are run<br />

alphabetically.<br />

Shutdown flow between rc scripts<br />

434 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


--------------------------------<br />

When a block is freed, the shutdown rule in /etc/inittab is coded to<br />

run<br />

/etc/rc.shutdown. From here, the flow is as follows:<br />

/etc/rc.shutdown<br />

run $BGL_SITEDISTDIR/etc/rc.local.shutdown<br />

run $BGL_DISTDIR/etc/rc.d/rc3.d/K* shutdown scripts and<br />

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K* shutdown scripts<br />

in numeric order<br />

unmount /bgl and any remaining mounted file systems<br />

Shell variables during startup and shutdown<br />

-------------------------------------------<br />

The special file /proc/personality.sh is a shell script that sets shell<br />

variables to node specific values. These are used by the rc.* scripts<br />

and may also be useful within site-written start and shutdown scripts.<br />

These are not exported variables so each script will need to source<br />

this<br />

file (or export the variables to other scripts). See the example<br />

script<br />

below.<br />

The file /etc/rc.dist contains three additional variable definitions<br />

not<br />

found in /proc/personality.sh and may also be useful. These are marked<br />

with an (*) below.<br />

BGL_MAC Ethernet mac address of this I/O node<br />

BGL_IP IP address of this I/O node<br />

BGL_NETMASK Netmask to be used for this I/O node<br />

BGL_BROADCAST Broadcast address for this I/O node<br />

BGL_GATEWAY Gateway address for the default route for this I/O node<br />

BGL_MTU MTU for this I/O node<br />

BGL_FS Fileserver IP address for /bgl<br />

BGL_EXPORTDIR Fileserver export directory for /bgl<br />

BGL_LOCATION Location string for this I/O node<br />

BGL_PSETNUM PSet number for this I/O node (0..$BGL_NUMPSETS-1)<br />

BGL_NUMPSETS Total number of PSets in this block (i.e. # of I/O<br />

nodes)<br />

BGL_NODESINPSET Number of compute nodes served by this I/O node<br />

BGL_{X,Y,Z}SIZE Size of block in nodes<br />

BGL_VIRTUALNM 1 if the block is running in virtual node mode<br />

Appendix C. The ionode.README file 435


BGL_BLOCKID Name of block<br />

BGL_VERBOSE Defined if the block is created with the<br />

"io_node_verbose"<br />

option<br />

BGL_SNIP Functional network IP address of the service node<br />

BGL_MEMSIZE Bytes of RAM for this I/O node<br />

BGL_VERSION Blue Gene personality version<br />

BGL_DISTDIR(*) Path to top of dist directory in NFS<br />

BGL_SITEDISTDIR(*) Path to top of site dist directory in NFS<br />

(/bgl/dist)<br />

BGL_OSDIR(*) Path to I/O node MCP subdir (/bgl/OS/x.y)<br />

Environment Variables used by ciod<br />

----------------------------------<br />

The following environment variables are used by ciod:<br />

CIOD_RDWR_BUFFER_SIZE=value<br />

This value specifies the size, in bytes, of each buffer used by ciod to<br />

issue read and write system calls. One buffer is allocated for each<br />

compute node CPU associated with this I/O node. The default size, if<br />

not<br />

specified, is 87600 bytes. A larger size may improve performance. If<br />

you<br />

are using a GPFS file server, it is recommended that this buffer size<br />

match<br />

the GPFS block size. Experiment with different sizes, such as 262144<br />

(256K) or 524288 (512K), to get the best performance.<br />

DEBUG_SOCKET_STARTUP={ ALL, }<br />

When specified, this variable causes ciod to start up sockets for<br />

debugging<br />

all blocks, or the specified block.<br />

These variables can be specified in the ciod start script,<br />

$BGL_DISTDIR/etc/rc.d/rc3.d/S50ciod. It is recommended that you create<br />

a /etc/sysconfig/ciod file in the ramdisk that specifies these<br />

variables.<br />

You can create this file in your<br />

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/Sxxxxxx<br />

script that runs before ciod is started. Refer to "Typical Site<br />

Customization" in this README for an example.<br />

Support for Network Time Protocol (NTP)<br />

436 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


---------------------------------------<br />

During I/O node startup, the $BGL_DISTDIR/etc/rc.d/rc3.d/S15xntpd<br />

script<br />

runs to set the time and date, and to start the NTP daemon (NTPD).<br />

There<br />

are three things you can do to configure this support:<br />

1. Set up your timezone file. The I/O node looks for the timezone file<br />

/etc/localtime. This is a symbolic link to<br />

$BGL_SITEDISTDIR/etc/localtime. Thus, to set up your timezone file,<br />

copy your service node's /etc/localtime file to<br />

$BGL_SITEDISTDIR/etc/localtime. If the timezone file is not found,<br />

the<br />

time is assumed to be UTC.<br />

2. Set up options used by the S15xntpd script. These options are<br />

located<br />

in file /etc/sysconfig/xntp. The default file contains two options:<br />

XNTPD_INITIAL_NTPDATE="AUTO-2"<br />

Specifies which NTP servers will be queried for the initial time<br />

and<br />

date.<br />

"AUTO" Query all of the NTP servers listed in the NTPD<br />

configuration file<br />

"AUTO-n" Query the first "n" NTP servers listed in the NTPD<br />

configuration file<br />

"" Don't perform the initial query at all<br />

"address1 address2 ..." Query the specified NTP servers<br />

The NTPD configuration file is /etc/ntp.conf.<br />

The default is "AUTO-2".<br />

XNTPD_OPTIONS=""<br />

Parameters for the /sbin/ntpd NTP daemon. The following<br />

parameters<br />

are supported:<br />

/sbin/ntpd [ -abdgmnqx ] [ -c config_file ] [ -e e_delay ]<br />

[ -f freq_file ] [ -k key_file ] [ -l log_file ]<br />

[ -p pid_file ] [ -r broad_delay ] [ -s statdir ]<br />

[ -t trust_key ] [ -v sys_var ] [ -V<br />

default_sysvar ]<br />

[ -P fixed_process_priority ]<br />

The default is no parameters.<br />

Appendix C. The ionode.README file 437


to<br />

by<br />

Normally, you should not have to change these options. If you wish<br />

change them, you need to do this in your site customization script<br />

(S10sitefs - refer to "Typical Site Customization" in this README)<br />

adding lines similar to the following:<br />

rm /etc/sysconfig/xntp<br />

echo "XNTPD_INITIAL_NTPDATE=\"AUTO-2\"" >> /etc/sysconfig/xntp<br />

echo "XNTPD_OPTIONS=\"-c /etc/ntp.conf\"" >> /etc/sysconfig/xntp<br />

3. Set up configuration information for the NTP daemon. This is stored<br />

in<br />

the /etc/ntp.conf file. The default file is created in the ramdisk<br />

by<br />

the S15xntpd script and contains the following:<br />

restrict default nomodify<br />

server $BGL_SNIP<br />

The default NTP server is the service node. Normally, you should<br />

not<br />

need to change these options. If you wish to supply your own file,<br />

you<br />

need to create it in your site customiztion script (S10sitefs -<br />

refer to<br />

"Typical Site Customization" in this README) by adding lines similar<br />

to<br />

the following:<br />

echo "restrict default nomodify" >> /etc/ntp.conf<br />

echo "server $BGL_SNIP" >> /etc/ntp.conf<br />

For more information on NTP, refer to http://www.ntp.org/<br />

If you do not want NTP started on the I/O nodes, you can remove the<br />

/etc/sysconfig/xntp file from the ramdisk in your site customization<br />

script<br />

(S10sitefs - refer to "Typical Site Customization" in this README) by<br />

adding the following line to that script:<br />

rm /etc/sysconfig/xntp<br />

Support for Syslog<br />

------------------<br />

During I/O node startup, the $BGL_DISTDIR/etc/rc.d/rc3.d/S45syslog<br />

script<br />

runs to start up the syslogd and klogd daemons. There are three things<br />

you<br />

438 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


can do to configure this support:<br />

1. Set up options used by the S45syslog script. These options are<br />

located<br />

in file /etc/sysconfig/syslog. The default file contains the<br />

following<br />

options:<br />

KERNEL_LOGLEVEL=1<br />

Specifies the default logging level for kernel messages (0-7).<br />

Kernel<br />

messages having a priority level equal to or more severe than<br />

(less<br />

than) this value are logged by syslog. The logging levels are as<br />

follows:<br />

KERN_EMERG 0 System is unusable<br />

KERN_ALERT 1 Action must be taken immediately<br />

KERN_CRIT 2 Critical conditions<br />

KERN_ERR 3 Error conditions<br />

KERN_WARNING 4 Warning conditions<br />

KERN_NOTICE 5 Normal but significant condition<br />

KERN_INFO 6 Informational<br />

KERN_DEBUG 7 Debug-level messages<br />

The default is "1".<br />

SYSLOGD_PARAMS=""<br />

Parameters for the /sbin/syslogd daemon. The following parameters<br />

are<br />

supported:<br />

/sbin/syslog [ -a socket ] [ -d ] [ -h ]<br />

[ -p socket ] [ -f config file ] [ -n ]<br />

[ -l hostlist ] [ -m interval ] [ -r ]<br />

[ -t ] [ -s domainlist ] [ -v ]<br />

The default is no parameters. However, the S45syslog script<br />

always<br />

specifies "-n" because it is required when starting syslogd from<br />

the<br />

init process.<br />

KLOGD_PARAMS="-2"<br />

Parameters for the /sbin/klogd daemon. The following parameters<br />

are<br />

supported:<br />

/sbin/klogd [ -c n ] [ -d ] [ -f fname ] [ -iI ] [ -n ] [ -o<br />

]<br />

Appendix C. The ionode.README file 439


]<br />

to<br />

by<br />

440 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

[ -k fname ] [ -v ] [ -x ] [ -2 ] [ -s ] [ -p<br />

The default is "-2" and "-c KERNEL_LOGLEVEL".<br />

SYSLOG_DAEMON="syslogd"<br />

The name of the syslog daemon. The default is "syslogd".<br />

Normally, you should not have to change these options. If you wish<br />

change them, you need to do this in your site customization script<br />

(S10sitefs - refer to "Typical Site Customization" in this README)<br />

adding lines similar to the following:<br />

rm /etc/sysconfig/syslog<br />

echo "KERNEL_LOGLEVEL=1" >> /etc/sysconfig/syslog<br />

echo "SYSLOGD_PARAMS=\"\"" >> /etc/sysconfig/syslog<br />

echo "KLOGD_PARAMS=\"-2\"" >> /etc/sysconfig/syslog<br />

echo "SYSLOG_DAEMON=\"syslogd\"" >> /etc/sysconfig/syslog<br />

2. Set up configuration information for the syslog daemon. This is<br />

stored<br />

in the /etc/syslog.conf file. The default file is created in the<br />

ramdisk by the S45syslog script and contains the following:<br />

authpriv.none;*.emerg;*.alert;*.crit;*.err; /dev/console<br />

This default file logs messages having priority level 0, 1, 2, or 3<br />

to<br />

the I/O node console (the I/O node log file). If you wish to supply<br />

your own file, you need to create it in your site customization<br />

script<br />

(S10sitefs - refer to "Typical Site Customization" in this README)<br />

by<br />

adding lines similar to the following:<br />

echo "authpriv.none;*.emerg;*.alert;*.crit;*.err; /dev/console" \<br />

>> /etc/syslog.conf<br />

3. If you want to send I/O node messages to a remote syslog server<br />

(using<br />

@server-name in the /etc/syslog.conf file), consider the following:<br />

a. When starting syslogd on the remote server, specify "-r -m 0" to<br />

enable it.<br />

b. If the remote server is logging to files, be aware that the files<br />

can get large. Consider using a utility to manage these files,<br />

such<br />

as logrotate.


For more information, refer to the man pages for syslog, syslogd,<br />

klogd,<br />

and logrotate.<br />

If you do not want syslog started on the I/O nodes, you can remove the<br />

/etc/sysconfig/syslog file from the ramdisk in your site customization<br />

script (S10sitefs - refer to "Typical Site Customization" in this<br />

README)<br />

by adding the following line to that script:<br />

rm /etc/sysconfig/syslog<br />

Support for GPFS<br />

----------------<br />

During I/O node startup, the GPFS client may optionally be started.<br />

Assuming the appropriate GPFS setup has been done, the following<br />

describes<br />

the steps for configuring the I/O node scripts to start GPFS and<br />

explains<br />

the flow of these scripts.<br />

In your site customization script (S10sitefs - refer to "Typical Site<br />

Customization" in this README), add the following line:<br />

echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />

This creates the /etc/sysconfig/gpfs file in the I/O node ramdisk and<br />

specifies that you want the GPFS client to be started.<br />

You may specify the following additional options in the same manner<br />

(the<br />

default values are shown here, so you only need to specify these if you<br />

want different values):<br />

echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />

echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >> /etc/sysconfig/gpfs<br />

GPFS_VAR_DIR specifies the pathname to the "var" directory to be used<br />

by<br />

the GPFS client. A per-I/O node directory is created under<br />

GPFS_VAR_DIR<br />

for storing log files and configuration information. Note that if the<br />

GPFS_VAR_DIR resides in an NFS exported file system, that export should<br />

specify "no_root_squash" so that the I/O node root user can write to<br />

the<br />

Appendix C. The ionode.README file 441


GPFS_VAR_DIR directory.<br />

GPFS_CONFIG_SERVER specifies the host name or IP address of the primary<br />

GPFS cluster configuration server node. The GPFS configuration file<br />

(/var/mmfs/gen/mmsdrfs) for an I/O node is retrieved from this node, if<br />

necessary.<br />

When you specify to start the GPFS client, as described above, the<br />

following occurs during I/O node startup:<br />

1) The SSH daemon is started by the $BGL_DISTDIR/etc/rc.d/rc3.d/S16sshd<br />

script. SSH is used by GPFS to communicate among the I/O nodes and<br />

the<br />

service node. This script will also use the<br />

$BGL_SITEDISTDIR/etc/hosts<br />

file to set the hostname of this I/O node.<br />

2) The GPFS client is started by the<br />

$BGL_DISTDIR/etc/rc.d/rc3.d/S40gpfs<br />

script.<br />

Typical Site Customization<br />

--------------------------<br />

A Blue Gene site will normally have a script that customizes the<br />

startup<br />

and shutdown. This script could perform the following functions:<br />

1. Mount (during startup) and unmount (during shutdown) high<br />

performance<br />

file servers for application use. To ensure the file servers are<br />

available<br />

to applications,<br />

- the mount must occur after portmap is started by the S05nfs script<br />

and<br />

before ciod is started by the S50ciod script.<br />

- if using S45syslog to start syslog that logs to files in a mounted<br />

file<br />

system, the mount must occur before syslog is started.<br />

- the unmount must occur after ciod is ended by the K50ciod script and<br />

before portmap is ended by the K95nfs script.<br />

Recommendations for NFS mounts:<br />

a. The mount should be retried several times to accomodate a busy file<br />

server.<br />

442 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


. Mount options:<br />

tcp - This provides automatic flow control when it detects that<br />

packets<br />

are being dropped due to network congestion or server<br />

overload.<br />

rsize,wsize - These specify read and write NFS buffer sizes. 8192<br />

is<br />

the minimum recommended size, and 32768 is the maximum<br />

recommended size. If your server is not built with a<br />

kernel compiled with a 32768 size, it will negotiate<br />

down<br />

to what it can support. In general, the larger the<br />

size,<br />

the better the performance. However, depending on the<br />

capacity of your network and server, 32768 may be too<br />

large and cause excessive slowdowns or hangs during<br />

times<br />

of heavy I/O. This is something that each site needs<br />

to<br />

tune.<br />

async - Specifying this option may improve write performance,<br />

although<br />

there is greater risk for losing data if the file server<br />

crashes<br />

and is unable to get the data written to disk.<br />

2. Point /tmp to a larger file system. The default /tmp is in a small<br />

ramdisk. Your applications may require a larger /tmp.<br />

3. Set ciod environment variables. Refer to "Environment Variables<br />

used<br />

by ciod" in this README for details.<br />

4. Set NTP parameters. Refer to "Support for Network Time Protocol<br />

(NTP)"<br />

in this README for details.<br />

5. Set syslog parameters. Refer to "Support for Syslog" in this README<br />

for<br />

details.<br />

6. Set GPFS parameters. Refer to "Support for GPFS" in thie README for<br />

details.<br />

Your site customization script should be in the<br />

Appendix C. The ionode.README file 443


$BGL_SITEDISTDIR/etc/rc.d/rc3.d directory. For example, call it<br />

"sitefs".<br />

Then, in order to properly place it in the startup and shutdown<br />

sequence,<br />

symbolic links should be created to this script as follows:<br />

ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S10sitefs<br />

ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K90sitefs<br />

The S10sitefs and K90sitefs links are sequenced such that they meet the<br />

mount requirements described above.<br />

During startup, S10sitefs will be called with the "start" parameter.<br />

During shutdown, K90sitefs will be called with the "stop" parameter.<br />

The following is an example of such a site customization script:<br />

#!/bin/sh<br />

#<br />

# Sample sitefs script.<br />

#<br />

# It mounts a filesystem on /scratch, mounts /home for user files<br />

# (applications), creates a symlink for /tmp to point into some<br />

directory<br />

# in /scratch using the IP address of the I/O node as part of the<br />

directory<br />

# name to make it unique to this I/O node, and sets up environment<br />

# variables for ciod.<br />

#<br />

. /proc/personality.sh<br />

. /etc/rc.status<br />

#-------------------------------------------------------------------<br />

# Function: mountSiteFs()<br />

#<br />

# Mount a site file system<br />

# Attempt the mount up to 5 times.<br />

# If all attempts fail, send a fatal RAS event so the block fails<br />

# to boot.<br />

#<br />

# Parameter 1: File server IP address<br />

# Parameter 2: Exported directory name<br />

444 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


# Parameter 3: Directory to be mounted over<br />

# Parameter 4: Mount options<br />

#------------------------------------------------------------------mountSiteFs()<br />

{<br />

# Make the directory to be mounted over<br />

[ -d $3 ] || mkdir $3<br />

# Set to attempt the mount 5 times<br />

ATTEMPT_LIMIT=5<br />

ATTEMPT=1<br />

# Attempt the mount up to 5 times.<br />

# Echo a message to the I/O node log for each failed attempt.<br />

until test $ATTEMPT -gt $ATTEMPT_LIMIT || mount $1:$2 $3 -o $4; do<br />

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 failed"<br />

sleep $ATTEMPT<br />

let ATTEMPT=$ATTEMPT+1<br />

done<br />

# If all attempts failed, send a fatal RAS event so the block fails<br />

to<br />

# boot. If the mount worked, echo a message to the I/O node log.<br />

if test $ATTEMPT -gt $ATTEMPT_LIMIT; then<br />

echo "Mounting $1:$2 on $3 failed" > /proc/rasevent/kernel_fatal<br />

exit<br />

else<br />

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 worked"<br />

fi<br />

}<br />

#-------------------------------------------------------------------<br />

# Script: sitefs()<br />

#<br />

# Perform site-specific functions during startup and shutdown<br />

#<br />

# Parameter 1: "start" - perform startup functions<br />

# "stop" - perform shutdown functions<br />

#-------------------------------------------------------------------<br />

# Set to ip address of site fileserver<br />

SITEFS=172.32.1.1<br />

# First reset status of this service<br />

rc_reset<br />

Appendix C. The ionode.README file 445


# Handle startup (start) and shutdown (stop)<br />

case "$1" in<br />

start)<br />

echo Mounting site filesystems<br />

# Mount a scratch file system...<br />

mountSiteFs $SITEFS /scratch /scratch<br />

tcp,rsize=32768,wsize=32768,async<br />

# Mount a home file system...<br />

mountSiteFs $SITEFS /home /home tcp,rsize=8192,wsize=8192<br />

# Arrange something for tmp...<br />

# - make a unique directory in the scratch file system.<br />

# - rename the original /tmp to /tmp.original.<br />

# - point /tmp to the unique directory.<br />

tmpdir=/scratch/ionodetmp/$BGL_IP<br />

[ -d $tmpdir ] || mkdir -p $tmpdir<br />

mv /tmp /tmp.original<br />

ln -s $tmpdir /tmp<br />

# Setup environment variables for ciod<br />

echo "export CIOD_RDWR_BUFFER_SIZE=262144" >><br />

/etc/sysconfig/ciod<br />

echo "export DEBUG_SOCKET_STARTUP=ALL" >><br />

/etc/sysconfig/ciod<br />

klogd<br />

for<br />

# Uncomment the following line to not start the NTP daemon<br />

# rm /etc/sysconfig/xntp<br />

# Uncomment the following line to not start the syslogd and<br />

# daemons<br />

# rm /etc/sysconfig/syslog<br />

# Uncomment the first line to start GPFS.<br />

# Optionally uncomment the other lines to change the defaults<br />

# GPFS.<br />

# echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />

# echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />

# echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >><br />

/etc/sysconfig/gpfs<br />

446 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


c_status -v<br />

;;<br />

stop)<br />

echo Unmounting site filesystems<br />

# Put /tmp back<br />

rm /tmp<br />

mv /tmp.original /tmp<br />

# Unmount the scratch and home file systems<br />

umount -f /scratch<br />

umount -f /home<br />

;;<br />

restart)<br />

;;<br />

status)<br />

;;<br />

esac<br />

rc_exit<br />

#------End of script---------<br />

A more advanced script could select a fileserver based on the I/O node<br />

location string (e.g. R01-M0-NE-I:J18-U01). Note that BGL_LOCATION<br />

comes<br />

from sourcing /proc/personality.sh.<br />

case "$BGL_LOCATION" in<br />

R00-*) SITEFS=172.32.1.1;; # rack 0 NFS server<br />

R01-*) SITEFS=172.32.1.2;; # rack 1 NFS server<br />

R02-*) SITEFS=172.32.1.3;; # rack 2 NFS server<br />

R03-*) SITEFS=172.32.1.4;; # rack 3 NFS server<br />

R04-*) SITEFS=172.32.1.5;; # rack 4 NFS server<br />

R05-*) SITEFS=172.32.1.6;; # rack 5 NFS server<br />

esac<br />

The script could use the $BGL_IP (ip addr of the I/O node) to compute<br />

the<br />

fileserver(s). It also can use $BGL_BLOCKID so that different blocks<br />

can<br />

mount special fileservers.<br />

Appendix C. The ionode.README file 447


Support for Subnets<br />

-------------------<br />

Each midplane or group of midplanes can belong to a different subnet.<br />

To<br />

take advantage of this, there are two things that need to be done:<br />

1. Specify IP addresses for specific I/O nodes. This is done in the<br />

BglIpPool database table.<br />

The I/O nodes in the midplane(s) belonging to a particular subnet<br />

must<br />

have IP addresses belonging to that subnet. When the BglIpPool<br />

table is<br />

populated, the I/O node locations must be specified along with the<br />

IP<br />

addresses as in the following example:<br />

db2 "INSERT INTO BglIpPool<br />

(location,machineSerialNumber,ipAddress)<br />

VALUES('R01-M0-N4-I:J18-U01','BGL', '172.30.100.244') "<br />

If the BglIpPool table already exists, and subnet support is being<br />

added, the following steps must be done:<br />

a. The BglIpPool table must be deleted and recreated with the<br />

correct<br />

location-ipAddress pairings, as described above.<br />

b. The IpAddress field within the BglNode table must be set to NULL:<br />

db2 "update bglnode set IpAddress = NULL"<br />

c. PostDiscovery must be re-run.<br />

2. Specify subnet information for each midplane in the<br />

BglMidplaneSubnet<br />

table, as in the following example:<br />

db2 "INSERT INTO \<br />

BglMidplaneSubnet<br />

(posInMachine,ipAddress,broadcast,nfsIpAddress) \<br />

VALUES('R000','172.27.138.254','172.27.138.255','172.27.96.117') "<br />

FIELD DESCRIPTION IONODE<br />

SCRIPT<br />

VARIABLE<br />

448 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


posInMachine The midplane specification<br />

ipAddress The IP address of the gateway for the subnet<br />

BGL_GATEWAY<br />

broadcast The broadcast address for the subnet<br />

BGL_BROADCAST<br />

nfsIpAddress The IP address of the /bgl file server for BGL_FS<br />

this midplane<br />

The ipAddress and broadcast fields are related to subnet support.<br />

The<br />

nfsIpAddress field enables multiple file servers to host /bgl, for<br />

improved performance.<br />

Extracting the ramdisk from ramdisk.elf<br />

---------------------------------------<br />

For those who may wish to examine the contents of the I/O node's<br />

ramdisk,<br />

here are instructions for extracting a ramdisk.img.gz from the<br />

ramdisk.elf<br />

that is shipped with the Blue Gene driver, and mounting the ramdisk.<br />

The ramdisk.img.gz is stored in the .data section of a standard ELF<br />

image. It is stored as raw data with an 8-byte header that is needed<br />

at<br />

runtime, which needs to be stripped off with the dd command.<br />

cd <br />

objcopy --only-section .data --output-target binary \<br />

/bgl/BlueLight//ppc/bglsys/bin/ramdisk.elf image.tmp<br />

dd if=image.tmp of=ramdisk.img.gz bs=8 skip=1<br />

rm -f image.tmp<br />

The ramdisk image has been extracted from ramdisk.elf.<br />

Uncompress it.<br />

gunzip ramdisk.img.gz<br />

If gunzip works, skip this step and go to the "loopback mount" step.<br />

If gunzip fails with "unexpected end of file", truncate 1 byte from<br />

ramdisk.img.gz using the following commands, where xxxxxx is the size<br />

of ramdisk.img.gz minus 1 (as calculated from "ls -l ramdisk.img.gz"<br />

output):<br />

dd if=ramdisk.img.gz of=ramdisk.img2.gz count=1 bs=xxxxxx<br />

Appendix C. The ionode.README file 449


mv ramdisk.img2.gz ramdisk.img.gz<br />

Loopback mount ramdisk.img on directory "r".<br />

mkdir r<br />

mount -o loop ramdisk.img r<br />

Run commands to examine the ramdisk under directory "r".<br />

For example:<br />

find r -ls<br />

bglsn:/bgl/BlueLight/ppcfloor/docs #<br />

450 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Abbreviations and acronyms<br />

BPM Bulk Power Module<br />

AIX Advanced Interactive Executive<br />

API Application Programming Interface<br />

ARP Address Resolution Protocol<br />

ASIC Application Specific Integrated<br />

Circuit<br />

BGL Blue Gene Light<br />

BIST Built-In Self Test<br />

BLRTS Blue Gene Run Time System<br />

BPE Bulk Power Enclosure<br />

CDMF Commercial Data masking Facility<br />

CLI Command Line Interface<br />

CN Compute Node<br />

CNK Compute Node Kernel<br />

CPU Central Processing Unit<br />

CSM Cluster <strong>Systems</strong> Management<br />

CWD Current Working Directory<br />

DES Data Encryption System<br />

DNS Domain Name System<br />

EOF End Of File<br />

FEN front-end Node<br />

FIFO First-In-First-Out<br />

FQDN Fully Qualified Domain Name<br />

GPFS General Parallel File System<br />

GPL GNU Public License<br />

GSA Global Storage Architecture<br />

GUI Graphical User Interface<br />

HPC High Performance computing<br />

HTML Hyper Text Markup Language<br />

IBM International Business Machines<br />

Corporation<br />

IDEA International Data Encryption<br />

Algorithm<br />

IEEE Institute of Electrical and Electronic<br />

Engineering<br />

IO Input Output<br />

IP Internet Protocol<br />

ICMP Internet Control Message Protocol<br />

ITSO International Technical Support<br />

Organization<br />

JTAG Joint Technical Advisory <strong>Group</strong><br />

LED Light Emitting Diode<br />

LPAR Logical Partition<br />

LUN Logical Unit Number<br />

MCP Mini-Control Program<br />

MMCS Midplane Management Control<br />

System<br />

MPI Message Passing Interface<br />

MSB Most Significant Bit<br />

MTU Maximum Transmission Unit<br />

NFS Network File System<br />

NSD Network Shared Disk<br />

NTP Network Time Protocol<br />

NUMA Non-Uniform Memory Access<br />

PGP Pretty Good Privacy<br />

PKI Public Key Infrastructure<br />

PPC Power PC®<br />

RAM Random Access Memory<br />

RAS Reliability Availability Serviceability<br />

RPC Remote Procedure Call<br />

RPM RedHat Package Manager<br />

RSA Rivest Shamir and Adelman<br />

RSCT Reliable Scalable Clustering<br />

Technology<br />

© Copyright IBM Corp. 2006. All rights reserved. 451


RSH Remote Shell<br />

SHA Secure Hash Algorithm<br />

SLES SUSE Linux Enterprise Server<br />

SMP Symmetric Multi-Processing<br />

SN Service Node<br />

SP System Parallel<br />

SQL Structured Query Language<br />

SRAM Static Random Access Memory<br />

SSH Secure Shell<br />

SSL Secure Socket Layer<br />

TCP Transmission Control Protocol<br />

TCP/IP Transmission Control Protocol /<br />

Internet Protocol<br />

TWS Tivoli Workload Scheduler<br />

UDP Universal Datagram Protocol<br />

UID User ID<br />

URL Universal Resource Locator<br />

VLSI Very Large Scale Integration<br />

XML Extended Markup Language<br />

452 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Related publications<br />

IBM Redbooks<br />

The publications that we list in this section are considered particularly suitable for<br />

a more detailed discussion of the topics that we cover in this redbook.<br />

For information about ordering these publications, see “How to get IBM<br />

Redbooks” on page 454. Note that some of the documents referenced here<br />

might be available in softcopy only.<br />

► Unfolding the IBM eServer Blue Gene Solution, SG24-6686<br />

► Blue Gene/L: System Administration, SG24-7178<br />

Other publications<br />

These publications are also relevant as further information sources:<br />

► IBM General Parallel File System Concepts, Planning, and Installation <strong>Guide</strong>,<br />

GA22-7968-02<br />

► IBM General Parallel File System Administration and Programming<br />

Reference, SA22-7967-02<br />

► IBM General Parallel File System <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>,<br />

GA22-7969-02<br />

► IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881<br />

► CSM for AIX 5L and Linux V1.5 Planning and Installation <strong>Guide</strong>,<br />

SA23-1344-01<br />

► CSM for AIX 5L and Linux V1.5 Administration <strong>Guide</strong>, SA23-1343-01<br />

► CSM for AIX 5L and Linux V1.5 Command and Technical Reference,<br />

SA23-1345-01<br />

► RSCT Administration <strong>Guide</strong>, SA22-7889-10<br />

► RSCT for Linux Technical Reference, SA22-7893-10<br />

© Copyright IBM Corp. 2006. All rights reserved. 453


Online resources<br />

These Web sites and URLs are also relevant as further information sources:<br />

► GPFS Concepts, Planning, and Installation <strong>Guide</strong>, GA22-7968-02<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23/bl1ins10/bl1ins10.html<br />

► GPFS Administration and Programming Reference, SA22-7967-02<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23/bl1adm10/bl1adm10.html<br />

► GPFS <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>, GA22-7969-02<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23/bl1pdg10/bl1pdg10.html<br />

► GPFS Documentation updates<br />

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />

ter.gpfs.doc/gpfs23_doc_updates/docerrata.html<br />

► Cluster <strong>Systems</strong> Management documentation and updates<br />

http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html<br />

How to get IBM Redbooks<br />

Help from IBM<br />

You can search for, view, or download Redbooks, Redpapers, Hints and Tips,<br />

draft publications and Additional materials, as well as order hardcopy Redbooks<br />

or CD-ROMs, at this Web site:<br />

ibm.com/redbooks<br />

IBM Support and downloads<br />

ibm.com/support<br />

IBM Global Services<br />

ibm.com/services<br />

454 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


Index<br />

Symbols<br />

/O node 218<br />

~/.rhosts 153–154<br />

A<br />

accountability 396<br />

allocate_block 162<br />

alternate Central Manager 171, 203<br />

Apache 120<br />

authentication 396, 398<br />

authorization 396<br />

authorized_keys 303, 402<br />

B<br />

back-end mpirun 150, 184<br />

Barrier 189<br />

Barrier and Global interrupt 21<br />

BG_ALLOW_LL_JOBS_ONLY 173, 183<br />

BG_CACHE_PARTIONS 173<br />

BG_CACHE_PARTITIONS 173<br />

BG_ENABLED 173<br />

BG_MIN_PARTITION_SIZE 173<br />

bgIO cluster 244, 251, 263<br />

BGL_DISTDIR 213<br />

BGL_EXPORTDIR 213<br />

BGL_OSDIR 213<br />

BGL_SITEDISTDIR 213<br />

BGLBASEPARTITION 67<br />

BGLBLOCK 36, 92<br />

BGLEVENTLOG 36<br />

BGLJOB 41<br />

bglmaster 31, 43, 62, 89, 115, 281, 283<br />

BGLMIDPLANE 68<br />

BGLNODECARDCOUNT 77<br />

BGLPROCESSORCARDCOUNT 77<br />

BGLSERVICEACTION 49<br />

BGLSERVICECARDCOUNT 70<br />

bglsysdb 287<br />

bgmksensor 390, 394<br />

BGWEB 57–59, 68, 71, 77, 80, 86, 108<br />

blc_powermodules 132<br />

blc_temperatures 132<br />

blc_voltages 132<br />

bll_lbist 133<br />

bll_lbist_linkreset 133<br />

bll_lbist_pgood 133<br />

bll_powermodules 134<br />

bll_temperatures 134<br />

bll_voltages 134<br />

block 93, 158, 162–163, 218<br />

Block Information 162<br />

Block information 123–124<br />

Block inititialization 106<br />

Block monitoring 100<br />

blrts 7, 145<br />

blrts tool chain 145, 148, 322<br />

Blue Gene Light Runtime System 7<br />

Blue Gene/L 15, 199<br />

Blue Gene/L driver 339, 342<br />

Blue Gene/L libraries 173<br />

Blue Gene/L processes 56<br />

Blue Gene/L, core 56, 82–83<br />

Boot process 36<br />

booting a block 138<br />

BPE 15<br />

BPM 15, 451<br />

bridge API 170, 179–180<br />

bridge.config 176<br />

BRIDGE_CONFIG_FILE 155, 167, 175, 349, 351<br />

bs_trash_0 132<br />

bs_trash_1 132<br />

Bulk power enclosure 15<br />

Bulk power module 15<br />

Bulk power modules 114<br />

BusyBox 33<br />

C<br />

CableDiscovery 44, 46<br />

CDMF 397<br />

Central Manager 168–170, 178, 203, 375–376<br />

chkconfig 152<br />

ciod 34, 39, 162, 212, 274<br />

ciodb 31, 34, 41, 89, 136, 274, 280<br />

cipherList 259<br />

clock card 15, 68<br />

© Copyright IBM Corp. 2006. All rights reserved. 455


clock signal 11<br />

cluster 168<br />

Cluster <strong>Systems</strong> Management. See CSM.<br />

Cluster Wide File System 164<br />

CNK 7, 35, 38, 145, 212<br />

coaxial port 15<br />

collective communication 142–143, 189<br />

collective communication and computation 142<br />

collective network 21, 26, 32, 41<br />

compilers 141, 144<br />

compilers, IBM 145<br />

compute card 2, 7–8, 80, 127, 268<br />

Compute node kernel 7, 34, 212<br />

COMPUTENODES 81<br />

condition 386, 388<br />

control system server logs 99<br />

co-processor 7<br />

cryptographic techniques 395<br />

CSM 384<br />

CSM client 385<br />

CSM cluster 385<br />

D<br />

data encryption 396<br />

data signing 396<br />

database 29, 60<br />

database browser 119, 130<br />

database view 70<br />

db.properties 176, 287<br />

DB_PROPERTY 156, 167, 175, 349, 352<br />

DB2 29, 58, 87<br />

DB2 commands 161<br />

DB2 database 69, 119, 384<br />

DB2 statements 111<br />

DB2 stored procedure 391<br />

DB2 trigger 391<br />

db2cshrc 330<br />

db2iauto 88<br />

db2profile 156, 167, 330, 349<br />

db2set 88<br />

DES 397<br />

dgemm160 132<br />

dgemm160e 132<br />

dgemm3200 132<br />

dgemm3200e 132<br />

diagnostic data 31<br />

diagnostic tests 31<br />

diagnostics 106, 113, 119, 128, 131–132, 135<br />

456 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

digital signature 398<br />

discovery 12, 30, 43, 46, 50, 67<br />

discovery process 104<br />

dr_bitfail 132<br />

driver update 320<br />

dumpconv 235<br />

E<br />

emac_dg 132<br />

EndServiceAction 46, 49, 53, 122, 294<br />

environment data 30<br />

environmental information 114, 118, 126<br />

ethtool 84, 290<br />

event 386<br />

F<br />

fan card 13<br />

fan unit 14<br />

fans 114<br />

FEN (Front-End Node) 24, 32, 61, 103, 144<br />

File Server 217<br />

file systems 32, 64<br />

free_block 267<br />

front-end mpirun 150, 183<br />

Functional Ethernet 21<br />

functional network 24, 29, 56, 60, 267, 269<br />

G<br />

gcc compiler 148<br />

General Parallel File System. See GPFS.<br />

gensmallblock 95<br />

gi_single_chip 132<br />

gidcm 132<br />

global barrier and interrupt network 27–28<br />

GPFS 24, 56, 65, 211–212, 226, 230, 234, 244,<br />

266, 294, 303, 329, 346<br />

GPFS access patterns 226<br />

GPFS cross-cluster authentication 246–247<br />

GPFS Portability Layer 230, 321<br />

GPFS_STARTUP 229<br />

H<br />

hardware browser 121<br />

hardware monitor 113–114<br />

hardware problems 106<br />

Hash function 396<br />

hosts.equiv 153–154


I<br />

I/O card 77, 135<br />

I/O node 7–8, 24, 27, 33, 39, 62, 77, 95, 145, 149,<br />

168, 212, 237–238, 318<br />

I/O node boot sequence 212–213<br />

I²C 21–22, 24<br />

IBM compilers 145<br />

IBM XL compilers 145<br />

ibmcmp 214<br />

IDEA 398<br />

IDo 9, 22, 28, 36<br />

IDo chip 28, 30–31, 43<br />

IDo link 270<br />

idoproxy 36, 43, 89, 280–281, 283, 289<br />

idoproxydb 31, 136<br />

ifconfig 290<br />

Initializing the partition 182<br />

installation dist 214<br />

intake plenum 13<br />

Inter-Integrated Circuit 24<br />

IRecv 142<br />

J<br />

job command file 317<br />

job cycle 176<br />

job ID 193, 197–198<br />

job information 123, 125, 162<br />

job monitoring 100<br />

job runtime information 165<br />

job setup 157<br />

job state 363<br />

job states 163<br />

job status 96, 161–162<br />

job submission 36, 66, 141, 149, 158, 178, 349<br />

job tracking 158<br />

job_type 199<br />

JOBNAME 198<br />

jobstepid 193<br />

journaling file system 226<br />

JTAG 21–22<br />

K<br />

known_hosts 303, 402<br />

L<br />

ldconfig 173<br />

ldd 173<br />

link card 17, 22, 114<br />

link card chips 71<br />

linpack 133<br />

linuximg 33<br />

list_jobs 139<br />

llcancel 193<br />

llhold 366<br />

llinit 204<br />

llq 178, 193, 197, 203, 365<br />

llstatus 61, 172, 186, 188–189, 192, 203, 371<br />

llsubmit 36, 66, 178, 193, 198–199<br />

LoadL_admin 171, 203, 206<br />

LoadL_config 172, 178, 204, 207<br />

LoadLeveler 56, 66, 109, 124, 141, 162, 167–168,<br />

170, 205, 295, 358<br />

daemons 202<br />

job classes 205<br />

job command file 198<br />

job state 177<br />

job submission 178<br />

job sumission 179<br />

queue 182<br />

run queue 193<br />

LOCAL_CONFIG 205<br />

location 47<br />

location codes 3<br />

LocationString 46<br />

lscondition 386<br />

lsresponse 387<br />

lssensor 389<br />

lxtrace 235<br />

M<br />

massive parallel system 15<br />

Master 375<br />

MCP 7, 33, 35, 38, 212–213<br />

mem_l2_coherency 133<br />

Message Passing Interface 141–142<br />

methodology 2, 56, 106<br />

microloader 35, 38<br />

midplane 6, 8, 17, 22, 68–70, 75, 158, 189<br />

midplane information 123<br />

Midplane Management Control System 30, 149<br />

mini-control program 212<br />

mkcondition 390<br />

mkcondresp 386<br />

mkresponse 390<br />

mksensor 390<br />

Index 457


mmauth 247–248, 261<br />

mmchconfig 247, 308<br />

mmchdisk 263<br />

MMCS 30, 141, 162<br />

MMCS console 113, 136<br />

mmcs_console 66, 96<br />

mmcs_db_console 91, 96, 136, 163, 220, 242, 252,<br />

267, 282–283, 295<br />

mmcs_db_server 31, 36, 39, 136, 282<br />

mmcs_server 89, 283<br />

MMCS_SERVER_IP 155, 166, 175, 349<br />

mmdsh 308<br />

mmfslinux 235<br />

mmgetstate 255, 298, 347–348<br />

mmlscluster 259<br />

mmremotefs 249–250, 257<br />

mmstartup 245<br />

monitoring tools 106<br />

MPI 56, 141, 147<br />

MPI library 142, 146<br />

MPI_Barrier 142–143<br />

MPI_Comm_rank 144<br />

MPI_Comm_size 144<br />

MPI_COMM_WORLD 142, 144<br />

MPI_Finalize 144<br />

MPI_Init 144<br />

MPI_IRecv 142<br />

MPI_Isend 142<br />

MPI_Recv 142<br />

MPI_Reduce 142<br />

MPI_Scatter 142<br />

MPI_Send 142<br />

mpicc 147<br />

mpicxx 147<br />

mpif77 147<br />

mpirun 36, 41, 66, 141, 149–150, 155, 158, 160,<br />

162–165, 170–171, 177, 183, 187, 200–201, 266,<br />

343, 347, 350<br />

only_test_protocol 356<br />

-verbose 370<br />

mpirun checklist 166<br />

ms_gen_short 133<br />

N<br />

naming convention 19<br />

Negotiator 168, 170, 189, 375<br />

NegotiatorLog 208<br />

netstat 208<br />

458 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

network switches 103<br />

network tuning parameters 223<br />

NFS 65, 90, 211–212, 215, 225, 266, 294, 296, 329<br />

NFS servers 215<br />

nfsd 219<br />

node card 8, 30, 48, 76, 104, 114<br />

node definition file 251<br />

node startup 212<br />

no-pty 155<br />

NSD 231<br />

NUMNODECARDS 77<br />

NUMSERVICECARDS 70<br />

O<br />

operational logs 113<br />

P<br />

pagepool 226, 244, 259, 261, 294, 297<br />

parallel 199<br />

parallel programming environment 141–142<br />

partition 180, 358<br />

partitions 158, 180, 197<br />

point-to-point 142<br />

point-to-point communication 142<br />

port mapper 294<br />

portmap 219<br />

PostDiscovery 43, 46, 54<br />

postdiscovery 50<br />

power_module_stress 133<br />

PowerPC 144<br />

PowerPC 440 7, 148<br />

PPC64 144<br />

ppcfloor 335<br />

predefined partitions 160<br />

PrepareForService 46–47, 50, 122, 289, 292, 294<br />

printf () 144<br />

problem definition 56<br />

problem determination methodology 55, 104, 254<br />

problem monitor 123<br />

processor card 80, 82<br />

Public Key Infrastructure (PKI) 396, 398<br />

Q<br />

quorum 244<br />

R<br />

rack 3, 104, 131, 158, 267


acks 66<br />

ramdisk 38<br />

RAS 30<br />

RAS data 30<br />

RAS event 36, 38, 108, 115, 118–119, 126–127,<br />

268<br />

Redbooks Web site 454<br />

Contact us xv<br />

Reliable Scalable Cluster Technology 385<br />

remote command execution 266, 399<br />

remote execution environment 151<br />

remote file copy 399<br />

remote shell 100, 151<br />

response 386, 388<br />

rpc.mountd 219<br />

RPM 147<br />

RSCT 385<br />

rsh 100<br />

rshd 153<br />

rundiag 135<br />

runtime 123<br />

S<br />

schedd 171, 185, 187<br />

scheduler 168<br />

scp 401<br />

secure shell 102, 151, 154, 383, 399<br />

secure socket layer 400<br />

sensor 388<br />

serial 199<br />

server logs 61<br />

service action 46, 121–122<br />

service card 11, 69, 114<br />

Service Network 267, 269–270, 273<br />

Service network 21<br />

Service Netwotk 29<br />

Service Node 22, 29, 32, 34, 44, 56, 59, 83, 103,<br />

135, 144, 155, 232, 237–238, 267, 273<br />

Service Node database 58<br />

serviceactionid 53<br />

sftp 401<br />

showmount 91, 218–219, 296<br />

site configuration 158<br />

site dist 214<br />

sitefs 217, 221, 259<br />

slogin 401<br />

SRAM 38<br />

ssh 401<br />

sshd 103, 155, 214, 334<br />

SSL 400<br />

Start daemon 168<br />

startcondresp 386, 389, 394<br />

startd 171–172, 182<br />

starter 172<br />

Starter process 168<br />

StarterLog 368<br />

status LED 12<br />

stderr 25<br />

stdin 25<br />

stdout 25<br />

submit_job 36, 96, 141, 149, 267<br />

sumbit_job 66<br />

Symmetric key 396<br />

sysctl.conf 223<br />

system processes 267<br />

systemcontroller 43, 46, 50<br />

T<br />

TBGLIPPOOL 60<br />

TBGLJOB 111<br />

TBGLJOB_HISTORY 111<br />

TBLSERVICECARD 69<br />

ti_abist 133<br />

ti_edramabist 133<br />

ti_lbist 133<br />

Torus 17, 21, 26, 189<br />

tr_connectivity 133<br />

tr_loopback 133<br />

tr_multinode 133<br />

tracedev 235<br />

Triple DES 397<br />

ts_connectivity 133<br />

ts_loopback_0 133<br />

ts_loopback_1 133<br />

ts_multinode 133<br />

V<br />

virtual node 7<br />

W<br />

Web interface 113<br />

write_con 137, 221, 242<br />

X<br />

xinetd 152<br />

Index 459


XL compilers 148<br />

XLC/XLF 148<br />

XLC/XLF compilers 148<br />

xlmass 149<br />

xlsmp 149<br />

460 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


(1.0” spine)<br />

0.875”1.498”<br />

460 788 pages<br />

IBM System Blue Gene Solution:<br />

<strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>


IBM System Blue Gene<br />

Solution<br />

<strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />

Learn detailed<br />

procedures through<br />

illustrations<br />

Use sample<br />

scenarios as helpful<br />

resources<br />

Discover GPFS<br />

installation hints and<br />

tips<br />

SG24-7211-00 ISBN 0738496766<br />

Back cover<br />

This IBM Redbook is intended as a problem determination<br />

guide for system administrators in a High Performance<br />

Computing environment. It can help you find a solution to<br />

issues that you encounter on your IBM eServer Blue Gene<br />

system.<br />

This redbook includes a problem determination methodology<br />

along with the problem determination tools that are available<br />

with the basic IBM eServer Blue Gene Solution. It also<br />

discusses additional software components that are required<br />

for integrating your Blue Gene system in a complex computing<br />

environment.<br />

This redbook also describes a GPFS installation procedure<br />

that we used in our test environment and several scenarios<br />

that describe possible issues and their resolution that we<br />

developed following the proposed problem determination<br />

methodology.<br />

Finally, this redbook includes a short introduction about to<br />

how to integrate your Blue Gene system in a High<br />

Performance Computing environment managed by IBM<br />

Cluster <strong>Systems</strong> Management as well as an introduction to<br />

how you can use secure shell in such an environment.<br />

INTERNATIONAL<br />

TECHNICAL<br />

SUPPORT<br />

ORGANIZATION<br />

®<br />

BUILDING TECHNICAL<br />

INFORMATION BASED ON<br />

PRACTICAL EXPERIENCE<br />

IBM Redbooks are developed by<br />

the IBM International Technical<br />

Support Organization. Experts<br />

from IBM, Customers and<br />

Partners from around the world<br />

create timely technical<br />

information based on realistic<br />

scenarios. Specific<br />

recommendations are provided<br />

to help you implement IT<br />

solutions more effectively in<br />

your environment.<br />

For more information:<br />

ibm.com/redbooks

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!