Problem Determination Guide - Systems Group
Problem Determination Guide - Systems Group
Problem Determination Guide - Systems Group
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
IBM System Blue Gene<br />
Solution<br />
<strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
Learn detailed procedures<br />
through illustrations<br />
Use sample scenarios as<br />
helpful resources<br />
Discover GPFS<br />
installation hints and tips<br />
ibm.com/redbooks<br />
Front cover<br />
Octavian Lascu<br />
Peter F Custerson<br />
Marty Fullam<br />
Ravi K Komanduri<br />
Dr. Thanh V Lam<br />
Sean Saunders<br />
Chris Stone<br />
Shinsuke Ueyama<br />
Dino Quintero
International Technical Support Organization<br />
IBM System Blue Gene Solution: <strong>Problem</strong><br />
<strong>Determination</strong> <strong>Guide</strong><br />
October 2006<br />
SG24-7211-00
Note: Before using this information and the product it supports, read the information in<br />
“Notices” on page ix.<br />
First Edition (October 2006)<br />
This edition applies to Version 1, Release 2, Modification 1 of Blue Gene/L driver<br />
(V1R2M1_020_2006-060110).<br />
© Copyright International Business Machines Corporation 2006. All rights reserved.<br />
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP<br />
Schedule Contract with IBM Corp.
Contents<br />
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix<br />
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x<br />
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi<br />
The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi<br />
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv<br />
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv<br />
Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.1 Blue Gene/L system overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.2 Hardware components of the Blue Gene/L system. . . . . . . . . . . . . . . . . . . 2<br />
1.2.1 Racks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.2.2 Midplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
1.2.3 Compute (processor) card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.2.4 I/O card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.2.5 Node card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
1.2.6 Service card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />
1.2.7 Fan card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />
1.2.8 Clock card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
1.2.9 Bulk power modules/enclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
1.2.10 Link card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
1.2.11 Rack, power module, card, and fan naming conventions . . . . . . . . 19<br />
1.3 Blue Gene/L networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
1.3.1 Service network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
1.3.2 Functional network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
1.3.3 Three dimensional torus (3D torus). . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
1.3.4 Collective (tree) network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
1.3.5 Global barrier and interrupt network . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
1.4 Service Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
1.4.1 DB2 database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
1.4.2 Service Node system processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
1.5 Front-End Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
1.6 External file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
1.7 Blue Gene/L system software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
1.7.1 I/O node kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
1.7.2 I/O kernel ramdisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
1.7.3 I/O kernel CIOD daemon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
1.7.4 Compute Node Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
© Copyright IBM Corp. 2006. All rights reserved. iii
1.7.5 Microloader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
1.8 Boot process, job submission, and termination. . . . . . . . . . . . . . . . . . . . . 36<br />
1.9 System discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
1.9.1 The bglmaster process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
1.9.2 SystemController. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
1.9.3 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
1.9.4 PostDiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
1.9.5 CableDiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
1.10 Discovering your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
1.10.1 Discovery logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
1.11 Service actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
1.11.1 PrepareForService . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
1.11.2 EndServiceAction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
1.12 Turning off the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
1.13 Turning on the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />
Chapter 2. <strong>Problem</strong> determination methodology . . . . . . . . . . . . . . . . . . . . 55<br />
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
2.2 Identifying the installed system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
2.2.1 Blue Gene/L Web interface (BGWEB) . . . . . . . . . . . . . . . . . . . . . . . 57<br />
2.2.2 DB2 select statements of the DB2 database on the SN . . . . . . . . . . 58<br />
2.2.3 Network diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
2.2.4 Service Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
2.2.5 Front-End Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
2.2.6 Control system server logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
2.2.7 File systems (NFS and GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
2.2.8 Job submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />
2.2.9 Racks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />
2.2.10 Midplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
2.2.11 Clock cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />
2.2.12 Service cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />
2.2.13 Link cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
2.2.14 Link card chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
2.2.15 Link summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
2.2.16 Node cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
2.2.17 I/O cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
2.2.18 Compute or processor cards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />
2.3 Sanity checks for installed components . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
2.3.1 Check the operating system on the SN. . . . . . . . . . . . . . . . . . . . . . . 83<br />
2.3.2 Check communication services on the SN . . . . . . . . . . . . . . . . . . . . 84<br />
2.3.3 Check that BGWEB is running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
2.3.4 Check that DB2 is working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
2.3.5 Check that BGLMaster and its child daemons are running. . . . . . . . 89<br />
iv IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.3.6 Check the NFS subsystem on the SN. . . . . . . . . . . . . . . . . . . . . . . . 90<br />
2.3.7 Check that a block can be allocated using mmcs_console. . . . . . . . 91<br />
2.3.8 Check that a simple job can run (mmcs_console) . . . . . . . . . . . . . . 96<br />
2.3.9 Check the control system server logs . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
2.3.10 Check remote shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
2.3.11 Check remote command execution with secure shell . . . . . . . . . . 102<br />
2.3.12 Check the network switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />
2.3.13 Check the physical Blue Gene/L racks configuration . . . . . . . . . . 104<br />
2.4 <strong>Problem</strong> determination methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
2.4.1 Define the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
2.4.2 Identify the Blue Gene/L system . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
2.4.3 Identify the problem area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
2.5 Identifying core Blue Gene/L system problems. . . . . . . . . . . . . . . . . . . . 108<br />
2.6 Identifying IBM LoadLeveler jobs to BGL jobs (CLI) . . . . . . . . . . . . . . . . 109<br />
Chapter 3. <strong>Problem</strong> determination tools . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
3.2 Hardware monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
3.2.1 Collectable information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
3.2.2 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
3.2.3 Checking the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
3.3 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
3.3.1 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
3.3.2 Checking the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
3.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
3.4.1 Test cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />
3.4.2 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />
3.4.3 Checking the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
3.5 MMCS console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />
3.5.1 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />
3.5.2 Checking the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />
Chapter 4. Running jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />
4.1 Parallel programming environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />
4.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />
4.2.1 The blrts tool chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />
4.2.2 The IBM XLC/XLF compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148<br />
4.3 Submitting jobs using built-in tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
4.3.1 Submitting a job using MMCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
4.3.2 The mpirun program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />
4.3.3 Example of submitting a job using mpirun . . . . . . . . . . . . . . . . . . . 163<br />
4.4 IBM LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167<br />
4.4.1 LoadLeveler overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167<br />
Contents v
4.4.2 Principles of operation in a Blue Gene/L environment . . . . . . . . . . 170<br />
4.4.3 How LoadLeveler plugs into Blue Gene/L. . . . . . . . . . . . . . . . . . . . 172<br />
4.4.4 Configuring LoadLeveler for Blue Gene/L. . . . . . . . . . . . . . . . . . . . 172<br />
4.4.5 Making the Blue Gene/L libraries available to LoadLeveler . . . . . . 173<br />
4.4.6 Setting Blue Gene/L specific environment variables. . . . . . . . . . . . 175<br />
4.4.7 LoadLeveler and the Blue Gene/L job cycle . . . . . . . . . . . . . . . . . . 176<br />
4.4.8 LoadLeveler job submission process . . . . . . . . . . . . . . . . . . . . . . . 178<br />
4.4.9 LoadLeveler checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186<br />
4.4.10 Updating LoadLeveler in a Blue Gene/L environment . . . . . . . . . 209<br />
Chapter 5. File systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211<br />
5.1 NFS and GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212<br />
5.1.1 I/O node boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212<br />
5.1.2 Additional scripts in I/O node boot sequence . . . . . . . . . . . . . . . . . 214<br />
5.2 NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />
5.2.1 How NFS plugs into a Blue Gene/L system . . . . . . . . . . . . . . . . . . 215<br />
5.2.2 Adding an NFS file system to the Blue Gene/L system . . . . . . . . . 216<br />
5.2.3 NFS problem determination methodology. . . . . . . . . . . . . . . . . . . . 217<br />
5.2.4 NFS checklists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218<br />
5.3 GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />
5.3.1 When to use GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225<br />
5.3.2 Features and concepts of GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 226<br />
5.3.3 GPFS requirements for Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . 227<br />
5.3.4 GPFS supported levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229<br />
5.3.5 How GPFS plugs in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230<br />
5.3.6 Creating the GPFS file system on a remote cluster (gpfsNSD) . . . 232<br />
5.3.7 Creating a GPFS cluster on Blue Gene/L (bgIO) . . . . . . . . . . . . . . 232<br />
5.3.8 Cross mounting the GPFS file system on to Blue Gene/L cluster. . 246<br />
5.3.9 GPFS problem determination methodology . . . . . . . . . . . . . . . . . . 254<br />
5.3.10 GPFS Checklists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255<br />
5.3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264<br />
Chapter 6. Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265<br />
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266<br />
6.2 Blue Gene/L core system scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267<br />
6.2.1 Hardware error: Compute card error. . . . . . . . . . . . . . . . . . . . . . . . 268<br />
6.2.2 Functional network: Defective cable . . . . . . . . . . . . . . . . . . . . . . . . 269<br />
6.2.3 Service network: Defective cable . . . . . . . . . . . . . . . . . . . . . . . . . . 269<br />
6.2.4 Service Node functional network interface down . . . . . . . . . . . . . . 271<br />
6.2.5 SN service network interface down. . . . . . . . . . . . . . . . . . . . . . . . . 271<br />
6.2.6 The /bgl file system full on the SN (no GPFS) . . . . . . . . . . . . . . . . 273<br />
6.2.7 The / file system full on the SN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274<br />
6.2.8 The /tmp file system is full on the SN . . . . . . . . . . . . . . . . . . . . . . . 275<br />
vi IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6.2.9 The ciodb daemon is not running on the SN. . . . . . . . . . . . . . . . . . 276<br />
6.2.10 The idoproxy daemon not running on the SN . . . . . . . . . . . . . . . . 280<br />
6.2.11 The mmcs_server is not running on the SN . . . . . . . . . . . . . . . . . 282<br />
6.2.12 DB2 not started on the SN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284<br />
6.2.13 The bglsysdb user OS password changed (Linux) . . . . . . . . . . . . 287<br />
6.2.14 Uncontrolled rack power off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289<br />
6.3 File system scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294<br />
6.3.1 Port mapper daemon not running on the SN . . . . . . . . . . . . . . . . . 295<br />
6.3.2 NFS daemon not running on the SN . . . . . . . . . . . . . . . . . . . . . . . . 296<br />
6.3.3 GPFS pagepool (wrongly) set to 512MB on bgIO cluster nodes . . 297<br />
6.3.4 Secure shell (ssh) is broken . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303<br />
6.3.5 The /bgl file system full (Blue Gene/L uses GPFS). . . . . . . . . . . . . 312<br />
6.3.6 Installing new Blue Gene/L driver code (driver update) . . . . . . . . . 320<br />
6.3.7 Duplicate IP addresses in /etc/hosts . . . . . . . . . . . . . . . . . . . . . . . . 343<br />
6.3.8 Missing I/O node in /etc/hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345<br />
6.3.9 Adding an extra alias for the SN in /etc/hosts . . . . . . . . . . . . . . . . . 347<br />
6.4 Job submission scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349<br />
6.4.1 The mpirun command: scenarios description . . . . . . . . . . . . . . . . . 349<br />
6.4.2 The mpirun command: environment variables not set . . . . . . . . . . 350<br />
6.4.3 The mpirun command: incorrect remote command<br />
execution (rsh) setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355<br />
6.4.4 LoadLeveler: scenarios description. . . . . . . . . . . . . . . . . . . . . . . . . 358<br />
6.4.5 LoadLeveler: job failed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358<br />
6.4.6 LoadLeveler: job in hold state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363<br />
6.4.7 LoadLeveler: job disappears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369<br />
6.4.8 LoadLeveler: Blue Gene/L is absent . . . . . . . . . . . . . . . . . . . . . . . . 371<br />
6.4.9 LoadLeveler: LoadLeveler cannot start. . . . . . . . . . . . . . . . . . . . . . 375<br />
Chapter 7. Additional topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383<br />
7.1 Cluster <strong>Systems</strong> Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384<br />
7.1.1 Overview of CSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384<br />
7.1.2 Monitoring the Blue Gene/L database with CSM . . . . . . . . . . . . . . 386<br />
7.1.3 Customizing the monitoring capabilities of CSM. . . . . . . . . . . . . . . 387<br />
7.1.4 Defining your own CSM monitoring constructs . . . . . . . . . . . . . . . . 390<br />
7.1.5 Miscellaneous related information. . . . . . . . . . . . . . . . . . . . . . . . . . 394<br />
7.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394<br />
7.2 Secure shell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395<br />
7.2.1 Basic cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395<br />
7.2.2 Secure shell basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399<br />
7.2.3 Sample configuration in a cluster environment . . . . . . . . . . . . . . . . 402<br />
7.2.4 Using ssh in a Blue Gene/L environment . . . . . . . . . . . . . . . . . . . . 406<br />
Appendix A. Installing and setting up LoadLeveler for Blue Gene/L . . . 409<br />
Contents vii
Installing LoadLeveler on SN and FENs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410<br />
Obtaining the rpms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410<br />
Installing the rpms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />
Setting up the LoadLeveler cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411<br />
Enabling Blue Gene/L capabilities in LoadLeveler . . . . . . . . . . . . . . . . . . 413<br />
Setting Blue Gene/L specific environment variables. . . . . . . . . . . . . . . . . 413<br />
Appendix B. The sitefs file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423<br />
The /bgl/dist/etc/rc.d/init.d/sitefs file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424<br />
Appendix C. The ionode.README file . . . . . . . . . . . . . . . . . . . . . . . . . . . 431<br />
/bgl/BlueLight/ppcfloor/docs/ionode.README file . . . . . . . . . . . . . . . . . . . . . 432<br />
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451<br />
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453<br />
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453<br />
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453<br />
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454<br />
How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454<br />
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454<br />
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455<br />
viii IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Notices<br />
This information was developed for products and services offered in the U.S.A.<br />
IBM may not offer the products, services, or features discussed in this document in other countries. Consult<br />
your local IBM representative for information on the products and services currently available in your area.<br />
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM<br />
product, program, or service may be used. Any functionally equivalent product, program, or service that<br />
does not infringe any IBM intellectual property right may be used instead. However, it is the user's<br />
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.<br />
IBM may have patents or pending patent applications covering subject matter described in this document.<br />
The furnishing of this document does not give you any license to these patents. You can send license<br />
inquiries, in writing, to:<br />
IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.<br />
The following paragraph does not apply to the United Kingdom or any other country where such provisions<br />
are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES<br />
THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,<br />
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,<br />
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer<br />
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.<br />
This information could include technical inaccuracies or typographical errors. Changes are periodically made<br />
to the information herein; these changes will be incorporated in new editions of the publication. IBM may<br />
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at<br />
any time without notice.<br />
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any<br />
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the<br />
materials for this IBM product and use of those Web sites is at your own risk.<br />
IBM may use or distribute any of the information you supply in any way it believes appropriate without<br />
incurring any obligation to you.<br />
Information concerning non-IBM products was obtained from the suppliers of those products, their published<br />
announcements or other publicly available sources. IBM has not tested those products and cannot confirm<br />
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on<br />
the capabilities of non-IBM products should be addressed to the suppliers of those products.<br />
This information contains examples of data and reports used in daily business operations. To illustrate them<br />
as completely as possible, the examples include the names of individuals, companies, brands, and products.<br />
All of these names are fictitious and any similarity to the names and addresses used by an actual business<br />
enterprise is entirely coincidental.<br />
COPYRIGHT LICENSE:<br />
This information contains sample application programs in source language, which illustrates programming<br />
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in<br />
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application<br />
programs conforming to the application programming interface for the operating platform for which the<br />
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,<br />
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy,<br />
modify, and distribute these sample programs in any form without payment to IBM for the purposes of<br />
developing, using, marketing, or distributing application programs conforming to IBM's application<br />
programming interfaces.<br />
© Copyright IBM Corp. 2006. All rights reserved. ix
Trademarks<br />
The following terms are trademarks of the International Business Machines Corporation in the United States,<br />
other countries, or both:<br />
eServer<br />
ibm.com®<br />
AIX 5L<br />
AIX®<br />
Blue Gene®<br />
DB2®<br />
IBM®<br />
LoadLeveler®<br />
NUMA-Q®<br />
Power PC®<br />
PowerPC®<br />
POWER<br />
POWER4<br />
POWER5<br />
The following terms are trademarks of other companies:<br />
x IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
Redbooks<br />
Redbooks (logo) <br />
Sequent®<br />
System p<br />
Tivoli®<br />
1350<br />
Java, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other<br />
countries, or both.<br />
Nina, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or<br />
both.<br />
Chips, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel<br />
Corporation or its subsidiaries in the United States, other countries, or both.<br />
UNIX is a registered trademark of The Open <strong>Group</strong> in the United States and other countries.<br />
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.<br />
Other company, product, or service names may be trademarks or service marks of others.
Preface<br />
This IBM® Redbook is intended as a problem determination guide for system<br />
administrators in a High Performance Computing environment. It can help you<br />
find a solution to issues that you encounter on your IBM eServer Blue Gene®<br />
system.<br />
This redbook presents an architectural overview of the IBM eServer Blue Gene<br />
Solution, with some of the principles that have been used to design this<br />
revolutionary supercomputer, a description of the hardware and software<br />
environment that compose this solution, along with a short description of each<br />
component and how to identify them in an installed system.<br />
This redbook also includes a problem determination methodology that we<br />
developed during our residency, along with the problem determination tools that<br />
are available with the basic IBM eServer Blue Gene Solution. It also discusses<br />
additional software components that are required for integrating your Blue Gene<br />
system in a complex computing environment. These components include file<br />
systems (NFS and GPFS) and job submission tools (mpirun and IBM<br />
LoadLeveler®).<br />
This redbook also describes a GPFS installation procedure that we used in our<br />
test environment and several scenarios that describe possible issues and their<br />
resolution that we developed following the proposed problem determination<br />
methodology.<br />
Finally, this redbook includes a short introduction about to how to integrate your<br />
Blue Gene system in a High Performance Computing environment managed by<br />
IBM Cluster <strong>Systems</strong> Management as well as an introduction to how you can use<br />
secure shell in such an environment.<br />
The team that wrote this redbook<br />
This redbook was produced by a team of specialists from around the world<br />
working at the International Technical Support Organization (ITSO), Austin<br />
Center.<br />
Octavian Lascu is a Project Leader at the ITSO, Poughkeepsie Center. He<br />
writes extensively and teaches IBM classes worldwide on all areas of IBM<br />
System p and Linux® clusters. Before joining the ITSO, Octavian worked in<br />
IBM Global Services Romania as a software and hardware Services Manager.<br />
© Copyright IBM Corp. 2006. All rights reserved. xi
He holds a Master's Degree in Electronic Engineering from the Polytechnical<br />
Institute in Bucharest and is also an IBM Certified Advanced Technical Expert in<br />
AIX/PSSP/HACMP. He has worked with IBM since 1992.<br />
Peter F Custerson is a Product Support Specialist based in Farnborough,<br />
United Kingdom. He has worked for IBM for nine years, four years for the former<br />
Sequent® organization and five years for the IBM UK Unix Support Centre. He<br />
currently is the High Performance Computing (HPC) Technical Advisor, and<br />
concentrates on customer support issues with the HPC product set for our<br />
regions HPC customers. This includes Cluster 1600, 1350 and Blue Gene<br />
customers. He holds an honors degree in Computer Studies from the University<br />
of Glamorgan in the United Kingdom.<br />
Marty Fullam is a software engineer on the Cluster <strong>Systems</strong> Management<br />
development team. He is the architect and lead developer for the CSM Blue<br />
Gene support project. He joined IBM in Poughkeepsie, New York, in 1982,<br />
working on electronic design automation tools. He holds a B.E in Electrical<br />
Engineering from SUNY Stonybrook, and an M.S. in Computer Engineering from<br />
Syracuse University.<br />
Ravi K Komanduri is a software engineer at High Performance Computing<br />
group of IBM <strong>Systems</strong> and Technology Labs, India. He joined IBM in 2004 and<br />
has about 5 1/2 years of total experience in High Performance Computing & Grid<br />
technologies. His areas of expertise include Parallel programming,<br />
Benchmarking & Performance analysis, Developing software tools on clusters,<br />
and currently involved in functional testing of the Blue Gene control system. He<br />
holds a bachelors degree in Computer Science from Jawaharlal Nehru<br />
Technological University, Hyderabad, India.<br />
Dr. Thanh V Lam is a software engineer in Cluster System Test. He leads a<br />
team of test engineers in testing cluster software and hardware on multiple<br />
platforms. His areas of expertise include LoadLeveler, high performance<br />
computing, parallel applications, and Blue Gene. He joined the IBM High<br />
Performance Supercomputing Lab, Kingston, New York, in 1988 and started<br />
working in service processing for the early scalable parallel systems known as<br />
SP. He holds a degree in Doctor of Professional Study in Computing from Pace<br />
University, White Plains, New York.<br />
Sean Saunders is an HPC Product Support Specialist working for ITS in IBM<br />
UK. He joined IBM in 2000, initially working on NUMA-Q® systems and now<br />
mainly supporting Cluster 1600, 1350 and Blue Gene systems, including the<br />
HPC software stack. He also supports AIX® and Linux. He holds a B.Sc.(Hons)<br />
degree in Computer Science from Kingston University, England.<br />
xii IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chris Stone is a Senior IT Specialist based IBM Hursley Park, United Kingdom.<br />
He works as part of the High Performance Computing Services Team based in<br />
the UK mainly on storage systems for customers. He has been working for IBM<br />
for over 20 years and in that time has gained a wide range of experience in<br />
monitor hardware development, software development and customer services.<br />
He has 7 years of experience in designing and installing GPFS storage systems<br />
for customers and has recently gained experience with Blue Gene systems by<br />
team leading the install of two Blue Gene installs in Europe. He holds an<br />
Honours degree in Electrical and Electronic engineering from Bristol University in<br />
the UK.<br />
Shinsuke Ueyama is a software engineer for Engineering and Technology<br />
Services in IBM Japan. He joined IBM in 2005, mainly working on supporting<br />
Blue Gene systems, and recently experienced the installation of a Blue Gene<br />
system in Japan. He holds a Master degree in Information Science and<br />
Technology from The University of Tokyo, Japan.<br />
Dino Quintero is a Consulting IT Specialist at the ITSO in Poughkeepsie, New<br />
York. Before joining the ITSO, he worked as a Performance Analyst for the<br />
Enterprise <strong>Systems</strong> <strong>Group</strong>, and as a Disaster Recovery Architect for IBM Global<br />
Services. His areas of expertise include disaster recovery and IBM System p<br />
clustering solutions. He is an IBM Certified Professional on IBM System p<br />
technologies and also certified on System p system administration and System p<br />
clustering technologies. Currently, he leads technical teams that deliver IBM<br />
Redbook solutions on System p clustering technologies and technical workshops<br />
worldwide.<br />
Thanks to the following people for their contributions to this project:<br />
Steve Mearns<br />
IBM Portsmouth, UK<br />
Randy A. Brewster<br />
Puneet Chaudhary<br />
Richard Coppinger<br />
Alexander Druyan<br />
Bruce Hempel<br />
Steve Normann<br />
Edwin Varella<br />
IBM Poughkeepsie, NY<br />
Lynn A Boger<br />
Mark Campana<br />
Cathy Cebell<br />
Thomas A. Budnik<br />
Darwin Dumonceaux<br />
Preface xiii
Frank Ingram<br />
Randal Massot<br />
Mike Nelson<br />
Jeff Parker<br />
Karl Solie<br />
Mike Woiwood<br />
IBM Rochester, MN<br />
Tom Engelsiepen<br />
IBM San Jose, CA<br />
Mark Mendell<br />
IBM Toronto, Canada<br />
Marc B Dombrowa<br />
David M. Singer<br />
IBM Yorktown Heights, NY<br />
CSM for Blue Gene Development Team:<br />
Ling Gao<br />
IBM Poughkeepsie, NY<br />
ITSO Editing team:<br />
Ella Buslovich<br />
IBM Poughkeepsie, NY<br />
Debbie Willmschen<br />
IBM Raleigh, NC<br />
Become a published author<br />
Join us for a two- to six-week residency program! Help write an IBM Redbook<br />
dealing with specific products or solutions, while getting hands-on experience<br />
with leading-edge technologies. You'll team with IBM technical professionals,<br />
Business Partners or customers.<br />
Your efforts will help increase product acceptance and customer satisfaction. As<br />
a bonus, you'll develop a network of contacts in IBM development labs, and<br />
increase your productivity and marketability.<br />
Find out more about the residency program, browse the residency index, and<br />
apply online at:<br />
ibm.com/redbooks/residencies.html<br />
xiv IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Comments welcome<br />
Your comments are important to us!<br />
We want our Redbooks to be as helpful as possible. Send us your comments<br />
about this or other Redbooks in one of the following ways:<br />
► Use the online Contact us review redbook form found at:<br />
ibm.com/redbooks<br />
► Send your comments in an e-mail to:<br />
redbook@us.ibm.com<br />
► Mail your comments to:<br />
IBM Corporation, International Technical Support Organization<br />
Dept. HYTD Mail Station P099<br />
2455 South Road<br />
Poughkeepsie, NY 12601-5400<br />
Preface xv
xvi IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chapter 1. Introduction<br />
1<br />
This book discusses diagnostic and problem determination methodologies for an<br />
IBM eServer Blue Gene Solution (also known as Blue Gene/L). In this chapter,<br />
we give a brief overview of Blue Gene/L architecture and describe the networks<br />
and the system boot process.<br />
© Copyright IBM Corp. 2006. All rights reserved. 1
1.1 Blue Gene/L system overview<br />
In this book, we investigate and describe symptoms, techniques, and<br />
methodologies for tackling issues that you might encounter while using your Blue<br />
Gene/L system. However, we do not describe in detail the hardware or<br />
application porting and running process for Blue Gene/L. For that purpose, we<br />
recommend the following IBM Redbooks:<br />
► Blue Gene/L: Hardware Overview and Planning, SG24-6796<br />
► Blue Gene/L: System Administration, SG24-7178<br />
► Unfolding the IBM eServer Blue Gene Solution, SG24-6686<br />
To begin using this book and solving any issues that you have with your system,<br />
you should have an understanding of your Blue Gene/L hardware, software,<br />
network, and boot process, as well as the location codes and LED status<br />
reported by the system. This type of information is described in this chapter.<br />
Later chapters discuss problem solving and include check lists for the basic Blue<br />
Gene/L system as well as other components, such as GPFS and LoadLeveler.<br />
1.2 Hardware components of the Blue Gene/L system<br />
The current configuration of Blue Gene/L is built from dual core processors<br />
placed in pairs on a Compute Card together with 1 GB of RAM (512 MB for each<br />
dual core CPU), as shown in Figure 1-1 on page 3. The configuration includes:<br />
► 16 Compute Cards installed in a node card (32 processors)<br />
► 16 Node Cards (512 processors) installed into a dual-sided midplane (1/2<br />
rack)<br />
► Two midplanes installed in a rack (1024 processors)<br />
You can link the racks together to a maximum of 64 racks (65536 processors).<br />
2 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.2.1 Racks<br />
Chip<br />
2 processors<br />
Compute Card<br />
2 chips, 1x2x1<br />
2.8/5.6 GF/s<br />
4MBC h<br />
Node Card<br />
(32 chips 4x4x2)<br />
16 compute, 0-2 IO cards<br />
5.6/11.2 GF/s<br />
1.0 GB RAM<br />
Figure 1-1 Blue Gene/L System<br />
32 Node Cards<br />
90/180 GF/s<br />
16 GB RAM<br />
2.8/5.6 TF/s<br />
512 GB RAM<br />
64 Racks, 64x32x32<br />
180/360 TF/s<br />
32 TB RAM<br />
The following paragraphs describe each of the hardware elements in a Blue<br />
Gene/L system.<br />
The hardware that we discuss in this chapter is installed in racks. The current<br />
maximum numbers of racks in a Blue Gene/L system is 64. Each rack has its<br />
own location code that is seen in system logs and RAS events. You need to learn<br />
these locations to determine problem efficiently.<br />
Figure 1-2 on page 4 and Figure 1-3 on page 5 show where the hardware<br />
components are installed in the rack and the location codes. These two figures<br />
show a front and a back view for each side of the midplane, and each rack has<br />
two midplanes installed.<br />
Rack<br />
System<br />
Note: In the hardware descriptions that we include here, you see two location<br />
codes per item. The first code refers to how the Blue Gene/L system software<br />
references the location, and the second code is how the hardware is labeled.<br />
(Hardware references begin with the letter J.)<br />
Chapter 1. Introduction 3
Figure 1-2 Card positions in the front of the rack with location names<br />
4 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 1-3 Card positions in the back of the rack with location names<br />
Chapter 1. Introduction 5
1.2.2 Midplane<br />
As the name suggests, the midplane sits in the middle of the rack (in vertical<br />
position). There are two midplanes per rack. Position 0 is at the bottom, and<br />
position 1 is at the top. All nodes, link, service, and fan cards plug into the<br />
midplane (see Figure 1-4).<br />
The midplane also provides the communications infrastructure by which the<br />
components to talk to each other. The network services provided by the node,<br />
service cards, and midplane are discussed in 1.3, “Blue Gene/L networks” on<br />
page 21.<br />
Figure 1-4 Midplane position in the rack<br />
6 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.2.3 Compute (processor) card<br />
1.2.4 I/O card<br />
The compute (processor) card, shown in Figure 1-5, is the basic building block of<br />
a Blue Gene/L system. It is comprised of two 700 Mhz 32-bit PowerPC® 440x5<br />
dual-core processors and 1 GB of RAM (512 Mb per dual core CPU).<br />
These nodes are the work horse of the system on which applications run. One<br />
compute card provides two or four nodes, depending on whether you select to<br />
run in communication co-processor (co) or virtual node (vn) mode. In co mode,<br />
one core of each PowerPC chip runs the application, while the other handles the<br />
message passing. In vn mode, both cores run the application and perform the<br />
message passing duties.<br />
The node runs the compute node kernel (CNK), which is a proprietary IBM kernel<br />
that is optimized for the Blue Gene/L environment. This kernel is a single user,<br />
single process, with up to two threads and does not implement paging (in fact, in<br />
this configuration, the virtual memory is limited to the real memory). It uses a<br />
subset of about 40% of supported Linux system calls. The CNK can also be<br />
known as Blue Gene Light Runtime System (BLRTS).<br />
Figure 1-5 Compute (processor) card<br />
The I/O card is similar to the compute card except these nodes have more<br />
memory (currently 2 Gb per card or 1 Gb per node) and have a specific use<br />
within the system. The I/O card runs a Linux version (2.4 uni-processor kernel)<br />
known as the mini-control program (MCP). The MCP has been altered from a<br />
standard Linux distribution to include support for the Blue Gene/L hardware<br />
(including an altered kernel). There is one instance of the MCP per chip, giving<br />
two nodes per I/O card.<br />
Important: Even though the I/O card is a dual core CPU, only one processor<br />
is used to run the I/O MCP.<br />
Chapter 1. Introduction 7
1.2.5 Node card<br />
As an application runs on a compute node, it is I/O is routed to the outside world<br />
through the integrated communication hardware on the I/O card. The I/O card<br />
handles I/O operations on behalf of compute nodes (groups of compute nodes).<br />
This grouping of compute nodes to an assigned I/O card is known as its<br />
processor set or pset.<br />
You can tell the difference between the compute and I/O card, by looking at the<br />
edge connectors. The compute card has a total of six edge connectors, which is<br />
one more than the I/O card (see Figure 1-6).<br />
Figure 1-6 I/O card<br />
The node card is the location into which the compute and I/O cards are inserted.<br />
16 compute cards are added to each node card, as well as up to two optional I/O<br />
cards. Each card slot inside the node card has its own unique location code, as<br />
shown in Figure 1-7 on page 9.<br />
Note: If every I/O node position in the system has a card installed, the system<br />
is known as I/O rich.<br />
The node cards are then installed into a midplane — eight in the front and eight<br />
in the back. This installation is repeated for the second midplane, giving a total of<br />
32 node cards per rack, each with its own unique location code.<br />
8 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 1-7 Node card block diagram showing location codes<br />
Note: Each of the locations has a specific use. For example:<br />
– J1: Midplane connector<br />
– C0 → CF or J2 → J17: Compute cards<br />
– I0, I1 or J18, J19: I/O cards<br />
The node card has LED status indicators on the front (Figure 1-8 on page 10).<br />
The center LED panel shows whether the card is linked to the midplane, whether<br />
the IDo link is working, and whether the card itself or something plugged into the<br />
card has a fault.<br />
Chapter 1. Introduction 9
Figure 1-8 Node card center LED panel<br />
Table 1-1 Node card status LEDs<br />
Ethernet LEDs<br />
Led Name/Color Indication<br />
LINK/Green ON - An active link from the Service Network into the<br />
node card. Card can be monitored and controlled.<br />
OFF - Control link is missing. All the other LEDs might<br />
contain inaccurate information because the Service<br />
Network connection is used to set them.<br />
4/Green ON - Flashing - card is operating<br />
OFF - Card is down. Node Cards should only be<br />
removed when this LED is OFF.<br />
3/Green Flashing - Un-initialized<br />
2/Green Flashing - Un-initialized<br />
1/Green Flashing - Un-initialized<br />
0/Amber ON - Card has a fault<br />
OFF - Card has no fault<br />
FLASHES - Card needs human interaction<br />
Note: If LED 0 is on after everything is initialized, one of the power modules<br />
might have failed. In this case, the card is operational but no longer has its<br />
redundant power.<br />
10 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.2.6 Service card<br />
In addition to the center LED panel, each RJ45 jack marked 1, 2, 3, or 4 has<br />
status LEDs (Figure 1-8 on page 10 shows two RJ45 jacks marked 2 and 3. The<br />
left green LED in 3 is on.)<br />
Table 1-2 Node card RJ45 LEDs and indications<br />
Led Name/Color/Position Indication<br />
RJ45 Gbit ethernet link LED/Green/Left ON - Active Gbit ethernet link<br />
OFF - No active Gbit ethernet link<br />
RJ45 Gbit ethernet link LED/Green/Right FLASHING - Indicates traffic<br />
OFF - No traffic<br />
The service card provides the major control functions for each midplane:<br />
► Provides an interface to the network that controls the fans and cooling<br />
through the I²C network. (See also 1.3.1, “Service network” on page 22.)<br />
► Distributes the clock signal to all node cards on the midplane. The clock card<br />
is connected directly to the front of each service card. (See 1.2.8, “Clock card”<br />
on page 15.)<br />
► Controls the ethernet switch for the midplane TCP/IP network.<br />
► Controls the boot sequence of each midplane.<br />
► Delivers persistent power to link cards, so that torus and collective networks<br />
continue to function in the event of a power outage, so that additional racks in<br />
the Blue Gene/L system continue to work. See 1.3.3, “Three dimensional<br />
torus (3D torus)” on page 26 and 1.3.4, “Collective (tree) network” on page 26<br />
for more details.<br />
► Connects devices on the midplane to the service node through the service<br />
network. Each device that is inserted into the midplane has an IDo chip that<br />
acts as a bridge between the network and the other hardware interfaces. (For<br />
more details see 1.3.1, “Service network” on page 22.)<br />
The service card also has its own set of LEDs to help with problem<br />
determination, as illustrated in Figure 1-9.<br />
Figure 1-9 Front of the service card<br />
Chapter 1. Introduction 11
Status LEDs are located on the right-hand side of the service card. Table 1-3<br />
explains these LEDs.<br />
Table 1-3 Service card status LEDs<br />
Led Name/Color Indication<br />
Power/Green ON - 3.3 volt persistent power to the Midplane present. This<br />
power is used to run the service card and to run the service<br />
path logic on all other Link, Node Cards and Fan modules in the<br />
midplane. This LED should always be on during normal<br />
operations.<br />
OFF - 3.3 volt persistent power to the midplane missing.<br />
Service Cards should only be removed when this LED is OFF<br />
4 /Green ON/FLASHING - Card is operating<br />
OFF - Card is down<br />
3/Amber OFF - Card initialized or no rack power<br />
FLASHING - Card un-initialized or needs to be brought up after<br />
a power cycle<br />
2/Amber OFF - Card initialized or no rack power<br />
FLASHING - Card un-initialized or needs to be brought up after<br />
a power cycle<br />
1/Amber OFF - Card initialized or no rack power<br />
FLASHING - Card un-initialized or needs to be brought up after<br />
a power cycle<br />
0 - Amber ON - Card has a fault<br />
OFF - Card has no fault<br />
FLASHING - Card un-initialized or needs to be brought up after<br />
a power cycle<br />
The service card also hosts 100 Mbps and 1 Gbps network ports. The 100 Mbps<br />
port is used in the discovery process, and the 1 Gbps port is used for talking to<br />
the hardware when the discovery process has completed. For more information<br />
about the discovery process, see 1.9.3, “Discovery” on page 43. Each RJ45 jack<br />
on the service card has status LEDs that are similar to the ones on the node card<br />
(see Table 1-2 on page 11).<br />
12 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.2.7 Fan card<br />
The cooling on the Blue Gene/L system is achieved through banks of fans. Each<br />
fan unit contains three individual fans and two status LEDs on the front<br />
(Figure 1-10).<br />
Figure 1-10 Fan unit front (showing LEDs) and fan unit removed from rack<br />
The fans draw the air through the intake plenum, through the rack, and release it<br />
out through the output plenum towards the ceiling (Figure 1-11). It is the plenums<br />
that give the Blue Gene/L its unique shape.<br />
Figure 1-11 Rack cooling<br />
Chapter 1. Introduction 13
There are 20 fan units (clusters) per rack — 10 are installed on the front and 10<br />
in the back, each with their own location code. (For the locations codes, see<br />
Figure 1-2 on page 4.) Figure 1-12 shows the fan clusters installed in the rack.<br />
Figure 1-12 Fans installed in a rack - exhaust plenum removed for clarity<br />
Table 1-4 shows the fan status LEDs significance.<br />
Table 1-4 Fan status LEDs<br />
Fan Good - Green - Left Fan Fault - Amber - Right Indication<br />
OFF OFF Not powered, can be in this<br />
mode during the first few<br />
seconds of rack power up.<br />
ON ON Autonomous mode. Host<br />
communication failure.<br />
ON OFF Fan is working normally.<br />
OFF ON One or more fans in the<br />
module is operating slow<br />
or not at all. Fan assembly<br />
is bad and needs to be<br />
looked at by hardware<br />
support.<br />
FLASHING ON or OFF Identification mode, used<br />
to identify a specific fan<br />
unit.<br />
14 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.2.8 Clock card<br />
The clock card is attached physically to the bottom of the rack and is connected<br />
to the midplanes in the system through special cables that connect to the coaxial<br />
ports that are located on the front of each service card (Figure 1-13). Because<br />
Blue Gene/L is a massive parallel system, the clocks on all nodes that are<br />
involved in running a job must be synchronized (especially for any MPI job). The<br />
clock card logic provides the necessary synchronization for all node cards.<br />
Figure 1-13 Clock card<br />
1.2.9 Bulk power modules/enclosure<br />
The bulk power modules (BPM) are located on top of each rack. There are seven<br />
modules per rack (three in the front and four in the back of the rack). These<br />
modules are connected in an n+1 redundancy scheme. The modules are<br />
inserted into the bulk power enclosure (BPE), as shown in Figure 1-14, which<br />
also houses the circuit breaker module that is used to turn on and off the rack.<br />
Figure 1-14 Front view of the BPE (three BPMs and a circuit breaker)<br />
Chapter 1. Introduction 15
The BPMs also have status LEDs, as shown in Figure 1-15.<br />
Figure 1-15 BPM (LEDs on top left)<br />
Table 1-5 details the significance of these LEDs.<br />
Table 1-5 BPM Status LEDs<br />
LED Name/Color/Position Indication<br />
AC Good/Green/Left ON - AC Input to power module is present and<br />
within specification<br />
OFF - There is a problem with the AC supply<br />
DC Good/Green/Middle ON - DC Input to power module is present and<br />
within specification<br />
OFF - There is a problem with the DC supply<br />
Fault/Amber/Right ON - Replace power module<br />
OFF - Operating normally (as long as AC Good and<br />
DC Good are ON)<br />
All LEDs OFF - No Power<br />
16 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.2.10 Link card<br />
The link card is used to connect different Blue Gene/L internal networks between<br />
compute (processor) cards on different midplanes. It allows the Blue Gene/L to<br />
expand beyond the physical midplane limit. The link care also provides<br />
connectors for X, Y, and Z dimension cabling through torus cables which are<br />
plugged into it (as shown in Figure 1-16). The Z cables go between midplanes<br />
that are located in the same rack, the Y cables go between midplanes in the<br />
same row, and the X cables go between midplanes in different rows.<br />
Figure 1-16 Direction of X,Y and Z cabling<br />
Note: The process of cabling a Blue Gene/L is beyond the scope of this<br />
redbook and is covered by the installation team when the system is installed.<br />
What you need to know is the locations and socket names on the link card (as<br />
illustrated in Figure 1-16).<br />
Chapter 1. Introduction 17
Figure 1-17 Link card that shows the location codes of X, Y, and Z cable connections<br />
18 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.2.11 Rack, power module, card, and fan naming conventions<br />
Each card is named according to its position in the rack. Figure 1-17,<br />
Figure 1-18, and Figure 1-19 depict the way a unique system location is built for<br />
each component.<br />
Racks:<br />
Rxx<br />
Rack column (0-F)<br />
Rack row (0-F)<br />
Power modules:<br />
Rxx-Px<br />
Power module (0-7)<br />
0-3 Left-to-right facing front<br />
Midplanes:<br />
Rxx-Mx<br />
Clock cards:<br />
Rxx-K<br />
4-7 Left-to-right facing rear<br />
Rack<br />
Figure 1-18 Hardware naming convention, 1 of 3<br />
Note: Position 0 is<br />
reserved for the<br />
circuit breaker<br />
Midplane (0-1) 0 = Bottom, 1 = Top<br />
Rack<br />
One Clock card per rack<br />
Rack<br />
Chapter 1. Introduction 19
Fan assemblies:<br />
Rxx-Mx-Ax<br />
Fan assembly (1-A)<br />
1 = Bottom front, 9 = Top front (Odd numbers)<br />
Link cards:<br />
Rxx-Mx-Lx<br />
Figure 1-19 Hardware naming convention, 2 of 3<br />
20 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
2 = Bottom rear, A = Top rear (Even numbers)<br />
Midplane<br />
Rack<br />
Link card(0-3)<br />
Midplane<br />
Rack<br />
0 = Bottom front, 1 = Top front<br />
2 = Bottom rear, 2 = Top rear<br />
Fans:<br />
Rxx-Mx-Ax-Fx Fan(0-2) 0 = Tailstock, 2 = Midplane<br />
Fan assembly<br />
Midplane<br />
Rack<br />
Service cards:<br />
Rxx-Mx-S<br />
Service card - one per rack<br />
Midplane<br />
Rack
Node cards:<br />
Rxx-Mx-Nx<br />
Figure 1-20 Hardware naming convention, 3 of 3<br />
1.3 Blue Gene/L networks<br />
Node card (0-F)<br />
0 = Bottom front, 7 = Top front<br />
8 = Bottom rear, F = Top rear<br />
Midplane<br />
Rack<br />
Compute cards:<br />
Rxx-Mx-Nx-C:Jxx Compute card (2* -17)<br />
Node card<br />
Midplane<br />
Rack<br />
I/O cards:<br />
Rxx-Mx-Nx-I:Jxx I/O card (18-19) 18 = Right, 19 = Left<br />
Node card<br />
Midplane<br />
Rack<br />
Even numbers - right<br />
Odd numbers - left<br />
Lower numbers - toward midplane<br />
Upper numbers toward tailstock<br />
Blue Gene/L has five networks that connect the internal components and the<br />
system to the outside world. The networks are:<br />
► Service network<br />
– JTAG<br />
– I²C<br />
► Functional Ethernet<br />
► Torus<br />
► Collective (tree)<br />
► Barrier and Global interrupt<br />
Chapter 1. Introduction 21
1.3.1 Service network<br />
The service network provides a gateway into the Blue Gene/L system, so that the<br />
Service Node can control and monitor the hardware. The Service Node<br />
communicates to the Service Card using 100 Mbps or 1 Gbps connections. An<br />
ethernet switch located in the Service Card provides two Fast and two Gigabit<br />
ethernet ports that are located on the front of the Service Card. The internal<br />
switch also provides 20 additional ports, which are used to connect to the IDo<br />
chips that are located on the 16 node cards and four link cards hosted on the<br />
midplane (Figure 1-21).<br />
Figure 1-21 Service Network<br />
The hardware on the node and link cards is unable to talk directly to the Service<br />
Network. These components communicate using their own protocols. The IDo<br />
chip acts as a bridge between the service network and these communication<br />
protocols. The two protocols are:<br />
► JTAG<br />
► I²C<br />
22 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
JTAG<br />
Each processor has an interface onto a control system. The protocol used is<br />
called the Joint Technical Advisory <strong>Group</strong> (JTAG) interface, which is an Institute<br />
of Electrical and Electronic Engineers (IEEE) 1149.1 standard. The IDo chip<br />
converts the 100 Mbps ethernet bus into 40 JTAG buses, two for each Compute<br />
Node (32 in all, one for control of each processor), two for each of the I/O nodes<br />
(4 in all, one for control of each processor), and four for the gigabit ethernet<br />
transceivers that are associated with the I/O nodes. On Blue Gene/L, JTAG<br />
provides:<br />
► Hardware control to turn on the CPU and start its clock.<br />
► Hardware diagnostic and debugging.<br />
► Delivery of the microloader to each CPU to start the boot process.<br />
► A mailbox function for recording messages from the MCP or CNK, which are<br />
used for recording RAS events.<br />
Figure 1-22 illustrates node cards links to the JTAG network.<br />
Figure 1-22 Node cards links to the JTAG network<br />
Chapter 1. Introduction 23
I²C<br />
The Inter-Integrated Circuit (I²C) bus (and protocol) is used to interface with the<br />
fan control logic and temperature sensors in the Blue Gene/L system<br />
(Figure 1-23). Data such as fan speed and voltages are also reported using this<br />
2-wire serial protocol.<br />
I²C Network<br />
Rear<br />
Link Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
BPM<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
Link Card iDo<br />
Figure 1-23 Service network showing JTAG and I²C connections<br />
1.3.2 Functional network<br />
The functional network connects the Blue Gene/L I/O nodes, Service Node,<br />
Front-end Nodes (FENs), and file system providers together in one network. The<br />
functional network:<br />
► Provides system software and application data access to the I/O and<br />
Compute nodes, because there is no persistent data storage inside the Blue<br />
Gene/L racks. This is done through the I/O nodes over the functional network,<br />
in the form of NFS GPFS and mounts from external sources. (For more<br />
information, see Chapter 5, “File systems” on page 211.)<br />
24 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
Midplane Network<br />
Midplane<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
Switch<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
iDo<br />
FAN<br />
Front<br />
Link Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
iDo<br />
Node Card<br />
Node Card<br />
Node Card<br />
Node Card<br />
Link Card<br />
Service Card<br />
JTAG Network<br />
Service<br />
Network<br />
Service Node<br />
(idoproxy)
► Communicates stdin, stdout, and stderr from the Compute Nodes back<br />
from where the application was submitted through the ciod and ciodb<br />
processes that are running on the I/O Node and Service Node (as shown in<br />
Figure 1-24). For more information, see 1.4, “Service Node” on page 29.<br />
Figure 1-24 Functional network<br />
Chapter 1. Introduction 25
1.3.3 Three dimensional torus (3D torus)<br />
Three dimensional torus (3D torus) is the first of the three specialized networks<br />
implemented on the Blue Gene/L system to enable high performance parallel<br />
computing. The 3D torus network is used for general purpose, point-to-point<br />
message passing and multicast operations to a selected class of nodes when an<br />
application is running.<br />
Each processor is directly connected to six other neighbor processors, two<br />
processors in each direction (X, Y, and Z dimensions: X+1, X-1, Y+1, Y-1, Z+1,<br />
Z-1), as shown in Figure 1-25. The tori are implemented in hardware by the node<br />
card, midplane, link cards, and torus cables.<br />
Figure 1-25 4x4x4 (64) node torus<br />
For more information about the torus network, see Unfolding the IBM eServer<br />
Blue Gene Solution, SG24-6686.<br />
1.3.4 Collective (tree) network<br />
The collective network is the second Blue Gene/L specialized network. It is used<br />
for one-to-all, all-to-one, and all-to-all communication. It connects all compute<br />
26 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
C – Compute Node<br />
I/O – I/O Node<br />
I/O<br />
C C<br />
C C C C<br />
Figure 1-26 Collective network<br />
nodes in the shape of a tree, and any node can be the tree root (originating<br />
point). Any compute node can send messages up or down the tree structure, and<br />
that message can stop at any level (see Figure 1-26).<br />
In a system with as many processors as Blue Gene/L, it is impractical to provide<br />
each processor with its own external connection. We have I/O nodes with<br />
dedicated external connections that handle external I/O operations on behalf of<br />
groups of compute nodes. This grouping of compute nodes to an assigned I/O<br />
node is known as its processor set or pset.<br />
The I/O nodes are connected to the collective network but do not participate in<br />
global messaging. They just handle I/O requests. All compute nodes exist under<br />
their associated I/O node in the collective (tree) network.<br />
C C<br />
C External<br />
C External<br />
C External<br />
C<br />
In/Out<br />
In/Out<br />
In/Out<br />
C C<br />
C C C C C C C C C C C C<br />
Pset Pset Pset<br />
1.3.5 Global barrier and interrupt network<br />
I/O<br />
I/O<br />
External<br />
In/Out<br />
C C<br />
The third specialized network is the global barrier and interrupt network. A global<br />
barrier is a way of synchronizing groups of compute nodes. When a barrier is<br />
raised by an application, the nodes wait until everyone reaches a certain position<br />
or condition, so that they can all continue. An interrupt is an asynchronous signal<br />
that indicates the need for attention (for example an error condition).<br />
Every node on the Blue Gene/L system is connected to the global barrier and<br />
interrupt network through four inputs and four outputs. Each input and output pair<br />
forms a channel in the network. Each of these channels can be independently<br />
I/O<br />
I/O<br />
C C<br />
External<br />
In/Out<br />
Chapter 1. Introduction 27
programmed as a global logical “OR” or “AND” statement, depending on the<br />
polarity of the signals.<br />
On each node card, the outputs of eight of the compute connections with the<br />
outputs of the optional I/O connections are connected together to form a<br />
“dot-OR” network. Each node card therefore has four “dot-OR” networks. These<br />
are then connected into the node cards IDo chip.<br />
The IDo chip on the Node Card is used to sample on all the “dot-OR” networks<br />
on that card and to pass any signals onto other “dot-OR” networks in the same<br />
card or to pass it onto other node cards.<br />
The global barrier and interrupt network on a midplane is divided into quadrants,<br />
each with four node cards. One of these node cards serves as the head of the<br />
quadrant.<br />
The outputs of the three non-head node cards are connected together and fed<br />
into the IDo chip of the quadrant head along with the quadrant head “dot-OR”<br />
networks. The output of each quadrant head is connected to the IDo chips on the<br />
link cards (see Figure 1-27). A link card handles each global interrupt and barrier<br />
channel.<br />
The output signal from a node is called the up signal and carries the node<br />
contribution all the way to the top of the partition. The combined signals are then<br />
fed back to all the nodes of the partition. This input signal is called the down<br />
signal.<br />
28 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 1-27 Global barrier and interrupt network<br />
1.4 Service Node<br />
1.4.1 DB2 database<br />
The Service Node is the control system for the Blue Gene/L racks. It is an IBM<br />
System p system (stand-alone or LPAR) running SUSE Linux Enterprise Server<br />
9 (SLES 9) and has connections to the Service and Functional networks. A DB2<br />
database is installed and running on the Service Node, and this contains the<br />
current state of the Blue Gene/L as well as jobs and system configuration. The<br />
Blue Gene/L specific set of daemons runs on the Service node, utilizing the<br />
database and performing activities such as booting, job submission, and<br />
hardware control.<br />
The service node runs a DB2 database, which is used to store the following four<br />
data categories about the Blue Gene/L:<br />
Chapter 1. Introduction 29
► Configuration data: Includes system configuration data, operational data,<br />
and historical data.<br />
– System configuration data, which includes a representation of the<br />
physical system in database tables. All system hardware and connections<br />
are recorded in this category of data and are only altered when a<br />
component fails or is replaced.<br />
Each component of the system is represented in a database table with its<br />
relevant information. All the entries in the tables are identified by the<br />
hardware components unique serial numbers as follows:<br />
Machine Generated value<br />
Midplane Service Card’s IDo chip License Plate (unique<br />
identifier for each chip)<br />
Node Card Card’s IDo Chip License Plate<br />
Processor Card Processor Card’s Serial Number and position in the<br />
Node Card (for example, J03,J04) or ECID<br />
(ElectronicChip ID is the unique identifier for each<br />
chip)<br />
Node Compute or I/O Node ECID<br />
Link Card IDo chip License Plate<br />
Service Card IDo chip License Plate<br />
Link Chip Chips® ECID<br />
IDo Chip License Plate<br />
Note: The configuration database is empty when the system is first<br />
installed. It is populated by the discovery process as described in 1.9.3,<br />
“Discovery” on page 43.<br />
– Operational data, which includes the status of what is currently in use by<br />
applications and the status of the jobs themselves. The Midplane<br />
Management Control System (MMCS) interacts with this database to<br />
schedule jobs and allocate blocks.<br />
– Historical data, which keeps track of hardware changes on the system<br />
and shows the history of what has run, when it has run, and on what<br />
hardware it has run.<br />
► Environment data: Periodically records the values for all sensors in the<br />
system. Voltage levels, fan speeds, and so forth are recorded here.<br />
► RAS data: This data is a very important in terms of problem determination.<br />
The system records all Reliability, Availability, and Scalability (RAS)<br />
30 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
information. It is critical to monitor RAS information as an indicator of system<br />
health. For more information about monitoring RAS, see 3.2, “Hardware<br />
monitor” on page 114 and “RAS events” on page 126.<br />
► Diagnostic data: Contains the results from diagnostic tests on the hardware.<br />
1.4.2 Service Node system processes<br />
The following paragraphs describe Blue Gene/L system processes that run on<br />
the Service Node.<br />
mmcs_db_server<br />
The Midplane Management Control System (MMCS) server process is<br />
responsible for the management of blocks. Blocks are partitions (sets of compute<br />
and I/O nodes) of the Blue Gene/L in which jobs run. The mmcs_db_server<br />
process configures blocks at boot time and identifies what physical hardware<br />
should be used and in what configuration. It also polls the database for block<br />
actions and starts the boot processes.<br />
ciodb<br />
After the blocks are booted, ciodb manages the job launch to the block. It then<br />
handles passing back stdin, stdout, and stderr for each job. The ciodb daemon<br />
talks to the ciod process running on the I/O node.<br />
idoproxydb<br />
The idoproxydb daemon handles hardware related communication<br />
communication through the Service Network. It communicates with the IDo chips<br />
located on the Service, Link, and Node Cards.<br />
bglmaster<br />
The bglmaster process is the parent process for the other three system<br />
processes. It starts all three of the main system processes (idoproxy,<br />
mmcs-db_server, and ciodb) and restarts them if a process is ended for any<br />
reason. It can also provide information about the latest status of the spawned<br />
processes.<br />
Additional software<br />
For the Service Node to function, additional software is required. For more<br />
information, see Unfolding the IBM eServer Blue Gene Solution, SG24-6686.<br />
Chapter 1. Introduction 31
1.5 Front-End Node<br />
A Front-End Node is another IBM System p running SUSE Linux Enterprise<br />
Server 9 (SLES 9). Multiple Front-End Nodes can be installed. This is where the<br />
users log into and submit jobs to the Blue Gene/L. IBM compilers and Blue<br />
Gene/L compiler extensions are installed so that the user can cross compile<br />
code so it will run on the Blue Gene/L hardware. The user then submits the job to<br />
the system through a job scheduler.<br />
The Front-End Node is needed because compilations and handling job I/O from<br />
many users can have a severe effect on the performance of the Service Node.<br />
You can find more information about compiling and submitting jobs in Chapter 4,<br />
“Running jobs” on page 141.<br />
1.6 External file systems<br />
As we explained in 1.3.2, “Functional network” on page 24, the Blue Gene/L<br />
system does not have any persistent storage (disk) directly attached. Storage is<br />
provided through external file systems such as NFS or GPFS (as illustrated in<br />
Figure 1-28). These file systems are then mounted by the I/O nodes, and the<br />
Compute nodes perform I/O through the collective network. You can find more<br />
information in Chapter 5, “File systems” on page 211.<br />
32 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Front End Node<br />
Job<br />
Job Scheduler<br />
ciodb<br />
idoproxy<br />
mmcs_server<br />
DB2<br />
Service Node<br />
Figure 1-28 Communication in Blue Gene/L<br />
1.7 Blue Gene/L system software<br />
1.7.1 I/O node kernel<br />
NFS/GPFS<br />
Functional Network<br />
Service Network<br />
CIOD<br />
Blue Gene/L<br />
I/O Node(s)<br />
Compute<br />
Resources<br />
iDo to JTAG<br />
Block<br />
So far we have talked about the physical hardware and software that runs on the<br />
Service Node. Now, we move onto the software that actually runs on the Blue<br />
Gene/L hardware (compute and I/O nodes).<br />
When a block is booted, each I/O node within that block receives the same boot<br />
image and configuration. This is in fact a port of Linux with specific patches to<br />
support the Blue Gene/L hardware. This altered kernel is also known as the<br />
Mini-Control Program (MCP). It is seen as linuximg in the block definition and<br />
has a small specialized shell that provides a subset of commands and command<br />
options called BusyBox. The I/O node boot scripts run the BusyBox commands.<br />
Chapter 1. Introduction 33
BusyBox is open source software. You can learn more about BusyBox at:<br />
http://www.busybox.net/<br />
1.7.2 I/O kernel ramdisk<br />
The ramdisk is a stripped down UNIX® file system which contains just the root<br />
user, configuration files, and binaries for the services that need to be started.<br />
This file system is mounted by the MCP at boot time. It is customized<br />
dynamically, so updates to startup files and services are seen immediately by the<br />
I/O nodes when they are next booted. The ramdisk is specified by ramdiskimg in<br />
the block definition.<br />
1.7.3 I/O kernel CIOD daemon<br />
The Compute node IO Daemon (ciod) is started on the I/O node by the<br />
initialization scripts. It controls applications on the Compute Nodes and provides<br />
I/O services to them. It interacts with ciodb on the Service Node. The ciod<br />
daemon waits for connection on TCP port 7000. When it receives the connection,<br />
it reads the command from ciodb from the socket. Commands that are sent are:<br />
VERSION Establish protocol<br />
LOGIN Set up job information<br />
LOAD Load application<br />
START Start running the job<br />
KILL End running job<br />
After the application is running, ciod also reports output from the Compute<br />
Nodes back to ciodb. CONSOLE_TEXT messages are stdout and stderr output, and<br />
CONSOLE_STATUS is the return status of the application.<br />
1.7.4 Compute Node Kernel<br />
For information about the Compute Node Kernel, see 1.2.3, “Compute<br />
(processor) card” on page 7. It is seen as blrtsimg in the clock definition.<br />
34 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.7.5 Microloader<br />
The microloader is used to boot inactive processors into a state where they can<br />
receive the CNK (compute nodes) or MCP (I/O nodes). It is seen as mloaderimg<br />
in the block definition. Example 1-1 shows the block output and the boot images<br />
specified.<br />
Example 1-1 Block definition showing image definitions<br />
mmcs$ list bglblock R000_128<br />
OK<br />
==> DBBlock record<br />
_blockid = R000_128<br />
_numpsets = 0<br />
_numbps = 0<br />
_owner =<br />
_istorus = 000<br />
_sizex = 0<br />
_sizey = 0<br />
_sizez = 0<br />
_description = Generated via genSmallBlock<br />
_mode = C<br />
_options =<br />
_status = F<br />
_statuslastmodified = 2006-03-31 15:08:11.992582<br />
_mloaderimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />
_blrtsimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />
_linuximg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />
_ramdiskimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />
_debuggerimg = none<br />
_debuggerparmsize = 0<br />
_createdate = 2006-03-06 17:56:20.600708<br />
Chapter 1. Introduction 35
1.8 Boot process, job submission, and termination<br />
The following steps outline the boot process, job submission, and termination<br />
steps:<br />
1. When a user submits a job using mpirun, llsubmit, or submit_job, depending<br />
on the command, a new block might be created according to the user’s<br />
specification or an exiting block might be reused.<br />
2. The selected block in the BGLBLOCK database table is set to C for configure.<br />
An entry is made in the BGLJOB table with the status of Q for queued.<br />
3. The mmcs_db_server process continually polls the BGLBLOCK database table<br />
looking for blocks in the C for configure state.<br />
4. When the mmcs_db_server process finds a block in the configure state, the<br />
boot process is started by changing the status of the block to A for allocating.<br />
The BGLEVENTLOG table is monitored for any FATAL RAS events. If any are<br />
recorded during the boot process, the block is D for de-allocated and F for<br />
freed.<br />
5. Block information is updated with the block owner which is set to the user ID<br />
that was used to submit the job. The mmcs_db_server process then uses<br />
idoproxy to establish connections to the IDo chips on the cards where the<br />
block is to be booted. If any IDo connections fail, a FATAL RAS event is<br />
created in the BGLEVENTLOG database table.<br />
6. IDo commands are used to initialize the chips on the I/O and compute cards.<br />
7. If the block spans multiple midplanes, Link Training is performed on the link<br />
chips. Signal patterns are sent between the chips. The patterns are used, so<br />
that the chips can lock onto each other and synchronize when they recognize<br />
the signal.<br />
Figure 1-29 illustrates these steps.<br />
36 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Front-End Node<br />
Job<br />
Job Scheduler<br />
mpirun<br />
Create Block<br />
Status = C<br />
JOB Status = Q<br />
Poll for Block<br />
status=C<br />
3<br />
1<br />
2<br />
Service Node<br />
mpirun_be<br />
Database<br />
STATUS=C<br />
STATUS=A<br />
OWNER<br />
Block Definition<br />
STATUS=Q<br />
Job Definition<br />
mmcs_db_server<br />
idoproxy<br />
Figure 1-29 Booting a block - initial steps<br />
5<br />
Set to A<br />
Set Owner<br />
4<br />
5<br />
6<br />
Link Training<br />
iDo<br />
SRAM<br />
SRAM<br />
7<br />
SRAM<br />
SRAM<br />
iDo<br />
iDo<br />
SRAM<br />
SRAM<br />
SRAM<br />
SRAM<br />
Rack<br />
SRAM<br />
SRAM<br />
Node Cards<br />
Service Card<br />
Service Card<br />
SRAM<br />
SRAM<br />
SRAM<br />
SRAM<br />
SRAM<br />
SRAM<br />
SRAM<br />
SRAM<br />
Chapter 1. Introduction 37
After the chips are initialized, the microloader that is specified in the block<br />
definition is loaded onto the SRAM area of the I/O and compute cards (see<br />
Figure 1-30). A checksum is performed to show that the code has arrived as<br />
expected. Each processor is then started and the microloader executes. Failure<br />
to load the microloader generates a FATAL RAS event.<br />
Figure 1-30 Microloader distributed to processors<br />
The microloader then loads over the Service Network and performs checksums<br />
for the MCP and ramdisk on the I/O nodes and CNK on compute nodes (see<br />
Figure 1-31). The loads are performed in parallel, and a start command is sent to<br />
each microloader when completed. The microloader then gives up control and<br />
starts the loaded code. The MCP and CNK boot nodes become active entities.<br />
Figure 1-31 MCP and CNK loaded<br />
38 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
The next steps in the process are:<br />
1. After the mmcs_db_server process has finished sending the start commands, it<br />
sets the status of the block to B for booting.<br />
2. During this process, the nodes begin Tree training. Collective and GB<br />
ethernet drivers are loaded. MCP and CNK establish links to each other.<br />
Training involves sending signal patterns for the nodes to identify each other:<br />
– Synchronization pattern is sent out on the torus and collective networks<br />
Torus network across midplanes<br />
Collective between I/O and Compute Nodes<br />
– Nodes take turns in sending and receiving signals<br />
– Each node looks out for a particular bit stream<br />
– After the stream is found, the nodes synchronize<br />
– After everyone has found their counterparts, training completes<br />
– Failure in the process generates a FATAL RAS event<br />
3. I/O nodes start the GB Ethernet connection to the functional network and NFS<br />
mount /bgl from the Service Node. Initialization scripts start on the I/O nodes,<br />
and ciod starts.<br />
4. The ciod daemon sends a message down the collective network and waits for<br />
the Compute Nodes to respond.<br />
5. When the nodes respond, ciod generates the RAS event CIOD initialized.<br />
6. The mmcs_db_server process waits for all I/O nodes to generate CIOD<br />
initialized, and then changes the block to I for initialized.<br />
Figure 1-32 illustrates this process.<br />
Chapter 1. Introduction 39
56<br />
ALL CIOD<br />
Complete<br />
Front-End Node<br />
Job<br />
Job Scheduler<br />
mpirun<br />
Start 1<br />
Complete<br />
Service Node<br />
mpirun_be<br />
/BGL<br />
Figure 1-32 Initializing the block<br />
Database<br />
STATUS=I<br />
STATUS=B<br />
Block Definition<br />
CIOD_INITIALIZED<br />
RAS<br />
mmcs_db_server<br />
idoproxy<br />
Mount /bgl<br />
3<br />
Compute<br />
Responded<br />
Tree<br />
Training<br />
40 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
5<br />
2<br />
iDo<br />
iDo<br />
iDo<br />
Rack<br />
Node Cards<br />
Service Card<br />
Service Card<br />
CIOD Start<br />
4<br />
CIOD CIOD
The next steps in this process are:<br />
1. When the block is set to I for initialized, mpirun changes the status of the job<br />
in BGLJOB table from Q for queued to S for starting.<br />
2. Then, ciodb polls for jobs in status S. When it finds these, it establishes a<br />
connection to ciod on the I/O Node on port 7000. The ciod receives the LOAD<br />
command from ciodb and sends the application over the collective network to<br />
the Compute Nodes.<br />
3. When all Compute Nodes have all received the application, the START<br />
command is issued. BGLJOB STATUS is set to R for running.<br />
4. Then, ciodb forwards STDIN to ciod and ciod forwards STDOUT and STDERR<br />
back to ciodb. ciod handles the file I/O on behalf of the Compute Nodes.<br />
ciodb continues to poll ciod for job completion, which happens when the job<br />
completes or is killed.<br />
5. Finally, ciodb marks BGLJOB STATUS as T for terminated. ciodb then closes<br />
the connection to ciod.<br />
Figure 1-33 illustrates these steps.<br />
Chapter 1. Introduction 41
Frontend Node<br />
Job<br />
Job Scheduler<br />
mpirun<br />
BGLBLOCK=I<br />
MPIRUN set to S<br />
BGLJOB<br />
STATUS=S<br />
Service Node<br />
mpirun_be<br />
Database<br />
STATUS=I<br />
Block Definition<br />
mmcs_db_server<br />
ciodb<br />
Figure 1-33 Starting the job<br />
2<br />
1<br />
STATUS=T<br />
STATUS=S<br />
Job Definition<br />
JOB<br />
Complete<br />
For a list of block and job states, see Table 4-3 on page 162 and Table 4-4 on<br />
page 162.<br />
Logs of the various system processes and I/O nodes are described in 2.2.6,<br />
“Control system server logs” on page 61.<br />
42 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
5<br />
4 STDIN/STDOUT/<br />
STDERR<br />
3<br />
Code Load<br />
& START<br />
CIOD CIOD<br />
I/O<br />
4<br />
GPFS<br />
NFS<br />
iDo<br />
Node Cards
1.9 System discovery<br />
System discovery is the process of finding all the hardware and communication<br />
links in a Blue Gene/L system and initializing them. The discovery process is also<br />
responsible for populating and updating the configuration database held on the<br />
Service Node. If a component is replaced, it is rediscovered and placed in the<br />
database. The old entries are marked as missing, allowing tracking of replaced<br />
hardware. For discovery work, you need to start several processes. We discuss<br />
these process in this section.<br />
1.9.1 The bglmaster process<br />
The bglmaster process is the master process that starts the Blue Gene/L system<br />
processes which provide an interface to be able to talk to the hardware after it is<br />
discovered through the idoproxy daemon.<br />
1.9.2 SystemController<br />
1.9.3 Discovery<br />
1.9.4 PostDiscovery<br />
When the racks are powered on they are in un-initialized state. In an<br />
un-initialized state, we are unable to communicate with the system, because the<br />
IDo chips do not have an IP address. We are, therefore, unable to communicate<br />
over the Service Network. To discover the system, we use the SystemController<br />
process. This process finds the IDo chips and allocates an IP address to each of<br />
them that was predefined at installation time in the Service Node database. This<br />
operation is done on the 100 Mb ethernet connections on the Service Card.<br />
After the SystemController has initialized the IDo chips, the Discovery process<br />
itself can then start to talk to them through the 1 Gb network connections on each<br />
Service Card. There are separate Discovery processes for each row of racks in<br />
your environment. Discovery turns on the hardware, finds the components, and<br />
populates (updates in the case of a power cycle or hardware change because<br />
the entries are already present) the relevant database table. If a device fails to<br />
respond to discovery, it is marked as missing. When completed, control of each<br />
component is passed over to idoproxy daemon.<br />
After Discovery has populated the database, PostDiscovery checks that the data<br />
is valid and cleans the database. It then adds location information of each<br />
component to the tables.<br />
Chapter 1. Introduction 43
1.9.5 CableDiscovery<br />
CableDiscovery is run after the Discovery process is completed. As the name<br />
suggests, it looks at the current configuration in the database and discovers Link<br />
Cards and the connected data cables. This process only needs to be performed<br />
on initial system installation or when a data cable or Link Card is replaced.<br />
1.10 Discovering your system<br />
In this section we provide a procedure that you can use to discover information<br />
about your Blue Gene/L system. Log on to your Service Node and execute the<br />
following steps:<br />
1. Start the idoproxy.<br />
cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
./bglmaster start<br />
2. Start SystemController.<br />
cd /discovery<br />
./SystemController start<br />
To view SystemController's messages, issue this command:<br />
./SystemController monitor<br />
3. Start a discovery up for each row of BGL racks.<br />
./Discovery0 start //This is for the first row of BGL racks.<br />
./Discovery1 start //This is for the second row of BGL racks.<br />
................................<br />
To view Discovery0 messages:<br />
./Discovery0 monitor<br />
To view Discovery1 messages:<br />
./Discovery1 monitor<br />
................................<br />
4. Start PostDiscovery.<br />
./PostDiscovery start<br />
To view PostDiscovery messages:<br />
./PostDiscovery monitor<br />
5. Use DB2 queries or Web page (if available) to verify all hardware reports as<br />
described in 2.2, “Identifying the installed system” on page 57.<br />
44 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6. After you have checked all information, stop discovery for each of the racks:<br />
./Discovery0 stop //This is for the first row of BGL racks.<br />
./Discovery1 stop //This is for the second row of BGL racks.<br />
7. Stop PostDiscovery.<br />
./PostDiscovery stop<br />
8. Restart bglmaster to restart the system processes.<br />
cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
./bglmaster restart<br />
9. Set the status of the Link Cards to good ready for CableDiscovery.<br />
source /discovery/dbprofile<br />
./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ pgood_linkcards all<br />
OK<br />
mmcs$ quit<br />
10.Start CableDiscovery.<br />
./bglmaster stop<br />
cd /discovery<br />
./CableDiscoveryAll<br />
To view CableDiscovery messages<br />
./CableDiscovery monitor<br />
CableDiscovery should end with:<br />
DiscoverCables ended<br />
11.Start the Blue Gene/L system processes.<br />
cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
./bglmaster start<br />
Chapter 1. Introduction 45
1.10.1 Discovery logs<br />
SystemController, Discovery, PostDiscovery, and CableDiscovery all have logs<br />
of their activity. Theses logs are created in /bgl/BlueLight/logs/BGL. Example 1-2<br />
presents the activity logs as we observed on our system.<br />
Example 1-2 Discovery process logs<br />
supportsn:/bgl/BlueLight/logs/BGL # ls -l | grep CurrentLog<br />
lrwxrwxrwx 1 root root 58 Mar 31 14:04 CurrentLog-Discovery0<br />
-> /bgl/BlueLight/logs/BGL/Discovery0-2006-03-31-14:04:28.log<br />
lrwxrwxrwx 1 root root 61 Mar 31 14:04<br />
CurrentLog-PostDiscovery -><br />
/bgl/BlueLight/logs/BGL/PostDiscovery-2006-03-31-14:04:43.log<br />
lrwxrwxrwx 1 root root 64 Mar 31 14:04<br />
CurrentLog-SystemController -><br />
/bgl/BlueLight/logs/BGL/SystemController-2006-03-31-14:04:32.log<br />
supportsn:/bgl/BlueLight/logs/BGL # ls -l | grep Cable<br />
-rwxrwxr-x 1 root bgladmin 16493 Mar 8 12:04<br />
CableDiscoveryAll-2006-03-08-12:03:22.log<br />
-rw-r--r-- 1 root root 6294 Mar 28 14:09<br />
CableDiscoveryAll-2006-03-28-14:07:47.log<br />
1.11 Service actions<br />
When your system has a problem that requires a part to be replaced or that<br />
requires the system to be shutdown, you need to run a service action. A service<br />
action prepares the specified location by powering it down in a controlled<br />
manner. Parts can then be removed from the rack or the rack itself be turned off<br />
through the circuit breaker in the BPE. When servicing action is complete, we<br />
can then close the service action. This turns on and brings the location back into<br />
production. Two commands are provided to allow this to be done:<br />
PrepareForService and EndServiceAction.<br />
1.11.1 PrepareForService<br />
The syntax for this command is:<br />
PrepareForService LocationString [Verbose] [FORCE]<br />
Where:<br />
► LocationString is the location string of the part/card that needs to be<br />
serviced. Location codes are shown in 1.2.1, “Racks” on page 3.<br />
46 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
► Verbose provides extra command output.<br />
► FORCE is a keyword that indicates that a new Service Action should be started<br />
even if there is already an existing active service action for the resource.<br />
Supported locations are:<br />
► R00-M1-N0: specific NodeCard<br />
► R00-M1-N : all NodeCards in the Midplane<br />
► R11-M0: all Node and LinkCards in the Midplane<br />
► R37-M0-Ax: individual Fan Module<br />
► R01: Whole rack<br />
► R20-P1: Bulk Power Module<br />
► R00-M1-L3: LinkCard. This card requires that the tool turns off ALL Link and<br />
NodeCards in the neighborhood of the specified LinkCard. Neighborhood is<br />
defined as those cards that are in either the same row or the same column as<br />
the specified LinkCard.<br />
Figure 1-18 on page 19, Figure 1-19 on page 20, and Figure 1-20 on page 21<br />
illustrate these locations.<br />
At the end of the PrepareForService command, you are given a service action ID<br />
that must be noted for further use to return the hardware to production.<br />
Example 1-3 shows an how PrepareForService is used for a compute node<br />
replacement.<br />
Example 1-3 PrepareForService on a Node Card<br />
bglsn:/discovery # ./PrepareForService R00-M0-N0<br />
Logging to<br />
/bgl/BlueLight/logs/BGL/PrepareForService-2006-03-31-11:25:09.log<br />
Mar 31 11:25:10.169 EST: PrepareForService started<br />
Mar 31 11:25:35.363 EST: Freed any blocks using R000<br />
Mar 31 11:25:43.288 EST: Disabled this NodeCard's ethernet port on<br />
ServiceCard (R00-M0-S, FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), Port (11)<br />
Mar 31 11:25:43.681 EST: Marked NodeCard<br />
(203231503833343000000000594c31304b34323630304b35) missing<br />
Mar 31 11:25:43.711 EST: Deleted node hardware attrs for 34 nodes<br />
Mar 31 11:25:43.712 EST: Card has been successfully powered off!<br />
Mar 31 11:25:43.736 EST: Proceed with service on part<br />
(mLctn(R00-M0-N0),<br />
mCardSernum(203231503833343000000000594c31304b34323630304b35),<br />
mLp(FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A), mIp(10.0.0.25), mType(4))<br />
Mar 31 11:25:43.736 EST:<br />
Chapter 1. Introduction 47
Mar 31 11:25:43.750 EST: Service Action ID 19<br />
Mar 31 11:25:43.755 EST: MyShutdownHook - Exiting PrepareForService<br />
Mar 31 11:25:43.756 EST: +++ This logfile is closed +++<br />
In Example 1-4, the Node Card is turned off and is marked as missing in the<br />
database, together with the 34 nodes that are inserted into it (32 Compute Nodes<br />
and 2 I/O Nodes).<br />
Example 1-4 Checking hardware is removed from the Service Node database<br />
bglsn:/discovery # db2 "select location,status from bglnodecard where<br />
location = 'R00-M0-N0'"<br />
LOCATION STATUS<br />
-------------------------------- ------<br />
R00-M0-N0 M<br />
1 record(s) selected.<br />
bglsn:/discovery # db2 "select location,status from bglprocessorcard<br />
where location like 'R00-M0-N0%'"<br />
LOCATION STATUS<br />
-------------------------------- ------<br />
R00-M0-N0-C:J02 M<br />
R00-M0-N0-C:J03 M<br />
R00-M0-N0-C:J04 M<br />
R00-M0-N0-C:J05 M<br />
R00-M0-N0-C:J06 M<br />
R00-M0-N0-C:J07 M<br />
R00-M0-N0-C:J08 M<br />
R00-M0-N0-C:J09 M<br />
R00-M0-N0-C:J10 M<br />
R00-M0-N0-C:J11 M<br />
R00-M0-N0-C:J12 M<br />
R00-M0-N0-C:J13 M<br />
R00-M0-N0-C:J14 M<br />
R00-M0-N0-C:J15 M<br />
R00-M0-N0-C:J16 M<br />
R00-M0-N0-C:J17 M<br />
R00-M0-N0-I:J18 M<br />
17 record(s) selected.<br />
48 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
There is a dedicated table in the database for the service actions. Actions on the<br />
same component cannot be done until the previous service action is closed. You<br />
can obtain open service actions by looking for the entries in<br />
BGLSERVICEACTION with a status of O (for Open), as shown in Example 1-5.<br />
Example 1-5 Showing open service actions<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # source /discovery/db.src<br />
Database Connection Information<br />
Database server = DB2/LINUXPPC 8.2.3<br />
SQL authorization ID = BGLSYSDB<br />
Local database alias = BGDB0<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select<br />
ID,LOCATION,STATUS from bglserviceaction where status='O'"<br />
ID LOCATION STATUS<br />
----------- -------------------------------- ------<br />
19 R00-M0-N0 O<br />
1 record(s) selected.<br />
If for some reason you end up with a service action for a component that is not<br />
really disabled for service, you can complete it by manually updating the<br />
database entry for that service action to C (for Closed), as shown in Example 1-6.<br />
This status can happen if a service action is initialized, but the system is then<br />
brought up by another method such as discovery, rather than using<br />
EndServiceAction.<br />
Example 1-6 Completing a service action manually<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select<br />
ID,LOCATION,STATUS from bglserviceaction where status='O'"<br />
ID LOCATION STATUS<br />
----------- -------------------------------- ------<br />
19 R00-M0-N0 O<br />
1 record(s) selected.<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "update bglserviceaction<br />
set STATUS='C' where ID=19"<br />
DB20000I The SQL command completed successfully.<br />
Chapter 1. Introduction 49
glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select<br />
ID,LOCATION,STATUS from bglserviceaction where ID=19"<br />
ID LOCATION STATUS<br />
----------- -------------------------------- ------<br />
19 R00-M0-N0 C<br />
1 record(s) selected.<br />
PrepareForService logs its invocations to /bgl/BlueLight/logs/BGL. The naming<br />
convention for log files begin with “PrepareForService” and is followed by the<br />
date and time (as shown in Example 1-7).<br />
Example 1-7 PrepareForService logs<br />
bglsn:/bgl/BlueLight/logs/BGL # ls PrepareForService*<br />
PrepareForService-2006-03-16-10:45:40.log<br />
PrepareForService-2006-03-30-14:06:40.log<br />
PrepareForService-2006-03-16-10:46:09.log<br />
PrepareForService-2006-03-30-14:07:07.log<br />
PrepareForService-2006-03-16-10:47:01.log<br />
1.11.2 EndServiceAction<br />
The syntax for this command is:<br />
EndServiceAction id [Verbose] [Wait / NoWait]<br />
Where:<br />
► id is the service action ID that was finished by PrepareForService<br />
► Wait indicates that it should wait for the component and subcomponents to<br />
become active<br />
► NoWait indicates that it should return after turning on the component but not<br />
wait for it and its subcomponents to become active in the database<br />
Example 1-8 shows the return to service of the node card that we disabled in<br />
1.11.1, “PrepareForService” on page 46.<br />
Before starting EndServiceAction, you need to start the systemcontroller,<br />
discovery, and postdiscovery processes. These processes are required when<br />
the hardware comes back online, because the hardware needs to be<br />
rediscovered and marked available instead of missing in the database. When all<br />
of the expected hardware is back online, you can stop the systemcontroller,<br />
discovery, and postdiscovery processes.<br />
50 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 1-8 EndServiceAction on a Node card<br />
bglsn:/discovery # ./EndServiceAction 19<br />
Mar 31 13:57:41.376 CST: EndServiceAction started<br />
Mar 31 13:57:55.069 CST: Disabled this NodeCard's ethernet port on<br />
ServiceCard (R00-M0-S, FF:F2:9F:15:1E:55:00:0D:60:EA:E1:AA), Port (11)<br />
Mar 31 13:57:55.453 CST: Marked NodeCard<br />
(203231503833343000000000594c31324b35323135303148) missing<br />
Mar 31 13:57:55.500 CST: Deleted node hardware attrs for 36 nodes<br />
Mar 31 13:57:55.517 CST: Card has been successfully powered off!<br />
Mar 31 13:57:55.596 CST: Powered Off NodeCard<br />
(203231503833343000000000594c31324b35323135303148)<br />
-snip-<br />
ar 31 13:59:21.252 CST: Enabled all of the NodeCard ethernet ports on<br />
ServiceCard (R00-M0-S, FF:F2:9F:15:1E:55:00:0D:60:EA:E1:AA)<br />
Mar 31 13:59:21.302 CST: Changed Midplane R00-M0's status from 'E' to<br />
'A'<br />
Mar 31 14:00:21.363 CST: @ Still waiting for 16 NodeCards in R00-M0 to<br />
become active<br />
Mar 31 14:07:21.498 CST:<br />
Mar 31 14:08:21.528 CST: All of R00-M0's NodeCards are active<br />
Mar 31 14:09:21.621 CST: @ Still waiting for 16 compute Processor cards<br />
in R00-M0 to become active<br />
Mar 31 14:11:21.669 CST: All of R00-M0's NodeCards are active<br />
Mar 31 14:11:21.684 CST: All of R00-M0's compute Processor cards are<br />
active<br />
Mar 31 14:11:21.706 CST: All of R00-M0's compute Nodes are active<br />
Mar 31 14:11:21.707 CST:<br />
Mar 31 14:11:35.531 CST: Enabling the PortB on each of this<br />
Midplane's LinkCards - to indicate that PortA is powered on<br />
Mar 31 14:11:36.537 CST: Enabling this Midplane's Port B output<br />
drivers<br />
Mar 31 14:11:37.726 CST: Enabling this Midplane's Port A input<br />
receivers<br />
Mar 31 14:11:38.946 CST: Ended Service Action Id 19 for R00-M0-N0<br />
Mar 31 14:11:38.952 CST: MyShutdownHook - Exiting EndServiceAction<br />
Mar 31 14:11:38.955 CST: +++ This logfile is closed +++<br />
Chapter 1. Introduction 51
EndServiceAction logs its invocations to /bgl/BlueLight/logs/BGL. The naming<br />
convention for log files start with “EndServiceAction” and is followed by the date<br />
and time, as shown in Example 1-9.<br />
Example 1-9 EndServiceAction logs<br />
bglsn:/bgl/BlueLight/logs/BGL # ls EndServiceAction*<br />
EndServiceAction-2006-03-16-11:06:08.log<br />
EndServiceAction-2006-03-30-14:30:27.log<br />
EndServiceAction-2006-03-16-11:15:04.log<br />
EndServiceAction-2006-03-30-14:31:16.log<br />
EndServiceAction-2006-03-16-11:49:05.log<br />
EndServiceAction-2006-03-30-14:31:45.log<br />
1.12 Turning off the system<br />
When turning off your Blue Gene/L system, you need to be careful and to ensure<br />
that you do so in a controlled manner. Simply switching off racks can leave the<br />
system in state that might make it difficult to get it back to operational. To turn off<br />
your system, follow these steps:<br />
1. Prepare each individual rack for service using the following commands:<br />
cd /discovery<br />
./PrepareForService Rxx<br />
This preparation has the effect of shutting down all the Blue Gene/L<br />
hardware.<br />
Repeat the command, and change xx for each rack in your system.<br />
After the PrepareForService command is finished, note the serviceactionid<br />
that is displayed at the end of the command output for each command.<br />
2. Stop the Control System using the following commands:<br />
cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
./bglmaster stop<br />
3. Stop DB2® on the Service Node using the following commands:<br />
su - bglsysdb<br />
db2force application all<br />
db2terminate<br />
db2stop<br />
4. Shutdown and turn off the Service Node.<br />
5. Turn off the racks.<br />
52 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
1.13 Turning on the system<br />
Having turned off your system properly, you cannot turn it on using these steps:<br />
1. Turn on the racks.<br />
2. Turn on and boot the Service Node.<br />
3. Check that DB2 has started. If it has not, start it with the following commands:<br />
su - bglsysdb<br />
dbstart<br />
For more details on getting DB2 started and to start it automatically when the<br />
system boots, see 2.3.4, “Check that DB2 is working” on page 87.<br />
4. Start the processes that rediscovers your hardware and initialize it using the<br />
following commands:<br />
cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
./bglmaster start<br />
cd /discovery<br />
./SystemController start<br />
./Discovery0 start<br />
Repeat the discovery command for each row of racks in your system,<br />
changing the last digit to each row number:<br />
./Discovery1 start<br />
./Discovery2 start<br />
5. Start the PostDiscovery process to check the discovered configuration and<br />
add position information.<br />
./PostDiscovery start<br />
6. End the service action that was used to shut down the system.<br />
cd /discovery<br />
./EndServiceAction X<br />
Repeat the EndServiceAction for each rack changing using the previously<br />
saved serviceactionid.<br />
7. Use DB2 queries or a Web page (if available) to verify all hardware reports as<br />
described in 2.2, “Identifying the installed system” on page 57.<br />
8. After the last EndServiceAction completes and all the hardware is shown,<br />
stop all the processes previously launched for the system discovery using the<br />
following commands:<br />
./SystemController stop<br />
./Discovery0 stop<br />
Chapter 1. Introduction 53
Repeat the discovery command for each row of racks in your system,<br />
changing the last digit to each row number.<br />
./Discovery1 stop<br />
./Discovery2 stop<br />
9. Stop PostDiscovery.<br />
./PostDiscovery stop<br />
10.Restart the Blue Gene/L system processes, so that the system can go back<br />
into production.<br />
./bglmaster restart<br />
54 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chapter 2. <strong>Problem</strong> determination<br />
methodology<br />
2<br />
This chapter discusses how to identify the various components in an IBM System<br />
Blue Gene Solution system. It also includes a list of core Blue Gene/L sanity<br />
checks that you can use to ensure that your system is working properly.<br />
This chapter also provides a problem determination methodology that can help<br />
you identify the cause of Blue Gene/L system problems. Following the<br />
methodology helps you quickly find the issue, identify the Blue Gene/L system,<br />
and identify the problem area so that you can confirm and fix the problem.<br />
© Copyright IBM Corp. 2006. All rights reserved. 55
2.1 Introduction<br />
Whenever you have to work with a complex system, it is essential that you obtain<br />
the actual system configuration. This chapter provides a list tasks that enable<br />
you to determine the system configuration for your Blue Gene/L system. We also<br />
provide a set of checks to ensure that the components in your core Blue Gene/L<br />
system are functioning correctly. We also present what we consider the core of<br />
the Blue Gene/L system to be.<br />
Due to the numerous components in a Blue Gene/L system, we consider that the<br />
best way to approach a problem is to separate it into different problem areas.<br />
The methodology that we discuss here provides a process for this approach that<br />
includes three distinct areas:<br />
► Defining the problem.<br />
► Identifying the Blue Gene/L system.<br />
► Identifying the problem area.<br />
This approach allows someone with little Blue Gene/L experience to assess<br />
where the problem lies quickly, and perhaps more importantly, to determine<br />
whether there is a problem at all. Such problem determination is the key to<br />
maintaining a successful running system. The methodology then points to<br />
particular check lists to show how to practically determine the problem for each<br />
of the areas.<br />
In this book we discuss the Blue Gene/L system in two different ways:<br />
1. Core Blue Gene/L<br />
– Blue Gene/L racks<br />
– Service Node, including the Blue Gene/L processes, NFS, and DB2.<br />
– Network switches for Service and Functional Network<br />
2. Complex Blue Gene/L<br />
– Core Blue Gene/L<br />
– Front-End Nodes<br />
– MPI<br />
– GPFS<br />
– LoadLeveler<br />
Note: For our discussion, we assume that readers already have a working<br />
knowledge of the Linux operating system and TCP/IP. This knowledge is a<br />
prerequisite for understanding the environment that we use and tools that we<br />
present.<br />
56 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.2 Identifying the installed system<br />
You can determine the Blue Gene/L system configuration with a combination of<br />
the following tools and actions:<br />
► Blue Gene/L Web interface (BGWEB)<br />
► DB2 select statements of the DB2 database on the SN<br />
► Standard operating system (Linux) commands<br />
► Visually inspecting the hardware<br />
We discuss these tools and actions in the following paragraphs.<br />
2.2.1 Blue Gene/L Web interface (BGWEB)<br />
BGWEB is installed on the Service Node (SN). To connect to this service, point<br />
your browser to the following URL:<br />
http:///web/index.php<br />
Figure 2-1 gives an example of the BGWEB home page.<br />
Figure 2-1 BGWEB home page on the SN<br />
Chapter 2. <strong>Problem</strong> determination methodology 57
The Configuration section displays the structure of the system and expands<br />
further into more detail for each physical component of the core Blue Gene/L<br />
system.<br />
Note: You can run BGWEB from a Front-End Node (FEN). However, this is<br />
not supported officially. BGWEB requires a DB2 client to interface with the<br />
DB2 database on the SN and a Web server configured on the FEN.<br />
2.2.2 DB2 select statements of the DB2 database on the SN<br />
You can run SQL select statements to query information that is stored in the DB2<br />
database. Select statements are useful to query the amount of different<br />
components in the system. Example 2-1 shows a DB2 select statement that<br />
displays the number of compute nodes cards in a system.<br />
Example 2-1 DB2 select command displaying the number of compute nodes<br />
# db2 "select count (*)num_of_compute_node_cards from BGLPROCESSORCARD<br />
/ where ISIOCARD ='F' and STATUS 'M'"<br />
NUM_OF_COMPUTE_NODE_CARDS<br />
-------------------------<br />
64<br />
1 record(s) selected.<br />
In this example, the user needs to create the appropriate execution environment<br />
by loading /discovery/db.src. This script sets up the default database<br />
environment and connects to the database (to run DB2 commands). Running<br />
this script should produce an output similar to that shown in Example 2-8.<br />
Example 2-2 Sourcing the /discovery/db.src file<br />
bglsn:/bgl/BlueLight/logs/BGL # source /discovery/db.src<br />
Database Connection Information<br />
Database server = DB2/LINUXPPC 8.2.3<br />
SQL authorization ID = BGLSYSDB<br />
Local database alias = BGDB0<br />
Note: The /discovery/db.src script is a copy of the<br />
/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/db.src file. In<br />
addition, the directory /bgl/BlueLight/V1R2M1_020_2006-060110 represents<br />
the driver version that we used for this redbook.<br />
58 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.2.3 Network diagram<br />
2.2.4 Service Node<br />
The system administrator needs to understand the network configuration of the<br />
Blue Gene/L system. (For information about the functions of both the service and<br />
functional networks, see 1.3, “Blue Gene/L networks” on page 21. These<br />
networks are required in a core or a complex Blue Gene/L system configuration.)<br />
It is important to have an up-to-date diagram of the network that connects the<br />
Blue Gene/L system. This diagram should include the IP addresses of the<br />
system and network switches.<br />
There can be only one SN per installed Blue Gene/L system. (Refer to 1.4.2,<br />
“Service Node system processes” on page 31 for a more detailed description.)<br />
You can check the IP configuration for the service and functional networks on the<br />
SN in the following way:<br />
► Service network<br />
a. Using the Blue Gene/L Web interface, at the BGWEB home page, click<br />
Database Browser at the bottom of the page.<br />
b. Click the TBGLIDOPROXY database table link. A new page is displayed<br />
showing the idoproxy configuration as shown in Figure 2-2.<br />
Figure 2-2 The DB2 database table TBGLIDOPROXY (Service network)<br />
c. Use the command line on the SN as shown in Example 2-3.<br />
Example 2-3 Showing service network using DB2 CLI<br />
# db2 "select PROXYID,PROXYIPADDRESS from TBGLIDOPROXY"<br />
PROXYID PROXYIPADDRESS<br />
-----------------------------------------------------------------------<br />
BlueGene1 10.0.0.1<br />
Chapter 2. <strong>Problem</strong> determination methodology 59
1 record(s) selected.<br />
Then check the IP range in the /etc/hosts:<br />
# grep 10.0.0.1 /etc/hosts<br />
10.0.0.1 bglsn_sn.itso.ibm.com bglsn_sn<br />
► Functional network<br />
a. Using the Blue Gene/L Web interface, at the BGWEB home page, click<br />
Database Browser at the bottom of the page.<br />
b. Click the TBGLIPPOOL database table link. A new page is displayed<br />
showing the IP range for the I/O nodes, as shown in Figure 2-3.<br />
Figure 2-3 Data from DB2 database table TBGLIPPOOL<br />
c. Using the command line on the SN, you can obtain the addresses for the<br />
functional network that are stored in the DB2 database and check the IP<br />
labels that are associated in /etc/hosts, as shown in Example 2-4.<br />
Example 2-4 Functional network info using DB2 CLI<br />
# db2 "select IPADDRESS from TBGLIPPOOL"<br />
IPADDRESS<br />
-----------------------------------------------------------------------<br />
172.30.2.1<br />
172.30.2.10<br />
..snip..<br />
# grep 172.30.2 /etc/hosts<br />
172.30.2.1 ionode1<br />
172.30.2.2 ionode2<br />
60 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
172.30.2.3 ionode3<br />
..snip..<br />
d. You can then compare these IP address to the output from /sbin/ip ad,<br />
as shown in Example 2-5.<br />
Example 2-5 Network interface configuration on the SN<br />
# ip ad<br />
..snip..<br />
2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />
link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />
inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />
inet6 fe80::20d:60ff:fe4d:28ea/64 scope link<br />
valid_lft forever preferred_lft forever<br />
..snip..<br />
5: eth3: mtu 1500 qdisc pfifo_fast qlen 1000<br />
link/ether 00:11:25:08:30:90 brd ff:ff:ff:ff:ff:ff<br />
inet 172.30.1.1/16 brd 172.30.255.255 scope global eth3<br />
inet6 fe80::211:25ff:fe08:3090/64 scope link<br />
valid_lft forever preferred_lft forever<br />
2.2.5 Front-End Nodes<br />
The Blue Gene/L system can have one or more Front-End Nodes (FENs). There<br />
does not seem to be any way to identify FENs apart from knowing that they are<br />
separate components within the Blue Gene/L configuration. FENs are nodes<br />
where user jobs are submitted through LoadLeveler or mpirun. So, the only way<br />
to identify the FENs is to know the components from where the jobs are<br />
submitted.<br />
These components should be included in the topology diagram. A possible check<br />
would be to see if MPI support and LoadLeveler are installed using the rpm<br />
-qa|grep mpi command . If LoadLeveler is installed, all the FENs should be<br />
listed in the llstatus output, as shown in Figure 4-12 on page 172. Refer to 1.5,<br />
“Front-End Node” on page 32 for more information.<br />
2.2.6 Control system server logs<br />
There are a number of logs that are generated on the SN. These logs are called<br />
the control system server logs. Table 2-1 on page 62 through Table 2-8 on<br />
page 64 show the logs that are generated and their purpose. There are also logs<br />
generated for diagnostics. We discuss these logs further in 3.4, “Diagnostics” on<br />
page 131.<br />
Chapter 2. <strong>Problem</strong> determination methodology 61
The default location of control system server logs is /bgl/BlueLight/logs directory.<br />
In this directory, there are two directories:<br />
► /bgl/BlueLight/logs/BGL<br />
This directory includes all the logs for the BGLMaster and its child daemons.<br />
There is also a log for each I/O node in the Blue Gene/L system. These logs<br />
are written through the /bgl NFS mount to the I/O nodes because the I/O<br />
nodes do not have any persistent storage.<br />
► /bgl/BlueLight/logs/diags<br />
There is a time stamped directory for every diagnostic run on the Blue Gene/L<br />
system. The directory looks similar to:<br />
/bgl/BlueLight/logs/diags/2006-0307-17:40:08_R000<br />
Table 2-1 Control system log for BGLMaster<br />
BGLMaster<br />
Name of log --bglmaster-current.log sym link to<br />
-bglmaster-.log<br />
Example bglsn-bglmaster-2006-0330-14:56:20.log<br />
Description Shows the initialization of BGLMaster and its child daemons,<br />
which involves parsing the<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster.init file. Also, logs<br />
the status of its child daemons.<br />
Table 2-2 Control system log for idoproxy<br />
idoproxy (BGLMaster child daemon)<br />
Name of log -idoproxydb-.log sym link to<br />
--idoproxydb-current.log<br />
Example bglsn-idoproxydb-2006-0330-14:56:20.log<br />
Description Shows initialization of idoproxy, the complete name of the<br />
process, the log generated, the Blue Gene/L driver version, ip<br />
address and the ports it opens. Also, provides information about<br />
each block that is booted in the Blue Gene/L system.<br />
62 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Table 2-3 Control system log for ciodb<br />
ciodb (BGLMaster child daemon)<br />
Name of log --ciodb-current.log sym link to<br />
-ciodb-.log<br />
Example bglsn-ciodb-2006-0330-14:56:20.log<br />
Description Shows initialization of ciodb, the log generated, the complete<br />
name of the process and the Blue Gene/L driver version. Also<br />
provides useful information about submitted jobs including the<br />
Blue Gene/L job ID and the I/O nodes (with IP addresses) used<br />
for the job.<br />
Table 2-4 Control system log for mmcs_db_server<br />
mmcs_db_server (BGLMaster child daemon)<br />
Name of log -mmcs_db_server-.log sym link to<br />
--mmcs_db_server-current.log<br />
Example bglsn-mmcs_db_server-2006-0330-14:56:20.log<br />
Description Shows initialization of mmcs_db_server, the log generated, the<br />
complete name of the process and the Blue Gene/L driver<br />
version. Also provides very useful information associating the<br />
booted block with the I/O node location codes, their log files,<br />
hostnames and IP addresses. It also provides useful runtime<br />
debug data.<br />
Table 2-5 Control system log for monitorHW<br />
monitorHW (BGLMaster child daemon)<br />
Name of log -monitorHW-.log sym link to<br />
--monitorHW-current.log<br />
Example bglsn-monitorHW-2006-0323-15:47:47.log<br />
Description Shows initialization of monitorHW, the log generated, the<br />
complete name of the process and the Blue Gene/L driver<br />
version. Also provides information about the monitoring that has<br />
taken place.<br />
Chapter 2. <strong>Problem</strong> determination methodology 63
Table 2-6 Control system log for perfmon<br />
perfmon (BGLMaster child daemon)<br />
Name of log -perfmon.pl-.log sym link to<br />
--perfmon.pl-current.log<br />
Example bglsn-perfmon.pl-2006-0329-18:24:38.log<br />
Description Shows initialization of perfmon, the log generated, the complete<br />
name of the process and the Blue Gene/L driver version. Also<br />
provides performance data on running Blue Gene/L jobs. This<br />
does not seem to be used at the time the redbook was written.<br />
Table 2-7 Control system log for I/O nodes<br />
I/O node log (One for each I/O node in the Blue Gene/L system)<br />
Name of log ---I:-.log<br />
Example R00-M0-N0-I:J18-U01.log<br />
Description Shows startup process of the I/O node, this includes the loading<br />
of the MCP, the startup scripts with their output and the partition<br />
it is associated with at boot up. When the I/O node is shutdown<br />
all the messages from the shutdown scripts and other shutdown<br />
messages are displayed. These files are appended to for every<br />
boot and shutdown.<br />
Table 2-8 The updateSchema.pl script log<br />
updateSchema.pl<br />
Name of log updateSchema-.log<br />
Example updateSchema-2006-04-01-13:18:29.log<br />
Description Shows updateSchema.pl updating the schema from the new<br />
driver version on the Blue Gene/L database. This only done<br />
when a driver update is applied.<br />
2.2.7 File systems (NFS and GPFS)<br />
Depending on the Blue Gene/L system configuration that you might have, you<br />
might have more than one file system available to the I/O nodes. Although<br />
Network File System (NFS) is required for the system to function (refer to 1.8,<br />
“Boot process, job submission, and termination” on page 36), General Parallel<br />
64 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
File System (GPFS) can also be used by the I/O nodes for reading and writing a<br />
jobs’ data. You need to be aware of the components that serve NFS and GPFS<br />
on the functional network:<br />
► NFS<br />
The SN exports (NFS) the /bgl directory over the functional network. Refer to<br />
2.3.6, “Check the NFS subsystem on the SN” on page 90 for more<br />
information.<br />
The I/O nodes can mount NFS file systems that are exported from servers<br />
which are different from either SN or FENs. In this case neither SN or FENs<br />
are required to mount these file systems. However, SN needs to know about<br />
these file systems because it has to pass the information down the I/O nodes<br />
using the /bgl/dist/etc/rc.d/rc3.d/S10sitefs file. If this file exists, you should<br />
check for a line that looks similar to the one shown in Example 2-6.<br />
Example 2-6 Additional NFS file systems to be mounted by the I/O nodes<br />
bglsn_# cat /bgl/dist/etc/rc.d/rc3.d/S10sitefs<br />
..<br />
# Mount a scratch file system...<br />
mountSiteFs $SITEFS /bgscratch /bgscratch<br />
tcp,rsize=32768,wsize=32768,async<br />
..<br />
► GPFS<br />
As previously mentioned, if a sitefs file exists, then it is possible to check<br />
whether GPFS has been enabled to run on the I/O nodes. Check for a line<br />
that looks similar to the one shown in Example 2-7.<br />
Example 2-7 The sitefs configuration for a Blue Gene/L system using GPFS<br />
bglsn_# cat /bgl/dist/etc/rc.d/rc3.d/S10sitefs<br />
..<br />
# Uncomment the first line to start GPFS.<br />
# Optionally uncomment the other lines to change the defaults for<br />
# GPFS.<br />
echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />
..<br />
To identify whether GPFS is in use, run /usr/lpp/mmfs/bin/mmlscluster and<br />
/usr/lpp/mmfs/bin/mmlsconfig on the SN. The I/O nodes in your Blue<br />
Gene/L system as well as your SN should be listed. You must configure<br />
GPFS to mount the GPFS file systems automatically when the GPFS daemon<br />
is started on SN as well as on all I/O nodes that are a part of the cluster. For<br />
more information about GPFS running on the Blue Gene/L system, refer to<br />
Chapter 5, “File systems” on page 211.<br />
Chapter 2. <strong>Problem</strong> determination methodology 65
2.2.8 Job submission<br />
2.2.9 Racks<br />
Note: The /bgl/dist/etc/rc.d/rc3.d/S10sitefs is a symbolic link to<br />
/bgl/dist/etc/rc.d/init.d/sitefs. Be aware that this might also be found under<br />
/bgl/BlueLight/ppcfloor/dist/etc/rc.d, although this is not the advised<br />
location for it.<br />
You can run jobs on the Blue Gene/L system in three different ways:<br />
► You can run the sumbit_job command at the mmcs_console on the SN. Only<br />
the root user on the SN can run this command.<br />
► You can run the mpirun command from a FEN or SN. You can run this<br />
command an authorized, non-root user.<br />
► You can run the llsubmit command from a FEN or SN. This command is a<br />
LoadLeveler command, and you can run it as an authorized, non-root user.<br />
Note: It is not very likely that the submit_job command will be used a part of<br />
daily activity on the system. Using this command may be appropriate during<br />
certain system verification actions, however we do not encourage the use of it.<br />
You need to be able to identify how jobs are submitted on your system. It is likely<br />
that the jobs will be submitted by MPI or LoadLeveler programs from the FENs.<br />
For more information about running jobs with the submit_job command refer to<br />
2.3.8, “Check that a simple job can run (mmcs_console)” on page 96, and for<br />
information about mpirun and llsubmit, refer to Chapter 4, “Running jobs” on<br />
page 141.<br />
Here are two ways to determine the number of racks in the Blue Gene/L system:<br />
► Using the Blue Gene/L Web interface:<br />
At the BGWEB home page, click Configuration to display the racks that are<br />
configured in the IBM System Blue Gene Solution system (Figure 2-4 on<br />
page 67).<br />
66 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 2-4 Configuration of Blue Gene/L in BGWEB<br />
Note: If a piece of hardware has a red box around, it means this part of the<br />
system or a hardware within it is missing or has a hardware error.<br />
► Using a DB2 select statement:<br />
In Example 2-8, the number of records in the BGLBASEPARTITION view<br />
represents the number of racks that are available to the SN since the last<br />
system Discovery was performed.<br />
Example 2-8 Listing the active racks in the Blue Gene/L system<br />
# db2 "select BPID,STATUS from BGLBASEPARTITION where STATUS 'M'"<br />
BPID STATUS<br />
---- ------<br />
R000 A<br />
1 record(s) selected.<br />
Chapter 2. <strong>Problem</strong> determination methodology 67
2.2.10 Midplanes<br />
2.2.11 Clock cards<br />
Note: The select statements and database views in this section query for<br />
records with the STATUS field not equal to M (Missing), using the operator < >.<br />
These records can also have values of E (Error) and A (Available).<br />
Each rack contains two midplanes that are only detected when a service card is<br />
plugged into the midplane and then connected by ethernet to the service<br />
network. You can determine the number of midplanes that are available to the<br />
Blue Gene/L system in two ways:<br />
► Using the Blue Gene/L Web interface:<br />
At the BGWEB home page, click Configuration to display the midplanes that<br />
are configured in the IBM System Blue Gene Solution system (Figure 2-4 on<br />
page 67).<br />
► Using a DB2 select statement:<br />
In Example 2-9, the number of records in the BGLMIDPLANE view<br />
represents the number of midplanes that are available to the system since the<br />
last system Discovery was performed. This example only shows one<br />
midplane being used.<br />
Example 2-9 Listing the midplanes available to the Blue Gene/L system<br />
# db2 "select LOCATION,POSINMACHINE,STATUS from BGLMIDPLANE where<br />
STATUS / 'M'"<br />
LOCATION POSINMACHINE STATUS<br />
-------------------------------- ------------ ------<br />
R00-M0 R000 A<br />
1 record(s) selected.<br />
Each rack has one clock card. Therefore, the number of clock cards is the same<br />
as the number of racks in the Blue Gene/L system. There is no way to identify the<br />
clock cards that are available to the Blue Gene/L system through the BGWEB or<br />
DB2. The only way to check cards is to check the bottom of each rack manually.<br />
For detail information about the clock card, refer to 1.2.8, “Clock card” on<br />
page 15.<br />
68 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.2.12 Service cards<br />
Blue Gene/L has one service card per midplane. Therefore, there are a<br />
maximum of two service cards per rack. You can determine the number of<br />
service cards that are available to the Blue Gene/L system in two ways:<br />
► Using the Blue Gene/L Web interface:<br />
At the BGWEB home page, click Configuration, and then click the midplane<br />
link in which you are interested. A new page displays a table of the cards<br />
within that midplane. Figure 2-5 shows the page that is loaded. You can see<br />
the service card at the bottom of the table that is displayed.<br />
Note: To identify the cards within a given midplane, Blue Gene/L: <strong>Systems</strong><br />
Administration, SG24-7178 advises you to query the relevant DB2 table.<br />
For example, if you want to find the number of service cards in a midplane,<br />
query the TBLSERVICECARD table and use the midplane serial number<br />
to ensure that you only find the cards in that midplane. In addition to the<br />
DB2 database tables, there are database views that do some of this work<br />
for you.<br />
Figure 2-5 BGWEB table showing the cards within a midplane<br />
Chapter 2. <strong>Problem</strong> determination methodology 69
2.2.13 Link cards<br />
► Using a DB2 select statement:<br />
In Example 2-10, the number of service cards is represented by the<br />
NUMSERVICECARDS field in the BGLSERVICECARDCOUNT view. This<br />
value is the number of service cards, per midplane, that are available to the<br />
system since the last system Discovery was performed.<br />
BGLSERVICECARDCOUNT generates information by querying the database<br />
alias BGLSERVICECARD.<br />
Example 2-10 Listing the number of service cards in the Blue Gene/L system<br />
# db2 "select * from BGLSERVICECARDCOUNT"<br />
MIDPLANESERIALNUMBER NUMSERVICECARDS<br />
--------------------------------------------------- --------------x'203937503631353900000000594C31304B35303238303036'<br />
1<br />
1 record(s) selected.<br />
Note: BGLSERVICECARDCOUNT is a database view and does not use the<br />
STATUS ‘M’ statement in its SQL statement the way that other count<br />
database views that are available on the system do. You need to be aware<br />
that there should only be one service card per midplane.<br />
A full Blue Gene/L rack has eight link cards—four link cards per midplane. Here<br />
are two ways to determine the number of link cards that are available to the<br />
system:<br />
► Using the Blue Gene/L Web interface:<br />
At the BGWEB home page, click Configuration, and then click the midplane<br />
links. A new page displays a table of the cards within that midplane.<br />
Figure 2-5 on page 69 shows the page that is loaded. You can see the link<br />
cards identified in the Type column.<br />
70 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
► Using a DB2 select statement:<br />
Example 2-11 shows the number of link cards that are represented by the<br />
NUMLINKCARDS field in the BGLLINKCARDCOUNT view. This value is the<br />
number of link cards, per midplane, that are available to the system since the<br />
last system Discovery was performed.<br />
Example 2-11 Listing the link cards available to the Blue Gene/L system<br />
# db2 "select * from BGLLINKCARDCOUNT"<br />
MIDPLANESERIALNUMBER NUMLINKCARDS<br />
--------------------------------------------------- -----------x'203937503631353900000000594C31304B35303238303036'<br />
4<br />
1 record(s) selected.<br />
2.2.14 Link card chips<br />
As previously discussed, there are four link cards per midplane. In addition, each<br />
link card contains six link chips. These chips are used to link signals between<br />
compute processors on different midplanes. For more information about the link<br />
card and chips, refer to Figure 1-17 on page 18.<br />
You can determine the number of link chips that are available to the system in<br />
two ways:<br />
► Using the Blue Gene/L Web interface:<br />
At the BGWEB home page, click Configuration, and then click the midplane<br />
links. A new page displays a table of the cards within that midplane.<br />
Figure 2-5 on page 69 shows the page that is loaded. You can see the link<br />
cards in the table. To view the link card details, click one of the link card<br />
hardware names. Figure 2-6 shows the BGWEB link card detail.<br />
Chapter 2. <strong>Problem</strong> determination methodology 71
Figure 2-6 BGWEB table showing the link chips on a link card<br />
Figure 2-7 shows the remainder of the same BGWEB page that is displayed<br />
in Figure 2-6 which gives details on the connection status of each link chip.<br />
Figure 2-7 shows an example of a 2x2 system. Therefore, it has X, Y, and Z<br />
cables between the midplanes and racks that use all six chips.<br />
72 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 2-7 BGWEB table showing cable connection data for a link card<br />
► Using a DB2 select statement:<br />
In Example 2-12, the number of available link chips is represented by the<br />
LINKCHIPS field in the output from the DB2 statement. This value is the<br />
number of link chips, per link card, that are available to the system since the<br />
last system Discovery was performed. The SERIALNUMBER field represents<br />
the serial number of the individual link cards. This also shows the status of the<br />
link cards.<br />
Example 2-12 Listing the link chips per link card on the system<br />
# db2 " SELECT serialNumber, (select count(*) from BGLLinkchip where /<br />
BGLLinkcard.serialNumber=BGLLinkchip.CardSerialNumber and<br />
BGLLinkcard.status / 'M' ) linkchips,status FROM BGLLinkcard where<br />
STATUS 'M'"<br />
SERIALNUMBER LINKCHIPS STATUS<br />
--------------------------------------------------- ----------- -----x'203937503438383700000000594C31314B35303630303146'<br />
6 A<br />
x'203937503438393200000000594C31344B3530353930314B' 6 A<br />
x'203937503438393200000000594C31354B35303737303044' 6 A<br />
x'203937503438383700000000594C33304B35323336303034' 6 A<br />
4 record(s) selected.<br />
Chapter 2. <strong>Problem</strong> determination methodology 73
2.2.15 Link summary<br />
You can view the configuration of the X, Y, and Z cables on the Blue Gene/L<br />
systems from the BGWEB interface. This view provides the best way to get a<br />
visual idea of the 3D torus cabling of your system. You can view a link summary<br />
in two ways:<br />
► Using the Blue Gene/L Web interface:<br />
a. At the BGWEB home page, click Configuration.<br />
b. Click Link Summary at the top of the page (Figure 2-8 and Figure 2-9).<br />
Figure 2-8 BGWEB Link Summary output for the X and Y links between racks<br />
74 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 2-9 BGWEB Link Summary output for Z links between midplanes<br />
Note: In our environment for this redbook, we did not have a system with X, Y,<br />
and Z cables to test. So, the output in Figure 2-8 and Figure 2-9 is from<br />
another system with more than four node cards in its configuration.<br />
► Using a DB2 select statement:<br />
Example 2-13 shows a DB2 select statement that identifies the Z links in a<br />
Blue Gene/L system. This particular example is a complex statement from the<br />
BGWEB source, but it does give a clear view of the link between the<br />
midplanes.<br />
Example 2-13 Displaying the Z links between two midplanes<br />
# db2 "SELECT CHAR(LEFT(SUBSTR(source,4,3),3),3) AS source, /<br />
CHAR(LEFT(SUBSTR(destination,4,3),3),3) AS destination FROM<br />
bglsysdb.TbglLink / WHERE source LIKE 'Z_%' AND sourceport = 'P2' FOR<br />
READ ONLY WITH UR "<br />
SOURCE DESTINATION<br />
------ -----------<br />
000 001<br />
001 000<br />
2 record(s) selected.<br />
Example 2-14 gives simpler DB2 select statement that shows the links within<br />
the Blue Gene/L system from ASIC-to-ASIC on each link card in use. In order<br />
to determine that the ASIC values 2 and 3 are for the Z cables, refer to<br />
Figure 1-17 on page 18.<br />
Chapter 2. <strong>Problem</strong> determination methodology 75
2.2.16 Node cards<br />
Example 2-14 Displaying the links within the Blue Gene/L system<br />
# db2 "select<br />
FROMLCLOCATION,FROMASIC,TOLCLOCATION,TOASIC,NUMBADWIRES,STATUS from<br />
tbglcable"<br />
FROMLCLOCATION FROMASIC TOLCLOCATION TOASIC NUMBADWIRES STATUS<br />
-------------- -------- ------------ ------ ----------- ------<br />
R00-M1-L0 2 R00-M0-L0 2 0 A<br />
R00-M1-L0 3 R00-M0-L0 3 0 A<br />
R00-M1-L1 2 R00-M0-L1 2 0 A<br />
R00-M1-L1 3 R00-M0-L1 3 0 A<br />
R00-M1-L2 2 R00-M0-L2 2 0 A<br />
R00-M1-L2 3 R00-M0-L2 3 0 A<br />
R00-M1-L3 2 R00-M0-L3 2 0 A<br />
R00-M1-L3 3 R00-M0-L3 3 0 A<br />
R00-M0-L0 2 R00-M1-L0 2 0 A<br />
R00-M0-L0 3 R00-M1-L0 3 0 A<br />
R00-M0-L1 2 R00-M1-L1 2 0 A<br />
R00-M0-L1 3 R00-M1-L1 3 0 A<br />
R00-M0-L2 2 R00-M1-L2 2 0 A<br />
R00-M0-L2 3 R00-M1-L2 3 0 A<br />
R00-M0-L3 2 R00-M1-L3 2 0 A<br />
R00-M0-L3 3 R00-M1-L3 3 0 A<br />
16 record(s) selected.<br />
The number of node cards can vary for each Blue Gene/L system. You can<br />
determine the number of node cards that are available in two ways:<br />
► Using the Blue Gene/L Web interface:<br />
At the BGWEB home page, click Configuration, and then click the midplane<br />
links. A new page displays a table of the cards within that midplane.<br />
Figure 2-5 on page 69 shows the page that is loaded. You can see the node<br />
cards identified in the Type column of the table.<br />
76 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.2.17 I/O cards<br />
► Using a DB2 select statement:<br />
In Example 2-15, the number of node cards is represented by the<br />
NUMNODECARDS field in the BGLNODECARDCOUNT view. This value is<br />
the number of node cards, per midplane, that are available to the system<br />
since the last system Discovery was performed.<br />
Example 2-15 Listing the number of node cards per midplane<br />
# db2 "select * from BGLNODECARDCOUNT"<br />
MIDPLANESERIALNUMBER NUMNODECARDS<br />
--------------------------------------------------- -----------x'203937503631353900000000594C31304B35303238303036'<br />
4<br />
1 record(s) selected.<br />
Note: There are no DB2 database views to help separate the number of I/O<br />
and Computer/Processor cards in the Blue Gene/L system. Database view<br />
BGLPROCESSORCARDCOUNT only displays the total number of processor<br />
cards on the system, including I/O cards.<br />
A node card can have one or two I/O cards installed. If it has two I/O cards<br />
installed, it is called an I/O rich node card. You can determine the number of I/O<br />
cards that are available to the system in two ways:<br />
► Using the Blue Gene/L Web interface:<br />
a. At the BGWEB home page, click Configuration.<br />
b. Click the midplane links. A new page displays a table of the cards within<br />
that midplane. Figure 2-5 on page 69 shows the page that is loaded.<br />
c. Click the individual node cards listed in the table to view the processor<br />
cards. Figure 2-10 and Figure 2-11 show the full page. The full page<br />
displays the cards with the I/O nodes identified by a Yes in the Is IO Card<br />
column.<br />
Chapter 2. <strong>Problem</strong> determination methodology 77
Figure 2-10 Top of the page for the Node card view in BGWEB page<br />
Figure 2-11 Node card view in BGWEB page (I/O node card marked as Yes)<br />
78 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 2-12 shows the detailed view of a I/O node that shows the two chips on<br />
the card.<br />
Figure 2-12 I/O node card detail showing the IP addresses of the I/O nodes<br />
► Using a DB2 select statement:<br />
The I/O nodes in a node card within a particular midplane are linked to the<br />
node card by their serial number (license plate). In Example 2-16, the<br />
IONODECARDS field is the I/O node count per node card. This example also<br />
shows the status of the node cards.<br />
Example 2-16 The number of I/O node cards per midplane<br />
# db2 " SELECT serialNumber, (select count(*) from BGLProcessorCard<br />
where / BGLNodeCard.serialNumber=BGLProcessorCard.nodeCardSerialNumber<br />
and / BGLProcessorCard.status 'M' and isiocard = 'T')<br />
ionodecard,status FROM / BGLNodeCard where STATUS 'M'"<br />
SERIALNUMBER IONODECARD STATUS<br />
--------------------------------------------------- ----------- -----x'203231503833343000000000594C31304B34323635303134'<br />
1 A<br />
x'203231503833343000000000594C31304B3432383030314B' 1 A<br />
x'203231503833343000000000594C31304B34323630304B35' 1 A<br />
x'203937503538373400000000594C31304B34313534303032' 1 A<br />
4 record(s) selected.<br />
Chapter 2. <strong>Problem</strong> determination methodology 79
2.2.18 Compute or processor cards<br />
Compute cards are also referred to as processor cards. Each node card holds 16<br />
compute cards. Thus, you can use the number of node cards in a midplane to<br />
determine the number of compute cards using the following equation:<br />
(Number of Node cards) x 16 = (number of compute cards)<br />
The number of compute nodes within the system is two times the number of<br />
compute cards. Therefore, in a midplane that has four node cards, the number of<br />
compute cards is 4 x 16 = 64 compute cards, resulting in (64 x 2) = 128 compute<br />
nodes in the system.<br />
Here are two ways to determine the number of compute cards that are available<br />
to the system:<br />
► Using the Blue Gene/L Web interface:<br />
a. At the BGWEB home page, click Configuration, and then click the<br />
midplane links. A new page displays a table of the cards within that<br />
midplane. Figure 2-5 on page 69 shows the page that is loaded.<br />
b. Click the individual node cards that are listed in the table to view the<br />
computer (processor) cards (Figure 2-10 on page 78 and Figure 2-11 on<br />
page 78).<br />
c. Click the processor card links to expand the detail for each processor<br />
card, as shown in Figure 2-13.<br />
Figure 2-13 Processor card detail view through BGWEB interface<br />
80 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
► Using a DB2 select statement:<br />
The compute cards in a node card within a particular midplane are linked to<br />
the midplane by its serial number (license plate). In Example 2-17, the<br />
COMPUTENODES field is the compute cards count per node card. This field<br />
also shows the status of the node cards.<br />
Example 2-17 The number of compute cards per node card<br />
# db2 "SELECT serialNumber, (select count(*) from BGLProcessorCard<br />
where / BGLNodeCard.serialNumber=BGLProcessorCard.nodeCardSerialNumber<br />
and / BGLProcessorCard.status 'M' and isiocard = 'F')<br />
computecards,status FROM / BGLNodeCard where STATUS 'M'"<br />
SERIALNUMBER COMPUTECARDS STATUS<br />
--------------------------------------------------- ------------ -----x'203231503833343000000000594C31304B34323635303134'<br />
16 A<br />
x'203231503833343000000000594C31304B3432383030314B' 16 A<br />
x'203231503833343000000000594C31304B34323630304B35' 16 A<br />
x'203937503538373400000000594C31304B34313534303032' 16 A<br />
4 record(s) selected.<br />
Note: To find specific STATUS values of the database records, change the not<br />
equal to operator (), as used in the DB2 examples with the STATUS field<br />
operator, to equals (=).<br />
Another way of quickly checking the number of processor cards that serve as<br />
compute nodes is to run the DB2 command shown in Example 2-18.<br />
Example 2-18 Checking the number of processor cards<br />
# db2 "select count (*)num_of_compute_node_cards from BGLPROCESSORCARD<br />
/ where ISIOCARD ='F' and STATUS 'M'"<br />
NUM_OF_COMPUTE_NODE_CARDS<br />
-------------------------<br />
64<br />
1 record(s) selected.<br />
Chapter 2. <strong>Problem</strong> determination methodology 81
To check the number of processor cards that serve as I/O node cards, run the<br />
DB2 command shown in Example 2-19.<br />
Example 2-19 Checking the node cards with I/O nodes<br />
# db2 "select count (*)num_of_io_node_cards from BGLPROCESSORCARD where<br />
/<br />
ISIOCARD ='T' and STATUS 'M'"<br />
NUM_OF_IO_NODE_CARDS<br />
--------------------<br />
4<br />
1 record(s) selected.<br />
2.3 Sanity checks for installed components<br />
If the core components of your Blue Gene/L system are not running properly,<br />
your system will not function correctly. We recommend that you follow this list of<br />
checks to ensure that the system is in a healthy state:<br />
► Check the operating system on the SN<br />
► Check communication services on the SN<br />
► Check that BGWEB is running<br />
► Check that DB2 is working<br />
► Check that BGLMaster and its child daemons are running<br />
► Check the NFS subsystem on the SN<br />
► Check that a block can be allocated using mmcs_console<br />
► Check that a simple job can run (mmcs_console)<br />
► Check the control system server logs<br />
► Check remote shell<br />
► Check remote command execution with secure shell<br />
► Check the network switches<br />
► Check the physical Blue Gene/L racks configuration<br />
Note: One component in the system can prevent another from running<br />
correctly. For example, if DB2 is not running, it might be down because of an<br />
operating system or communications issue and, therefore, a job cannot run.<br />
82 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
BGWEB<br />
DB2<br />
Figure 2-14 illustrates the core Blue Gene/L configuration that we used to<br />
provide examples of the systems checks in this section.<br />
BLUE GENE/L<br />
System Processes<br />
Service Node running<br />
SLES9 PPC<br />
64 bit<br />
IP: 10.0.0.1<br />
netmask: 255.255.0.0<br />
IP: 172.30.1.1<br />
netmask: 255.255.0.0<br />
IP: 192.168.00.49<br />
netmask: 255.255.255.0<br />
NFS<br />
Public LAN<br />
Switch<br />
Figure 2-14 Diagram of a core Blue Gene/L configuration<br />
2.3.1 Check the operating system on the SN<br />
eth0<br />
eth1<br />
eth5<br />
Service<br />
Network<br />
Switch<br />
Functional<br />
Network<br />
Switch<br />
You need to perform two checks for the operating system on the SN:<br />
1. Check the /etc/passwd and /etc/shadow files for the required Blue Gene/L<br />
users. In a core Blue Gene/L configuration, without FENs, we would only<br />
need the users root, bglsysdb, and bgdb2clim, as shown in Example 2-20.<br />
Example 2-20 Blue Gene/L user checking<br />
Gbit ido<br />
Node card 3<br />
Node card 2<br />
Node card 1<br />
Node card 0<br />
BLUE GENE Rack 00,<br />
Front half of Midplane:<br />
R00-M0<br />
Clock card<br />
master<br />
# egrep "root|bglsysdb|bgdb2cli" /etc/passwd /etc/shadow<br />
/etc/passwd:root:x:0:0:root:/root:/bin/bash<br />
/etc/passwd:bglsysdb:x:1000:1000::/dbhome/bglsysdb:/bin/bash<br />
/etc/passwd:bgdb2cli:x:1001:1001::/dbhome/bgdb2cli:/bin/bash<br />
/etc/shadow:root:$1$5zzzJBvz$XTd9evpJ8d1cVvDw5c3hV/:13210:0:10000::::<br />
slave<br />
Service<br />
card<br />
Chapter 2. <strong>Problem</strong> determination methodology 83
etc/shadow:bglsysdb:$1$SwI1..4e$iGNeJ3bSSOHXD1Dy5TM250:13222:0:99999:7<br />
:::<br />
/etc/shadow:bgdb2cli:$1$Lyzz.trF$npMmXlHv5.XPf.ijiBFGC1:13213:0:99999:7<br />
:::<br />
2. Check for any full or nearly full file systems using the command /bin/df. In a<br />
non-customized SN installation, with DB2 and Blue Gene/L code installed, the<br />
file system layout looks similar to the output shown in Example 2-21. Full file<br />
systems can cause problems with many system processes.<br />
Example 2-21 File system checking on the SN<br />
# df -k<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
/dev/sdb3 70614928 4752552 65862376 7% /<br />
tmpfs 1898508 8 1898500 1% /dev/shm<br />
/dev/sda4 489452 95872 393580 20% /tmp<br />
/dev/sda1 9766544 4075948 5690596 42% /bgl<br />
/dev/sda2 9766608 712428 9054180 8% /dbhome<br />
If you discover issues during the check of the operating system, you can take the<br />
following corrective actions:<br />
1. If a user does not exist in /etc/passwd and /etc/shadow, then you need to<br />
create the user.<br />
2. If any of the file systems are full or are nearly full, you need to either clean<br />
that file system (remove unnecessary files or add disk space) to ensure that<br />
the operating system, database, and other Blue Gene/L processes are not<br />
affected.<br />
2.3.2 Check communication services on the SN<br />
The Service and Functional networks are required for the Blue Gene/L system to<br />
function correctly. To check these networks, perform the following<br />
communication checks:<br />
1. Verify that both the carrier and the network is up with the /usr/sbin/ethtool<br />
command. Example 2-22 shows the output for a working interface on the SN.<br />
Note the Speed, Duplex, and Link detected fields.<br />
Example 2-22 Ethernet adapter characteristics<br />
# /usr/sbin/ethtool eth0<br />
Settings for eth0:<br />
Supported ports: [ TP ]<br />
Supported link modes: 10baseT/Half 10baseT/Full<br />
84 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
100baseT/Half 100baseT/Full<br />
1000baseT/Full<br />
Supports auto-negotiation: Yes<br />
Advertised link modes: 10baseT/Half 10baseT/Full<br />
100baseT/Half 100baseT/Full<br />
1000baseT/Full<br />
Advertised auto-negotiation: Yes<br />
Speed: 1000Mb/s<br />
Duplex: Full<br />
Port: Twisted Pair<br />
PHYAD: 0<br />
Transceiver: internal<br />
Auto-negotiation: on<br />
Supports Wake-on: umbg<br />
Wake-on: g<br />
Current message level: 0x00000007 (7)<br />
Link detected: yes<br />
2. Check TCP/IP configuration using the /sbin/ip ad command. Example 2-23<br />
shows output showing the loopback, eth0, and eth1 interfaces configured and<br />
up on the SN.<br />
Example 2-23 IP configuration on the SN<br />
# /sbin/ip ad<br />
1: lo: mtu 16436 qdisc noqueue<br />
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00<br />
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo<br />
inet6 ::1/128 scope host<br />
valid_lft forever preferred_lft forever<br />
2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />
link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />
inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />
inet6 fe80::20d:60ff:fe4d:28ea/64 scope link<br />
valid_lft forever preferred_lft forever<br />
3: eth1: mtu 1500 qdisc pfifo_fast qlen 1000<br />
link/ether 00:11:25:08:30:90 brd ff:ff:ff:ff:ff:ff<br />
inet 172.30.1.1/16 brd 172.30.255.255 scope global eth3<br />
inet6 fe80::211:25ff:fe08:3090/64 scope link<br />
valid_lft forever preferred_lft forever<br />
...etc..<br />
3. Verify that the IP configuration is correct on the network interfaces. Refer to<br />
the network diagram for the system as discussed in 2.2.3, “Network diagram”<br />
on page 59.<br />
Chapter 2. <strong>Problem</strong> determination methodology 85
4. Use the /bin/ping command to check network connectivity of the SN<br />
interfaces for the functional and service networks from another system.<br />
Note: It is likely there will be another system on the functional network,<br />
such as an FEN. However, for the service network, you might need to<br />
connect a test system to perform this check.<br />
5. Check the lights on the network interfaces on the SN.<br />
If you discover issues during the check of the communication services, you can<br />
take the following corrective actions:<br />
1. If the /usr/sbin/ethtool output is not as expected, then check the settings<br />
on the interfaces and ensure that the ethernet cables are plugged in correctly<br />
at the SN and the switch.<br />
2. The /sbin/ip ad command should show the interfaces as UP. If they are not<br />
in UP state, check the configuration files and activate them using appropriate<br />
commands or scripts (/etc/init.d/network or /sbin/ifconfig).<br />
2.3.3 Check that BGWEB is running<br />
As mentioned at the beginning of 2.2, “Identifying the installed system” on<br />
page 57, BGWEB is a very useful tool for gathering data on many aspects of the<br />
Blue Gene/L system. To check if BGWEB is working properly, follow these steps:<br />
1. From the SN or a remote system, try to connect to BGWEB using the<br />
following URL:<br />
http:///web/index.php<br />
Note: There might be a firewall between the system where the Web page is<br />
loaded and the SN. A tunnel can be created to forward port TCP:80 on the SN<br />
to a local port (in this case, local TCP:5519 port). The URL would then need to<br />
change to something like:<br />
http://localhost:5919/web/index.php<br />
2. Check that the apache server processes are running on the SN using the<br />
/bin/ps command:<br />
ps -ef | grep httpd<br />
86 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
3. Check whether apache is configured to start automatically on start with the<br />
/sbin/chkconfig command:<br />
# chkconfig --list apache<br />
apache 0:off 1:off 2:off 3:on 4:off 5:on<br />
6:off<br />
If you discover issues during the check of BGWEB, you can take the following<br />
corrective actions:<br />
1. If apache is not running then try and start it using the start script<br />
/etc/rc.d/apache on the SN:<br />
/etc/rc.d/apache start<br />
2. If apache has not been configured to start when the SN boots, run (as root)<br />
the /sbin/chkconfig command:<br />
# chkconfig -s apache 35<br />
3. If there are issues with the apache configuration, check the<br />
/etc/httpd/httpd.conf file.<br />
2.3.4 Check that DB2 is working<br />
The DB2 database needs to be running on the SN because the Blue Gene/L<br />
relies on it to operate. To check that DB2 is working properly, follow these steps:<br />
1. Check that you can connect to the database instance as shown in<br />
Example 2-2 on page 58.<br />
2. Check that the DB2 user exists by connecting the to the SN using the<br />
following command:<br />
# ssh bglsysdb@<br />
Although highly unlikely, if secure shell is not configured, use the<br />
/usr/bin/telnet command.<br />
3. Check that the user’s bglsysdb password is the same as the password in the<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file. In Example 2-24 the<br />
database_password field in the db.properties file represents the password for<br />
the user bglsysdb.<br />
Example 2-24 Output of the /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file<br />
# cat db.properties<br />
database_name=bgdb0<br />
database_user=bglsysdb<br />
database_password=bglsysdb<br />
...<br />
Chapter 2. <strong>Problem</strong> determination methodology 87
If you discover issues during the check of DB2, you can take the following<br />
corrective actions:<br />
1. If DB2 is not running, you need to start it. Run the commands:<br />
# su - bglsysdb<br />
# /dbhome/bglsysdb/sqllib/adm/db2start<br />
2. If required, update the /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file to<br />
reflect the configuration shown in Example 2-24 on page 87.<br />
3. If DB2 will not start, even though the user exists and the password for<br />
bglsysdb is configured correctly in the db.properties file, go through the steps<br />
in 2.3.1, “Check the operating system on the SN” on page 83 and 2.3.2,<br />
“Check communication services on the SN” on page 84.<br />
4. If DB2 has not been started automatically when the SN was booted, then<br />
check that the database instance has been set to start automatically with the<br />
/dbhome/bglsysdb/sqllib/adm/db2set command:<br />
# /dbhome/bglsysdb/sqllib/adm/db2set -i bglsysdb<br />
DB2COMM=tcpip<br />
DB2AUTOSTART=YES<br />
If the DB2AUTOSTART field has the value YES, then it is set to start<br />
automatically. If this is not set, you can enable it using the<br />
/opt/IBM/db2/V8.1/instance/db2iauto command:<br />
# /opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb<br />
Also, check that the db2fmcd entry has not been commented out of or deleted<br />
from the /etc/inittab file.<br />
# grep db2fmcd /etc/inittab<br />
fmc:2345:respawn:/opt/IBM/db2/V8.1/bin/db2fmcd #DB2 Fault Monitor<br />
Coordinator<br />
Note: The previous DB2 commands can be run as the root user or the DB2<br />
user bglsysdb.<br />
5. If after all these checks DB2 still does not work correctly, you need to call your<br />
DB2 support.<br />
88 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.3.5 Check that BGLMaster and its child daemons are running<br />
BGLMaster and three of its child daemons must be running for the Blue Gene/L<br />
system to operate properly. These daemons are:<br />
► idoproxy<br />
► ciodb<br />
► mmcs_server<br />
To ensure that BGLMaster is running properly, follow these steps:<br />
1. Check the status of BGLMaster on the SN using the following command:<br />
# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster status<br />
Example 2-25 shows the expected output from BGLMaster on a system that<br />
is not using hardware (monitor) or performance (perfmon) monitoring<br />
daemons for BGL.<br />
Example 2-25 Checking the BGLMaster status<br />
# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster status<br />
idoproxy started [18622]<br />
ciodb started [18623]<br />
mmcs_server started [18624]<br />
monitor stopped<br />
perfmon stopped<br />
2. We advise to double check that there is only one process for the BGLMaster<br />
and each of and its child daemons by running the following command:<br />
# ps -ef | egrep "BGLMaster|idoproxy|mmcs_db_server|ciodb"<br />
If there is more than one instance of a process, this needs to be cleaned up<br />
because BGLMaster is only aware of the process it has most recently started<br />
for each daemon name. It is also unaware of daemon processes if they are<br />
not started by the bglmaster command.<br />
If you discover issues during the check of BGLMaster, you can take the following<br />
corrective actions:<br />
1. If you need to start a particular child daemon, use the command.<br />
# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start <br />
Caution: Restarting the BGLMaster terminates all communications between<br />
the SN and the racks. This action terminates all running jobs and booted<br />
partitions on the Blue Gene/L system.<br />
Chapter 2. <strong>Problem</strong> determination methodology 89
2. If you need to restart BGLMaster, you can run the following command:<br />
# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster restart<br />
3. If there are still issues with BGLMaster, we suggest that you run through this<br />
sequence of commands:<br />
# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster stop<br />
# ps -ef | egrep "BGLMaster|idoproxy|mmcs_db_server|ciodb"<br />
# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start<br />
# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster status<br />
Note: You can edit the file /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster.init to<br />
determine what is started automatically when you run<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start. For more<br />
information, see Blue Gene/L: System Administration, SG24-7178.<br />
2.3.6 Check the NFS subsystem on the SN<br />
NFS is an integral part of the Blue Gene/L boot process, as described in 1.8,<br />
“Boot process, job submission, and termination” on page 36. It can also be used<br />
for the reading and writing of data by the jobs that are submitted on the system.<br />
(This is discussed in 2.2.7, “File systems (NFS and GPFS)” on page 64 and in<br />
more depth in Chapter 5, “File systems” on page 211.) Thus, NFS needs to be<br />
functioning correctly on the SN, otherwise a block will not be able to boot.<br />
To check that the NFS subsystem is functioning properly, check that the /bgl file<br />
system is exported from the SN over the functional network. The<br />
/usr/sbin/showmount -e command shows what file systems are exported and<br />
also gives a good indication whether NFS is working.<br />
# showmount -e bglsn<br />
Export list for bglsn:<br />
/bgl 172.30.0.0/255.255.0.0<br />
90 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
If the showmount command returns an error, depending on the message, it might<br />
be possible to identify the potential area of the problem or part of the problem<br />
with NFS:<br />
1. showmount -e points to a potential issue with the port mapper service<br />
# showmount -e<br />
mount clntudp_create: RPC: Port mapper failure - RPC: Unable to<br />
receive<br />
Possible fix:<br />
# /etc/init.d/portmap restart<br />
# /etc/init.d/nfsserver restart<br />
2. showmount -e points to a potential issue with rpc.mountd or nfsd<br />
# showmount -e<br />
mount clntudp_create: RPC: Program not registered<br />
Possible fix:<br />
# /etc/init.d/nfsserver restart<br />
Note: Refer to the check lists in Chapter 5, “File systems” on page 211 for<br />
more NFS related checks.<br />
2.3.7 Check that a block can be allocated using mmcs_console<br />
Note: The Blue Gene/L blocks are also referred to as partitions.<br />
A good way to ensure that the core Blue Gene/L system is working correctly is to<br />
check whether a block can be booted. This check ensures that the<br />
communication between the SN and the racks and files used during the boot<br />
process is functioning. (These topics are covered in 1.8, “Boot process, job<br />
submission, and termination” on page 36.)<br />
To check that a block can be allocated, follow these steps:<br />
1. Connect to the mmcs_db_console on the SN:<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$<br />
Chapter 2. <strong>Problem</strong> determination methodology 91
2. After you are connected to the console, try to allocate a predefined block. The<br />
following is an example of a successful boot of the block called R000_128:<br />
mmcs$ allocate R000_128<br />
OK<br />
mmcs$<br />
It is possible to list the predefined blocks and the currently booted blocks<br />
within the mmcs_db_console using the list bglblock command. The list<br />
command queries the DB2 database alias BGLBLOCK.<br />
3. Monitor the block from the bglsn-bgdb0-idoproxydb-current.log and<br />
bglsn-bgdb0-mmcs_db_server-current.log that is located in<br />
/bgl/BlueLight/logs/BGL.<br />
4. Check the block state from the mmcs_db_console with the list bglblock<br />
command, as shown in Example 2-26.<br />
Example 2-26 Checking a block (R000_128)<br />
mmcs$ list bglblock R000_128<br />
OK<br />
==> DBBlock record<br />
_blockid = R000_128<br />
_numpsets = 0<br />
_numbps = 0<br />
_owner =<br />
_istorus = 000<br />
_sizex = 0<br />
_sizey = 0<br />
_sizez = 0<br />
_description = Generated via genSmallBlock<br />
_mode = C<br />
_options =<br />
_status = I<br />
_statuslastmodified = 2006-04-01 15:08:30.896873<br />
_mloaderimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />
_blrtsimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />
_linuximg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />
_ramdiskimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />
_debuggerimg = none<br />
_debuggerparmsize = 0<br />
_createdate = 2006-03-06 17:56:20.600708<br />
92 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
The _status field shows the correct status of the block. (Table 4-3 on<br />
page 162 explains these values.)<br />
You can also gather the block information from the BGWEB interface. At the<br />
BGWEB home page, click Runtime to show the currently booting or initialized<br />
dynamic or predefined blocks. Predefined block information is displayed<br />
whatever the current status. Figure 2-15 shows an example of both dynamic<br />
and predefined blocks.<br />
Figure 2-15 BGWEB Block Information page<br />
An additional check on the core system that you can perform to ensure that a<br />
block can be allocated is to check the I/O nodes are running and are connected<br />
to the functional network. Take the following steps:<br />
1. Go to the point where the predefined block shows that it has been allocated in<br />
/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log.<br />
Example 2-27 shows the text that is generated when block R000_128 is<br />
allocated at the mmcs_db_console.<br />
Example 2-27 Text generated in mmcs_db_server log when a block is initialized<br />
..snip..<br />
Apr 05 17:20:57 (I) [1090516192] test1 allocate R000_128<br />
Apr 05 17:20:57 (I) [1090516192] test1<br />
DBMidplaneController::addBlock(R000_128)<br />
..snip..<br />
Apr 05 17:20:57 (I) [1090516192] test1:R000_128<br />
BlockController::connect()<br />
Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />
R00-M0-N0-I:J18-U01 log file:<br />
/bgl/BlueLight/logs/BGL/R00-M0-N0-I:J18-U01.log<br />
Chapter 2. <strong>Problem</strong> determination methodology 93
Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />
R00-M0-N0-I:J18-U11 log file:<br />
/bgl/BlueLight/logs/BGL/R00-M0-N0-I:J18-U11.log<br />
.snip..<br />
Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />
R00-M0-N3-I:J18-U01 log file:<br />
/bgl/BlueLight/logs/BGL/R00-M0-N3-I:J18-U01.log<br />
Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node:<br />
R00-M0-N3-I:J18-U11 log file:<br />
/bgl/BlueLight/logs/BGL/R00-M0-N3-I:J18-U11.log<br />
Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />
BlockController::load_microloader() loading microloader<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />
Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />
BlockController::start_microloaders() starting mailbox polling<br />
Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />
BlockController::start_microloaders() starting microloader boot image<br />
on 136 nodes<br />
Apr 05 17:20:58 (I) [1092621536] test1:R000_128 MailboxMonitor thread<br />
starting<br />
Apr 05 17:20:58 (I) [1090516192] test1:R000_128<br />
BlockController::start_microloaders() startMicroLoader starting 136<br />
nodes<br />
Apr 05 17:20:59 (I) [1090516192] test1:R000_128<br />
DBBlockController::boot(): making switch settings for block R000_128<br />
Apr 05 17:20:59 (I) [1090516192] test1:R000_128<br />
DBBlockController::boot(): completed switch settings for block R000_128<br />
Apr 05 17:20:59 (I) [1090516192] test1:R000_128 BlockController::load()<br />
loading program /bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />
Apr 05 17:21:03 (I) [1090516192] test1:R000_128 BlockController::load()<br />
loading program /bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />
Apr 05 17:21:07 (I) [1090516192] test1:R000_128 BlockController::load()<br />
loading program /bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />
Apr 05 17:21:08 (I) [1090516192] test1:R000_128<br />
BlockController::start() starting 136 nodes<br />
Apr 05 17:21:08 (I) [1090516192] test1:R000_128 starting 128 nodes with<br />
entry point 0x00000290<br />
Apr 05 17:21:08 (I) [1090516192] test1:R000_128 starting 8 nodes with<br />
entry point 0x00800000<br />
Apr 05 17:21:08 (I) [1090516192] test1:R000_128<br />
DBBlockController::waitBoot(R000_128)<br />
Apr 05 17:21:36 (I) [1092621536] test1:R000_128 DBBlockController:<br />
contacting I/O node {119} at address 172.30.2.4:7000<br />
Apr 05 17:21:36 (I) [1092621536] test1:R000_128 DBBlockController:<br />
contacted I/O node {119} at address 172.30.2.4:7000<br />
94 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
..snip..<br />
Apr 05 17:21:36 (I) [1092621536] test1:R000_128 DBBlockController:<br />
contacted I/O node {34} at address 172.30.2.5:7000<br />
Apr 05 17:21:38 (I) [1090516192] test1:R000_128<br />
DBBlockController::waitBoot(R000_128) block initialization successful<br />
..snip..<br />
2. Using the date pattern from when the block was allocated (Example 2-27 on<br />
page 93), it is possible to create a for loop to identify the I/O nodes that were<br />
used for the block and to ensure that they are responding to a ping over the<br />
network, as shown in Example 2-28.<br />
Example 2-28 Checking the I/O nodes for the booted blocks<br />
# for i in `grep "R000_128"<br />
/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log | grep /<br />
"Apr 05 17:2" |grep contact| awk '{print $14}'| cut -f1 -d':'| uniq /<br />
|sort`; do echo Pinging ionode IP $i; ping -c1 $i >/dev/null 2>&1; /<br />
echo Return Code:$?; done<br />
Pinging ionode IP 172.30.2.1<br />
Return Code:0<br />
Pinging ionode IP 172.30.2.2<br />
Return Code:0<br />
Pinging ionode IP 172.30.2.3<br />
Return Code:0<br />
Pinging ionode IP 172.30.2.4<br />
Return Code:0<br />
Pinging ionode IP 172.30.2.5<br />
Return Code:0<br />
Pinging ionode IP 172.30.2.6<br />
Return Code:0<br />
Pinging ionode IP 172.30.2.7<br />
Return Code:0<br />
Pinging ionode IP 172.30.2.8<br />
Return Code:0<br />
Note: This test can be useful when the booting of a block is hanging. If a block<br />
cannot boot and if there is no indication where the problem might be, try to<br />
boot node card sized blocks that are generated by the gensmallblock<br />
command to isolate the hardware problem.<br />
Chapter 2. <strong>Problem</strong> determination methodology 95
2.3.8 Check that a simple job can run (mmcs_console)<br />
If a block can be booted from the mmcs_db_console, then a good test is to see if<br />
you can run a simple job on the block using a program such as hello.rts. You<br />
can find an example of this code in Unfolding the IBM eServer Blue Gene<br />
Solution, SG24-6686.<br />
To check that a simple job can run, follow these steps:<br />
1. When the block has booted successfully, in this case R000_128, select the<br />
block to run the job, using select_block:<br />
mmcs$ select_block R000_128<br />
OK<br />
mmcs$<br />
Note: There is no need to select the block if the previously run command<br />
at the mmcs prompt was allocate.<br />
2. Submit the job using submit_job:<br />
mmcs$ submit_job /bglscratch/test1/hello.rts /bglscratch/test1<br />
OK<br />
jobId=257<br />
Note: In addition, it is also possible to select the block and submit the job at<br />
the mmcs_db_console prompt with the submitjob command.<br />
3. Monitor the job from bglsn-bgdb0-mmcs_db_server-current.log and<br />
bglsn-bgdb0-ciodb-current.log, which are located in /bgl/BlueLight/logs/BGL.<br />
4. Check the status of the job from the mmcs_db_console with the list bgljob<br />
command. In Example 2-29, job 257 has a status of T (Terminated).<br />
Example 2-29 Checking the status of a job<br />
mmcs$ list bgljob 257<br />
OK<br />
==> DBJob record<br />
_jobid = 257<br />
_entrydate = 2006-03-20 14:55:56.497880<br />
_username = root<br />
_blockid = R000_128<br />
_jobname = Mon Mar 20 14:55:53 2006<br />
.520.1096840416<br />
96 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
_executable = /bgl/applications/Examples/hello.rts<br />
_outputdir = /bgl/applications/Examples<br />
_status = T<br />
_errtext =<br />
_action = D<br />
_exitstatus = 0<br />
_mode = C<br />
_starttime = 2006-03-20 14:55:56.258832<br />
_nodesused = 128<br />
_strace = -1<br />
_stdininfo = 1024<br />
_stdoutinfo = 1024<br />
_stderrinfo = 1024<br />
The list bgljob command in the mmcs_db_console queries the DB2<br />
alias BGLJOB. In the same way, you can use the list bgljob_history<br />
command to gather data on previously run jobs. You can query the<br />
BGLJOB and BGLJOB_HISTORY aliases using the<br />
/dbhome/bgdb2cli/sqllib/bin/db2 command.<br />
You can also retrieve job information from the BGWEB interface. At the<br />
BGWEB home page, click Runtime, and then click Job Information.<br />
Figure 2-16 and Figure 2-17 present examples of job information views<br />
through the BGWEB interface.<br />
Chapter 2. <strong>Problem</strong> determination methodology 97
Figure 2-16 BGWEB showing job Information page<br />
Figure 2-17 BGWEB showing detailed Job ID information<br />
98 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
An additional checks that you can perform is that you can map the Blue Gene/L<br />
job ID to a block using the BGWEB interface:<br />
1. Click Runtime and then Block Information.<br />
2. Select a link for a particular block. It gives details on the block, including the<br />
current jobs that are running on it, as shown in Figure 2-18.<br />
Figure 2-18 BGWEB showing block details and jobs for block data<br />
2.3.9 Check the control system server logs<br />
There are certain logs that you can check, depending on what information is<br />
required. For information about the control system server logs, see 2.2.6,<br />
“Control system server logs” on page 61. There, we include an explanation about<br />
the information that is available in each log file.<br />
Chapter 2. <strong>Problem</strong> determination methodology 99
To check the control system server log, you need to do the following:<br />
► Block monitoring<br />
The block can be monitored from bglsn-bgdb0-idoproxydb-current.log and<br />
bglsn-bgdb0-mmcs_db_server-current.log, which is located in<br />
/bgl/BlueLight/logs/BGL. Run the following command on each log:<br />
# tail -F /bgl/BlueLight/logs/BGL/bglsn-bgdb0-idoproxydb-current.log<br />
# tail -F<br />
/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log<br />
► Job monitoring<br />
When the job is submitted, you can monitor it by checking the<br />
bglsn-bgdb0-ciodb-current.log. Use the following command:<br />
# tail -F /bgl/BlueLight/logs/BGL/bglsn-bgdb0-ciodb-current.log<br />
2.3.10 Check remote shell<br />
In a complex Blue Gene/L system, to execute commands on remote machines<br />
(FENs), you can implement rsh and rshd. Here, we show what you need to check<br />
to ensure that rsh is configured correctly:<br />
1. Check that the SN and FENs can talk to each other using /usr/bin/rsh in<br />
both directions across the functional network.<br />
– From SN, use the following commands:<br />
Use rsh date to check to the FEN, which should be repeated to each<br />
FEN.<br />
# rsh bglfen1_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
Use rsh date to check to the SN itself.<br />
# rsh bglsn_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
– From the FENs, use the following commands:<br />
Use rsh date to check to the SN itself.<br />
# rsh bglsn_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
Use rsh date to check to the FEN, which should be repeated to each<br />
FEN.<br />
# rsh bglfen1_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
100 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2. Check that rsh is enabled with the /sbin/chkconfig command. The following<br />
example is the expected output when rsh is enabled.<br />
# /sbin/chkconfig --list rsh<br />
xinetd based services:<br />
rsh: on<br />
You can double check that rsh is enable by looking at the /etc/xinetd.d/rsh file<br />
and checking that there is not an extra entry for the stanza field disabled with<br />
the value yes. Example 2-30 shows the contents of the /etc/xinetd.d/rsh file<br />
with rshd enabled.<br />
Example 2-30 Checking rshd stanza in xinetd configuration<br />
# cat /etc/xinetd.d/rsh<br />
# default: off<br />
# description:<br />
# The rshd server is a server for the rcmd(3) routine and,<br />
# consequently, for the rsh(1) program. The server provides<br />
# remote execution facilities with authentication based on<br />
# privileged port numbers from trusted hosts.<br />
#<br />
service shell<br />
{<br />
socket_type = stream<br />
protocol = tcp<br />
flags = NAMEINARGS<br />
wait = no<br />
user = root<br />
group = root<br />
log_on_success += USERID<br />
log_on_failure += USERID<br />
server = /usr/sbin/tcpd<br />
# server_args = /usr/sbin/in.rshd -L<br />
server_args = /usr/sbin/in.rshd -aL<br />
instances = 200<br />
disable = no<br />
}<br />
3. Check the ~/.rhosts file for each user or the /etc/hosts.equiv file. For further<br />
details, refer to “Remote command execution setup” on page 151.<br />
Note: If the rsh checks are done after a job has failed, you need to run them<br />
as the owner (user) of the job that failed.<br />
Chapter 2. <strong>Problem</strong> determination methodology 101
If there is an issue with rsh, then check the following:<br />
1. Check for correct name resolution, and resolve identical from all nodes<br />
involved. Name resolution should be local (/etc/hosts).<br />
2. Check that the ~/.rhosts has the correct IP labels (host names) and users or<br />
that /etc/hosts.equiv has the correct IP labels (host names).<br />
3. Enable rsh on the SN and the FENs, if required, with the /sbin/chkconfig<br />
command.<br />
# chkconfig rsh on<br />
# chkconfig --list rsh<br />
xinetd based services:<br />
rsh: on<br />
2.3.11 Check remote command execution with secure shell<br />
In a complex Blue Gene/L configuration, you can implement secure shell (ssh)<br />
for remote command execution. Although you can also use ssh for remote<br />
command execution between SN and I/O nodes, in this section, we only cover<br />
checks for ssh between SN and FENs.<br />
Note: For details about setting up secure shell, see “Configuring ssh and scp<br />
on SN and I/O nodes” on page 237 and 7.2, “Secure shell” on page 395.<br />
To check remote command execution with ssh, follow these steps:<br />
1. Check that the SN and FENs can talk to each other in both directions through<br />
the functional network using /usr/bin/ssh.<br />
– From the SN:<br />
Use ssh date to check to the FEN, which should be repeated to each<br />
FEN.<br />
# ssh bglfen1_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
Use ssh date to check to the SN itself.<br />
# ssh bglsn_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
– From the FENs:<br />
Use ssh date to check to the SN itself.<br />
# ssh bglsn_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
102 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Use ssh date to check to the FEN, which should be repeated to each<br />
FEN.<br />
# ssh bglfen1_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
2. Check that sshd is enabled with the /sbin/chkconfig command. The<br />
following command shows sshd enabled:<br />
# chkconfig --list sshd<br />
sshd 0:off 1:off 2:off 3:on 4:off 5:on<br />
6:off<br />
3. For further checks, refer to 7.2, “Secure shell” on page 395.<br />
Note: If the checks are being done after a job has failed the checks need to be<br />
run as the owner (user) of the job that failed.<br />
If there is an issue with ssh, check the following:<br />
1. Ensure that the ~/.ssh/known_hosts and ~/.ssh/authorized_keys have the<br />
correct entries for root and all users able to submit jobs.<br />
2. Enable sshd on the SN and the FENs, if required, with the /sbin/chkconfig<br />
command.<br />
# chkconfig sshd 23<br />
# chkconfig --list sshd<br />
sshd 0:off 1:off 2:on 3:on 4:off 5:off<br />
6:off<br />
For further actions, refer to 7.2, “Secure shell” on page 395 and to ssh man<br />
pages.<br />
2.3.12 Check the network switches<br />
A good understanding of your network switches and the topology that is used for<br />
the Blue Gene/L system configuration is required to solve any problems related<br />
to the network switches. (See also 2.2.3, “Network diagram” on page 59.) We<br />
recommend that you check the network switches for errors and that you perform<br />
a general health check on them. Refer to the network switches documentation<br />
and get additional help from your site network administrating team.<br />
If there are any issues with the switches, you need to resolve them to ensure<br />
reliable connectivity and network performance.<br />
Chapter 2. <strong>Problem</strong> determination methodology 103
2.3.13 Check the physical Blue Gene/L racks configuration<br />
It is also useful to perform a physical inspection of the Blue Gene/L racks to help<br />
identify any obvious problems.<br />
1. Check that the status lights on the service cards and the node cards are in the<br />
expected state, as explained in 1.2, “Hardware components of the Blue<br />
Gene/L system” on page 2.<br />
2. Check that all cables are connected to the to the interfaces and are plugged<br />
into the correct place.<br />
3. Check the clock card cables on the clock card and the service card.<br />
4. Ensure that the master/slave switch is set to the correct setting.<br />
If there are any issues with the physical configuration of the Blue Gene/L racks,<br />
then these issues need to be addressed by qualified service technicians.<br />
Note: Node cards that are connected to a powered switch with a carrier will<br />
not show they have a carrier until they are discovered by the discovery<br />
process. Refer to 1.9, “System discovery” on page 43 for more information.<br />
2.4 <strong>Problem</strong> determination methodology<br />
In this section, we describe a methodology that you can use to locate issues in a<br />
Blue Gene/L system. We developed this methodology by first building a core<br />
Blue Gene/L system, as listed in “Core Blue Gene/L” on page 56, and injecting<br />
errors into that system to simulate faults that might occur. In the process of<br />
investigating these errors, we then developed this practical problem<br />
determination methodology.<br />
We then created a more complex Blue Gene/L system with FENs, LoadLeveler,<br />
MPI, and GPFS, as listed in “Complex Blue Gene/L” on page 56. We then<br />
evolved the methodology further to ensure that these components were also<br />
addressed.<br />
104 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
NO<br />
Use monitoring<br />
tools, hardware<br />
diagnostics,<br />
RAS database and<br />
so forth<br />
Figure 2-19 illustrates this methodology. In this illustration, the steps of the<br />
methodology are in the circles and the check lists are in the squares.<br />
Anything<br />
changed?<br />
YES<br />
Call IBM<br />
software<br />
support<br />
START<br />
<strong>Problem</strong><br />
definition<br />
HW SW<br />
HW/SW?<br />
<strong>Problem</strong><br />
identified?<br />
YES<br />
NO NO Can you fix<br />
it?<br />
END<br />
YES<br />
YES<br />
YES<br />
Figure 2-19 <strong>Problem</strong> determination methodology for Blue Gene/L system<br />
The three steps in our methodology are:<br />
1. Define the problem.<br />
2. Identify the Blue Gene/L system.<br />
3. Identify the problem area.<br />
Anything<br />
changed?<br />
<strong>Problem</strong><br />
identified?<br />
YES<br />
Call IBM<br />
software<br />
support<br />
If you cannot find the problem area in one of the check lists where you think the<br />
problem has occurred, then we recommend that you run through the check lists<br />
for the other relevant components that you have in your system. If the cause of<br />
the problem is not obvious, we recommend that you start with 2.5, “Identifying<br />
core Blue Gene/L system problems” on page 108 for help with identifying the<br />
problem.<br />
NO<br />
Core<br />
SW<br />
Compilers<br />
GPFS<br />
Analyze<br />
components<br />
NFS<br />
Chapter 2. <strong>Problem</strong> determination methodology 105<br />
LL<br />
MPI
This methodology also allows someone with little experience of a Blue Gene/L<br />
system to assess where the problem lies quickly and, perhaps more importantly,<br />
to determine whether there is a problem at all.<br />
Following the methodology, you can find the appropriate checks to confirm in<br />
more detail where the problem lies. If it is not clear where an issue is in a<br />
complex system, then you can break down the problem into smaller chunks and<br />
check that each part of the system is working correctly.<br />
We also recommend that your support or administration staff document the<br />
diagnosis and resolution steps for any problem that you encounter to help with<br />
potential problems in the future.<br />
<strong>Problem</strong>s on the Blue Gene/L system generally fall into the falling categories:<br />
► Block initialization<br />
► Runtime issues<br />
► General hardware problems<br />
► Results from diagnostics<br />
► Monitoring tool outputs<br />
You can then check the appropriate components to ensure that they are in a<br />
functional state using individual checklists for each component in the current<br />
Blue Gene/L environment. This methodology is demonstrated in Chapter 6,<br />
“Scenarios” on page 265 through a number of different scenarios.<br />
We recommend that you follow the seven steps in the following list to ensure that<br />
you determine the problem, resolve it, and document it correctly. The most<br />
important steps are to get a clear description of the problem symptoms and to<br />
understand the system that you are dealing with.<br />
We recommend that you:<br />
1. Define the problem.<br />
2. Identify the Blue Gene/L system.<br />
3. Identify the problem area.<br />
The first three points are the three main areas that we cover in this book.<br />
However, a normal problem determination methodology extends also to the<br />
following steps:<br />
4. If possible, check to see if the issue has occurred before.<br />
5. Generate an action plan to resolve the issue.<br />
6. After you have determined the problem, take corrective measures.<br />
7. Document the problem for future administration purposes.<br />
106 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.4.1 Define the problem<br />
When you have answered the following questions, you should have enough data<br />
to indicate where to start looking for the problem:<br />
► What is the description of the problem?<br />
► Are there logs or output to demonstrate the problem?<br />
► Is the problem reproducible?<br />
► Is this a hardware or a software problem?<br />
2.4.2 Identify the Blue Gene/L system<br />
The Blue Gene/L system might be running within a simple or complex<br />
configuration, and it is important that you know the system. A process to<br />
understand what configuration that you have is described in 2.2, “Identifying the<br />
installed system” on page 57.<br />
2.4.3 Identify the problem area<br />
When you have answered the questions in 2.4.1, “Define the problem” on<br />
page 107, then go through the following high-level checks to pin point the<br />
problem area on the Blue Gene/L system:<br />
► Has anything changed recently on the system?<br />
► Where are we seeing the problem?<br />
The questions listed in 2.4.1, “Define the problem” on page 107 should give<br />
you a good indication of where to start looking. As mentioned in 2.3, “Sanity<br />
checks for installed components” on page 82, the functionality of one part of<br />
the Blue Gene/L is dependant on another part functioning correctly.<br />
Therefore, even though the initial starting point in trying to determine a<br />
problem area might seem obvious, it does not mean this is what is causing<br />
the overall problem.<br />
The following list includes the main components that make up a complex Blue<br />
Gene/L environment. Each component has a separate checklist to identify<br />
whether it is working as expected:<br />
– Software<br />
2.5, “Identifying core Blue Gene/L system problems” on page 108<br />
4.4.9, “LoadLeveler checklist” on page 186<br />
“The mpirun checklist” on page 166<br />
5.3.10, “GPFS Checklists” on page 255<br />
5.2.4, “NFS checklists” on page 218<br />
Chapter 2. <strong>Problem</strong> determination methodology 107
– Hardware<br />
3.2, “Hardware monitor” on page 114<br />
3.4, “Diagnostics” on page 131.<br />
2.3.12, “Check the network switches” on page 103<br />
2.3.13, “Check the physical Blue Gene/L racks configuration” on<br />
page 104<br />
2.5 Identifying core Blue Gene/L system problems<br />
Here is a basic checklist that you can use to reveal core Blue Gene/L<br />
system-related issues:<br />
► Check what has changed on the system.<br />
► Perform some basic checks from the SN:<br />
– Check that DB2 is working<br />
– Check that BGWEB is running<br />
– Check that BGLMaster and its child daemons are running<br />
– Check that a block can be allocated using mmcs_console<br />
– Check that a simple job can run (mmcs_console)<br />
► Check the control system server logs<br />
► Check the RAS Events or use RAS drill down in the BGWEB for relevant<br />
errors when the issue occurred. See Chapter 3, “<strong>Problem</strong> determination<br />
tools” on page 113 for more information.<br />
► In the BGWEB, check the Runtime link to obtain information about the job<br />
that has been run and the blocks in use.<br />
Examples are shown in Figure 2-16 on page 98 and Figure 2-17 on page 98.<br />
For more information, see Chapter 3, “<strong>Problem</strong> determination tools” on<br />
page 113.<br />
► Check the Configuration link in the BGWEB for hardware issues.<br />
Examples are shown in 2.2, “Identifying the installed system” on page 57. For<br />
more information, see Chapter 3, “<strong>Problem</strong> determination tools” on page 113.<br />
► If there is no indication where the problem might be, we recommend that you<br />
use the following sequence of checks:<br />
– 2.3.1, “Check the operating system on the SN” on page 83<br />
– 2.3.2, “Check communication services on the SN” on page 84<br />
– 2.3.6, “Check the NFS subsystem on the SN” on page 90<br />
– 2.3.12, “Check the network switches” on page 103<br />
– 2.3.13, “Check the physical Blue Gene/L racks configuration” on page 104<br />
108 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
2.6 Identifying IBM LoadLeveler jobs to BGL jobs (CLI)<br />
This section gives an overview on how to identify the jobs that are submitted to<br />
Blue Gene/L by the IBM LoadLeveler. The tasks and actions that we present<br />
here are not intended to replace the procedures that are described in the<br />
LoadLeveler documentation. Rather, we provide some tips.<br />
We start by setting the ‘-verbose’ option in the LoadLeveler commands file<br />
(.cmd), as shown in Example 2-31.<br />
Example 2-31 LoadLeveler job command file with ‘-verbose’ option<br />
bglsn:/bglscratch/test1 # cat ior-gpfs.cmd<br />
#@ job_type = bluegene<br />
#@ executable = /usr/bin/mpirun<br />
#@ bg_size = 128<br />
##@ bg_partition = R000_128<br />
##@ arguments = -verbose 2 -exe /bglscratch/test1/hello-file-2.rts<br />
-args 6<br />
#@ arguments = -verbose 2 -exe<br />
/bglscratch/test1/applications/IOR/IOR.rts -args -f<br />
/bglscratch/test1/applications/IOR/ior-inputs<br />
##@ output = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />
##@ error = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />
#@ output = /bglscratch/test1/ior-gpfs.out<br />
#@ error = /bglscratch/test1/ior-gpfs.err<br />
#@ environment = COPY_ALL<br />
##@ notification = error<br />
##@ notify_user = loadl<br />
#@ class = small<br />
#@ queue<br />
Then, you can check the job’s stderr(2) file for detail information, as shown in<br />
Example 2-32.<br />
Example 2-32 Verbose information in job’s stderr(2) file<br />
..snip..<br />
BE_MPI (Debug): Adding job<br />
bglfen1.itso.ibm.com.16.0 to the DB...<br />
BRIDGE (Debug): rm_get_jobs() - Called<br />
BRIDGE (Debug): rm_get_jobs() - Completed<br />
Successfully<br />
SCHED_BRIDGE (Debug): Partition<br />
RMP24Mr104437181 - No BG/L job assigned to this partition<br />
BRIDGE (Debug): rm_add_job() - Called<br />
Chapter 2. <strong>Problem</strong> determination methodology 109
BRIDGE (Debug): rm_add_job() - Completed<br />
Successfully<br />
BE_MPI (Debug): Job bglfen1.itso.ibm.com.16.0<br />
was successfully added to the DB<br />
BE_MPI (Debug): Quering the DB job ID<br />
BE_MPI (Debug): DB job ID is 199<br />
..snip..<br />
Here, we can see the interaction between LoadLeveler and the Blue Gene/L<br />
database:<br />
► LL job bglfen1.itso.ibm.com.16.0<br />
► Partition Id : RMP24Mr104437181<br />
► BGL job ID: 199<br />
You can map the information in this list to the job information, as reported by the<br />
llq command on the Front-End Node, as shown in Example 2-33.<br />
Example 2-33 LoadLeveler queue information (llq command)<br />
test1@bglfen1:/bglscratch/test1/applications/IOR> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
------------------------ ---------- ----------- -- --- ------------<br />
----------bglfen1.12.0<br />
test1 3/24 09:45 R 50 small<br />
bglsn<br />
bglfen1.13.0 test1 3/24 09:46 R 50 small<br />
bglfen1<br />
bglfen1.15.0 test1 3/24 09:46 R 50 small<br />
bglfen2<br />
bglfen1.16.0 test1 3/24 09:47 R 50 small<br />
bglfen1<br />
bglfen1.17.0 test1 3/24 09:48 I 50 small<br />
Also, you can identify the job that is running (Example 2-33) in the log file<br />
~/loadl/log/StarterLog, from which we extracted the lines shown in Example 2-34.<br />
Example 2-34 Job log file<br />
..snip..<br />
03/24 09:46:23 TI-0 bglfen1.13.0 Prolog not run, no program was<br />
specified.<br />
03/24 09:46:23 TI-0 bglfen1.13.0 run_dir =<br />
/home/loadl/execute/bglfen1.itso.ibm.com.13.0<br />
110 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
03/24 09:46:23 TI-0 bglfen1.13.0 Sending request for executable to<br />
Schedd<br />
03/24 09:46:23 TI-0 03/24 09:46:23 TI-0 bglfen1.13.0 User environment<br />
prolog not run, no program was specified.<br />
03/24 09:46:23 TI-0 LoadLeveler: 2539-475 Cannot receive command from<br />
client bglfen1.itso.ibm.com, errno =2.<br />
03/24 09:46:23 TI-0 bglfen1.13.0 llcheckpriv program exited, termsig =<br />
0, coredump = 0, retcode = 0<br />
03/24 09:46:23 TI-0 bglfen1.itso.ibm.com.13.0 Sending READY status to<br />
Startd<br />
03/24 09:46:23 TI-0 bglfen1.13.0 Main task program started (pid=9064<br />
process count=1).<br />
03/24 09:46:23 TI-0 bglfen1.itso.ibm.com.13.0 Sending RUNNING status to<br />
Startd<br />
03/24 09:46:24 TI-0 bglfen1.13.0 Blue Gene partition id<br />
RMP24Mr104230153.<br />
03/24 09:46:24 TI-0 bglfen1.13.0 Blue Gene Job Name<br />
bglfen1.itso.ibm.com.13.0.<br />
..snip..<br />
Here, we have the following relevant information:<br />
► LoadLeveler ID: bglfen1.13.0<br />
► LoadLeveler long job name : bglfen1.itso.ibm.com.13.0<br />
► Partition name : RMP24Mr104230153<br />
To cross-check with the Blue Gene/L database, we can query the TBGLJOB and<br />
TBGLJOB_HISTORY tables. The TBGLJOB table shows the currently running<br />
jobs, and the TBGLJOB_HISTORY table shows the jobs that have finished.<br />
You can get all the job statistics from these tables, including the BGL job ID,<br />
using the DB2 statements shown in Example 2-35.<br />
Example 2-35 Querying the DB2 database for LoadLeveler jobs<br />
bglsn:/bglscratch/test1 # db2 "select jobid,jobname,blockid from<br />
TBGLJOB_HISTORY where jobname = 'bglfen1.itso.ibm.com.15.0'"<br />
JOBID JOBNAME BLOCKID<br />
----------- ------------------------------------------------<br />
----------------<br />
198 bglfen1.itso.ibm.com.15.0<br />
RMP24Mr104232176<br />
1 record(s) selected.<br />
Chapter 2. <strong>Problem</strong> determination methodology 111
112 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chapter 3. <strong>Problem</strong> determination tools<br />
In this chapter we introduce problem determination tools that are available on the<br />
Blue Gene/L system.<br />
We start with an overview of each tool and then provide more detail on each tool<br />
in individual sections. We include information about when to use the tool<br />
according to the problem determination methodology, and how to interpret the<br />
output of the tool.<br />
We discuss the following tools:<br />
► Hardware monitor<br />
► Web interface<br />
► Diagnostics<br />
► MMCS console<br />
3<br />
In addition to these tools, Blue Gene/L provides operational logs, which are<br />
explained in 2.2.6, “Control system server logs” on page 61.<br />
© Copyright IBM Corp. 2006. All rights reserved. 113
3.1 Introduction<br />
This chapter explains how to monitor and analyze your system by using available<br />
problem determination tools. Among these tools, we focus on the hardware<br />
monitor, Web interface, diagnostic suite, and MMCS (Midplane Management<br />
Control System) console.<br />
Each section contains subsections that illustrate a procedure of how to start the<br />
tool and how to check the results.<br />
3.2 Hardware monitor<br />
The hardware monitor is a tool that is used to capture information about a<br />
specified piece of hardware of the Blue Gene/L system. After you start the<br />
hardware monitor on the running system, it keeps monitoring and storing the<br />
environmental information.<br />
3.2.1 Collectable information<br />
The hardware monitor gathers information about the following Blue Gene/L rack<br />
hardware:<br />
► Fans<br />
► Bulk power modules<br />
► Service cards<br />
► Node cards<br />
► Link cards<br />
For each of the devices presented in the previous list, multiple data is available.<br />
For example, monitoring the fans captures temperature, voltage, speed and<br />
status flags. See Table 3-1 for the complete list of the collectable data. Note that<br />
the data for fan modules and bulk power modules are collected by monitoring the<br />
service cards.<br />
The hardware monitor stores all of these information in the DB2 database. They<br />
are accessible through the Environmental and the Database section in the Web<br />
interface, or you can query the database directly using SQL commands.<br />
114 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
3.2.2 Starting the tool<br />
Table 3-1 The complete list of the collectable data<br />
Hardware component Data collected<br />
Service card Temperature, voltage, status flags<br />
Link card Temperature, voltage, power, status flags<br />
Node card Temperature, voltage, status flags<br />
Fan modules Temperature, voltage, speed, status flags<br />
Bulk power modules Temperature, voltage, status flags<br />
Hardware monitor and RAS events<br />
Among the information stored in the DB2 database, the hardware monitor<br />
generates a reliability, availability, and serviceability (RAS) event when it finds<br />
status information outside the normal range. Those events can be examined by<br />
issuing a query to the DB2 database, or they can be viewed using the Web<br />
browser interface.<br />
Specifying MONITOR facility on the RAS Event Query page returns the events<br />
generated by the hardware monitor. See 3.3, “Web interface” on page 119 for<br />
details about the Web interface.<br />
The hardware monitor also records error messages in its own log file. When it is<br />
created, the file is stored in the directory /bgl/BlueLight/logs/BGL as<br />
-monitorHW-.log. The hardware monitor writes error<br />
messages when it cannot contact or recognize any piece of hardware (that might<br />
not have been removed, added, or replaced).<br />
Here, we discuss briefly how to start the hardware monitor. For more detailed<br />
information, see Blue Gene/L: System Administration, SG24-7178.<br />
You can start the hardware monitor manually from the command line, or you can<br />
start it automatically with bglmaster.<br />
Chapter 3. <strong>Problem</strong> determination tools 115
Using the GUI<br />
If you have a graphical user interface (GUI) on the Service Node (SN), run the<br />
commands shown in Example 3-1 to start the hardware monitor.<br />
Example 3-1 Starting the hardware monitor<br />
cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
./startmon<br />
This opens a new window, which is shown in Figure 3-1.<br />
Figure 3-1 Initial screen of the hardware monitor<br />
To start monitoring on the hardware:<br />
1. Click Settings → Start Monitoring.<br />
2. In the Start Monitoring window, click the drop-down list box and select ALL<br />
LINK CARDS.<br />
3. Click Update.<br />
4. Repeat this procedure for all available cards.<br />
116 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Using the console<br />
You can also start the hardware monitor without a GUI. Issuing the commands in<br />
Example 3-2 allows you to start monitoring from a console.<br />
Example 3-2 Start monitoring row 0 from a console<br />
$ cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
$ ./startmon --row 0 --autostart --nodisplay<br />
Example 3-2 launches the hardware monitor for row 0 of the Blue Gene/L<br />
system, monitoring all the cards with default interval time. The default time<br />
interval is 5 minutes for service cards and 30 minutes for the other cards.<br />
Tip: You can use the --row --autostart option for the GUI startup as<br />
well. This option opens the window with all the cards monitored with the<br />
default interval time.<br />
3.2.3 Checking the results<br />
There are two ways of checking the environmental data that is collected by the<br />
hardware monitor. One is by using the Web interface, and the other is by using<br />
the GUI window of the hardware monitor. Both methods provide the same<br />
information, because they are referring to the same DB2 tables.<br />
To check the collected data through the Web interface:<br />
1. Point a Web browser to:<br />
http:///web/index.php<br />
2. Click Environmental.<br />
3. Select one of the information from the top drop list box.<br />
4. Specify the range for the date.<br />
5. Specify location if needed.<br />
6. Click send.<br />
This procedure returns a table that includes the data that was collected by the<br />
hardware monitor in the specified time interval. Figure 3-2 shows an example of<br />
querying a temperature of a service card.<br />
Chapter 3. <strong>Problem</strong> determination tools 117
Figure 3-2 Querying the environmental data<br />
Also, as previously mentioned, the hardware monitor generates an RAS event on<br />
a particular circumstance, such as hardware component cannot be contacted.<br />
Figure 3-3 on page 119 illustrates one of the error messages that is reported on<br />
the RAS event pages.<br />
Note: It is good idea to check both environmental and RAS event pages,<br />
because some information might only come up in only one of these two pages.<br />
118 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
3.3 Web interface<br />
Figure 3-3 RAS events generated by the hardware monitor<br />
The Web interface is one of the major tools that you can use for problem<br />
determination. It provides a number of ways to analyze your Blue Gene/L<br />
system. All the information is stored in the DB2 database that is running on the<br />
SN, and a Web browser interface gives an easy way of selecting the necessary<br />
information.<br />
The Web interface provides the following basic sections:<br />
► Configuration<br />
► Runtime<br />
► Environmental<br />
► RAS events<br />
► Diagnostics test results<br />
► Database browser<br />
Chapter 3. <strong>Problem</strong> determination tools 119
3.3.1 Starting the tool<br />
Figure 3-4 shows the top page of the Web interface.<br />
Figure 3-4 Top page of the Web interface<br />
Tip: Whenever the Blue Gene logo is shown at the top, left corner of a Web<br />
page, clicking the logo brings your Web browser back to the top page.<br />
For the Web interface, there is no particular procedure that is required to start or<br />
to stop the tool. However, there are certain checks to make sure that the tool is<br />
working correctly:<br />
1. Make sure that the DB2 database is running because all the information<br />
accessed through the Web interface is stored in the DB2 database.<br />
If required, use the appropriate procedures to start the database, as shown in<br />
2.3.4, “Check that DB2 is working” on page 87.<br />
2. Check whether an Apache Web server is correctly configured and running,<br />
including the PHP module.<br />
Make sure that the Apache Web server starts when the system restarts and<br />
that the name resolution and apache directories exist as they are defined in<br />
the configuration file. For details see 2.3.3, “Check that BGWEB is running”<br />
on page 86.<br />
120 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
3.3.2 Checking the result<br />
When the Web server and the DB2 are working correctly, the Web interface is<br />
accessible by pointing a Web browser to:<br />
http:///web/index.php<br />
This top page provides links to six basic categories of information with a short<br />
description.<br />
Configuration<br />
The Configuration page provides information about the current status of<br />
hardware. It includes four subcategories:<br />
► Hardware browser<br />
► Link summary<br />
► Service actions<br />
► <strong>Problem</strong> monitor<br />
Hardware browser<br />
The hardware browser provides a Web page that gives you a simple and<br />
effective way to find which hardware is recognized and which is marked as<br />
missing.<br />
If a piece of hardware is in a missing state, then that particular hardware is not<br />
available as a system resource. Hardware is marked missing, for example, when<br />
a service action is performed (maintenance operation) or when a piece of<br />
hardware is in error state.<br />
Throughout the Web pages, each piece of hardware is represented by its name<br />
based on the naming convention (as listed in 1.2.11, “Rack, power module, card,<br />
and fan naming conventions” on page 19). Also, when the hardware browser<br />
finds a system resource in a missing state, it highlights the name of the missing<br />
system with a red box, as shown in Figure 3-5 on page 122.<br />
Note: If you know the name of the particular hardware in which you are<br />
interested, you can enter the name of that hardware in the text box titled Find<br />
Hardware, and then press Enter. This jumps to the detailed information page<br />
of that hardware.<br />
Chapter 3. <strong>Problem</strong> determination tools 121
Figure 3-5 Missing hardware in a red box<br />
The hardware browser provides the exact location when you find a system<br />
resource in a missing state and helps to investigate the cause of that state.<br />
You can also use information that you obtain from the hardware browser with<br />
other tools to concentrate on a specific issue. For example, you can use the<br />
location of the hardware and search for RAS events that are related to that<br />
location.<br />
Link summary<br />
The link summary page gives an overall picture of how the Blue Gene/L system<br />
is wired. It shows which midplane is connected to the other midplane in X, Y, and<br />
Z direction.<br />
This page helps you to make sure that all the data cables are connected,<br />
detected, and configured correctly.<br />
Service actions<br />
The service actions page show service actions in progress and the history of<br />
completed service actions. The service action consists of two phases:<br />
PrepareForService and EndServiceAction. Between these two actions, a<br />
hardware service engineer can remove or replace defective or suspect<br />
hardware.<br />
122 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
An entry of service action in progress means that PrepareForService has been<br />
performed and the ID for the service has been assigned. To complete the<br />
service, EndServiceAction must be performed for that ID.<br />
Using the service actions allow a specified hardware to be turned off and on for<br />
maintenance purpose.<br />
<strong>Problem</strong> monitor<br />
The problem monitor generates a list of hardware that has been detected with<br />
some sort of problem that needs to be solved. It gives a location of hardware and<br />
a short description of the problem.<br />
A location is presented as a link (see Figure 3-6) and helps a you to jump quickly<br />
to the detailed page for the piece of hardware in question.<br />
Link<br />
Figure 3-6 <strong>Problem</strong> monitor showing a midplane is in error state<br />
Runtime<br />
The Runtime page provides the current status of jobs and blocks. It includes four<br />
subcategories:<br />
► Block information<br />
► Job information<br />
► Midplane information<br />
► Utilization<br />
For our discussion, we focus on the block information and the job information.<br />
Chapter 3. <strong>Problem</strong> determination tools 123
Block information<br />
The block information page provides a table of available blocks and their current<br />
status (Figure 3-7). The Status and Job Status columns contain the information<br />
that might need attention.<br />
Figure 3-7 List of blocks page and related information<br />
For example, if a block is in Booting status for a long time, there might be<br />
problem in booting that block. Alternatively, if a job is in the Ready to start<br />
status for a long time, this status might also indicate that there are some<br />
problems with loading the job.<br />
Tip: Clicking the name of each column title returns a new table which is sorted<br />
by the selected column.<br />
For the complete list of the types of status for a block, see Table 4-3 on page 162,<br />
and for the complete list of the types of status for a job see Table 4-4 on<br />
page 162.<br />
Note: A block that is created by LoadLeveler acts differently. For a job<br />
submitted through LoadLeveler, if a predefined block is not specified, one<br />
block is created and allocated dynamically. The dynamically created block<br />
remains in the table until the LoadLeveler job ends. Then, it is released<br />
(freed), and information about this block is lost.<br />
124 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Clicking a specific block ID reveals detailed information for that block. The<br />
Overlapping Blocks table provides useful information, especially if the users of<br />
the Blue Gene/L system are using mpirun to submit their jobs (see Figure 3-8).<br />
It is also helpful to check the Hardware used by this block table, especially after<br />
you have changed a configuration of the system or you have replaced an I/O<br />
card. The table tells whether a particular block is using the correct hardware.<br />
Figure 3-8 Overlapping blocks and hardware in use<br />
Job information<br />
Job information is provided in two main tables. One includes information for the<br />
current jobs, while the other includes the history of jobs submitted. As noted<br />
earlier, the status column is also important for the job information table. If a job is<br />
in Error status, for example, it is good idea to click the job ID to see the details of<br />
the job. Checking the Error Text can help you to determine the problem.<br />
Chapter 3. <strong>Problem</strong> determination tools 125
In addition, the Show RAS events for this job link at the bottom of the page<br />
(Figure 3-9) jumps to the RAS event page that shows RAS events only related to<br />
the job in question.<br />
Figure 3-9 Detailed job information<br />
Environmental<br />
You can use the environmental page to view hardware information that is<br />
collected by the hardware monitor as described in 3.2, “Hardware monitor” on<br />
page 114. Refer to that section for more information.<br />
RAS events<br />
The RAS event page allows you to search for all RAS events that ar stored in the<br />
DB2 database. Because the Blue Gene/L system generates a vast amount of<br />
events, the RAS event page allows an interface that accepts a number of<br />
variables to pick up only the desired information. Specifying a time range, block,<br />
job ID, and location are the typical examples (Figure 3-10 on page 127).<br />
This RAS event page is one of the most frequently used tools for problem<br />
determination, because it gives you an idea of when and where the problem<br />
occurred. For example, the Entry Data suggests the cause of the problem.<br />
126 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 3-10 Entry Data in the RAS event query result<br />
Another feature of the RAS event page is the RAS drill down. The RAS drill down<br />
is accessible from the link that is located at the bottom of the page. This drill<br />
down provides a number of fatal or advisory RAS events that occurred in a<br />
specified period of time for each location.<br />
Clicking the link in front of the location name expands the table to show more<br />
information. Compute card, for example, provides a link to a table of RAS events<br />
that are specific to the compute card when you fully expand a row in the drill<br />
down table, as shown in Figure 3-11 on page 128.<br />
Chapter 3. <strong>Problem</strong> determination tools 127
Figure 3-11 RAS drill down showing the number of errors and their location<br />
Diagnostics test results<br />
Diagnostic test results show the result of a diagnostic test in various ways. The<br />
first page provides a brief test result for each block. If the last result column is a<br />
color other than green (Figure 3-12), you should look into the details by clicking<br />
the block ID.<br />
Figure 3-12 The top page for Diagnostics<br />
128 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
For detailed information about the diagnostic test, including how to run the<br />
diagnostics, see 3.4, “Diagnostics” on page 131.<br />
Details of the diagnostic test results are provided in the form of a table. Each row<br />
in the table represents a single test case with its test results. A number in the<br />
result column shows the count for hardware that passed or failed the test. All of<br />
the numbers are in the form of a link, and clicking one of the numbers brings you<br />
to a more detailed test result that is focused on each piece of hardware<br />
(Figure 3-13).<br />
Figure 3-13 Testcase details<br />
Each piece of hardware has its individual log files for every test case that is<br />
performed in a diagnostic test. Those log files give you an idea for what each test<br />
case is looking. Moreover, if you look into a log file of hardware which is reported<br />
as failed, you might find a reason for the failure marked in red, as shown in<br />
Figure 3-14.<br />
Chapter 3. <strong>Problem</strong> determination tools 129
Figure 3-14 Testcase highlighting the cause of failure<br />
The diagnostic test results page for each block ID also provides two links at the<br />
bottom of the page:<br />
► Show all failed tests for <br />
► Full log for run on <br />
Although both links are self explanatory, checking the red and blue lines for the<br />
second link is useful. This second link provides a summary of the test at the end<br />
and also provides information about the current status of the system. As an<br />
example, the diagnostics test can mark a midplane unavailable (or M for missing),<br />
depending on the test result.<br />
Database browser<br />
The database browser page provides the complete list of database tables and<br />
views that are used in the Blue Gene/L system. All table and view names are in a<br />
form of link. Clicking them shows you a structure of definition and the actual data<br />
stored in the DB2 database.<br />
Although the Database browser page allows you to look into each table or view, it<br />
is not always the best tool to explore the content of the database. For example, if<br />
the total number of currently active compute cards for the entire system is in<br />
question, there is another, better way to obtain this information. Because the<br />
name of tables and views listed in the database browser page are indeed used in<br />
the system, looking up those names and using DB2 commands might help in<br />
some occasions (see Example 3-3).<br />
Example 3-3 Sample DB2 command with one of the names from database browser<br />
$ db2 “select count(*) from TBGLPROCESSORCARD where stauts=’A’”<br />
1<br />
-----------<br />
68<br />
1 record(s) selected.<br />
130 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
3.4 Diagnostics<br />
The diagnostics suite performs a number of tests on the Blue Gene/L rack to<br />
determine system health. The test suite consists of two major sections:<br />
► Blue Gene/L Compute and IO nodes (BLC ASICs) for a midplane<br />
► Blue Gene/L Link chips (BLL ASICs) for a whole rack<br />
Because every data that is related to the Blue Gene/L system is stored in the<br />
DB2 database, the result of diagnostics is also stored in the database. These<br />
results are accessible through the Web interface Diagnostic section. For<br />
information about how to view these results, refer to “Diagnostics test results” on<br />
page 128.<br />
A system administrator can run diagnostics anytime that they are necessary.<br />
However, we recommend that you use diagnostics especially in the following<br />
circumstances:<br />
► When you encounter a problem and cannot isolate whether the problem is<br />
caused by software or hardware error, run diagnostics.<br />
Because the diagnostic suite checks the entire hardware, it helps you to<br />
isolate and to determine what the problem is. In this case, the diagnostic suite<br />
is used for an error detection.<br />
► When you replace hardware, run diagnostics immediately after the<br />
replacement.<br />
The main focus of running the tests is not only on the newly inserted<br />
hardware but also on all the hardware in a rack. The diagnostic suite might<br />
find a piece of hardware with few correctable errors for multiple test cases.<br />
This type of error might indicate a piece of hardware that requires attention. In<br />
this case, the diagnostic suite is used as a precaution.<br />
Chapter 3. <strong>Problem</strong> determination tools 131
3.4.1 Test cases<br />
The diagnostics suite includes a numbers of test cases. Each of the following<br />
tables shows a list of test cases. Table 3-2 shows a list for BLC, which is for<br />
compute and I/O nodes. Table 3-3 on page 133 shows a list for BLL, which is for<br />
link chips. The tables also include a short description of each the test cases.<br />
Table 3-2 List of available testcases for BLC<br />
Test case Description<br />
blc_powermodules Queries the status of each power module on a nodecard.<br />
blc_voltages Queries the 1.5V and 2.5V power rail on a nodecard.<br />
blc_temperatures Queries all temperature sensors on a nodecard.<br />
bs_trash_0 Generates random instruction and executes them on the<br />
PPC instructions unit 0.<br />
bs_trash_1 Same as bs_trash_0 but using unit 1 instead of unit 0.<br />
dgemm160 Tests the floating point unit on the BLC ASIC.<br />
dgemm160e Extended diagnostics based on dgemm160 for testing the<br />
floating point unit.<br />
dgemm3200 Tests both the floating point unit and the memory subsystem<br />
on the BLC ASIC for a problem not found in the earlier tests.<br />
dgemm3200e Extended diagnostics based on dgemm3200 for testing the<br />
floating point unit and memory subsystem.<br />
dr_bitfail Writes to and reads from all DDR memory locations, and<br />
flags all failures. Does a simple memory test and prints a log<br />
description that attempts to identify the failing component.<br />
Identifies specific failing bits by ASIC and DRAM pin when<br />
possible.<br />
emac_dg Tests the ethernet function in loopback on the BLC ASIC.<br />
Specifically checks BLC ASICs, PHYs, and connection<br />
between the two.<br />
gi_single_chip Tests whether the global interrupt port is accessible and<br />
whether the global interrupt wires can be forced to 0 and 1<br />
using the local global interrupt loopback.<br />
gidcm Tests whether the communication through the global<br />
interrupt barrier network is fully functional. One of the<br />
compute node sends out signals in a specific pattern and the<br />
rest of the nodes receive the signals.<br />
132 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Test case Description<br />
linpack Runs single node linpack. Aims to examine the hardware<br />
health using a program that an ordinary user would submit.<br />
mem_l2_coherency Exercises all paths from L2 to connected memory and IO<br />
modules for testing the BLC on-chip cache function.<br />
ms_gen_short This runs memory pattern tests which is a complete memory<br />
test that checks for BLC ASIC function and the external<br />
DRAM.<br />
power_module_stress This test allows the compute cards to use as much power as<br />
possible and see whether power module can cope with<br />
whether the power module can cope with the power surge.<br />
ti_abist SRAM ABIST (SRAM Array Built-In Self Test) tests all<br />
on-chip SRAM arrays.<br />
ti_edramabist EDRAM ABIST (EDRAM Array Built-In Self Test) tests all<br />
on-chip Embedded DRAM arrays.<br />
ti_lbist LBIST (Logic Built-In Self Test) tests the on-chip logic for<br />
operation at frequency, using random patterns.<br />
tr_connectivity Checks basic connectivity of the collective network.<br />
tr_loopback Tests the on-chip functionality of the collective unit on the<br />
BLC ASIC.<br />
tr_multinode Thorough check of the collective network.<br />
ts_connectivity Checks basic connectivity of the torus network.<br />
ts_loopback_0 Tests the on-chip functionality of the torus unit on BLC ASIC<br />
0.<br />
ts_loopback_1 Same as ts_loopback_0 but using unit 1 instead of unit 0.<br />
ts_multinode Thorough check of the torus network.<br />
Table 3-3 List of available testcases for BLL<br />
Test case Description<br />
bll_lbist LBIST (Logic Built-In Self Test) tests the on-chip logic<br />
for operation at frequency, using random patterns.<br />
bll_lbist_linkreset Performs a resets or re-initializes linkchips.<br />
bll_lbist_pgood Resets linkchips to simple state after turning on the<br />
system.<br />
Chapter 3. <strong>Problem</strong> determination tools 133
3.4.2 Starting the tool<br />
Test case Description<br />
bll_powermodules Queries the status of each power module on a linkcard.<br />
bll_temperatures Queries all temperature sensors on a linkcard.<br />
bll_voltages Queries all 1.5V and 2.5V power rail on a linkcard.<br />
There are several ways to run the diagnostic suite on the Blue Gene/L system.<br />
Among those, we recommend that you use the following scripts:<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/submit_rack_diags.ksh<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/submit_midplane_diags.ksh<br />
Both of these scripts require at least one argument, which is a midplane or a rack<br />
identifier (see Example 3-4). If none is specified, the script uses R00 as a default<br />
rack identifier and R000 as a default midplane identifier.<br />
Example 3-4 Specifying an identifier for the scripts<br />
$ /bgl/BlueLight/ppcfloor/bglsys/bin/submit_rack_diags.ksh R01<br />
$ /bgl/BlueLight/ppcfloor/bglsys/bin/submit_midplane_diags.ksh R010<br />
The scripts also accept optional arguments. For a list of acceptable arguments,<br />
issue the command ./rundiag -h, which is located in the<br />
/bgl/BlueLight/ppcfloor/bgldiag/ directory, or for detailed instructions (the manual<br />
page), issue ./rundiag -help from the same directory.<br />
You can also run a subset of tests from the diagnostic suite by choosing from a<br />
menu which test to run in interactive mode, as shown in Example 3-5.<br />
Example 3-5 Evoking the diagnostic suite menu<br />
$ cd /bgl/BlueLight/ppcfloor/bgldiags/common<br />
$ ./rundiag -host localhost -block <br />
--- Blue Gene/L System Diagnostic Console ---<br />
1 Power, Packaging & Cooling tests<br />
2 Link chip diagnostic<br />
3 Compute & IO node BIST engine tests<br />
4 Compute & IO single node bootstrap tests<br />
5 Compute & IO single node kernel tests<br />
6 Compute & IO node global interrupt tests<br />
7 Compute & IO multi-node tests<br />
8 IO node only tests<br />
134 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
9 Compute & IO node BLADE exercisers (e.g. dgemm3200e)<br />
10 Compute & IO node power modules stress test<br />
11 Compute & IO node BLRTS tests<br />
19 Enter remarks or notes for this run (accepts also: ‘remarks’ or<br />
‘r’)<br />
20 Print the current remarks or notes for this run (accepts also:<br />
‘printremarks’ or ‘p’)<br />
exit exit (short: e,q)<br />
Note: Running a diagnostics suite using the provided script creates its own<br />
block to run the tests. The block will include all the available I/O nodes that are<br />
installed in the system. For certain configurations, for example when one I/O<br />
card is installed per node card but only one ethernet cable is connected to the<br />
node card, the diagnostic tests might cause a problem for this block.<br />
Because an I/O card supports two ethernet cables and the diagnostic block<br />
expects to have two ethernet connections, the block might fail to boot with an<br />
error message no ethernet link. Thus, depending on your configuration, some<br />
test cases might be unable to execute.<br />
If you have such a configuration, use the rundiag command with the -block<br />
option instead of the script. Make sure that you specify a bootable (or<br />
valid) block for the option.<br />
Tip: You can run testcases interactively from the menu, or you can add the<br />
-batch option to run test cases in batch mode.<br />
For more detailed information, refer to the diagnostic documentation that comes<br />
with each driver release, which is located in /bgl/BlueLight/ppcfloor/bgldiags/doc.<br />
3.4.3 Checking the results<br />
The results of the diagnostic suite are stored in the DB2 database and in log files.<br />
You can use the Web interface for easy access to the content that is stored in the<br />
database. (See “Diagnostics test results” on page 128 for more information.)<br />
You can also check the result in the log files. The log files are stored in the<br />
/bgl/BlueLight/logs/diags/ directory (on the SN). Each time the diagnostic suite<br />
runs, it creates a directory with the start time, followed by the block ID and the<br />
midplane or rack identifier as the directory name. Inside this directory, the<br />
diagnostics store various log files, scripts, and a directory for each individual test<br />
case.<br />
Chapter 3. <strong>Problem</strong> determination tools 135
While the Web interface provides a log file for each node in each test case, it is<br />
still worth looking at the files in the diags directory. Here is a typical example of<br />
such a case.<br />
We assume that one of the test cases has failed, and a large number of nodes<br />
has reported errors. You might first want to look into the individual error logs<br />
through the diagnostics test results page. Although it depends on the problem<br />
that you encounter, often some nodes return a different error message from<br />
others. In such a case, instead of clicking each link for a log file on the Web<br />
interface, sometimes it is more efficient to look into the files under the logs<br />
directory. Some files might have a larger file size, which indicates that the<br />
particular node produced a larger amount of error messages.<br />
Because those log files are plain text, if you are looking for a particular message,<br />
you can use tools such as awk, sed, grep and so on to filter messages.<br />
3.5 MMCS console<br />
3.5.1 Starting the tool<br />
The Midplane Management Control System (MMCS) console is a tool that<br />
provides various commands to control and maintain blocks and jobs<br />
(MMCS_DB). Although this tool is not designed for a problem determination, it is<br />
an important and useful tool to obtain the current status of the system.<br />
To launch the MMCS console, issue the following commands on the SN:<br />
$ cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
$ ./mmcs_db_console<br />
These commands open a shell that has a prompt which starts with mmcs$. If<br />
launching the mmcs_db_console does not provide the mmcs$ prompt and exits with<br />
an error message, check the following:<br />
► mmcs_db_server, idoproxydb, and ciodb must be running on the SN.<br />
► A path to the DB2 client libraries must be included in your PATH environment<br />
variable. Running the following command updates the PATH variable:<br />
$ . ~bgdb2cli/sqllib/db2profile<br />
136 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
3.5.2 Checking the results<br />
Most of the commands return the results to the console, just like an ordinary<br />
shell. Some of the commands that affect the system status might update the DB2<br />
database as well.<br />
To list all the available commands on mmcs_db_console, type help at the mmcs$<br />
prompt. This command returns a list of commands and their syntax. If you know<br />
the command for which you are looking, then help shows the syntax<br />
and a short description of the desired command.<br />
Among the number of useful commands that are provided by the MMCS console,<br />
we focus on the ones that are particularly helpful for problem determination:<br />
Interacting with nodes<br />
Generally, opening an interactive login shell on the I/O node is not desired<br />
(because of the amount of memory required to open a shell). However, MMCS<br />
console provides a tool, called write_con , which is useful if you have to<br />
look into an I/O node. The write_con utility allows you to submit a command to<br />
run on the I/O node. By using the option, nodes and IDo chips can be<br />
specified.<br />
Example 3-6 shows how to send the hostname command to the system on all I/O<br />
nodes in the block. The same example also demonstrates the usage of target<br />
option. The {i} option specifies all I/O nodes of the selected block. Each node is<br />
represented by the numbers in curly braces. In the example, the first I/O node<br />
returned its host name, ionode4, and it was recognized as node 119 in the<br />
system.<br />
Example 3-6 Example of using write_con and target option<br />
mmcs$ redirect R000_128 on<br />
OK<br />
mmcs$ {i} write_con hostname<br />
OK<br />
mmcs$ Apr 06 18:59:41 (I) [1079031008] {119}.0: h<br />
Apr 06 18:59:41 (I) [1079031008] {119}.0: ostname<br />
ionode4<br />
$<br />
Apr 06 18:59:41 (I) [1079031008] {102}.0: h<br />
Apr 06 18:59:41 (I) [1079031008] {102}.0: ostname<br />
ionode3<br />
$<br />
Chapter 3. <strong>Problem</strong> determination tools 137
Apr 06 18:59:41 (I) [1079031008] {17}.0: h<br />
Apr 06 18:59:41 (I) [1079031008] {17}.0: ostname<br />
ionode2<br />
The use of the target option also allows you to specify one particular I/O node.<br />
For example, to send the hostname command to the I/O node, which is<br />
recognized as #17, use {17} instead of {i}, as shown in Example 3-7.<br />
Example 3-7 Targeting a specific node<br />
mmcs$ redirect R000_128 on<br />
OK<br />
mmcs$ {17} write_con hostname<br />
OK<br />
mmcs$ Apr 06 19:33:34 (I) [1079031008] {17}.0: h<br />
Apr 06 19:33:34 (I) [1079031008] {17}.0: ostname<br />
ionode2<br />
Tip: If you need to know the physical location of the given node number, the<br />
locate command provides a list for the selected block.<br />
The redirect command, also used in Example 3-7, enables the output of the<br />
sent command to be displayed in the MMCS console. The output is also<br />
recorded in a log file for the specified node. The log file is located in<br />
/bgl/BlueLight/logs/BGL directory.<br />
Booting a block and submit a job<br />
You can use the MMCS console to check whether a block boots successfully and<br />
a job runs correctly. Example 3-8 illustrates the booting of a block and the<br />
submitting of a simple job, which is a good check when isolating a problem.<br />
Example 3-8 A sequence of submitting a job<br />
$ cd /bgl/BlueLight/ppcfloor/bglsys/bin<br />
$ ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ list_blocks<br />
OK<br />
mmcs$ allocate R000_128<br />
138 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
OK<br />
mmcs$ list_blocks<br />
OK<br />
R000_128 root(1) connected<br />
mmcs$ submit_job /bgl/hello/hello.rts /bgl/hello<br />
OK<br />
jobID=285<br />
mmcs$ list_jobs<br />
OK<br />
mmcs$ free R000_128<br />
OK<br />
mmcs$ quit<br />
After issuing the list_jobs command, do not forget to check the output of the<br />
program and confirm that it ran as expected.<br />
Chapter 3. <strong>Problem</strong> determination tools 139
140 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chapter 4. Running jobs<br />
4<br />
This chapter describes the parallel programming, compilers, and job submission<br />
environment on the BlueGene/L system. Also, we briefly discuss the Message<br />
Passing Interface (MPI), which is the foundation for large scale scientific and<br />
engineering applications. The topics covered are:<br />
► Parallel programming environment<br />
► Compilers<br />
► Job submission mechanisms (how they plug into the BlueGene/L<br />
environment)<br />
– Submit job (submit_job command) from Midplane Management Control<br />
System (MMCS)<br />
– mpirun command (a stand-alone program for submitting jobs)<br />
– LoadLeveler, which is the IBM job management for AIX and Linux (batch<br />
queuing system)<br />
© Copyright IBM Corp. 2006. All rights reserved. 141
4.1 Parallel programming environment<br />
The Message Passing Interface (MPI) is a parallel programming environment<br />
that has been ported and modified to suit the Blue Gene/L system. In Blue<br />
Gene/L terminology, each partition or block is a set of compute nodes and I/O<br />
nodes. Compute node kernel (CNK) is a very light weight kernel that developed<br />
specifically for Blue Gene/L system and supports a very limited set of system<br />
calls (about 35% of the Linux kernel system calls).<br />
All of the I/O calls (such as open, read, write and so forth) for the compute nodes<br />
are shipped to the I/O nodes for doing the job. Also, the compute nodes cannot<br />
be talked directly. They can be talked to through I/O nodes in the external<br />
network. At any instance, only a single user can allocate and use the partition, in<br />
the sense that no context switching is possible on the compute nodes.<br />
Each compute node is seen as a single process or thread for any application that<br />
is executing code on the system. Multiple compute nodes form the virtual<br />
communication network when any application executes. This process is not<br />
specific to the Blue Gene/L system, because the parallel applications forms this<br />
for the virtual set of processes that are executing. The MPI_COMM_WORLD<br />
argument that is specified in each MPI library call binds all of these virtual sets of<br />
processes under one group.<br />
There are three basic communication modes in MPI:<br />
► Point-to-point communication (for example: MPI_Send/Recv and so forth)<br />
► Collective communication (for example: MPI_Scatter, MPI_Barrier and so<br />
forth)<br />
► Collective communication and computation (for example: MPI_Reduce and so<br />
forth)<br />
Note: Certain non-blocking communication calls (for example:<br />
MPI_Isend/IRecv) exists in the MPI library that are out of scope of our<br />
discussion here. These are simply point-to-point communication calls.<br />
Let us briefly look at each of the modes:<br />
► Point-to-point communication mode describes the communication across<br />
processes, which in our case takes place across compute nodes (one<br />
compute node runs one process at a point in time). In order to achieve faster<br />
communication throughout the compute nodes, a three dimensional torus (3D<br />
Torus) network was designed on the Blue Gene/L system. Each compute<br />
node is connected to six neighbor compute nodes using the Torus network,<br />
as shown in Figure 1-25 on page 26.<br />
142 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
► With collective communication mode, the communication for I/O calls and<br />
some MPI calls are passed through the collective network (see Figure 1-26<br />
on page 27). However, one of the MPI calls, MPI_Barrier, uses a different<br />
network (see Figure 1-27 on page 29) that is implemented into the Blue<br />
Gene/L system. This is because this call specifically requires every process to<br />
come to a synchronized state before processing can continue. In summary, in<br />
order to achieve high bandwidth and low latency, MPI is designed on the Blue<br />
Gene/L system to take advantage of the underlying network topology.<br />
► Collective communication and computation mode is similar in nature to the<br />
collective communication mode. However, it depends on the implementation<br />
details of MPI, which is out of our scope of discussion here.<br />
Example 4-1 shows the sample Hello world! program C code, which we use<br />
throughout different sections in this chapter while doing compilation and job<br />
submission on the system.<br />
Example 4-1 Sample “Hello world!” program in C<br />
#include "mpi.h"<br />
#include <br />
int main(int argc, char *argv[])<br />
{<br />
int numprocs; /* Number of processors */<br />
int MyRank; /* Processor number */<br />
/* Initialize MPI */<br />
MPI_Init(&argc, &argv);<br />
/* Find this processor number */<br />
MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);<br />
/* Find the number of processors */<br />
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);<br />
printf ("Hello world! from processor %d out of %d\n", MyRank,<br />
numprocs);<br />
/* Shut down MPI */<br />
MPI_Finalize();<br />
printf (“I am the Root process\n”);<br />
return 0;<br />
}<br />
Chapter 4. Running jobs 143
4.2 Compilers<br />
The following steps explain the program shown in Example 4-1 for users who do<br />
not have experience in MPI.<br />
1. The MPI environment is initialized using the MPI_Init call, by which all the<br />
processes recognize and start simultaneously.<br />
2. MPI_Comm_rank gives the unique identity of each process in the total<br />
MPI_COMM_WORLD (set of processes) and MPI_Comm_size gives the information<br />
about how many processes are in the MPI_COMM_WORLD. MPI_COMM_WORLD is a<br />
dynamic parameter that is initialized and updated during the application<br />
execution. We explain this in 4.3.2, “The mpirun program” on page 150.<br />
3. Each process prints the Hello world! message along with its identification<br />
and the total number of processes for the current execution.<br />
4. The MPI_Finalize call terminates the MPI environment and all processes<br />
except the main process that executes this program (the root process).<br />
Note: The root process (rank zero or master process or thread) for an MPI<br />
application should not be confused with operating system root user ID.<br />
5. The root process executes the printf () function (C programming call), and<br />
the program returns.<br />
Note: For details about the MPI programming standard, refer to:<br />
http://www.mcs.anl.gov/mpi<br />
The compiler is one of the major software package that needs to be discussed<br />
before we head onto the execution environment. Depending on your site and the<br />
size of your Blue Gene/L system, you can install a number of FENs to balance<br />
the load of the user community. (Refer to Unfolding the IBM eServer Blue Gene<br />
Solution, SG24-6686 for an overview of the required software on the Front-End<br />
Nodes (FENs) and Service Node (SN).<br />
To use the XLC/XLF compilers on the FENs (Linux PPC64 platform), you will<br />
need to get a formal license agreement. Because Blue Gene/L has a different<br />
processor architecture (PowerPC 440) and there is no way to compile the job on<br />
the Compute node, the applications need to be cross-compiled. XLC/XLF<br />
compilers do require additional add-on RPMs for the Blue Gene/L system in<br />
order to compile user applications. On the SN, the Blue Gene/L control system<br />
RPMs are installed under the /bgl directory (shared file system across FENs and<br />
144 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
I/O nodes). The CNK supports the applications that are compiled either with the<br />
blrts-gnu-gcc/g77 or by the blrts_xlc/xlf compilers.<br />
The cross-compiler environment can be summarized by indicating the following<br />
required components:<br />
► Front-End Node running SuSE SLES 9 on a PPC64 (POWER4 and<br />
POWER5)<br />
► PowerPC-Linux-GNU to generate PowerPC-blrts-GNU<br />
► GNU tool chain for Blue Gene/L<br />
► IBM XL cross compilers for Blue Gene/L<br />
Currently, to build binaries (executables) for Blue Gene/L, the IBM XL compilers<br />
require the following:<br />
► Installation of IBM XLC V7.0/XLF V9.1 compilers for SuSE SLES9<br />
Linux/PPC64<br />
► Installation of the Blue Gene/L add-on that includes Blue Gene/L versions of<br />
the XL run-time libraries, compiler scripts, and configuration files.<br />
– The GNU Blue Gene/L tool chain:<br />
gcc, g++, and g77 v3.2<br />
binutils (as, ld, and so forth) v2.13<br />
GLIBC v2.2.5<br />
– Blue Gene/L support is supplied through patches. You apply the patches<br />
and build the tool chain; IBM supplies scripts to download, patch, and build<br />
everything.<br />
Note: For further reference on the compiler refer to Unfolding the IBM eServer<br />
Blue Gene Solution, SG24-6686.<br />
4.2.1 The blrts tool chain<br />
The blrts tool chain is the GNU tool chain, but it has been adapted so that it<br />
generates code and programs that run in the Blue Gene/L environment. The<br />
following packages (Open source and GPL license) are required for building a<br />
tool chain:<br />
► binutils-2.13<br />
► gcc-3.2<br />
► gdb-5.3<br />
► glibc-2.2.5<br />
► glibc-linuxthreads-2.2.5<br />
Chapter 4. Running jobs 145
Because these RPMs can be downloaded from open source community Web<br />
sites, IBM provides the patches and a script for building the toolchain. This script<br />
is located in the /bgl/BlueLight/ppcfloor/toolchain/buildBlrtsToolChain.sh<br />
directory and is used for applying the patches and building the blrts-gnu directory<br />
where the compilers (powerpc-bgl-blrts-gnu-gcc/g77), debuggers (gdb), and so<br />
forth are installed.<br />
The default cross compiler used for generating Blue Gene/L code is<br />
powerpc-bgl-blrts-gnu-gcc.<br />
Note: Refer to the readme file for the tool chain, which is available on the<br />
customer site, when downloading the updated drivers.<br />
Compilation process using blrts-gnu-gcc/g77<br />
You compile and link user applications on the FENs only. In Example 4-2, we<br />
compiled the sample Hello world! program using the blrts-gnu-gcc compiler.<br />
This is a parallel program that requires MPI libraries and header files that are<br />
included while generating the executable code.<br />
Example 4-2 Compiling the “Hello world!” program using blrts-gnu-gcc<br />
test1@bglfen1:~/Examples/codes> ls<br />
hello-world.c<br />
test1@bglfen1:~/Examples/codes><br />
/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gcc -o<br />
hello-world.rts hello-world.c -I/bgl/BlueLight/ppcfloor/bglsys/include<br />
-L/bgl/BlueLight/ppcfloor/bglsys/lib -lmpich.rts -lrts.rts<br />
-lmsglayer.rts -ldevices.rts<br />
test1@bglfen1:~/Examples/codes> ls<br />
hello-world.c hello-world.rts<br />
test1@bglfen1:~/Examples/codes><br />
Tip: To simplify the syntax of the compile command line, you can supply the<br />
compiler options, the include files, and libraries from a makefile.<br />
In this example, we compiled the hello-world.c program using the<br />
powerpc-bgl-blrts-gnu-gcc compiler and linked it using the MPI library<br />
(lmpich.rts), the (lrts.rts) and the messaging layer and devices layer<br />
(lmsglayer.rts, ldevices.rts) that are specific to the Blue Gene/L environment.<br />
Note: The include directory path is required for compiling MPI programs<br />
because it contains the mpi.h file (plus additional header files).<br />
146 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Table 4-1 includes a list of libraries that are required to compile and link<br />
applications.<br />
Table 4-1 Libraries and their associated RPMs in the Blue Gene/L driver<br />
Libraries RPMs (under which they are present)<br />
libmpich.rts.a bglmpi-2006.1.2-1<br />
libmsglayer.rts bglmpi-2006.1.2-1<br />
librts.rts.a bglcnk-2006.1.2-1<br />
libdevices.rts.a bglcnk-2006.1.2-1<br />
You can also compile and link parallel applications using the mpicc command, as<br />
shown in Example 4-3. Because MPI is already installed as a separate package<br />
(RPM), installing the Blue Gene/L driver adds the following scripts:<br />
► mpicc (C compiler)<br />
► mpicxx (C++ compiler)<br />
► mpif77 (FORTRAN compiler)<br />
Example 4-3 Compiling the “Hello world!” program using the mpicc command<br />
test1@bglfen1:~/Examples/codes> ls<br />
hello-world.c<br />
test1@bglfen1:~/Examples/codes> mpicc -o hello-world.rts hello-world.c<br />
test1@bglfen1:~/Examples/codes> ls<br />
hello-world.c hello-world.rts<br />
test1@bglfen1:~/Examples/codes> file hello-world.rts<br />
hello-world.rts: ELF 32-bit MSB executable, PowerPC or cisco 4500,<br />
version 1 (embedded), statically linked, not stripped<br />
If the compile process is successful, you should move the executable code into<br />
the /bgl or /bglscratch file systems (depending on which file systems are shared<br />
across the SN, FENs, and I/O nodes) in order to make the binary available for job<br />
execution. For further information about mpicc refer to the<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/mpicc script.<br />
Note: Versions of the previously mentioned RPMs vary depending on the Blue<br />
Gene/L driver release.<br />
Chapter 4. Running jobs 147
4.2.2 The IBM XLC/XLF compilers<br />
This section discusses the add-ons and extra RPMs that are essential for<br />
compiling applications using the IBM XLC/XLF compilers. The basic XLC/XLF<br />
compilers on SLES9 are installed on the FENs as a set of three RPMs for each of<br />
the C and Fortran compilers (with two of the RPMs common across both). These<br />
RPMs are required to wrap around the generally available XLC/XLF compilers on<br />
the FEN systems because these are tuned especially for the Blue Gene/L<br />
version of PowerPC 440.<br />
While compiling applications using these compilers, applications can take<br />
advantage of the underlying processor architecture and, in turn, yield good timing<br />
and performance results. Also, depending on the application, you can experiment<br />
with optimization flags for best results.<br />
The XL compilers require the blrts tool chain for functioning. Some of the bltrs<br />
tool chain features are:<br />
► The assembler and linker from the blrts tool chain are used to create<br />
programs that run in the Blue Gene/L environment<br />
► Runtime libraries are built with the blrts tool chain<br />
► Runtime routines from the blrts tool chain (glibc) are used for applications<br />
► Maintain binary compatibility with the gcc compiler that are in the blrts tool<br />
chain (that is, you can link .o files from the XL compilers and the gcc<br />
compilers and they should run)<br />
► Support many of the other tools in the blrts tool chain (that is, gdb, gmon,<br />
and so forth)<br />
There are four XLC/XLF RPMs that you must install on the FENs for compiling<br />
the applications on the system:<br />
► bgl-vacpp-7.0.0-5.ppc64.rpm (C/C++ compiler for Blue Gene/L)<br />
► bgl-xlf-9.1.0-5.ppc64.rpm (Fortran compiler for Blue Gene/L)<br />
► bgl-xlmass.lib-4.3.0-5.ppc64.rpm (MASS mathematical library for Blue<br />
Gene/L)<br />
► bgl-xlsmp.lib-1.5.0-5.ppc64.rpm (dummy SMP library in case the -qsmp<br />
option has been used - for pre-compiled code)<br />
You can download these from the Blue Gene/L customer site.<br />
148 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Tip: The xlmass and xlsmp options are common across C & Fortran compilers.<br />
Because XLC/XLF compilers support the -qsmp option, the dummy SMP<br />
library is provided to avoid any errors in programs that require the use of this<br />
flag. These versions can be updated at a later stage. Refer to the IBM<br />
compiler download site for the latest information about the releases.<br />
Compiling using blrts_xlc/xlf<br />
With the IBM blrts_xlc/xlf compilers, user applications are compiled on the FEN<br />
nodes. Then, the executables files are copied to the shared file system (NFS<br />
mounted /bgl or /bglscratch) that mounts to the I/O nodes. After the executable<br />
files are mounted, the process follows similar compiler steps as described in<br />
“Compilation process using blrts-gnu-gcc/g77” on page 146 and as illustrated in<br />
Example 4-4. The advantage of using the blrts_xlc/xlf compiler over the GNU<br />
compiler is that the IBM compilers provide numerous additional flags that can<br />
optimize and tune large scale applications on the Blue Gene/L system.<br />
Example 4-4 Compiling the “Hello World!” program using the blrts_xlc compiler<br />
test1@bglfen1:~/Examples/codes> ls<br />
hello-world.c<br />
test1@bglfen1:~/Examples/codes> blrts_xlc -o hello-world.rts<br />
hello-world.c -I/bgl/BlueLight/ppcfloor/bglsys/include<br />
-L/bgl/BlueLight/ppcfloor/bglsys/lib -lmpich.rts -lrts.rts<br />
-lmsglayer.rts -ldevices.rts<br />
test1@bglfen1:~/Examples/codes> ls<br />
hello-world.c hello-world.rts<br />
4.3 Submitting jobs using built-in tools<br />
You can submit a job on a Blue Gene/L system in three ways:<br />
► Using the submit_job command from Midplane Management Control System<br />
(MMCS)<br />
► Using the mpirun command (a stand-alone program for submitting jobs)<br />
► Using LoadLeveler, which is the IBM job management for AIX and Linux<br />
(batch queuing system)<br />
4.3.1 Submitting a job using MMCS<br />
The system administrator can submit jobs on Blue Gene/L/L by logging on to the<br />
MMCS console. As shown in 2.2.8, “Job submission” on page 66, the main idea<br />
Chapter 4. Running jobs 149
of providing this access for the administrator is to check the status of the<br />
partitions , I/O node, and IP addresses to know what file systems have been<br />
mounted on the I/O node(s) and so on. For submitting a job, refer to 2.3.8,<br />
“Check that a simple job can run (mmcs_console)” on page 96.<br />
4.3.2 The mpirun program<br />
The mpirun program is a stand-alone program that is used to execute parallel<br />
applications on the system. It is command that is bundled in the MPI RPM of the<br />
Blue Gene/L driver set and is named bglmpi-2006.1.2-1 (for the Blue Gene/L<br />
driver version used in this book). This RPM is also installed along with the control<br />
system on the SN. Users are not allowed to log on to the SN. So, mpirun is<br />
designed so that how it operates through FENs and SNs is transparent to the<br />
user.<br />
The mpirun program is divided into two components, namely front-end mpirun<br />
(which executes on the FEN) and back-end mpirun (which executes on the SN).<br />
This distinction is made to ensure secure access to the control system database.<br />
Normally, the DB2 client is installed on the SN only.<br />
Let us consider an example in which we submit a job using mpirun on the<br />
front-end node. This node communicates to the back-end mpirun and queries the<br />
database about the partition state (in our case, a predefined one). Depending on<br />
the query result, it makes a decision on whether to go ahead and boot the<br />
partition (or, if it has already been allocated, to return a message). See<br />
Figure 4-1.<br />
mpirun mpirun_be mpirun_be mpirun<br />
rsh / ssh rshd / sshd rshd / sshd rsh / ssh<br />
Front end node Blue Gene/L Service node Front end node<br />
Figure 4-1 Front-end and back-end mpirun communication<br />
150 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Service Node running<br />
SLES9 PPC<br />
64 bit<br />
IP: 10.0.0.1, bglsn_sn<br />
netmask: 255.255.0.0<br />
IP: 172.30.1.1, bglsn_fn<br />
netmask: 255.255.0.0<br />
IP: 192.168.00.49, bglsn<br />
netmask: 255.255.255.0 eth5<br />
Public<br />
Network<br />
Remote command execution setup<br />
In this section, we discuss two remote execution environments that we used in<br />
our sample environment, as shown in Figure 4-2.<br />
► Remote shell (rsh/rshd)<br />
► Secure shell (ssh/sshd)<br />
eth0<br />
eth1<br />
FrontEnd Node running<br />
SLES9 PPC<br />
64 bit<br />
IP: 172.30.1.41, bglfen1_fn<br />
netmask: 255.255.0.0<br />
IP: 192.168.100.41, bglfen1<br />
netmask: 255.255.255.0<br />
eth0<br />
eth1<br />
Functional<br />
Network<br />
Switch<br />
Service<br />
Network<br />
Switch<br />
FrontEnd Node running<br />
SLES9 PPC<br />
64 bit<br />
IP: 172.30.1.42, bglfen2_fn<br />
netmask: 255.255.0.0<br />
IP: 192.168.100.42, bglfen2<br />
netmask: 255.255.255.0<br />
1 2<br />
Figure 4-2 Sample environment used for this redbook<br />
eth0<br />
eth1<br />
Gbit ido<br />
Node card 3<br />
Node card 2<br />
Node card 1<br />
Node card 0<br />
BLUE GENE Rack 00,<br />
Front half of Midplane:<br />
R00-M0<br />
Clock card<br />
master<br />
Remote shell (rsh/rshd)<br />
The system administrator must enable the rshd service (part of the xinetd<br />
system) on the SN and the FENs so that users can execute remote commands<br />
between the SN and FENs.<br />
slave<br />
Service<br />
card<br />
Chapter 4. Running jobs 151
The system administrator can set up remote shell server (rshd) on SN and FENs<br />
by first checking whether rsh is enabled with the /sbin/chkconfig command as<br />
follows:<br />
# /sbin/chkconfig --list rsh<br />
xinetd based services:<br />
rsh: on<br />
If the status is off, you can enable the service as follows:<br />
# /sbin/chkconfig --set rsh on ; /etc/init.d/xinetd start rsh<br />
Note: The xinetd stores the child daemons’ configuration files in the<br />
/etc/xinetd.d directory. Alternately, you can check whether rshd is enabled in<br />
the rsh configuration file that located in this directory. If the line shown in bold<br />
in the following example is present or if this line is missing, then rshd is<br />
enabled.<br />
# cat /etc/xinetd.d/rsh<br />
# default: off<br />
# description:<br />
# The rshd server is a server for the rcmd(3) routine and,<br />
# consequently, for the rsh(1) program. The server provides<br />
# remote execution facilities with authentication based on<br />
# privileged port numbers from trusted hosts.<br />
#<br />
service shell<br />
{<br />
socket_type = stream<br />
protocol = tcp<br />
flags = NAMEINARGS<br />
wait = no<br />
user = root<br />
group = root<br />
log_on_success += USERID<br />
log_on_failure += USERID<br />
server = /usr/sbin/tcpd<br />
# server_args = /usr/sbin/in.rshd -L<br />
server_args = /usr/sbin/in.rshd -aL<br />
instances = 200<br />
disable = no<br />
}<br />
152 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
To allow remote commands to execute without being prompted for a password<br />
between SN and FENs, the remote shell server (rshd) must be aware of the<br />
identities that are allowed for these operations. We tested two ways to implement<br />
this type of execution:<br />
► System-wide implementation (/etc/hosts.equiv could pose a potential security<br />
threat)<br />
► User-based implementation which requires populating the ~/.rhosts file in the<br />
user’s home directories (permission 600)<br />
Example 4-5 shows the contents of the /etc/hosts.equiv file and the ~/.rhosts file<br />
that we created for our environment (one SN, two FENs, service network,<br />
functional network, and public network). Figure 4-2 on page 151 illustrates this<br />
environment.<br />
Example 4-5 The rsh setup using /etc/hosts.equiv<br />
test1@bglfen1:~> cat /etc/hosts.equiv<br />
#<br />
# hosts.equiv This file describes the names of the hosts which are<br />
# to be considered "equivalent", i.e. which are to be<br />
# trusted enough for allowing rsh(1) commands.<br />
#<br />
# hostname<br />
bglfen1<br />
bglfen2<br />
bglfen1_fn<br />
bglfen2_fn<br />
bglsn<br />
bglsn_fn<br />
Example 4-6 shows the ~/.rhosts file for user test1 on SN and FENs. We<br />
recommend this method, because it allows for more access granularity.<br />
Example 4-6 rsh setup on a per-user basis (user test1)<br />
test1@bglfen1:~>cat ~/.rhosts<br />
bglfen1 test1 bglfen1<br />
bglfen2 test1<br />
bglsn test1<br />
bglfen1_fn test1<br />
bglfen2_fn test1<br />
bglsn_fn test1<br />
Chapter 4. Running jobs 153
Note: A similar configuration is performed on service node and the Front-End<br />
Nodes (the ~/.rhosts file or the /etc/hosts.equiv file are distributed to SN and<br />
FENs). Best cluster practices require that you set up the same user identities<br />
(user ID and name) on all nodes (in our case, test1).<br />
After remote shell is set up, you need to check to ensure that it is working as<br />
expected. You have to test remote command execution through rsh from each<br />
node to every other node. Check that the SN and FENs can talk to each other<br />
using /usr/bin/rsh in both directions across the functional network:<br />
► From SN:<br />
– Using the date command, check connection to the FEN, this should be<br />
repeated to each FEN.<br />
test1@bglsn_fn:~> rsh bglfen1_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
– Using the date command, check to the SN itself.<br />
test1@bglsn_fn:~> rsh bglsn_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
► From the FENs:<br />
– Using the date command, check to the SN itself.<br />
test1@bglfen1_fn:~> rsh bglsn_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
– Using the date command, check connection to the FEN, this should be<br />
repeated to each FEN.<br />
test1@bglfen1_fn:~> rsh bglfen2_fn date<br />
Mon Apr 3 14:41:44 EDT 2006<br />
If there is any problem in executing the date command across nodes, check the<br />
rshd server on the SN and FENs and the /etc/hosts.equiv or the ~/.rhosts files.<br />
Secure shell (ssh/sshd)<br />
As an alternate way for remote command execution (which is needed by the job<br />
submission process if your environment requires enhanced security) is secure<br />
shell (ssh/sshd). You can set up secure shell using mpirun. The mpirun<br />
command has an option for specifying which shell to execute when submitting a<br />
job (-shell). We recommend that you set up secure shell on SN and FENs in<br />
such a way that allows remote command execution (for the designated users)<br />
but without being able to login on to the service node.<br />
154 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Note: This behavior (remote command execution allowed but login not<br />
allowed) is desired for security reasons. Remote shell (rsh/rshd) allows this<br />
behavior because another daemon is used for servicing the login requests,<br />
usually either rlogind or telnetd. However, by default, the secure shell server<br />
(sshd) allows for both remote command execution and login, and you can alter<br />
this behavior to serve your purpose. Refer to 7.2.4, “Using ssh in a Blue<br />
Gene/L environment” on page 406 and the no-pty option in authorized_keys<br />
file.<br />
Environment setup for mpirun<br />
You can use the mpirun command to submit (parallel) jobs to the Blue Gene/L<br />
system. You must set up the correct environment for mpirun to function properly.<br />
This section presents the tasks that you need to accomplish to set up mpirun<br />
correctly.<br />
Front-End Nodes (FENs)<br />
You can set up the Front-End Node in two ways:<br />
► Export the MMCS_SERVER_IP variable in the ~/.bashrc or ~/.cshrc (depending on<br />
the user shell) using one of the following commands:<br />
– export MMCS_SERVER_IP=SN_IP_Addr (in ~/.bashrc)<br />
– setenv MMCS_SERVER_IP SN_IP_Addr (in ~/.cshrc)<br />
► Using one of the numerous option of the mpirun command. For example, you<br />
can use the -host option to submit a job instead of specifying the<br />
MMCS_SERVER_IP environment variable.<br />
In addition to MMCS_SERVER_IP, there are other variables that you can specify in<br />
the user’s shell to control several aspects of the job execution (however, most<br />
options can be overwritten using the mpirun command line arguments while<br />
submitting jobs from FENs).<br />
Service Node<br />
There are three basic settings that are required for the user environment in order<br />
to execute the job successfully using mpirun:<br />
► BRIDGE_CONFIG_FILE<br />
This bridge configuration file includes the images that are required to boot the<br />
dynamic partition on the Blue Gene/L system. For example, you can use<br />
either the -partition or -shape command line arguments while submitting<br />
jobs. You can use the -partition option only if you have a predefined<br />
partition stored in the Blue Gene/L database. Also, if you want to specify the<br />
shape of the partition (configuration of the Blue Gene/L internal networks),<br />
you can use the -shape parameter (for a dynamically generated partition).<br />
Chapter 4. Running jobs 155
► DB_PROPERTY<br />
This setting is defined for the mpirun back-end to access the database on the<br />
SN while trying to check the availability of the partition that is requested and<br />
its state. For information about block states refer Table 4-3 on page 162.<br />
► Sourcing the db2profile<br />
The db2profile (script file) sets the DB2 database environment variables<br />
(including, but not limited to, binaries and library path for back-end mpirun).<br />
You should set the variables for every user on the system in their ~/.bashrc or<br />
.cshrc files, as shown in Example 4-7.<br />
Example 4-7 Contents of bridge config file and db.properties file<br />
test1@bglfen1:~> cat bridge_config_file.txt<br />
BGL_MACHINE_SN <br />
BGL_MLOADER_IMAGE <br />
BGL_BLRTS_IMAGE <br />
BGL_LINUX_IMAGE <br />
BGL_RAMDISK_IMAGE <br />
test1@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> cat db.properties<br />
database_name=bgdb0<br />
database_user=bglsysdb<br />
database_password=bglsysdb<br />
# database_password=db24bgls<br />
database_schema_name=bglsysdb<br />
system=BGL<br />
min_pool_connections=1<br />
# Web Console Configuration<br />
mmcs_db_server_ip=127.0.0.1<br />
mmcs_db_server_port=32031<br />
mmcs_max_reply_size=8029<br />
mmcs_max_history_size=2097152<br />
mmcs_redirect_server_ip=default<br />
mmcs_redirect_server_port=32032<br />
Bridge config file is present in the<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />
156 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
These environment variables are required for every user in the Blue Gene/L<br />
system before submitting the jobs. Example 4-8 shows the sample ~/.bashrc file<br />
for user test1.<br />
Example 4-8 Simple ~/.bashrc file<br />
test1@bglfen1:~> cat ~/.bashrc<br />
hstnm=`hostname`<br />
if [ $hstnm = "bglsn" ]<br />
then<br />
export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />
export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />
fi<br />
Job setup validation<br />
After you have configured the remote shell and set up the environment, you can<br />
test user applications using the test that is provided with mpirun. Example 4-9<br />
shows mpirun command checks across the FENs and SN. An Exit status: 0<br />
message indicates that the environment is configured properly.<br />
Example 4-9 Mpirun using only_test_protocol argument<br />
test1@bglfen1:/bglscratch/test1> mpirun -only_test_protocol -exe<br />
/bglscratch/test1/hello-world.rts -np 128 -verbose 1<br />
FE_MPI (Info) : Initializing MPIRUN<br />
FE_MPI (Info) : Scheduler interface library<br />
loaded<br />
FE_MPI (WARN) :<br />
======================================<br />
FE_MPI (WARN) : = Front-End - Only checking<br />
protocol =<br />
FE_MPI (WARN) : = No actual usage of the BG/L<br />
Bridge =<br />
FE_MPI (WARN) :<br />
======================================<br />
BE_MPI (WARN) :<br />
======================================<br />
BE_MPI (WARN) : = Back-End - Only checking<br />
protocol =<br />
BE_MPI (WARN) : = No actual usage of the BG/L<br />
Bridge =<br />
BE_MPI (WARN) :<br />
======================================<br />
Chapter 4. Running jobs 157
BRIDGE (Info) : The machine serial number<br />
(alias) is BGL<br />
FE_MPI (Info) : Back-End invoked:<br />
FE_MPI (Info) : - Service Node: bglsn<br />
FE_MPI (Info) : - Back-End pid: 6806 (on<br />
service node)<br />
FE_MPI (Info) : Preparing partition<br />
FE_MPI (Info) : Adding job<br />
FE_MPI (Info) : Job added with the following<br />
id: 123<br />
FE_MPI (Info) : Starting job 123<br />
FE_MPI (Info) : Waiting for job to terminate<br />
FE_MPI (Info) : BG/L job exit status = (0)<br />
FE_MPI (Info) : Job terminated normally<br />
BE_MPI (Info) : Starting cleanup sequence<br />
BE_MPI (Info) : == BE completed ==<br />
select: Interrupted system call<br />
FE_MPI (Info) : == FE completed ==<br />
FE_MPI (Info) : == Exit status: 0 ==<br />
Note: The mpirun command has numerous command line arguments. To find<br />
out about them in detail, refer to the mpirun user guide that was installed<br />
during your Blue Gene/L system setup.<br />
Job tracking using the Web interface<br />
After the job environment has been configured, you can proceed to job<br />
submission. When you have installed and configured the Blue Gene/L drivers,<br />
you can use the Web interface to check your system. You can access the Web<br />
interface by pointing your Web browser to the following URL:<br />
http:///web/index.php<br />
For more details about the Web interface, see 3.3, “Web interface” on page 119.<br />
Note: On the SN, the system administrator should set up the Web interface<br />
(Web server configuration) to allows browsing from the local network. For<br />
more information, refer to 2.3.3, “Check that BGWEB is running” on page 86.<br />
System administrators define the set of partitions (blocks) based on the Blue<br />
Gene/L system configuration at their site. For example, if we have a four-rack<br />
system, a set of predefined blocks can be 32-node blocks, 128-node blocks,<br />
2-Rack, 4-Rack, apart from the normal midplane, and rack configuration. This<br />
setup is very much dependent on the site configuration. When this configuration<br />
158 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
is defined, the user community can browse through the Web site and can know<br />
the availability of the partitions and their states.<br />
Figure 4-3 shows the Web interface.<br />
Figure 4-3 Web interface<br />
Chapter 4. Running jobs 159
In Figure 4-3, clicking Runtime displays the total number of available predefined<br />
partitions and the dynamically created partitions on the system. Each row<br />
contains information about the partitions’ state, as shown in Figure 4-4.<br />
Figure 4-4 Partitions available on the system<br />
Note: When you submit a job using mpirun with the -shape option or through<br />
IBM LoadLeveler, the partition is created dynamically and the name of the<br />
block allocated starts with RMP.<br />
Table 4-2 briefly describes the columns shown in Figure 4-4.<br />
Table 4-2 Description of the job information Web page<br />
Column Description<br />
Block ID Indicates the name of the partition or block<br />
Owner Indicates who has created the block<br />
Description Indicates how the block is created (using LoadLeveler or<br />
predefined in the database)<br />
160 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Column Description<br />
Status Indicates the status of the block<br />
Time status updated Indicates the last time status of the block is modified<br />
Time created Indicates the time that the block was created<br />
Size Indicates the size of the block (32, 128, 512, 1024, 2048,<br />
and so forth)<br />
Job status Indicates the job’s current status (refer Table 4-4 on<br />
page 162)<br />
Before submitting the job, you should check the job and block Web interface to<br />
see which partitions are free for use. Another way to check the availability of the<br />
blocks is to use the DB2 command CLI on SN (which can be done by the system<br />
administrator only), as shown in Example 4-10.<br />
Example 4-10 The DB2 CLI command to get the block information about SN<br />
bglsn:~ # db2 'connect to bgdb0 user bglsysdb using bglsysdb'<br />
Database Connection Information<br />
Database server = DB2/LINUXPPC 8.2.3<br />
SQL authorization ID = BGLSYSDB<br />
Local database alias = BGDB0<br />
bglsn:~ # db2 "select substr(blockid,1,16)blockid,STATUS,OWNER from<br />
BGLBLOCK"<br />
BLOCKID STATUS OWNER<br />
---------------- ------ ---------------------------<br />
R000_128 I root<br />
R000_J102_32 F<br />
R000_J104_32 F<br />
R000_J106_32 F<br />
R000_J108_32 F<br />
5 record(s) selected.<br />
Chapter 4. Running jobs 161
The Web page shown in Figure 4-4 also includes links to Job Information and<br />
Block Information. Table 4-3 describes the block states.<br />
Table 4-3 Block states on the Blue Gene/L system<br />
Block states Description<br />
(E)rror An initialization error has occurred. You must issue an allocate,<br />
allocate_block, or free command to reset the block<br />
(F)ree The block is available to allocate.<br />
(A)llocated The block has been allocated by a user. IDo connections have<br />
been established, but the block has not been booted.<br />
(C)onfiguring The block is in the process of being booted. This is an internal<br />
state for communicating between LoadLeveler and MMCS DB.<br />
(B)ooting The block is in the process of being booted but has not yet<br />
reached the point that CIOD has been contacted on all I/O nodes.<br />
(I)nitialized The block is ready to run application programs. CIODB has<br />
contacted CIOD on all I/O nodes.<br />
(D)e-allocating The block is in the process of being freed. This state is an internal<br />
state for communicating between LoadLeveler and the MMCS<br />
DB.<br />
Until now, we have discussed the partition states. Initially, when mpirun boots the<br />
partition, the job state is not known. However, after the block is booted, the job<br />
status (LoadLeveler) changes from Queued to Start, then to Running and, finally,<br />
to Terminated. Table 4-4 describes the job status types after the block has<br />
booted.<br />
Table 4-4 Blue Gene/L job states (without IBM LoadLeveler)<br />
Job status Description<br />
(E)rror An initialization error occurred. The _errtext field contains the<br />
error message.<br />
(Q)ueued The job has been created in the BGLJOB database table but<br />
has not yet started.<br />
(S)tart The job has been released to start but has not yet started<br />
running.<br />
(R)unning The job is running.<br />
(T)erminated The job has ended.<br />
(D)ying The job has been killed but has not yet ended.<br />
162 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Note: Job states shown in Table 4-4 differ from the jobs submitted through<br />
LoadLeveler. Refer to 4.4, “IBM LoadLeveler” on page 167 for more<br />
information.<br />
4.3.3 Example of submitting a job using mpirun<br />
Each job or application runs in its own block and takes over the entire block. (A<br />
block cannot be shared, entirely or partially, at any time between two different<br />
jobs.) Therefore, each job can take anywhere from 32 compute nodes (the<br />
smallest block size) up to the entire system.<br />
Note: Details of the job submission process are different depending on<br />
whether you are using a job scheduler such as LoadLeveler, directly invoking<br />
mpirun, or submitting jobs through the SN through mmcs_db_console.<br />
Submitting a job to the Blue Gene/L system using mpirun requires several<br />
command line arguments, as shown in Example 4-11. In this example, we use<br />
rsh, which is the default remote command program.<br />
Example 4-11 Job submission using the mpirun command (by default uses rsh)<br />
test1@bglfen1:/bglscratch/test1> mpirun -partition -np<br />
-exe <br />
-cwd -args mpirun -partition -np<br />
-exe <br />
-cwd -args
Figure 4-5 shows a diagram of the job submission process using mpirun.<br />
User<br />
1<br />
Administrator<br />
Edit/Compile<br />
Job<br />
4<br />
Create/<br />
Allocate<br />
block<br />
5<br />
Service<br />
Node<br />
Figure 4-5 Job submission process<br />
The job submission process is as follows:<br />
1. The user edits, compiles, and so forth the job code on one of the FENs.<br />
2. The user moves the job to the Cluster Wide File System (CWFS). The CWFS<br />
can be any supported type of file system, such as the NFS or GPFS that is<br />
available to the FENs, the SN, and the I/O nodes. The job and work files can<br />
reside permanently on the CWFS or can be copied there from the FEN’s local<br />
file system.<br />
3. The job scheduler assigns a block to the user.<br />
4. Alternatively to step 3, the system administrator working through the SN can<br />
assign a block to the user.<br />
5. The SN initializes the block, (that is, it boots the Compute Nodes and I/O<br />
nodes).<br />
6. The block’s I/O nodes load the job code and data from the CWFS and begin<br />
its execution.<br />
7. All job I/O runs to and from the CWFS. The job executes on the Compute<br />
Node. Its data travels across the collective network to the I/O node and<br />
across the Functional Network (Gigabit Ethernet) to the CWFS.<br />
164 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
3<br />
Job Scheduler<br />
2<br />
Job/data on<br />
file system<br />
Boot I/O nodes<br />
Compute nodes<br />
User job<br />
Cluster-Wide File System<br />
Job/data into<br />
6 7<br />
block<br />
Block<br />
Block<br />
Job I/O<br />
Block<br />
Blue Gene/L<br />
Block
Job runtime information (Web interface)<br />
After the job is submitted using the mpirun, you can view its status through the<br />
Web interface. Figure 4-6 shows information about the jobs’ state.<br />
Figure 4-6 Information about each running job<br />
Table 4-5 explains the columns shown in Figure 4-6.<br />
Table 4-5 Job information columns<br />
Column Description<br />
Job ID Currently running job on the system<br />
Block ID On which partition is the job running<br />
User name Displays the user running the job<br />
Job name The mpirun job name (indicates from which FEN job is<br />
submitted)<br />
Mode Under which mode the job is or will be running (mpirun<br />
option)<br />
Status Relates to the discussion in Table 4-4<br />
Status last modified Relates to the change of each state (for example, from Q<br />
to S to R to T)<br />
Chapter 4. Running jobs 165
Users and system administrators can check the job status and obtain detailed<br />
information by following the Job ID link. Figure 4-7 shows a sample job (ID 240).<br />
Figure 4-7 Detailed description of a sample job<br />
The following paragraphs provide an overview of the mpirun checklist for the<br />
required options to submit a job on the Blue Gene/L system.<br />
The mpirun checklist<br />
Use this mpirun checklist for the required options to submit a job on the Blue<br />
Gene/L system:<br />
► Check for the environment variable echo $MMCS_SERVER_IP on the FEN. If this<br />
variable is not set to the SN IP address or it is empty, set it. Refer to the FEN<br />
environment setup as described in “Environment setup for mpirun” on<br />
page 155.<br />
166 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
► Check for the bridge configuration and database properties files and<br />
environment variables, respectively, ($BRIDGE_CONFIG_FILE and<br />
$DB_PROPERTY). If these variables are empty or if the files have the wrong<br />
contents, correct them and then set the variables.<br />
► Check whether the db2profile file (which is in the<br />
/bgl/BlueLight/ppcfloor/bglsys/bin directory) is sourced on the SN.<br />
► Set and verify the remote command execution environment as described in<br />
“Remote shell (rsh/rshd)” on page 151 or “Secure shell (ssh/sshd)” on<br />
page 154.<br />
► Test the environment by using mpirun with the -only_test_protocol option to<br />
check the job submission through FEN and SN (refer to Example 4-9).<br />
4.4 IBM LoadLeveler<br />
IBM LoadLeveler is a software package that provides utilities for job submission<br />
and scheduling (workload management) on various UNIX platforms. Blue<br />
Gene/L is one of the platforms supported by IBM LoadLeveler.<br />
Due to the characteristics of the Blue Gene/L platform, to use LoadLeveler, in<br />
addition to the basic LoadLeveler knowledge, you need information that is<br />
specific to the Blue Gene/L environment. You can find information about<br />
LoadLeveler in the official IBM manual IBM LoadLeveler Using and<br />
Administering <strong>Guide</strong>, SA22-7881. LoadLeveler software comes separately from<br />
the Blue Gene/L software. Depending on the system type, a different version of<br />
LoadLeveler is installed. For the latest documentation, see:<br />
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=<br />
/com.ibm.cluster.loadl.doc/llbooks.html<br />
In this section, we present an overview of how LoadLeveler works in the Blue<br />
Gene/L environment. This information is essential for identifying and analyzing<br />
problems.<br />
4.4.1 LoadLeveler overview<br />
A LoadLeveler cluster is a collection of nodes (stand-alone systems and LPARs)<br />
that you can use to run your computing jobs. LoadLeveler manages the defined<br />
resources and schedules jobs based on resource availability and workload<br />
characteristics. Jobs are run on the nodes. LoadLeveler uses a central instance<br />
for managing the entire cluster. Thus, this can be considered a management<br />
domain type of cluster.<br />
Chapter 4. Running jobs 167
A node part of a LoadLeveler cluster has a file structure that contains a<br />
LoadLeveler code and configuration. Some nodes in the cluster are assigned<br />
special functionalities that are carried out by daemons. The most important<br />
daemons are Master, (also known as the Central Manager or Negotiator),<br />
Schedd, and Startd.<br />
Figure 4-8 illustrates the daemons and their relationship on a single-node<br />
LoadLeveler cluster. The job to be submitted goes to the Central Manager, which<br />
dispatches it to the Scheduler. After analyzing the job characteristics<br />
(requirements) and checking for available resources, the Scheduler sends the<br />
job to the Start daemon. The Start daemon starts the Starter process to run the<br />
job.<br />
In case of a parallel job, the Start daemon initiates multiple Starters. Each Starter<br />
runs a parallel task. However, this is true only in the case of “classic” parallel<br />
systems, which run their jobs on SMP machines (each node running a full copy<br />
of the operating system), clustered by a traditional clustering infrastructure. For<br />
Blue Gene/L, LoadLeveler works in a different way, because it does not interact<br />
with Compute nodes (which do not provide the kernel support for running<br />
multiple processes at the same time) or I/O nodes.<br />
Negotiator<br />
(Central Manager)<br />
start<br />
Job<br />
dispatching<br />
Status<br />
reporting<br />
Master<br />
Schedd<br />
Figure 4-8 LoadLeveler daemons on a single-node cluster<br />
To expand a single-node cluster into a multi-node cluster, multiple instances of<br />
the daemons can run on each node. Not all the daemons are required to run on<br />
every node. Depending on which daemons are running on a node, the node has<br />
a specific role in a LoadLeveler cluster.<br />
168 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
start<br />
start<br />
Job<br />
scheduling<br />
Status<br />
reporting<br />
Job<br />
finishing<br />
Startd<br />
Starter<br />
Job<br />
starting
The most important node is the Central Manager node, which runs the Central<br />
Manager daemon. There is only one Central Manager instance in a LoadLeveler<br />
cluster, even though one (or more) alternate Central Managers can be defined<br />
for failover and recovery purposes. The remainder of the nodes in the cluster<br />
have the option of running the Scheduler daemon, the Start daemon, or both.<br />
Figure 4-9 shows LoadLeveler daemons running on different nodes in a<br />
multi-node cluster.<br />
Master<br />
AIX<br />
Master<br />
Node 2<br />
Node 5<br />
Schedd<br />
Startd<br />
Schedd<br />
Central Manager node<br />
Master<br />
(Node 1)<br />
Negotiator<br />
Figure 4-9 A multi-node LoadLeveler cluster<br />
Schedd<br />
Startd<br />
Master<br />
SLES<br />
Startd<br />
AIX RH<br />
Master<br />
Node 3<br />
Node 4<br />
Schedd<br />
Startd<br />
Schedd<br />
Startd<br />
Note: The LoadLeveler Central Manager daemon serves as the single point of<br />
control, storage, and management of the cluster and job information. This<br />
daemon must be running for the LoadLeveler cluster to function.<br />
It is possible to have a mixed LoadLeveler cluster, that is nodes in the cluster<br />
running different operating systems. The operating systems supported by<br />
LoadLeveler are IBM AIX, SUSE Linux, or Red Hat Linux. For details on versions<br />
that are supported, check the IBM LoadLeveler readme files that come with the<br />
product.<br />
Note: In a mixed cluster, the binary files are not compatible between different<br />
operating systems. You must configure the cluster so that jobs are scheduled<br />
on the appropriate nodes. See IBM LoadLeveler Using and Administering<br />
<strong>Guide</strong>, SA22-7881 for further information.<br />
Chapter 4. Running jobs 169
4.4.2 Principles of operation in a Blue Gene/L environment<br />
In a Blue Gene/L environment, LoadLeveler code runs on the SN and several<br />
FENs. These nodes become part of the LoadLeveler cluster, which is in fact a<br />
LoadLeveler cluster that consists of Linux nodes with the usual LoadLeveler<br />
configuration (and nothing specific to Blue Gene/L).<br />
Because LoadLeveler does not run on the Blue Gene/L system or the I/O nodes,<br />
it relies on the bridge API from the SN to access these nodes. (Figure 4-10<br />
illustrates that the Blue Gene/L system is not actually part of the LoadLeveler<br />
cluster.) LoadLeveler calls the bridge API functions through the control server to<br />
set up the Blue Gene/L partitions. After the partitions are created, the job<br />
information is passed to the mpirun front-end and back-end programs, which<br />
have the role of submitting the job to the Blue Gene/L system.<br />
Central<br />
Manager<br />
mpirun<br />
starter<br />
startd<br />
mpirun<br />
starter<br />
startd<br />
Figure 4-10 The Blue Gene/L nodes are outside of the LoadLeveler cluster<br />
The LoadLeveler Central Manager daemon (CM or negotiator) has been<br />
enhanced with bridge API calls for Blue Gene/L operations. It uses the bridge<br />
API function calls to query the partition information from the Blue Gene/L<br />
database and uses these function calls to carry out operations such as:<br />
► Adding or removing a partition record<br />
► Sending operations to a service controller<br />
► Checking the status of the partitions<br />
170 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
Real Machine
The system administrator has to define the Central Manager on the SN when<br />
configuring LoadLeveler cluster. Thus, the SN must be part of the LoadLeveler<br />
cluster. Besides the Blue Gene/L file structure and the database, other<br />
LoadLeveler file structures and locations are set up similar to a regular<br />
LoadLeveler cluster.<br />
Note: Although you can configure an alternate Central Manager (as in a<br />
regular LoadLeveler cluster), failover to a node other than the SN is not<br />
desirable. As the result of a failover, the alternate Central Manager no longer<br />
has access to the Blue Gene/L control server and database.<br />
The LoadLeveler Scheduler daemon (schedd) schedules jobs according to<br />
workloads and resources on each node in the cluster. In a Blue Gene/L system,<br />
schedd treats mpirun jobs as common jobs (not Blue Gene/L specific) that run on<br />
the SN or FENs. However, the SN is reserved for Blue Gene/L administrative<br />
workload. Therefore, it is not desirable that schedd runs on the SN. The system<br />
administrator can choose to run schedd on all or some FENs. Thus, schedd is not<br />
shown in Figure 4-10 and Figure 4-11 on page 171.<br />
The scheduler (schedd) does not schedule the job on the Blue Gene/L compute<br />
nodes. It decides which FEN runs the mpirun and passes the job information to<br />
the LoadLeveler Start daemon (startd) on that node.<br />
Service Node<br />
Central Manager<br />
FEN<br />
Startd<br />
Starter<br />
FEN<br />
Startd<br />
Starter<br />
bridge<br />
Actions<br />
Figure 4-11 Central Manager accesses Blue Gene/L nodes through the bridge API<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
Physical<br />
Machine<br />
Usually, the FENs are where users submit jobs into the LoadLeveler queue. Not<br />
all of the FENs need to run schedd either. However, the LoadL_admin file<br />
specifies at least one node as the public scheduler (schedd). The LoadLeveler<br />
Chapter 4. Running jobs 171
Start daemon (startd) runs on each FEN. The startd daemon receives mpirun<br />
information from schedd. Then, startd starts the LoadLeveler Starter process<br />
(starter), which starts the mpirun job on the local node.<br />
Figure 4-12 shows the output from the LoadLeveler llstatus command with the<br />
LoadLeveler cluster running.<br />
Figure 4-12 The llstatus output on a Blue Gene/L system<br />
4.4.3 How LoadLeveler plugs into Blue Gene/L<br />
The LoadLeveler software works similarly on many platforms. The following<br />
tasks apply specifically to Blue Gene/L:<br />
► Configuring LoadLeveler for Blue Gene/L<br />
► Making the Blue Gene/L libraries available to LoadLeveler<br />
► Setting Blue Gene/L specific environment variables<br />
► Using Blue Gene/L specific keywords in job command file<br />
In the following sections, we discuss these tasks in more detail.<br />
4.4.4 Configuring LoadLeveler for Blue Gene/L<br />
To enable LoadLeveler to recognize a Blue Gene/L system, the LoadL_config file<br />
must contain specific keywords, as shown in Example 4-13 with the<br />
recommended values. Only the LoadLeveler system administrator can change<br />
these keywords in LoadL_config, and they should not be changed while<br />
LoadLeveler is running. See also IBM LoadLeveler Using and Administering<br />
<strong>Guide</strong>, SA22-7881.<br />
172 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 4-13 Blue Gene/L specific configuration keywords<br />
BG_ENABLED = true<br />
BG_CACHE_PARTIONS = true<br />
BG_ALLOW_LL_JOBS_ONLY = false<br />
BG_MIN_PARTITION_SIZE = 32<br />
The keyword BG_ENABLED is essential. Setting it to true tells LoadLeveler that this<br />
is a Blue Gene/L cluster. LoadLeveler then uses the Blue Gene/L bridge API to<br />
talk with the Blue Gene/L control system.<br />
Setting the keyword BG_CACHE_PARTITIONS to true tells LoadLeveler to reuse<br />
existing partitions which have been previously allocated by LoadLeveler.<br />
The keyword BG_ALLOW_LL_JOBS_ONLY is to false to allow users to run mpirun<br />
programs without using LoadLeveler.<br />
The keyword BG_MIN_PARTITION_SIZE specifies that the smallest number of<br />
computing nodes allowed in a partition is 32.<br />
4.4.5 Making the Blue Gene/L libraries available to LoadLeveler<br />
The Blue Gene/L bridge API library is provided with the Blue Gene/L code,<br />
together with the DB2 libraries. In order for LoadLeveler to access these libraries,<br />
the system administrator needs to set up appropriate symbolic links.<br />
On the SN or FEN, these libraries are mostly referred to at the /usr/lib64 or<br />
/usr/lib directories. However. the actual library binary can reside in another<br />
location. Instead of copying the binaries to /usr/lib64 or /usr/lib, using symbolic<br />
links avoids the situation where two different binaries of the same library exist in<br />
two separate locations.<br />
Note: Alternatively, symbolic links might be broken when there are changes in<br />
software levels. Broken links sometimes are not detected and can cause<br />
problems.<br />
Therefore, the libraries used by LoadLeveler are listed for reference purposes.<br />
You should check the links and the binary checksums when users encounter<br />
errors. A brief description of each library is provided to help identify problems.<br />
Note: The library directory paths and names can be system-specific. You can<br />
use the ldconfig and ldd commands to check library links and dynamic<br />
dependencies.<br />
Chapter 4. Running jobs 173
The binaries are located in /bgl/BlueLight/ppcfloor/bglsys/lib64. Their symbolic<br />
links are in /usr/lib64. The following group of libraries belongs to the 64-bit code<br />
version:<br />
libbgldb.so Provide the interface to the database tables on the SN<br />
libtableapi.so for bridge API.<br />
libbglmachine.so Provides the interface to the Blue Gene/L hardware.<br />
libbglbridge.so Provides a 64-bit bridge API library that is used to access<br />
the state of and send orders to the Blue Gene/L system.<br />
libsaymessage.so Provides a 64-bit library that is used by Blue Gene/L<br />
software for producing log messages.<br />
Some libraries come in both 64-bit and 32-bit versions (from both DB2 and<br />
LoadLeveler). Although the libraries’ name is the same, you should link the 64-bit<br />
version with /usr/lib64 and the 32-bit version with /usr/lib.<br />
We use libdb2.so as an example. You can find the file libdb2.so in<br />
/opt/IBM/db2/V8.1/lib64 and /opt/IBM/db2/V8.1/lib. Create proper links to point to<br />
the appropriate version.<br />
The libdb2.so library is a standard 64-bit DB2 client library. It is used by 64-bit<br />
programs to connect to the DB2 database for queries and updates.<br />
The LoadLeveler libraries that are located in /opt/ibmll/LoadL/full/lib are:<br />
► libllapi.so<br />
► libsched_if.so<br />
► libsched_if32.so<br />
► libllpoe.so<br />
The first two libraries are in 64-bit format and need to be linked to /usr/lib64. The<br />
remaining two libraries are in 32-bit format and need to be linked to /usr/lib.<br />
libllapi.so Provides LoadLeveler's 64-bit API library. It is used by the<br />
LoadLeveler daemons, commands, and by external 64-bit<br />
programs that need to access the LoadLeveler API.<br />
libsched_if.so Include the interfaces between mpirun and<br />
libsched_if32.so LoadLeveler. The mpirun program uses these API calls to<br />
get job parameters from LoadLeveler (partition in which to<br />
start and so on).<br />
libllpoe.so Provides 32-bit version of libllapi.so. Although the binary<br />
name is libllpoe.so, point the link to /usr/lib/libllapi.so.<br />
174 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Note: In a conventional LoadLeveler installation (using rpm and the install_ll<br />
script that is provided), libllpoe.so, libsched_if.so, and libsched_if32.so are<br />
copied to the appropriate directories and do not need to be linked.<br />
Example 4-14 LoadLeveler script to set up required links<br />
DVR_DIR=/bgl/BlueLight/ppcfloor<br />
cd /usr/lib64<br />
ln -f -s /opt/IBM/db2/V8.1/lib64/libdb2.so.1 libdb2.so.1<br />
ln -f -s libdb2.so.1 libdb2.so<br />
ln -f -s $DVR_DIR/bglsys/lib64/libbgldb.so.1 libbgldb.so.1<br />
ln -f -s libbgldb.so.1 libbgldb.so<br />
ln -f -s $DVR_DIR/bglsys/lib64/libtableapi.so.1 libtableapi.so.1<br />
ln -f -s libtableapi.so.1 libtableapi.so<br />
ln -f -s $DVR_DIR/bglsys/lib64/libbglmachine.so.1 libbglmachine.so.1<br />
ln -f -s libbglmachine.so.1 libbglmachine.so<br />
ln -f -s $DVR_DIR/bglsys/lib64/libbglbridge.so.1 libbglbridge.so.1<br />
ln -f -s libbglbridge.so.1 libbglbridge.so<br />
ln -f -s $DVR_DIR/bglsys/lib64/libsaymessage.so.1 libsaymessage.so.1<br />
ln -f -s libsaymessage.so.1 libsaymessage.so<br />
ln -f -s /opt/ibmll/LoadL/full/lib/libllapi.so libllapi.so.1<br />
ln -f -s libllapi.so.1 libllapi.so<br />
cd /usr/lib<br />
ln -f -s /opt/IBM/db2/V8.1/lib/libdb2.so.1 libdb2.so.1<br />
ln -f -s libdb2.so.1 libdb2.so<br />
4.4.6 Setting Blue Gene/L specific environment variables<br />
You can use the following variables in the Blue Gene/L environment. You set<br />
these variables to point to the appropriate configuration files and SN. The<br />
variables are:<br />
► BRIDGE_CONFIG_FILE<br />
► DB_PROPERTY<br />
► MMCS_SERVER_IP<br />
Chapter 4. Running jobs 175
Because these variable are user specific, you can set them in the ~/.bashrc file<br />
under the user’s home directory. Example 4-15 shows the content of the<br />
~/.bashrc that we used in our test environment.<br />
Example 4-15 Setting up environment variables with ~/.bashrc<br />
test1@bglsn_~/>cat ~/.bashrc<br />
# .bashrc<br />
# User specific aliases and functions<br />
# Source global definitions<br />
if [ -f /etc/bashrc ]; then<br />
. /etc/bashrc<br />
fi<br />
source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />
. /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
export<br />
BRIDGE_CONFIG_FILE=/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />
export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
export MMCS_SERVER_IP=bglsn.itso.ibm.com<br />
The bridge.config file contains the Blue Gene/L system serial number and the<br />
names and locations of the images to be loaded onto compute and I/O nodes.<br />
The db.properties file contains DB2 database information and the Blue Gene/L<br />
Web console configuration.<br />
Note: These environment variables are required for the LoadLeveler<br />
commands to work properly. Checking the values of these variables and the<br />
contents of the configuration files is an important task in problem<br />
determination.<br />
4.4.7 LoadLeveler and the Blue Gene/L job cycle<br />
LoadLeveler is a complex job scheduling subsystem. A job can go through<br />
multiple stages from the time it is submitted into the queue until it finishes. A job<br />
in the queue can frequently be seen in one of these states: Idle, Starting,<br />
Running, Held, Remove Pending, and so on. For detail information, see IBM<br />
LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881.<br />
176 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Table 4-6 provides a quick overview of the LoadLeveler job states, which are<br />
referred to in the problem determination process.<br />
Table 4-6 Summary of job states in LoadLeveler<br />
State Brief description Remarks<br />
(I)dle The job is being considered to run on a<br />
system, although no system has been<br />
selected.<br />
The system here is a FEN, not the Blue<br />
Gene/L system or nodes.<br />
User (H)old The job has been put on user hold. There are many reasons as to why a job is<br />
put on hold.<br />
(ST)arting The job is starting: dispatched, received by<br />
target system, job env. is being set up.<br />
(R)unning The job is running: dispatched and started<br />
on designated system.<br />
Remove<br />
Pending (RP)<br />
Job information is being passed to mpirun.<br />
LoadLeveler has completed passing job<br />
information to mpirun. See different Blue<br />
Gene/L job states.<br />
The job is in the process of being removed The mpirun job is being removed<br />
according to Blue Gene/L job status.<br />
A LoadLeveler job in Blue Gene/L goes through different states. During the job<br />
starting process, the job information is passed to the mpirun front end. At this<br />
point, the job is in Running state in the LoadLeveler queue. However, the mpirun<br />
process picks up the job from there and starts the tasks on a Blue Gene/L<br />
partition. Then , the Blue Gene/L job goes through different states in the partition.<br />
Note: The Blue Gene/L job status is different from the LoadLeveler job states<br />
(see Table 4-6 and Table 4-4 on page 162).<br />
The user who submits a job into LoadLeveler queue usually checks the job<br />
status through LoadLeveler commands. However, using LoadLeveler commands<br />
you cannot see the Blue Gene/L partition and job status. The system<br />
administrator has access to the Blue Gene/L service console to check the Blue<br />
Gene/L database.<br />
The user and system administrator have two different views of the job. Seeing<br />
where the job is in its life cycle can help determine its status. Also, if a job fails,<br />
the user and system administrator have to trace back to determine where the<br />
failure has occurred.<br />
Chapter 4. Running jobs 177
4.4.8 LoadLeveler job submission process<br />
Throughout the steps in the LoadLeveler job submission process, we use a<br />
simple mpirun job running a “Hello world!” program. Example 4-16 lists a sample<br />
job command file that we used in our environment. See , “Job command file” on<br />
page 198 for brief descriptions of the keywords that this file uses.<br />
Example 4-16 A sample LoadLeveler job command file named hello.cmd<br />
#@ job_type = bluegene<br />
##@ executable = /usr/bin/mpirun<br />
#@ executable = /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun<br />
#@ bg_size = 128<br />
#@ arguments = -verbose 4 -exe /bgl/hello/hello.rts<br />
#@ output = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />
#@ error = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />
#@ environment =<br />
COPY_ALL:BRIDGE_CONFIG_FILE=/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.c<br />
onfig:DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties:MMCS<br />
_SERVER_IP=bglsn.itso.ibm.com<br />
#@ notification = error<br />
#@ notify_user = loadl<br />
#@ class = small<br />
#@ queue<br />
Note: In the steps in this process, we assume that the user is logged in to one<br />
of the FENs.<br />
The following steps detail the LoadLeveler job’s life cycle:<br />
Step 1: Submitting the job<br />
A user submits a job from an FEN using the following LoadLeveler command:<br />
llsubmit hello.cmd<br />
The command returns a message that indicates that the job has been submitted.<br />
The job is associated with a LoadLeveler Job ID, and the job information is sent<br />
to the LoadLeveler Central Manager (CM). The CM daemon runs on the SN and<br />
receives job information through IP communication (the IP port on which the CM<br />
is listening). This TCP/IP port is specified in the LoadL_config file.<br />
As shown in Example 4-17, the llq command does not display the Blue Gene/L<br />
job information at this point, because the job has only been queued into<br />
LoadLeveler.<br />
178 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 4-17 Output of the llq -b command<br />
loadl@bglfen1:/bgl/loadl> llq -b<br />
Id Owner Submitted LL BG PT Partition Size<br />
________________________ __________ ___________ __ __ __ ________________ _____<br />
bglfen1.47.0 loadl 3/29 12:14 I<br />
1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted<br />
Figure 4-13 shows the LoadLeveler job submission process on Blue Gene/L.<br />
Service Node<br />
Central Manager<br />
Job Info<br />
FEN<br />
llsubmit 1<br />
Startd<br />
Starter<br />
FEN<br />
Startd<br />
Starter<br />
Figure 4-13 Job submission process<br />
bridge<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
Actions<br />
Physical<br />
Machine<br />
Step 2: Getting Blue Gene/L information<br />
The LoadLeveler CM receives the job information (at this point, the status of the<br />
job is I for idle). Meanwhile, the CM retrieves a snapshot of the Blue Gene/L<br />
system using the bridge API (from the Blue Gene/L database). The information<br />
requested by CM includes:<br />
► A list of nodes<br />
► The locations of those node<br />
► The status of those nodes<br />
The bridge API calls are responded to by the control server, which accesses the<br />
DB2 database for this information. The control server daemon also updates the<br />
database with the status of the Blue Gene/L hardware.<br />
Chapter 4. Running jobs 179
Figure 4-14 shows the process of LoadLeveler retrieving information from the<br />
Blue Gene/L database through the bridge API.<br />
Service Node<br />
Central Manager<br />
FEN<br />
Startd<br />
Starter<br />
BGL Info<br />
Figure 4-14 Getting Blue Gene/L information<br />
Step 3: Deciding on the partition to use<br />
With the information received from the previous step, the CM constructs a 3D<br />
model of the Blue Gene/L system in memory. The CM then uses the 3D model to<br />
determine the partition to use for the job. Part of this decision process is whether<br />
to reuse an existing partition or to create a new partition. Figure 4-15 illustrates<br />
this process.<br />
Note: LoadLeveler has full responsibility in manipulating the partitions that it<br />
creates for running jobs.<br />
180 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
2<br />
FEN<br />
Startd<br />
Starter<br />
bridge<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
Actions<br />
Physical<br />
Machine
Service Node<br />
Central Manager<br />
3<br />
BGL/Snapshot<br />
FEN<br />
Startd<br />
Starter<br />
Figure 4-15 Deciding on the partition<br />
bridge<br />
Step 4: Updating the Blue Gene/L database<br />
The chosen partition for the job is either an existing one or a new one. If it is a<br />
new partition, it is created dynamically. The CM uses the bridge API to insert the<br />
partition record into the database (see Figure 4-16). At this point in the process,<br />
nothing happens (yet) on the Blue Gene/L system.<br />
Service Node<br />
Central Manager<br />
BGL/Snapshot<br />
FEN<br />
Startd<br />
Starter<br />
FEN<br />
Startd<br />
Starter<br />
bridge<br />
DB<br />
DB<br />
Figure 4-16 Updating the Blue Gene/L database<br />
4<br />
partition<br />
FEN<br />
Startd<br />
Starter<br />
Status Updates<br />
Control<br />
Daemon<br />
Actions<br />
Status Updates<br />
Control<br />
Daemon<br />
Actions<br />
Physical<br />
Machine<br />
Physical<br />
Machine<br />
Chapter 4. Running jobs 181
Steps 5 and 6: Initializing the partition<br />
For the dynamically created partition, the CM uses the bridge API to change the<br />
partition state to allocating (A). This change in state triggers the booting of the<br />
partition under the control of the Blue Gene/L daemons (see Figure 4-17). The<br />
state of the Blue Gene/L partition now changes to I for initializing and<br />
LoadLeveler, represented by the root user on the SN, is the owner of this<br />
dynamic partition.<br />
Note: The state of the job in LoadLeveler queue is idle (I) but the state of the<br />
Blue Gene/L partition is initializing (also I). Although the two states are both<br />
abbreviated to I, the meaning of the state is obviously different. At the end of<br />
this step, the job in LoadLeveler queue is changing to ST for “starting,” which is<br />
a transition state, for a very short time.<br />
Service Node<br />
Central Manager<br />
FEN<br />
Startd<br />
Starter<br />
Figure 4-17 Initializing the partition<br />
Step 7: Launching mpirun<br />
LoadLeveler goes ahead and schedules the job on one of the FENs. The Start<br />
daemon (startd) receives the job and initiates a Starter process. The Starter<br />
launches the mpirun front-end process, which communicates with the mpirun<br />
back-end process on the SN (Figure 4-18).<br />
At this point, the job status in the LoadLeveler queue is changed to running (R).<br />
LoadLeveler is basically done passing the job to mpirun, which has now the<br />
responsibility to run the job.<br />
182 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
5<br />
Allocate Partition<br />
FEN<br />
Startd<br />
Starter<br />
bridge<br />
A<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
6<br />
Actions<br />
Partition being<br />
configured<br />
Physical<br />
Machine
Note: The state of the mpirun job in LoadLeveler queue is running (R).<br />
However, the state of the Blue Gene/L partition is now changing from<br />
initializing (I) to ready (R).<br />
This step can be thought of as when an user invokes the mpirun command<br />
(without using LoadLeveler). However, there is a small additional check that the<br />
mpirun front-end process needs to do. It needs to check to see if LoadLeveler is<br />
installed and can run by calling the LoadLeveler API. One of the checks is to<br />
make sure that the LoadLeveler configuration allows mpirun to run jobs outside<br />
of LoadLeveler. See the discussion on BG_ALLOW_LL_JOBS_ONLY configuration<br />
parameter in 4.4.4, “Configuring LoadLeveler for Blue Gene/L” on page 172.<br />
Service Node<br />
Central Manager<br />
FEN<br />
mpirun<br />
Startd<br />
Starter<br />
FEN<br />
Startd<br />
Starter<br />
bridge<br />
Figure 4-18 LoadLeveler launching mpirun<br />
7<br />
I<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
Actions<br />
Physical<br />
Machine<br />
Chapter 4. Running jobs 183
Steps 8 and 9: Starting the parallel job<br />
The back-end mpirun process uses the bridge API to set the partition to ready (R)<br />
state in the database, which triggers the control daemon to execute the job on<br />
the partition (see Figure 4-19).<br />
Service Node<br />
Central Manager<br />
FEN<br />
mpirun<br />
Startd<br />
Starter<br />
FEN<br />
Startd<br />
Starter<br />
Figure 4-19 Starting the parallel job<br />
Note: Again, the job state in LoadLeveler queue is running (R) while the<br />
partition state that represent the job is in ready (also R). Both states are<br />
abbreviated as R but obviously have a different meaning.<br />
184 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
8<br />
bridge<br />
I<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
9<br />
Actions<br />
Job begins<br />
executing<br />
Physical<br />
Machine
Step 10: Waiting for the job to complete<br />
The mpirun back-end process uses the bridge API to poll for the job status until<br />
the job is complete (see Figure 4-20). During this time, mpirun monitors job<br />
activities.<br />
Service Node<br />
Central Manager<br />
FEN<br />
mpirun<br />
Startd<br />
Starter<br />
FEN<br />
Startd<br />
Starter<br />
10<br />
bridge<br />
Figure 4-20 Waiting for the job to complete<br />
Actions<br />
Note: LoadLeveler is not aware of what is going on with the job that is running<br />
in the Blue Gene/L partition. Thus, the job status in LoadLeveler queue<br />
remains in the running (R) state during this period.<br />
Steps 11 and 12: Cleaning the partition<br />
After mpirun receives “Job complete” status, the mpirun processes terminate,<br />
and the LoadLeveler Starter process terminates as well. The startd daemon<br />
receives the job status from Starter process and sends it to schedd, which reports<br />
the job status back to the CM on the SN. The CM uses the bridge API to set the<br />
state of the partition to F for free, which means that it will not be reused. At this<br />
point, the job cycle completes (see Figure 4-21).<br />
I<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
Job Is Running<br />
Physical<br />
Machine<br />
Chapter 4. Running jobs 185
Service Node<br />
Central Manager<br />
Figure 4-21 Cleaning the partition<br />
Note: If the LoadLeveler is configured to reuse partitions, then the partition is<br />
not freed. Instead, is marked ready (R) to be reused for the next job (if it fits).<br />
4.4.9 LoadLeveler checklist<br />
FEN<br />
Startd<br />
Starter<br />
Free Partition<br />
You can use the tasks presented in this section for scanning normal LoadLeveler<br />
status with attention to details. You can use these checks to spot any abnormal<br />
aspects and to investigate hard-to-find problems or pitfalls.<br />
The checklist includes:<br />
► LoadLeveler cluster and node status<br />
► LoadLeveler run queue<br />
► Job command file<br />
► LoadLeveler processes, logs, and persistent storage<br />
► LoadLeveler configuration keywords<br />
► Environment variables, network, and library links<br />
LoadLeveler cluster and node status<br />
The llstatus command displays the status of the LoadLeveler cluster.<br />
Figure 4-22 shows the following important information that is provided by the<br />
llstatus command:<br />
1. Blue Gene is present. This message means that LoadLeveler can talk with<br />
the Blue Gene/L control server.<br />
186 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
11<br />
FEN<br />
Startd<br />
Starter<br />
bridge<br />
DB<br />
Status Updates<br />
Control<br />
Daemon<br />
12<br />
Actions<br />
Physical<br />
Machine
2. The Central Manager is defined on node bglsn.itso.ibm.com. This node is<br />
the Blue Gene/L service node. This message is a good indication that the CM<br />
is up and running.<br />
3. Scheduler daemons (Schedd) are available. They are ready to schedule jobs<br />
on two of the FENs.<br />
Tip: Schedd dispatches the mpirun jobs<br />
4. The job starting daemons (Startd) are idle. They are ready to start jobs that<br />
come their way. Startd forks a child process called Starter, which then starts<br />
the mpirun front-end process.<br />
Figure 4-22 llstatus command output<br />
Schedds are<br />
running<br />
loadl@bglfen1:/usr/lib64> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 1184 PPC64 Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
Startds are running<br />
CM is up & running on<br />
this machine<br />
LoadLeverler<br />
Can talk to<br />
Blue Gene<br />
When comparing the Figure 4-22 to Figure 4-23 on page 188, you can see the<br />
following problems in the LoadLeveler cluster:<br />
1. Blue Gene is absent. This message indicates that LoadLeveler cannot talk<br />
to the control server. As a result, jobs are not able to run.<br />
2. The Central Manager is defined on node, bglfen2.itso.ibm.com. This node is a<br />
node other than the service node. This might not be possible in a Blue<br />
Gene/L cluster but it is worthwhile to pay attention to the error messages.<br />
3. One of the schedd daemons is down, which means that LoadLeveler is having<br />
some problems on node bglfen1.itso.ibm.com. However, this issue could also<br />
be normal if the system administrator previously decided not to run schedd on<br />
this node.<br />
3<br />
4<br />
Chapter 4. Running jobs 187<br />
1<br />
2
4. One of the start daemons is down (indicated by the idle status, which means<br />
also that there are some problems with LoadLeveler on node<br />
bglfen2.itso.ibm.com.<br />
5. One node is absent. Although LoadLeveler can function with missing (or<br />
absent) nodes, the individual nodes might have important roles in the cluster.<br />
1<br />
LoadLeveler<br />
Cannot talk<br />
to Blue Gene<br />
3 4<br />
One schedd is<br />
down<br />
loadl@bglfen1:/usr/lib64> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />
bglfen1.itso.ibm.com Down 0 0 Idle 0 0.00 1184 PPC64 Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com, but is unusable<br />
Alternate Central Manager is serving from bglfen2.itso.ibm.com<br />
CM is running on a<br />
different machine<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is absent<br />
The following machines are absent<br />
bglsn.itso.ibm.com<br />
Figure 4-23 <strong>Problem</strong>s reported from the llstatus command<br />
In the worst case scenario, the llstatus command does not return any<br />
information but just error messages similar to those in Example 4-18.<br />
Example 4-18 Error messages reported from llstatus regarding LoadL_negotiator errors<br />
loadl@bglfen1:/usr/lib64> llstatus<br />
03/29 16:52:50 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />
"LoadL_negotiator" on port 9614. errno = 111<br />
03/29 16:52:50 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />
"LoadL_negotiator" on port 9614. errno = 111<br />
llstatus: 2512-301 An error occurred while receiving data from the<br />
LoadL_negotiator daemon on host bglsn.itso.ibm.com.<br />
The error messages should interpreted accordingly. For example, the following<br />
error message could be interpreted in a two ways:<br />
2539-463 Cannot connect to bglsn.itso.ibm.com "LoadL_negotiator" on port<br />
9614. errno = 111<br />
188 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
5<br />
One node is absent<br />
One startd is down<br />
2
This message could mean either that LoadLeveler Negotiator or Central<br />
Manager is down or that LoadLeveler might not be running at all.<br />
The llstatus command in Blue Gene/L<br />
The llstatus command provides Blue Gene/L specific option (command flags)<br />
such as -b, -B, -P, which display information that is related to Blue Gene/L. (See<br />
IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881, for a complete<br />
reference on the llstatus command.) Using these option flags, you can check<br />
and compare the LoadLeveler Blue Gene/L related information with information<br />
from other sources such as the Web interfaces.<br />
Issuing the llstatus -b command shows the overall dimension of the Blue<br />
Gene/L system and jobs in the queue. In Example 4-19 and Example 4-20, the<br />
llstatus -b is issued on two different Blue Gene/L systems.<br />
Example 4-19 The llstatus -b command on a one-midplane Blue Gene/L system<br />
loadl@bglfen1:~> llstatus -b<br />
Name Base Partitions c-nodes InQ Run<br />
BGL 1x1x1 8x8x8 0 0<br />
Example 4-20 The llstatus -b command on a 20-rack Blue Gene/L system<br />
thanhlam@bgwfen1:~> llstatus -b<br />
Name Base Partitions c-nodes InQ Run<br />
BGL 5x4x2 40x32x16 0 0<br />
Issuing llstatus with the -b and -l options (combined) displays more detail of<br />
the Blue Gene/L system structure, including network switches and cabling.<br />
Example 4-21 shows output from llstatus -b -l on an one-midplane Blue<br />
Gene/L system.<br />
Note: Because this is the minimum configuration possible (one midplane), all<br />
Blue Gene/L specific networks (Torus, Barrier, and Collective) are inside the<br />
midplane. Thus, there is no additional cabling (wiring).<br />
Example 4-21 The llstatus -b -l command on a single midplane system<br />
loadl@bglfen1:~> llstatus -b -l | more<br />
Total Blue Gene Base Partitions 1<br />
Total Blue Gene Compute Nodes 512<br />
Machine Size in Base Partitons X=1 Y=1 Z=1<br />
Machine Size in Compute Nodes X=8 Y=8 Z=8<br />
-- list of base partitions --<br />
Chapter 4. Running jobs 189
Z = 0<br />
=====<br />
+------------+<br />
| R000|<br />
0 | |<br />
| |<br />
+------------+<br />
-- list of switches --<br />
Switch ID: X_R000<br />
Switch State: UP<br />
Base Partition: R000<br />
Switch Dimension: X<br />
Switch Connections: NONE<br />
Switch ID: Y_R000<br />
Switch State: UP<br />
Base Partition: R000<br />
Switch Dimension: Y<br />
Switch Connections: NONE<br />
Switch ID: Z_R000<br />
Switch State: UP<br />
Base Partition: R000<br />
Switch Dimension: Z<br />
Switch Connections: NONE<br />
-- list of wires --<br />
Wire Id: R000X_R000<br />
Wire State: UP<br />
FromComponent=R000 FromPort=MINUS_X<br />
ToComponent=X_R000 ToPort=PORT_S0<br />
PartitionState=NONE Partition=NONE<br />
Wire Id: R000Y_R000<br />
Wire State: UP<br />
FromComponent=R000 FromPort=MINUS_Y<br />
ToComponent=Y_R000 ToPort=PORT_S0<br />
PartitionState=NONE Partition=NONE<br />
Wire Id: R000Z_R000<br />
Wire State: UP<br />
FromComponent=R000 FromPort=MINUS_Z<br />
ToComponent=Z_R000 ToPort=PORT_S0<br />
PartitionState=NONE Partition=NONE<br />
Wire Id: X_R000R000<br />
Wire State: UP<br />
FromComponent=X_R000 FromPort=PORT_S1<br />
ToComponent=R000 ToPort=PLUS_X<br />
190 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
PartitionState=NONE Partition=NONE<br />
Wire Id: Y_R000R000<br />
Wire State: UP<br />
FromComponent=Y_R000 FromPort=PORT_S1<br />
ToComponent=R000 ToPort=PLUS_Y<br />
PartitionState=NONE Partition=NONE<br />
Wire Id: Z_R000R000<br />
Wire State: UP<br />
FromComponent=Z_R000 FromPort=PORT_S1<br />
ToComponent=R000 ToPort=PLUS_Z<br />
PartitionState=NONE Partition=NONE<br />
Example 4-22 is extracted from the output of the llstatus -l -b on a 20-rack<br />
Blue Gene/L system.<br />
Note: Because the output is very long, only partial is captured in this example.<br />
Example 4-22 The llstatus -b -l command on a 20-rack Blue Gene/L system<br />
thanhlam@bgwfen1:~> llstatus -b<br />
Total Blue Gene Base Partitions 40<br />
Total Blue Gene Compute Nodes 20480<br />
Machine Size in Base Partitons X=5 Y=4 Z=2<br />
Machine Size in Compute Nodes X=40 Y=32 Z=16<br />
-- list of base partitions --<br />
Z = 1<br />
=====<br />
+----------------------------------------------------------------+<br />
| R011| R110| R311| R411| R210|<br />
3 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR21R31| wR411| wR21R31|<br />
|----------------------------------------------------------------|<br />
| R031| R130| R331| R431| R230|<br />
2 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR33| wR431| wR23|<br />
|----------------------------------------------------------------|<br />
| R021| R120| R321| R421| R220|<br />
1 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR22R32| wR421| wR22R32|<br />
|----------------------------------------------------------------|<br />
| R001| R100| R301| R401| R200|<br />
0 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR20R30| wR401| wR20R30|<br />
+----------------------------------------------------------------+<br />
Chapter 4. Running jobs 191
Z = 0<br />
=====<br />
+----------------------------------------------------------------+<br />
| R010| R111| R310| R410| R211|<br />
3 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR21R31| wR410| wR21R31|<br />
|----------------------------------------------------------------|<br />
| R030| R131| R330| R430| R231|<br />
2 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR33| wR430| wR23|<br />
|----------------------------------------------------------------|<br />
| R020| R121| R320| R420| R221|<br />
1 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR22R32| wR420| wR22R32|<br />
|----------------------------------------------------------------|<br />
| R000| R101| R300| R400| R201|<br />
0 | READY| READY| READY| READY| READY|<br />
| wR0| wR1| wR20R30| wR400| wR20R30|<br />
+----------------------------------------------------------------+<br />
-- list of switches --<br />
........ >>>>> Omitted lines
LoadLeveler run queue<br />
LoadLeveler is based on a job queuing principle. To identify a particular job that a<br />
user has submitted, a job identification (job ID) is returned when the job is sent<br />
successfully by the llsubmit command, as shown in Example 4-24.<br />
Example 4-24 A job ID returned by the llsubmit command<br />
loadl@bglfen1:/bgl/loadl> llsubmit hello.cmd<br />
llsubmit: The job "bglfen1.itso.ibm.com.53" has been submitted.<br />
loadl@bglfen1:/bgl/loadl> llq<br />
Id Owner Submitted ST PRI Class Running On<br />
------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.53.0<br />
loadl 4/3 09:32 I 50 small<br />
1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted<br />
In this example, after the job is submitted successfully, the job number 53 shows<br />
up in the queue. The job ID bglfen1.53.0 is a unique identifier for this job. The<br />
number zero (0) is the job step identifier (jobstepid) in case of a job that contains<br />
more than one step.<br />
Note: The job ID that is returned by llsubmit has the long host name with the<br />
domain name, but the queue displays only the short host name with the job ID<br />
because of the limited number of characters that can fit on a line of the screen.<br />
However, LoadLeveler log files include the job ID with long host names<br />
(FQDN).<br />
In a queue that has hundreds or perhaps thousands of jobs, you can filter the llq<br />
command output with the job ID.<br />
The job ID is also used in other LoadLeveler commands such as llcancel. The<br />
llcancel command tells LoadLeveler to terminate the job if it is running and to<br />
remove it from the queue. For example:<br />
llcancel bglfen1.53<br />
Chapter 4. Running jobs 193
You can use the llq -l command to get more information about a job.<br />
Another useful flag is -s, used as llq -s . If a job is in idle (I) state,<br />
using the -s flag tells the llq command to analyze and to display the reasons<br />
that the job cannot run at the moment (see Example 4-25).<br />
Example 4-25 Reasons for job in idle state<br />
loadl@bglfen1:/bgl/loadl> llsubmit hello.cmd<br />
llsubmit: The job "bglfen1.itso.ibm.com.54" has been submitted.<br />
loadl@bglfen1:/bgl/loadl> llq<br />
Id Owner Submitted ST PRI Class Running On<br />
------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.54.0<br />
loadl 4/3 10:10 I 50 small<br />
1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted<br />
loadl@bglfen1:/bgl/loadl> llq -s bglfen1.54.0<br />
=============== Job Step bglfen1.itso.ibm.com.54.0 ===============<br />
Job Step Id: bglfen1.itso.ibm.com.54.0<br />
Job Name: bglfen1.itso.ibm.com.54<br />
Step Name: 0<br />
Structure Version: 10<br />
Owner: loadl<br />
Queue Date: Mon 03 Apr 2006 10:10:20 AM EDT<br />
Status: Idle<br />
Reservation ID:<br />
Requested Res. ID:<br />
Scheduling Cluster:<br />
Submitting Cluster:<br />
Sending Cluster:<br />
Requested Cluster:<br />
Schedd History:<br />
Outbound Schedds:<br />
Submitting User:<br />
Execution Factor: 1<br />
Dispatch Time:<br />
Completion Date:<br />
Completion Code:<br />
Favored Job: No<br />
User Priority: 50<br />
user_sysprio: 0<br />
class_sysprio: 0<br />
group_sysprio: 0<br />
System Priority: -157448<br />
q_sysprio: -157448<br />
194 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Previous q_sysprio: 0<br />
Notifications: Error<br />
Virtual Image Size: 472 kb<br />
Large Page: N<br />
Checkpointable: no<br />
Ckpt Start Time:<br />
Good Ckpt Time/Date:<br />
Ckpt Elapse Time: 0 seconds<br />
Fail Ckpt Time/Date:<br />
Ckpt Accum Time: 0 seconds<br />
Checkpoint File:<br />
Ckpt Execute Dir:<br />
Restart From Ckpt: no<br />
Restart Same Nodes: no<br />
Restart: yes<br />
Preemptable: no<br />
Preempt Wait Count: 0<br />
Hold Job Until:<br />
RSet: RSET_NONE<br />
Mcm Affinity Options:<br />
Cmd: /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun<br />
Args: -verbose 2 -exe /bgl/hello/hello.rts<br />
Env:<br />
In: /dev/null<br />
Out: /bgl/loadl/out/hello.bglfen1.54.0.out<br />
Err: /bgl/loadl/out/hello.bglfen1.54.0.err<br />
Initial Working Dir: /bgl/loadl<br />
Dependency:<br />
Resources:<br />
Requirements: (Arch == "PPC64") && (OpSys == "Linux2")<br />
Preferences:<br />
Step Type: Blue Gene<br />
Size Requested: 128<br />
Size Allocated:<br />
Shape Requested:<br />
Shape Allocated:<br />
Wiring Requested: MESH<br />
Wiring Allocated:<br />
Rotate: True<br />
Blue Gene Status:<br />
Blue Gene Job Id:<br />
Partition Requested:<br />
Partition Allocated:<br />
Error Text:<br />
Node Usage: shared<br />
Chapter 4. Running jobs 195
Submitting Host: bglfen1.itso.ibm.com<br />
Schedd Host: bglfen1.itso.ibm.com<br />
Job Queue Key:<br />
Notify User: loadl<br />
Shell: /bin/bash<br />
LoadLeveler <strong>Group</strong>: No_<strong>Group</strong><br />
Class: small<br />
Ckpt Hard Limit: undefined<br />
Ckpt Soft Limit: undefined<br />
Cpu Hard Limit: undefined<br />
Cpu Soft Limit: undefined<br />
Data Hard Limit: undefined<br />
Data Soft Limit: undefined<br />
Core Hard Limit: undefined<br />
Core Soft Limit: undefined<br />
File Hard Limit: undefined<br />
File Soft Limit: undefined<br />
Stack Hard Limit: undefined<br />
Stack Soft Limit: undefined<br />
Rss Hard Limit: undefined<br />
Rss Soft Limit: undefined<br />
Step Cpu Hard Limit: undefined<br />
Step Cpu Soft Limit: undefined<br />
Wall Clk Hard Limit: 00:30:00 (1800 seconds)<br />
Wall Clk Soft Limit: 00:30:00 (1800 seconds)<br />
Comment:<br />
Account:<br />
Unix <strong>Group</strong>: loadl<br />
NQS Submit Queue:<br />
NQS Query Queues:<br />
Negotiator Messages:<br />
Bulk Transfer: No<br />
Step Adapter Memory: 0 bytes<br />
Adapter Requirement:<br />
Step Cpus: 0<br />
Step Virtual Memory: 0.000 mb<br />
Step Real Memory: 0.000 mb<br />
================= EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.54.0 ================<br />
Not enough resources to start now:<br />
Quarter "Q1" of BP "R000" is busy.<br />
Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
196 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Not enough resources for this step as top-dog:<br />
Quarter "Q1" of BP "R000" is busy.<br />
Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Not enough resources to start now:<br />
Quarter "Q1" of BP "R000" is busy.<br />
Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Not enough resources for this step as top-dog:<br />
Quarter "Q1" of BP "R000" is busy.<br />
Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q3" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Quarter "Q4" of BP "R000" does not have a nodecard with > 0 io_nodes.<br />
Blue Gene/L specific information for the run queue<br />
The LoadLeveler job ID is also stored in the Blue Gene/L database with the job<br />
record. You can use the DB2 select command to retrieve job information, as<br />
shown in Example 4-26.<br />
Example 4-26 Retrieving jobid from DB2 database<br />
bglsn:~ # db2 "select jobid,blockid,jobname from tbgljob_history where<br />
username='loadl'"<br />
JOBID BLOCKID JOBNAME<br />
----------- ---------------- ------------------------------------------------<br />
184 RMP22Mr154151027 bglsn.itso.ibm.com.6.0<br />
183 RMP22Mr153314019 bglsn.itso.ibm.com.5.0<br />
185 RMP22Mr160740038 bglsn.itso.ibm.com.7.0<br />
202 RMP24Mr135201017 bglfen1.itso.ibm.com.18.0<br />
203 RMP24Mr151425028 bglfen1.itso.ibm.com.19.0<br />
229 R000_128 mpirun.2954.bglfen1<br />
230 R000_128 mpirun.3140.bglfen1<br />
255 RMP03Ap112830043 bglfen1.itso.ibm.com.53.0<br />
8 record(s) selected.<br />
Note: The dynamic conditions of the Blue Gene/L partitions and the<br />
LoadLeveler nodes make it complicate for llq to diagnose all possible<br />
problems in such an environment. However, the reasons displayed can serve<br />
as hints and starting points for looking at the live problems on the system.<br />
Chapter 4. Running jobs 197
Note: The JOBID from Blue Gene/L database is the Blue Gene job identifier.<br />
The LoadLeveler job ID is called JOBNAME in the database table.<br />
You can view the same job information from the Blue Gene/L Web interface.<br />
From the Blue Gene/L home page, click Runtime and then Job Information<br />
(see Figure 4-24).<br />
Figure 4-24 Job information from the Web interface<br />
Job command file<br />
To submit a job using LoadLeveler, a job command file is required by the<br />
llsubmit command. We have used a sample job command file (Example 4-16 on<br />
page 178) throughout the discussions in this chapter. We kept the number of<br />
keywords in this file to a minimum. There are many other keywords available.<br />
See IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881, for a<br />
complete reference on job command file keywords.<br />
198 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Note: In the job command file, each keyword starts with a new line, followed<br />
by two special characters, the number sign (#) and the “at” sign (@). The<br />
number sign character (#) is equivalent to the same character that starts the<br />
comments field in shell scripting language (bash). If job_type = serial,<br />
LoadLeveler executes the job command file as though it were a shell script.<br />
The @ character tells LoadLeveler that this line contains a job command<br />
keyword to be evaluated, and this line is not interpreted by the shell because it<br />
is a considered a comment.<br />
The following list describes each keyword in the file briefly and provides hints at<br />
what could go wrong with the keyword where possible:<br />
► #@ job_type<br />
There are basically three types of job supported by LoadLeveler:<br />
– serial<br />
– parallel<br />
– bluegene<br />
In this book, we only discuss the bluegene job type.<br />
► #@ job_type = bluegene<br />
This line is mandatory in the job command file. Without this keyword,<br />
LoadLeveler does not understand other Blue Gene/L related keywords. In<br />
fact, this keyword tells LoadLeveler to use the bridge API to exchange<br />
information with the Blue Gene/L system. Example 4-27 shows the llsubmit<br />
command returning an error message when the job command file does not<br />
contain the keyword #@ job_type = bluegene.<br />
Example 4-27 Missing job_type in job command file<br />
loadl@bglfen1:/bgl/loadl> llsubmit hello.cmd<br />
llsubmit: 2512-585 The "bg_size" keyword is only valid for "job_type =<br />
BLUEGENE" job steps.<br />
llsubmit: 2512-051 This job has not been submitted to LoadLeveler.<br />
► #@ executable = /usr/bin/mpirun<br />
This directive tells LoadLeveler that the executable for a Blue Gene/L job is<br />
the mpirun command, which is invoked by LoadLeveler to launch an MPI job<br />
on the Blue Gene/L nodes. If this keyword is missing or points to a different<br />
executable, LoadLeveler cannot find the mpirun program (and fails to submit<br />
the job).<br />
Chapter 4. Running jobs 199
► #@ arguments<br />
This keyword contains the arguments that are passed to mpirun. They must<br />
be typed here exactly as they are entered on the mpirun command line. For<br />
example, to run the “Hello world!” program with mpirun on the command line,<br />
we used the following syntax:<br />
mpirun -exe /bglscratch/test1/hello-file.rts -args 10 -verbose 1<br />
In the LoadLeveler job command file, this line is translated to:<br />
#@ executable = /usr/bin/mpirun<br />
#@ arguments = -exe /bglscratch/test1/hello-file.rts -args 10<br />
-verbose 1<br />
Note: LoadLeveler does not validate the syntax of the command string that is<br />
passed to mpirun. If there is problem with an argument or value, mpirun<br />
returns a message in the job’s stderr(2).<br />
► #@ environment<br />
This keyword passes any environment variables that the job needs to set<br />
when it is running. The reserved word COPY_ALL copies all the user’s shell<br />
environment variables for the job (as displayed by the commands set or env).<br />
► #@ output and #@ error<br />
These two keywords contain the directory (directories) where the job’s<br />
stdout(1) and stderr(2) are sent. If these two keywords are not specified in<br />
the job command file, LoadLeveler redirects the job’s stderr(2) and<br />
stdout(1) to /dev/null.<br />
Note: These directories have to be writable for the user ID that runs the job. If<br />
the directories do not exist or are not accessible, the job is rejected by<br />
LoadLeveler.<br />
► #@ notification<br />
This keyword consists of a reserved word that indicates that LoadLeveler<br />
should notify the user ID specified in the #@ notify_user keyword.<br />
► #@ notify_user<br />
This keyword contains the user ID that is going to receive LoadLeveler’s<br />
notification in case the notification condition is set (#@ notification).<br />
► #@ class<br />
Depending on how the LoadLeveler cluster is configured, the job can chose to<br />
run under a LoadLeveler class.<br />
200 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Note: If the class specified by the job command file is not defined in<br />
LoadLeveler configuration, the job is going to remain idle (I) in the queue. See<br />
, “LoadLeveler configuration keywords” on page 205 for information about how<br />
to identify class definitions in LoadLeveler configuration.<br />
► #@ queue<br />
This keyword is usually the last keyword in a job command file. It is not set to<br />
any value. The command llsubmit returns an error if the #@ queue line is<br />
missing from the job command file (as shown in Example 4-28).<br />
Example 4-28 The llsubmit error when missing #@queue keyword<br />
loadl@bglsn:/bgl/loadl> llsubmit hello.cmd<br />
llsubmit: 2512-058 The command file "hello.cmd" does not contain any<br />
"queue" keywords.<br />
llsubmit: 2512-051 This job has not been submitted to LoadLeveler.<br />
Blue Gene/L specific keywords<br />
The following list provides the keywords that are specific to Blue Gene/L:<br />
► #@ bg_size<br />
This keyword contains an integer that specifies the number of compute nodes<br />
that are required for this job to run. This is equivalent to the argument -np<br />
for mpirun.<br />
► #@ bg_shape<br />
This keyword contains the reserved word, mesh or torus. It is equivalent to the<br />
argument flag -shape for mpirun.<br />
► #@ bg_partition<br />
This keyword can be set to a partition name. The partition has to be<br />
predefined. It is equivalent to the argument flag -partition for mpirun.<br />
Note: One of these keywords should be used in the job command file instead<br />
of the mpirun argument flags -np, -shape, -partition in the #@ arguments<br />
keyword. When mpirun receives the job information from LoadLeveler, the<br />
partition is either created with the specified number of compute nodes, shape,<br />
or the specified predefined partition is chosen.<br />
When debugging problems for jobs with a complex command file, start with a<br />
simple file as described in this section. Make sure that the job can run with this<br />
file. Use the “Hello world!” program if necessary. Then, you can add keywords<br />
gradually into the same job command file until the problem is observed.<br />
Chapter 4. Running jobs 201
LoadLeveler processes, logs, and persistent storage<br />
As discussed in 4.4.1, “LoadLeveler overview” on page 167, the cluster is<br />
managed and run by different LoadLeveler daemons. Figure 4-8 on page 168<br />
shows all the LoadLeveler daemons running in the background on a node.<br />
However, LoadLeveler is designed in such a way that not all daemons should run<br />
on every node in the cluster. It is normal to have only a subset of daemons<br />
running on every node.<br />
In order to know for sure which LoadLeveler daemons are running, you can use<br />
the ps -ef command (filtered for LoadLeveler processes), as shown in<br />
Example 4-29. In our case, the Negotiator daemon (or CM) and the Master<br />
daemon are running on the SN.<br />
Example 4-29 LoadLeveler daemons running on Blue Gene/L SN<br />
loadl@bglsn:/bgl/loadl> ps -ef | grep LoadL<br />
loadl 27425 1 0 Apr01 ? 00:00:00<br />
/opt/ibmll/LoadL/full/bin/LoadL_master<br />
loadl 27436 27425 0 Apr01 ? 00:03:17 LoadL_negotiator -f -c<br />
/tmp -C /tmp<br />
loadl 14892 11456 0 15:43 pts/34 00:00:00 grep LoadL<br />
Running the same command on an FEN shows the Master daemon, the<br />
Scheduler daemon, the Start daemon, and the Starter daemon running, as<br />
shown in Example 4-30.<br />
Example 4-30 LoadLeveler daemons running on Blue Gene/L FEN<br />
loadl@bglfen1:~> ps -ef | grep LoadL<br />
loadl 18931 1 0 Apr01 ? 00:00:00<br />
/opt/ibmll/LoadL/full/bin/LoadL_master<br />
loadl 18940 18931 0 Apr01 ? 00:00:00 LoadL_schedd -f -c /tmp<br />
-C /tmp<br />
loadl 18941 18931 0 Apr01 ? 00:04:48 LoadL_startd -f -C /tmp<br />
-c /tmp<br />
loadl 820 18941 0 08:51 ? 00:00:00 LoadL_starter -p 131 -c<br />
/tmp -C /tmp<br />
loadl 1950 1891 0 13:45 pts/6 00:00:00 grep LoadL<br />
202 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
The two previous examples match with the LoadLeveler cluster configuration as<br />
shown by the llstatus command in Example 4-31.<br />
Example 4-31 Matching daemons with llstatus command<br />
loadl@bglfen1:~> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />
OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 395 PPC64<br />
Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.03 9999 PPC64<br />
Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
This configuration is defined in the LoadLeveler configuration files, which are<br />
described in “LoadLeveler configuration keywords” on page 205.<br />
The most important daemon is the Negotiator. If the Negotiator cannot start,<br />
LoadLeveler commands such as llstatus and llq will not work. To determine<br />
which node is supposed to be the CM, check the LoadL_admin file and look for<br />
the following line:<br />
central_manager = true<br />
Example 4-32 shows the lines extracted from the file LoadL_admin, in which CM<br />
is defined on node bglsn.itso.ibm.com.<br />
Example 4-32 CM defined in LoadL_admin<br />
bglsn.itso.ibm.com: type = machine<br />
schedd_host = true<br />
central_manager = true<br />
Note: In a common LoadLeveler configuration, there could be other nodes<br />
defined with central_manager = alt (these are alternate CMs). One of these<br />
nodes takes over the role of the CM if the designated CM fails. However, an<br />
alternate CM is not supported in Blue Gene/L environment.<br />
Chapter 4. Running jobs 203
You can configure the LoadLeveler daemons to be respawned in case of<br />
abnormal termination. Therefore, for problems that happened in the past, each<br />
daemon’s log file has to be investigated. The log file names and locations are<br />
defined in the LoadL_config file, as shown in Example 4-33.<br />
Example 4-33 Log file names and locations<br />
loadl@bglfen1:/bgl/loadlcfg> grep -i _log LoadL_config | grep -v MAX<br />
KBDD_LOG = $(LOG)/KbdLog<br />
STARTD_LOG = $(LOG)/StartLog<br />
SCHEDD_LOG = $(LOG)/SchedLog<br />
NEGOTIATOR_LOG = $(LOG)/NegotiatorLog<br />
GSMONITOR_LOG = $(LOG)/GSmonitorLog<br />
STARTER_LOG = $(LOG)/StarterLog<br />
MASTER_LOG = $(LOG)/MasterLog<br />
Also in LoadL_config the $(LOG) variable can be defined as:<br />
LOG = $(tilde)/log<br />
where $(titde) is the home directory of the LoadLeveler administrator or the<br />
user ID that starts LoadLeveler.<br />
If the log file names and locations are not defined in LoadL_config, they are set<br />
to the location that is specified when the command llinit is issued. The syntax<br />
of the llinit command is:<br />
llinit -local /home/loadl<br />
where the option flag -local specifies the location where the following three<br />
directories are created:<br />
► log<br />
► execute<br />
► spool<br />
In addition to the log directory, the llinit command also creates two other<br />
directories: execute and spool. These directories serve as persistent storage for<br />
job data and history information. Therefore, if LoadLeveler is stopped with jobs in<br />
the queue, job data and information are saved for next time when LoadLeveler is<br />
started.<br />
Depending on the state of a job at the time LoadLeveler stops, it can be<br />
restarted, resumed, or started when LoadLeveler starts next time.<br />
Note: For complete information regarding LoadLeveler processes and logs,<br />
see IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881.<br />
204 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
LoadLeveler configuration keywords<br />
The LoadL_config file includes the global LoadLeveler configuration keywords.<br />
To determine the location of this file, check the contents of the /etc/LoadL.cfg file,<br />
which contains the basic LoadLeveler configuration. Example 4-34 shows the<br />
contents of this file.<br />
Example 4-34 The contents of the /etc/LoadL.cfg file<br />
loadl@bglsn:~/log.save> cat /etc/LoadL.cfg<br />
LoadLUserid = loadl<br />
LoadL<strong>Group</strong>id = loadl<br />
LoadLConfig = /bgl/loadlcfg/LoadL_config<br />
This file resides on a common file system so that you can access it from any<br />
node in the cluster (that NFS mounts the /bgl file system). The most important<br />
Blue Gene/L keywords in the LoadL_config file are described in 4.4.4,<br />
“Configuring LoadLeveler for Blue Gene/L” on page 172. Some of the<br />
configuration keywords that define the log files and locations are also discussed<br />
in , “LoadLeveler processes, logs, and persistent storage” on page 202.<br />
In addition to the global configuration file, LoadLeveler also uses a local<br />
configuration file that resides on a local file system on each node. This is<br />
specified in the global configuration file with the keyword LOCAL_CONFIG as<br />
following:<br />
LOCAL_CONFIG = $(tilde)/LoadL_config.local<br />
This local configuration file is needed in case the system administrator wants<br />
different nodes having different configurations for the following:<br />
► LoadLeveler daemons running<br />
► Job classes<br />
► Number of Starters<br />
To specify these parameters, the following keywords are used:<br />
► SCHEDD_RUNS_HERE<br />
If this keyword is set to FALSE, then LoadLeveler is not going to start the<br />
Scheduler daemon on this node. If the number of nodes defined to run the<br />
Scheduler is the majority in the cluster, this keyword can be set to TRUE in the<br />
global configuration file. Then, in the local configuration file of the node that<br />
does not need to run the Scheduler set the value to FALSE.<br />
Note: The setup value in the local configuration file overrides the one in the<br />
global configuration file.<br />
Chapter 4. Running jobs 205
► STARTD_RUNS_HERE<br />
This keyword specifies whether LoadLeveler should start the Start daemon<br />
on the local node.<br />
Note: It is usually not desirable to run Scheduler and Start daemon on the<br />
Blue Gene/L SN.<br />
► CLASS<br />
To control the types of jobs to run on particular nodes, you can specify the<br />
CLASS keyword either in the global or local configuration file. An example is<br />
following:<br />
CLASS = small(8) medium(5) large(2)<br />
Unless a default class is defined in the LoadL_admin file, a job has to specify<br />
the keyword #@ class in its job command file to be able to run. The keyword<br />
is set to one of the class name. See Example 4-16 on page 178 for the use of<br />
the keyword #@ class.<br />
► MAX_STARTERS<br />
This configuration keyword sets the maximum number of jobs that can run on<br />
the local node. Depending on the capability of the node, the appropriate<br />
number of jobs is recommended.<br />
Blue Gene/L specific configuration keywords<br />
As discussed in 4.4.4, “Configuring LoadLeveler for Blue Gene/L” on page 172,<br />
there are special configuration keywords that enable Blue Gene/L functionalities<br />
in LoadLeveler. It is also recommended that Scheduler and Start daemon are not<br />
running on the Service Node. See 4.4.2, “Principles of operation in a Blue<br />
Gene/L environment” on page 170.<br />
Environment variables, network, and library links<br />
This section explains the variables that are critical for Blue Gene/L LoadLeveler<br />
software environment (job running).<br />
Environment variables<br />
In 4.4.6, “Setting Blue Gene/L specific environment variables” on page 175, we<br />
discuss the environment variables that you need to set up for a user ID to start<br />
LoadLeveler. A simple way to check these variables in a UNIX shell is to issue<br />
the echo command, as shown in Example 4-35.<br />
206 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 4-35 Checking environment variables<br />
loadl@bglfen1:/bgl/loadlcfg> echo $BRIDGE_CONFIG_FILE<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />
loadl@bglfen1:/bgl/loadlcfg> echo $DB_PROPERTY<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
loadl@bglfen1:/bgl/loadlcfg> echo $MMCS_SERVER_IP<br />
bglsn.itso.ibm.com<br />
If these variables are not set up correctly for the LoadLeveler administrator user<br />
ID, LoadLeveler will not start. For other users, they need to have these variables<br />
set up once at the beginning to be able to submit jobs. However, the following<br />
should be noted:<br />
► The location of the configuration files to which these variables point can be<br />
changed for individual users.<br />
► The contents of the configuration files for individual users can also be<br />
changed for various needs.<br />
Another common variable is the PATH ($PATH). For accessing the LoadLeveler<br />
commands, users should have $PATH, including the directory where the<br />
LoadLeveler binaries and libraries are installed. For example:<br />
/opt/ibmll/LoadL/full/bin/<br />
Network and communications<br />
Because LoadLeveler is a cluster managed subsystem, network and<br />
communication between nodes are vital to its operations. The daemons use<br />
sockets to communicate to each other. Basic TCP/IP and sockets knowledge is<br />
helpful in problem determination.<br />
The global LoadL_config file defines the port numbers that the daemons use, as<br />
shown in Example 4-36.<br />
Example 4-36 Network ports defined for LoadLeveler daemons<br />
# Specify port numbers<br />
MASTER_STREAM_PORT = 9616<br />
NEGOTIATOR_STREAM_PORT = 9614<br />
SCHEDD_STREAM_PORT = 9605<br />
STARTD_STREAM_PORT = 9611<br />
COLLECTOR_DGRAM_PORT = 9613<br />
STARTD_DGRAM_PORT = 9615<br />
MASTER_DGRAM_PORT = 9617<br />
Chapter 4. Running jobs 207
For example, knowing how sockets work helps when a socket is closed abruptly,<br />
and it might need to wait a certain time for all the traffic to be quiesced. This<br />
situation might occur when you stop and restart LoadLeveler very quickly,<br />
without waiting for a short while (1 minute). Example 4-37 shows the<br />
NegotiatorLog, which includes messages that it cannot start and has to wait on<br />
port 9614.<br />
Example 4-37 Negotiator daemon waiting on port 9614<br />
03/19 16:54:30 TI-1 *************************************************<br />
03/19 16:54:30 TI-1 *** LOADL_NEGOTIATOR STARTING UP ***<br />
03/19 16:54:30 TI-1 *************************************************<br />
03/19 16:54:30 TI-1<br />
03/19 16:54:30 TI-1 LoadLeveler: LoadL_negotiator started, pid = 14176<br />
03/19 16:54:30 TI-5 LoadLeveler: 2539-479 Cannot listen on port 9614 for service<br />
LoadL_negotiator.<br />
03/19 16:54:30 TI-5 LoadLeveler: Batch service may already be running on this<br />
machine.<br />
03/19 16:54:30 TI-5 LoadLeveler: Delaying 1 seconds and retrying ...<br />
03/19 16:54:30 TI-5 LoadLeveler: 2539-479 Cannot listen on port 9614 for service<br />
LoadL_negotiator.<br />
03/19 16:54:30 TI-5 LoadLeveler: Batch service may already be running on this<br />
machine.<br />
03/19 16:54:30 TI-5 LoadLeveler: Delaying 2 seconds and retrying ...<br />
03/19 16:54:30 TI-5 LoadLeveler: 2539-479 Cannot listen on port 9614 for service<br />
LoadL_negotiator.<br />
One way to alleviate this problem is to set the TCP recycle attribute with the<br />
following command:<br />
/sbin/sysctl -w netipv4.tcp_tw_recycle=”1”<br />
The netstat command is also a helpful to understand the status of the sockets or<br />
ports. For example:<br />
netstat -an | grep 9614<br />
Library links<br />
Libraries are other vital resources for LoadLeveler to run. As shown in some<br />
scenarios, some bad or broken links can cause problems with running<br />
LoadLeveler and submitting jobs. This verification can be useful as a last resort<br />
when everything else fails to find the problems for following reasons:<br />
► Links can be removed or broken<br />
► Libraries can be updated<br />
► A link can be pointing to a different library for various users<br />
208 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Blue Gene/L specific links<br />
Library links are used extensively with LoadLeveler setup as described in 4.4.5,<br />
“Making the Blue Gene/L libraries available to LoadLeveler” on page 173. When<br />
in doubt, check and validate all the library links. You can use a script similar to<br />
that shown in Example 4-14 on page 175.<br />
4.4.10 Updating LoadLeveler in a Blue Gene/L environment<br />
In a Blue Gene/L system, LoadLeveler is not part of the Blue Gene/L code<br />
distribution. You can update LoadLeveler code separately on the SN and FENs.<br />
Consider the following recommendations:<br />
► Check the code levels on all the nodes in the cluster to make sure that there<br />
are no version or level mismatches.<br />
► If in doubt, check the libraries and their symbolic links.<br />
► Note that some installation scripts copy the library binaries rather than<br />
creating a symbolic link. In this case, the checksum command can help<br />
validate the binary files.<br />
Chapter 4. Running jobs 209
210 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chapter 5. File systems<br />
5<br />
This chapter provides an understanding of the steps that are required to fix<br />
problems with the file systems (persistent storage) that are currently supported<br />
on Blue Gene/L. Currently both NFS and GPFS file systems are supported for<br />
use with Blue Gene/L, and in this chapter we discuss problem determination for<br />
both these file system types.<br />
For each of the file systems supported, we begin with a general introduction, and<br />
then we describe how the file system plugs in to Blue Gene/L. We discuss both<br />
the concepts and the steps that are required to configure the file systems. For<br />
each file system type, we then present a problem determination methodology<br />
that recommends a specific sequence of checks, including a checklist of steps<br />
that show how to make each of the checks, along with an explanation and<br />
suggested commands.<br />
© Copyright IBM Corp. 2006. All rights reserved. 211
5.1 NFS and GPFS<br />
In a basic configuration, Blue Gene/L requires an NFS file system, regardless of<br />
whether a GPFS file system is also used. Native access to a GPFS file system<br />
from a Blue Gene/L I/O node requires an available NFS file system so that the<br />
GPFS code on the I/O node can read the centrally held GPFS configuration files<br />
while during the node startup.<br />
NFS is the most convenient to set up because most operating systems provide<br />
the facilities to both create and mount the NFS file system. GPFS provides a<br />
more scalable solution for those configurations where high performance and<br />
large file system storage is needed (higher requirements than NFS can provide).<br />
You can configure GPFS with multiple Storage Server nodes that can work<br />
together to provide cumulated performance (unlike NFS, where all the storage<br />
that belongs to a NFS file system must be attached physically to a single node).<br />
GPFS currently supports a file size up 200 TB. This size limit is not the actual<br />
architectural limit (which in GPFS 2.3 is 2 99 bytes), rather it is the configuration<br />
that we tested.<br />
Both NFS and GPFS file systems must be mounted on the I/O node<br />
automatically during the node boot process. This mounting is essential because<br />
each time a job is run, a new block might have to be allocated and all the I/O<br />
nodes belonging to this block, therefore, will be booted. To be able to<br />
immediately run a job, the file system must be available.<br />
Understanding the I/O node boot sequence is key to understanding problem<br />
determination for Blue Gene/L file systems.<br />
5.1.1 I/O node boot sequence<br />
The IBM proprietary kernel Compute Node Kernel (CNK) that runs on the<br />
compute node is a single user, single process run time environment that has no<br />
paging mechanism. The compute node can communicate to the outside world<br />
only through the I/O node, and any executable program that runs on the compute<br />
node must be loaded from the I/O node through the Blue Gene/L internal<br />
collective network.<br />
Depending on the configuration, a Blue Gene/L system includes a number of I/O<br />
nodes. The I/O node runs the Mini-Control Program (MCP), which is a cut down<br />
Linux OS that runs a 32-bit PPC 2.4 uniprocessor kernel. The Compute node I/O<br />
daemon (ciod) that is loaded during MCP initialization and runs on the I/O node<br />
is responsible for handling the I/O calls made by the Compute nodes. The MCP,<br />
unlike the Compute Node Kernel, supports TCP/IP communication programs,<br />
212 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
such as NFS, ping, and other I/O related system functions that help with problem<br />
determination.<br />
Steps in the I/O node boot sequence<br />
Note: The variables that we use in this section are set during the I/O node<br />
system init (rc.sysinit) when it runs the /etc/rc.dist script that is built into the<br />
RAM disk image. Here are the contents of this file on our system:<br />
export<br />
BGL_DISTDIR="/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/dist"<br />
export BGL_SITEDISTDIR="/bgl/dist"<br />
export BGL_OSDIR="/bgl/OS/4.1"<br />
The steps of the I/O node boot sequence that we discuss in detail further in this<br />
section are:<br />
1. MCP kernel and ramdisk are loaded over the service network.<br />
2. The MCP launches from the ramdisk image the /sbin/init process that reads<br />
/etc/inittab. The system init rule in inittab is coded to run /etc/rc.d/rc.sysinit.<br />
3. The rc.sysinit is invoked from within the MCP ramdisk image. (You can find a<br />
copy of this file in the /bgl/BlueLight/ppcfloor/dist/etc/rc.d directory.) This<br />
script attempts to do the following:<br />
a. NFS mounts the /bgl directory from the Service Node (SN) or the directory<br />
that defined by the BGL_EXPORTDIR variable, if that variable is set.<br />
b. Runs the /etc/rc.dist script from the ramdisk image.<br />
4. The rc.sysinit2 is next invoked from the NFS mounted directory<br />
(/bgl/BlueLight/ppcfloor/dist/etc/rc.d) and does the following:<br />
a. Replaces empty /lib with symbolic link to $BGL_OSDIR/lib.<br />
b. Replaces empty /usr with symbolic link to $BGL_OSDIR/usr.<br />
c. Replaces empty /etc/rc.d/rc3.d with symbolic link under $BGL_DISTDIR.<br />
d. Loads the collective/tree network device drivers.<br />
e. Runs $BGL_DISTDIR/etc/rc.d/rc3.d/S* start scripts.<br />
f. Runs $BGL_SITEDISTDIR/etc/rc.d/rc3.d/S* start scripts.<br />
The scripts that are run by default by these start scripts are found in the<br />
/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d directory and are listed below<br />
along with the jobs that they perform:<br />
nfs Starts the portmap daemon.<br />
xntpd Starts the network time protocol daemon.<br />
Chapter 5. File systems 213
sshd (optional) Starts the secure shell daemon if required. This<br />
occurs if either GPFS_STARTUP=1 is set in<br />
/etc/sysconfig/gpfs on the I/O node OR the<br />
/etc/sysconfig/ssh is found.<br />
You can find this file on the SN as<br />
/bgl/BlueLight/ppcfloor/dist/etc/sysconfig/ssh<br />
gpfs Starts and mounts GPFS file systems if<br />
GPFS_STARTUP=1 is set in /etc/sysconfig/gpfs.<br />
syslog Starts syslog services.<br />
ciod Starts ciod.<br />
ibmcmp Starts the XL Compiler Environment for I/O node<br />
g. runs $BGL_SITEDISTDIR/etc/rc.local script.<br />
5. As each I/O node completes its MCP boot process, it looks for additional<br />
scripts to run. These additional scripts can be found in two separate<br />
directories documented in the following paragraphs.<br />
5.1.2 Additional scripts in I/O node boot sequence<br />
You can save scripts that you want invoked during the I/O node boot sequence in<br />
either of the following two directories:<br />
► /bgl/BlueLight/ppcfloor/dist/etc/rc.d/rc3.d - (installation dist directory)<br />
Warning: The /bgl/BlueLight/ppcfloor/ppc/dist/etc/rc.d/rc3.d directory is<br />
part of the installed software. Its contents are lost when you install a new<br />
driver or release.<br />
► /bgl/dist/etc/rc.d/rc3.d - (site dist directory)<br />
To be considered, the script’s file name must begin with the uppercase letter S<br />
(for start) or K (for kill), followed by two decimal digits (for example, S10mynfs,<br />
K10mynfs, and so forth) and a relevant name for the service.<br />
Note: The general rule is that the scripts starting with S are run at system<br />
init, and the scripts starting with K ar run when the system is shut down.<br />
At system initialization the list of scripts starting with S are sorted by the<br />
subsequent two digits, which specify the relative order that the scripts will be run<br />
by the I/O node. In a similar way, the kill scripts that start with the letter K are<br />
used when the I/O node is shut down when the associated block is freed.<br />
The scripts in both directories are sorted into a single list and then run one at a<br />
time in that order.<br />
214 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
5.2 NFS<br />
Warning: If a start script in the site dist directory has the same name as a<br />
start script in the installation dist directory, only the script in the installation dist<br />
directory is run.<br />
Let us assume that you have a script named /bgl/dist/etc/rc.d/rc3.d/S10mynfs<br />
that mounts additional file systems that contain your data. Because the script<br />
name begins with S10, it runs before S50ciod, which starts the ciod, and after<br />
S05nfs, which starts the port mapper. This sequence is correct, because your file<br />
systems are mounted before jobs can be started and after NFS is already<br />
running.<br />
Blue Gene/L requires an NFS file system regardless of whether a GPFS file<br />
system is also required. The default NFS file system, the /bgl directory, is<br />
exported from the SN.<br />
5.2.1 How NFS plugs into a Blue Gene/L system<br />
It is important that the /blg directory on the SN is NFS exported because the I/O<br />
nodes must be able to mount this file system when a block is booted. While<br />
applications can be run from the /blg directory, it is recommended that the /bgl<br />
file system is preserved for the installed Blue Gene/L code and that another file<br />
system is used for user applications. Figure 5-1 shows how NFS plugs in to a<br />
Blue Gene/L system.<br />
Note: Even though the system can run without an additional NFS server (SN<br />
provides the basic NFS services and file systems), we strongly recommend<br />
that you configure additional NFS servers due to applications performance<br />
and storage requirements and to avoid overloading the SN.<br />
Chapter 5. File systems 215
1. Create NFS Server with storage and create and<br />
export the NFS file system (/bglscratch).<br />
172.30.1.33/16<br />
p5<br />
Storage<br />
NFS Server<br />
2. Export the NFS file system<br />
from the Server to the Blue<br />
Gene functional network.<br />
3. Check the NFS file<br />
system can be mounted over<br />
the functional network to the<br />
Service Node<br />
Functional<br />
Ethernet<br />
Figure 5-1 Adding an NFS file system to the Blue Gene/L<br />
172.30.0.0/16<br />
5.2.2 Adding an NFS file system to the Blue Gene/L system<br />
This section provides an example of how to make a file system available through<br />
NFS to the Blue Gene/L system. The file system can be served by any file server<br />
that complies to the NFS V3 protocol. The file system is made available through<br />
the Functional Network and must be mounted by the I/O nodes, SN, and the<br />
FENs that are used to compile and execute the jobs.<br />
In our environment, we used an IBM System p 630 running SUSE SLES 9<br />
connected to an IBM DS4500 storage. Because it is outside the scope of this<br />
book, we do not present the basic operating system, storage (file systems), and<br />
networking configuration steps. Instead, we emphasize the steps to make the file<br />
system available to the Blue Gene/L system.<br />
216 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
IO Nodes<br />
connections<br />
pSeries<br />
Service Node<br />
BlueGene<br />
NFS clients<br />
4. Add command to the sitefs to mount the<br />
NFS file system (/bglscratch) when the IO<br />
nodes boot.<br />
Important: The File Server name in this section refers to either the SN or<br />
another system that is used to host the NFS file system (NFS-FS) that is used<br />
to run user jobs.
Here are the steps that are required to make an NFS file system available for<br />
running jobs on Blue Gene/L:<br />
1. Create the file system (NFS-FS) on the File Server system (172.30.1.33,<br />
p630n03_fn) and mount it on the File Server system.<br />
mount /bglscratch<br />
The File Server system could be the SN, one FEN, or another system.<br />
2. Export the NFS-FS from the File Server.<br />
Set USE_KERNEL_NFSD_NUMBER="64" in /etc/sysconfig/nfs.<br />
Add the following line to /etc/exports and then activate it:<br />
/bglscratch 172.30.0.0/255.255.0.0 (rw,no_root_squash,async)<br />
exportfs -a<br />
Now check the export on the FS by issuing the command:<br />
showmount -e<br />
Check that the NFS server is started.<br />
/etc/init.d/nfsserver status<br />
This should return:<br />
Checking for kernel based NFS server: running<br />
3. Check that this NFS-FS can be mounted and accessed on the SN.<br />
On the SN issue the command (172.1.1.31 is the File Server IP address on<br />
the functional network):<br />
mount 172.30.1.33:/bglscratch /mnt<br />
Check that you can access the command on the SN:<br />
cd /mnt; touch foo<br />
4. Update the site customization script (sitefs) to enable the NFS-FS to be<br />
mounted when the I/O nodes boot. Then check that a job can access files on<br />
the NFS-FS when run. See sitefs entries in “Step 3 - Check that the NFS-FS<br />
is mounted when the block boots” on page 221.<br />
5.2.3 NFS problem determination methodology<br />
The methodology that we present here is intended to help with a wide variety of<br />
problems. The first sections cover the basics and the later sections cover the<br />
more unlikely and esoteric problem areas. If you think you already know in which<br />
area the problem lies, then we encourage you to go straight to that section.<br />
However, if you are unsure where the problem lies, we suggest that you use the<br />
methodology in the order presented here, because this approach often uncovers<br />
Chapter 5. File systems 217
5.2.4 NFS checklists<br />
the simplest problems quickly and easily before you spend a long time looking for<br />
a solution to a presumed problem rather than the real one.<br />
Check that the NFS-FS can be mounted on the SN<br />
After each step mentioned here, check whether you can mount the NFS-FS:<br />
► Check that the NFS-FS is exported from the Server, as described in “Step 1 -<br />
Check that NFS-FS is exported from the File Server” on page 218.<br />
► Check that you can ping the FS Server over the functional network.<br />
► If you still cannot mount the file system, then check error messages (screen,<br />
console, system log) in /var/log/messages.<br />
Check that the NFS-FS can be mounted on the I/O nodes<br />
Also check whether you can mount the NFS-FS on the I/O nodes:<br />
► First boot a block that uses the I/O node that has the problem.<br />
► Check if the NFS-FS can be mounted on the I/O node, as described in “Step 2<br />
- Check if the NFS-FS can be mounted on the I/O node” on page 220.<br />
► Check that you can ping the File Server’s IP address from the I/O node.<br />
► Check that the NFS-FS is mounted when the block boots, as described in<br />
“Step 3 - Check that the NFS-FS is mounted when the block boots” on<br />
page 221<br />
Check network tuning parameters<br />
Network tuning parameters are unlikely to prevent NFS from mounting. However,<br />
if you are experiencing performance or intermittent connection problems, this<br />
check might help solving the problem. See “Step 4 - Check the network tuning<br />
parameters on the SN” on page 223.<br />
In addition to the problem determination methodology, the following detailed<br />
checklists (steps) can aid in NFS problem determination.<br />
Step 1 - Check that NFS-FS is exported from the File Server<br />
The best way to check if a file system is exported is to use the showmount<br />
command. Example 5-1 issues the showmount command on the SN to check that<br />
the /bgl directory is exported from the SN over the functional network.<br />
218 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 5-1 Checking NFS exports<br />
bglsn:/tmp # showmount -e<br />
Export list for bglsn:<br />
/bgl 172.30.0.0/255.255.0.0<br />
Example 5-2 uses the showmount command from the SN to check an additional<br />
NFS server (that holds the user application code and data) to see what NFS file<br />
systems can be mounted on the SN (and on I/O nodes).<br />
Example 5-2 Checking exports for additional servers<br />
bglsn:/tmp # showmount -e p630n03_fn<br />
Export list for p630n03_fn:<br />
/nfs_mnt (everyone)<br />
/bglscratch (everyone)<br />
/bglhome (everyone)<br />
If the showmount command returns the following error then the rpc.mountd or nfsd<br />
services are not running:<br />
'mount clntudp_create: RPC: Program not registered'.<br />
To fix this issue, run the following command:<br />
/etc/init.d/nfsserver restart<br />
Another error returned by the showmount command might be the following<br />
message, which means that the portmap service is not running:<br />
'mount clntudp_create: RPC: Port mapper failure - RPC: Unable to<br />
receive'<br />
To fix this issue run the following:<br />
/etc/init.d/portmap restart<br />
/etc/init.d/nfsserver restart<br />
You can use the following command to check the server:<br />
/etc/init.d/nfsserver status<br />
Checking for kernel based NFS server: running<br />
To check the port mapper service, you can use the following command:<br />
bglsn:/tmp # /etc/init.d/portmap status<br />
Checking for RPC portmap daemon: running<br />
Checking for kernel based NFS server: running<br />
Chapter 5. File systems 219
Step 2 - Check if the NFS-FS can be mounted on the I/O node<br />
Use the mmcs_db_console to check which file systems are mounted on a<br />
particular I/O node (in our case, ionode4) using the mount command and then<br />
use the same technique to check connectivity (ping) the SN - 172.30.1.1.<br />
Example 5-3 shows the commands we used in a mmcs_db_console session and<br />
the write_con command (command lines are in bold font).<br />
Example 5-3 Using mmcs_db_console to mount NFS file system on I/O node<br />
mmcs$ allocate R000_J108_32<br />
OK<br />
mmcs$ redirect R000_J108_32 on<br />
OK<br />
mmcs$ {i} write_con hostname<br />
OK<br />
mmcs$ Mar 29 13:42:35 (I) [1079301344] {17}.0: h<br />
Mar 29 13:42:35 (I) [1079301344] {0}.0: h<br />
Mar 29 13:42:35 (I) [1079301344] {0}.0: ostname<br />
ionode3<br />
$<br />
Mar 29 13:42:35 (I) [1079301344] {17}.0: ostname<br />
ionode4<br />
$<br />
mmcs$ {17} write_con hostname<br />
OK<br />
mmcs$ Mar 29 13:48:46 (I) [1079301344] {17}.0: h<br />
Mar 29 13:48:46 (I) [1079301344] {17}.0: ostname<br />
ionode4<br />
mmcs$ {17} write_con mount<br />
OK<br />
mmcs$ Mar 29 13:43:36 (I) [1079301344] {17}.0: m<br />
Mar 29 13:43:36 (I) [1079301344] {17}.0: ount<br />
rootfs on / type rootfs (rw)<br />
/dev/root on / type ext2 (rw)<br />
none on /proc type proc (rw)<br />
172.30.1.1:/bgl on /bgl type nfs<br />
(rw,v3,rsize=8192,wsize=8192,hard,udp,nolock,addr=172.30.1.1)<br />
220 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
172.30.1.33:/bglscratch on /bglscratch type nfs<br />
(rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.30.1.33)<br />
/dev/bubu_gpfs1 on /bubu type gpfs (rw)<br />
$<br />
mmcs$ {17} write_con ping -c 2 172.30.1.1<br />
OK<br />
mmcs$ Mar 29 13:44:07 (I) [1079301344] {17}.0: p<br />
Mar 29 13:44:07 (I) [1079301344] {17}.0: ing -c 2 172.30.1.1<br />
PING 172.30.1.1 (172.30.1.1) 56(84) bytes of data.<br />
64 bytes from 172.30.1.1: icmp_seq=1 ttl=64 time=0.126 ms<br />
Mar 29 13:44:08 (I) [1079301344] {17}.0: 64 bytes from 172.30.1.1:<br />
icmp_seq=2 ttl=64 time=0.098 ms<br />
Mar 29 13:44:08 (I) [1079301344] {17}.0:<br />
--- 172.30.1.1 ping statistics ---<br />
2 packets transmitted, 2 received, 0% packet loss, time 999ms<br />
rtt min/avg/max/mdev = 0.098/0.112/0.126/0.014 ms<br />
$<br />
From Example 5-3, you can see how to check for mounted file systems on an I/O<br />
node that is booted and also how to ping the SN from that node to check basic<br />
functional network connectivity. This technique (using the mmcs_db_console and<br />
the write_con commands) can also be used to mount the NFS-FS if it is NOT<br />
automatically mounted.<br />
Step 3 - Check that the NFS-FS is mounted when the block boots<br />
You start by checking that the sitefs file has the correct entries (see<br />
Example 5-4). Next, check that the correct links are in place to invoke the sitefs<br />
file when I/O nodes are booted.<br />
The complete sitefs file that we used is shown in Appendix B, “The sitefs file” on<br />
page 423. Example 5-4 shows the relevant lines (in bold) from our sitefs file.<br />
Example 5-4 Sample sitefs file with /bglscratch file system<br />
bglsn:/bgl/dist/etc/rc.d/init.d # ls<br />
. .. sitefs<br />
bglsn:/bgl/dist/etc/rc.d/init.d # cat sitefs<br />
#!/bin/sh<br />
#<br />
# Sample sitefs script.<br />
#<br />
Chapter 5. File systems 221
# It mounts a filesystem on /scratch, mounts /home for user files<br />
# (applications), creates a symlink for /tmp to point into some<br />
directory<br />
# in /scratch using the IP address of the I/O node as part of the<br />
directory<br />
# name to make it unique to this I/O node, and sets up environment<br />
# variables for ciod.<br />
#<br />
. /proc/personality.sh<br />
. /etc/rc.status<br />
#-------------------------------------------------------------------<br />
# Function: mountSiteFs()<br />
#<br />
# Mount a site file system<br />
# Attempt the mount up to 5 times.<br />
# If all attempts fail, send a fatal RAS event so the block fails<br />
# to boot.<br />
#<br />
# Parameter 1: File server IP address<br />
# Parameter 2: Exported directory name<br />
# Parameter 3: Directory to be mounted over<br />
# Parameter 4: Mount options<br />
#------------------------------------------------------------------mountSiteFs()<br />
{<br />
#............>...............<br />
#-------------------------------------------------------------------<br />
# Script: sitefs()<br />
#<br />
# Perform site-specific functions during startup and shutdown<br />
#<br />
# Parameter 1: "start" - perform startup functions<br />
# "stop" - perform shutdown functions<br />
#-------------------------------------------------------------------<br />
# Set to ip address of site fileserver<br />
SITEFS=172.30.1.33<br />
# First reset status of this service<br />
rc_reset<br />
# Handle startup (start) and shutdown (stop)<br />
case "$1" in<br />
222 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
start)<br />
echo Mounting site filesystems<br />
# Mount a scratch file system...<br />
mountSiteFs $SITEFS /bglscratch /bglscratch<br />
tcp,rsize=32768,wsize=32768,async<br />
##..............>>>>>>>> Omitted lines > Omitted lines
5.3 GPFS<br />
Example 5-6 Recommended network tuning parameters<br />
# set UDP receive buffer default (and max, below) so that we don't drop<br />
packets<br />
net.core.rmem_default = 1024000<br />
net.core.rmem_max = 8388608<br />
net.core.wmem_max = 8388608<br />
net.core.netdev_max_backlog = 3072<br />
# ARP cache area size to avoid Neighbour table overflow messages<br />
# defaults are 128, 512, 1024. For 64 racks should be 512, 2048, and<br />
4096 .<br />
net.ipv4.neigh.default.gc_thresh1 = 256<br />
net.ipv4.neigh.default.gc_thresh2 = 1024<br />
net.ipv4.neigh.default.gc_thresh3 = 2048<br />
# NFS tuning parameters<br />
net.ipv4.tcp_timestamps = 1<br />
net.ipv4.tcp_window_scaling = 1<br />
net.ipv4.tcp_sack = 1<br />
net.ipv4.tcp_rmem = 4096 87380 4194304<br />
net.ipv4.tcp_wmem = 4096 65536 4194304<br />
net.ipv4.ipfrag_low_thresh = 393216<br />
net.ipv4.ipfrag_high_thresh = 524288<br />
Important: If you have changed any of the network parameters, then you<br />
must run /etc/rc.d/boot.sysctl start or sysctl -p /etc/sysctl.conf for the<br />
settings to take effect immediately.<br />
To view the current settings for these parameters use:<br />
sysctl -A | grep net.<br />
GPFS stands for General Parallel File System. GPFS is a high I/O performance<br />
and scalable file system that is intended primarily for clusters of computers<br />
where a large number of processors are required to access the same copy of the<br />
data (one of the basic requirements for parallel computing environments).<br />
GPFS is based on a client-server model, with the server part responsible for<br />
managing the storage, and the client part providing application access.The<br />
GPFS client software is highly efficient in handling data so that the CPU slice<br />
224 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
equired to read and write data is typically much less than for other file systems<br />
(like NFS).<br />
Unlike NFS, where managing the storage associated with a file system is the<br />
responsibility of a single server (OS image), in GPFS the storage can be<br />
distributed among multiple servers, eliminating the single server bottleneck.<br />
Using GPFS it is easy to create file systems that store the data on many<br />
separate disks connected to many separate servers. In addition to performance,<br />
storage, and server load balancing, GPFS also provides excellent scalability,<br />
reliability, and high availability by providing continuos operation while adding or<br />
removing nodes, disks, and file systems.<br />
Blue Gene/L is a highly scalable processing engine that is designed for highly<br />
parallel applications. It is likely that many parallel applications that are designed<br />
to run efficiently on Blue Gene/L will also benefit from the increased and scalable<br />
I/O performance that GPFS file systems can provide. By combining the<br />
scalability and performance of the Blue Gene/L processing platform with the<br />
scalability and I/O performance of a GPFS file system it is possible to provide a<br />
highly optimized computing environment to run parallel applications.<br />
This section provides a general overview of GPFS.<br />
5.3.1 When to use GPFS<br />
Because GPFS is a client-server application that requires additional knowledge<br />
and system administration skills, you should consider carefully when deciding to<br />
go for a GPFS implementation. Because the benefits of such a product are<br />
major, you should also consider specific elements when deciding to implement it.<br />
The following considerations can help you make the correct decision:<br />
► If you need a file system that can provide high performance (and a single<br />
server cannot deliver), then GPFS would be a good solution.<br />
► If you need a reliable file system for a cluster that is unaffected by a failure of<br />
a storage server or disk then GPFS can be configured to provide such a<br />
system.<br />
► If you want to allow a parallel applications running on a cluster of machines to<br />
access a single file at the same time with tight control over data integrity<br />
(multiple application instances accessing same data file concurrently), then<br />
GPFS has the appropriate architectural features and also the proven track<br />
record in providing these functions.<br />
Chapter 5. File systems 225
However, if you have the following requirements, then GPFS is not mandatory:<br />
► File system performance offered by one NFS server system is adequate.<br />
► The files you are using are smaller than 2 GB.<br />
► You have no requirement to run parallel applications that write to the same<br />
file.<br />
► You do not intend to scale up in performance or storage capacity.<br />
5.3.2 Features and concepts of GPFS<br />
Some of the features of GPFS include:<br />
► File consistency<br />
GPFS uses a sophisticated token management system to provide data<br />
consistency while allowing multiple independent paths to the same file by the<br />
same name from anywhere in the cluster.<br />
► High recoverability and increased data availability<br />
Using GPFS replication, it is possible to configure GPFS to have two copies<br />
of the data on separate groups of disks (failure groups) should a single disk or<br />
group for disks fail, access to the data is not lost.<br />
GPFS is a journaling file system that creates separate journal files for each<br />
node. These logs record the allocation and modification of metadata, aiding in<br />
fast recovery and the restoration of data consistency in the event of node<br />
failure.<br />
► High I/O performance<br />
GPFS can provide high I/O performance and achieves this partly by striping<br />
the files across all the disks in the file system. Managing its own striping<br />
affords GPFS the control it needs to achieve fault tolerance and to balance<br />
load across adapters, storage controllers, and disks. Large files in GPFS are<br />
divided into equal sized blocks, and consecutive blocks are placed on<br />
different disks in a round-robin fashion.<br />
To exploit disk parallelism when reading a large file from a single-threaded<br />
application, whenever it can recognize a pattern, GPFS prefetches data into<br />
its buffer pool (pagepool), issuing I/O requests in parallel to as many disks as<br />
necessary to achieve the bandwidth of which the switching fabric is capable.<br />
GPFS recognizes sequential, reverse sequential, and various forms of strided<br />
access patterns.<br />
For parallel applications GPFS provides enhanced performance by allowing<br />
multiple processes or applications on all nodes in the cluster simultaneous<br />
access to the same file using standard file system calls. GPFS also allows<br />
concurrent reads and writes from multiple nodes. This is a key concept in<br />
226 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
parallel processing. Also useful for parallel applications is GPFS’s support of<br />
byte-range locks on file writes so that multiple clients can write to different<br />
byte-ranges within the same file at the same time.<br />
► Very large file and file system sizes<br />
The currently supported limits for both GPFS file system and file size are<br />
currently 95 TB for Linux and 100 TB for AIX. These supported limits are<br />
confined to those configurations that have been tested. The architectural limit<br />
for GPFS however in 2 PB.<br />
This is substantially more that most available file systems and can be a key<br />
advantage as data volumes and file sizes continue to increase.<br />
► Cross cluster file system access<br />
GPFS allows users shared access to files in either the cluster where the file<br />
system was created, or other (remote) GPFS clusters. Each cluster in the<br />
network is managed as a separate cluster, while allowing shared file system<br />
access. When multiple clusters are configured to access the same GPFS file<br />
system, Open Secure Sockets Layer (OpenSSL) is used to authenticate<br />
cross-cluster network connections.<br />
5.3.3 GPFS requirements for Blue Gene/L<br />
Due to the internal structure of the Blue Gene/L system, adding GPFS support<br />
has been a challenging task. In this section we describe some of the major<br />
challenges and the solutions designed to overcome them.<br />
Tip: The GPFS implementation for Blue Gene/L exploits in one of the main<br />
features of GPFS: cross cluster mounting of GPFS file systems.<br />
In fact the configuration consists of two GPFS clusters, a storage “server”<br />
cluster (in fact a common GPFS cluster with storage nodes), and a “client“<br />
cluster, consisting of the Blue Gene/L I/O nodes and the SN.<br />
Challenges for GPFS on Blue Gene/L<br />
Challenges for GPFS on Blue Gene/L systems include:<br />
► bgIO cluster - SN is the only quorum node<br />
► bgIO has no local storage, another cluster is required for storage (gpfsNSD)<br />
► Blue Gene/L I/O nodes are diskless<br />
GPFS code usually runs on AIX and Linux, on stand-alone machines that have<br />
dedicated OS disk(s) from which the system boots. Due to its tight integration<br />
with the OS, GPFS has been designed to store its code, configuration and log<br />
files on the local disk(s). On Blue Gene/L all the I/O traffic is done through the I/O<br />
Chapter 5. File systems 227
nodes and these have no boot/OS disks. GPFS, however, needs a place to store<br />
the files that it uses. These include:<br />
► GPFS code and utilities<br />
► GPFS configuration files (one individual set per cluster node)<br />
► Console log files (one set per node)<br />
► Syslog files<br />
► Traces (if debug is needed)<br />
In addition, due to GPFS structure (clustering layer, storage abstraction layer<br />
and file system device driver), each node can assume various roles (file system<br />
manager, configuration manager, and so on). Because the availability of the I/O<br />
nodes is dynamic (per job block allocation and releasing), if one of these nodes<br />
would assume a management role inside the cluster, this would cause huge<br />
performance problems. Performance would be affected by two factors:<br />
► Cluster reconfiguration requiring GPFS management roles takeover which<br />
can suspend I/O during such an operation.<br />
► The additional load induced by the GPFS management roles create an<br />
imbalance between I/O nodes that are allocated for the same job.<br />
Solutions for GPFS on Blue Gene/L<br />
This section describes the approach that IBM development takes (GPFS and<br />
Blue Gene/L) to the challenges for GPFS on Blue Gene/L.<br />
► <strong>Problem</strong>: Access to GPFS files for each I/O node<br />
Solution: Because the SN is the only in the bgIO cluster that has disks, the<br />
files needed by GPFS must be stored on those disks. The /bgl file system on<br />
the SN is used for this purpose. The Blue Gene/L I/O nodes access the GPFS<br />
files in this file system by means of NFS mounts .<br />
► <strong>Problem</strong>: I/O node bootup sequence must include GPFS handling<br />
Solution: The ‘/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d/gpfs’ script has been<br />
provided, and is installed when installing the Blue Gene/L RPMs.<br />
When called during I/O node boot up, the $BGL_DISTDIR/etc/rc.d/init.d/gpfs<br />
script is responsible for creating the necessary symbolic links that allow the<br />
GPFS client code to find all the files that it normally uses .<br />
228 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 5-7 presents the directories and links that are created in the gpfs<br />
script if GPFS_STARTUP = 1 is found in the /etc/sysconfig/gpfs file.<br />
Example 5-7 Excerpt from ‘gpfs’ script<br />
# Directories ....<br />
/bgl/gpfsvar//var/mmfs/gen<br />
/bgl/gpfsvar//var/mmfs/etc<br />
/bgl/gpfsvar//var/mmfs/tmp<br />
/bgl/gpfsvar//var/adm/ras<br />
# Links......<br />
ln -s /bgl/gpfsvar//var/mmfs /var/mmfs<br />
ln -s /bgl/gpfsvar//var/adm/ras /var/adm/ras<br />
► <strong>Problem</strong>: Preventing I/O nodes from assuming GPFS management roles<br />
(cluster manager, file system/stripe group manager)<br />
Solution: The Blue Gene/L I/O cluster, referred to hereafter as bgIO, does<br />
not own any file system, rather it cross- mounts it from another GPFS cluster<br />
(referred to hereafter as gpfsNSD).<br />
This solutions also clearly separates the administration of the GPFS file<br />
system storage from the administration of the bgIO cluster (SN plus I/O<br />
nodes). This has the advantage that both the Blue Gene/L system and the<br />
GPFS storage cluster (gpfsNSD) can be scaled independently, if required.<br />
► <strong>Problem</strong>: The GPFS cluster for the Blue Gene/L system (bgIO), is unusual in<br />
as much as it contains only one quorum node, the SN. If this single quorum<br />
node goes down then the cross mounted GPFS file system, referred to<br />
hereafter as the GPFS-FS, will become unmounted.<br />
Solution: This is consistent with the same dependency that the Blue Gene/L<br />
system has on the SN. Thus, if the SN is turned off (or becomes unavailable<br />
for any reason), the Blue Gene/L system cannot operate anyway. Therefore,<br />
it is acceptable that the GPFS-FS will also be unavailable.<br />
5.3.4 GPFS supported levels<br />
It is important that the GPFS packages that are installed for the Blue Gene/L I/O<br />
nodes match the level of the code that are installed for the Blue Gene/L driver<br />
itself. The installation levels must match because for the Blue Gene/L I/O nodes,<br />
we do not have to build the portability layer because this is already provided by<br />
the Blue Gene/L GPFS RPMs.<br />
Chapter 5. File systems 229
Important: The GPFS code that is installed for the Blue Gene/L I/O nodes is<br />
different from that installed for the SN. The SN runs Linux on IBM System p<br />
64-bit hardware. This hardware, therefore, uses GPFS for SUSE Linux (SLES<br />
9) on IBM System p code. The I/O nodes uses a PowerPC 440 CPU that is 32<br />
bit. So, this hardware uses a special version of GPFS code that is specific to<br />
this environment.<br />
This Blue Gene/L I/O node specific GPFS code can only be downloaded from the<br />
following IBM Web site (which is password protected):<br />
https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGL-BL<br />
UEGENE<br />
Access to this IBM Web site is granted to organizations that have purchased the<br />
GPFS for Blue Gene/L code.<br />
When the Blue Gene/L driver code level is updated, the GPFS code for the I/O<br />
nodes must be reinstalled at the correct level. The code must be at the correct<br />
level because when the I/O nodes boot, they depend on files that exist under the<br />
following directory:<br />
/bgl/BlueLight/ppcfloor/dist/etc/rc.d<br />
When the Blue Gene/L driver is updated, the /bgl/BlueLight/ppcfloor symbolic<br />
link points to another directory that does not have the GPFS files that are<br />
required during I/O nodes boot process. You have to re-install the GPFS for MCP<br />
(I/O nodes) code into the new Blue Gene/L driver directory.<br />
The GPFS Portability Layer for the SN must be built on the SN and Front-End<br />
Nodes (FENs) when the GPFS for SLES RPMs have been installed. The GPFS<br />
Portability Layer for the Blue Gene/L I/O nodes is shipped pre-built, so there is no<br />
need to build the GPFS Portability Layer for the I/O nodes after installing the<br />
GPFS for Blue Gene/L RPMs.<br />
5.3.5 How GPFS plugs in<br />
In this section we describe the steps needed to add a GPFS file system to the<br />
Blue Gene/L system. We present only the GPFS commands that enable an<br />
existing GPFS file system to be cross-mounted on to the Blue Gene/L system.<br />
This section assumes that you follow the GPFS installation and administration<br />
procedures that are documented in the GPFS product manuals that are<br />
documented at the end of this chapter in 5.3.11, “References” on page 264 to<br />
build the GPFS storage cluster (gpfsNSD).<br />
230 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 5-2 presents the three high-level steps that needed to make a GPFS file<br />
system available on Blue Gene/L. The essential concept that we use to make this<br />
possible is the ability of GPFS 2.3 to allow one GPFS cluster to mount a GPFS<br />
file systems that belongs to a remote GPFS cluster.<br />
While it is possible to add the NSD (Network Shared Disk) storage servers<br />
directly to the bgIO cluster and provide a locally owned GPFS file system this is<br />
not recommended, as the dynamic nature of the Blue Gene/L system might<br />
cause the GPFS cluster performance problems (see “Challenges for GPFS on<br />
Blue Gene/L” on page 227).<br />
1. Create GPFS cluster (gpfsNSD) with storage<br />
and create a GPFS file system(/gpfs1).<br />
172.30.1.31,32,33<br />
p5 p5 p5<br />
Storage<br />
IBM<br />
DS4500<br />
gpfsNSD Cluster<br />
Figure 5-2 Plugging in GPFS in steps<br />
3. Cross mount GPFS file<br />
system (/gpfs1) from gpfsNSD<br />
onto Blue Gene cluster (bgIO).<br />
Functional Ethernet<br />
172.30.0.0/16<br />
IO Nodes<br />
connections<br />
172.30.1.1<br />
BlueGene<br />
You can find the latest detailed instructions to install Blue Gene/L in the GPFS<br />
“How to“ document, which is available at:<br />
https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGL-BL<br />
UEGENE<br />
The three major steps that are required to enable a GPFS file system on a Blue<br />
Gene/L system are:<br />
1. Create the GPFS file system on a remote cluster (gpfsNSD).<br />
2. Create a GPFS cluster on Blue Gene/L (bgIO). This step creates the bgIO<br />
cluster with just the SN.<br />
3. Cross mount the GPFS file system from gpfsNSD on to bgIO. This step<br />
includes adding the I/O nodes for Blue Gene/L to the bgIO cluster.<br />
pSeries<br />
Service Node<br />
bgIO GPFS Cluster<br />
2. Create GPFS cluster with the<br />
Service Node and IO<br />
nodes.(bgIO)<br />
Chapter 5. File systems 231
5.3.6 Creating the GPFS file system on a remote cluster (gpfsNSD)<br />
Figure 5-3 show the GPFS cluster that we built for our test environment. This<br />
GPFS cluster uses three nodes running AIX 5L V5.3 and has the GPFS file<br />
system (mounted as /gpfs1) using four LUNs that reside on 4+P RAID5 arrays<br />
from a DS4500 storage controller.<br />
p630n01_fn p630n02_fn p630n03_fn<br />
OpenPower<br />
Functional<br />
Ethernet (Gbit)<br />
Power Module<br />
Fan<br />
Cont r oller<br />
Cache<br />
DS4500<br />
Storage<br />
DS4500<br />
TotalStorage<br />
EXP710<br />
Storage <strong>Group</strong><br />
Figure 5-3 GPFS storage cluster<br />
GPFS storage cluster:<br />
gpfsNSD<br />
GPFS file system<br />
mounted as: /gpfs1<br />
You can create the GPFS file system on a cluster using either AIX or Linux<br />
nodes. We do not discuss the process of creating this cluster in detail here.<br />
However, the remote cluster must conform to the following rules:<br />
► The SN is not included in this cluster<br />
► All nodes in this storage cluster must be able to access the SN and all Blue<br />
Gene/L I/O nodes across the functional network.<br />
5.3.7 Creating a GPFS cluster on Blue Gene/L (bgIO)<br />
The GPFS code that is installed on the SN is different to that installed for the<br />
Blue Gene/L I/O nodes. In this section we create the GPFS cluster on the Blue<br />
Gene/L with only one node that is the SN. Figure 5-4 presents the diagram of our<br />
cluster before we installed GPFS and configured the bgIO cluster.<br />
232 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
OpenPower<br />
TotalStorage<br />
172.30.0.0/16<br />
OpenPower
172.30.1.1,<br />
bglsn_fn<br />
OpenPower<br />
SLES 9 PPC 64bit<br />
Service Node<br />
/bgl<br />
(local)<br />
Functional<br />
Ethernet (Gbit)<br />
I/O Node<br />
I/O Node<br />
/bgl (nfs)<br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
Figure 5-4 Blue Gene/L system before bgIO cluster is created<br />
I/O Node<br />
I/O Node<br />
The I/O nodes can only be added to this GPFS cluster when the block that they<br />
serve has been initialized and after the cross-mount of the GPFS file system.<br />
Here are the high-level steps that are required to create a GPFS cluster that uses<br />
only the Blue Gene/L SN:<br />
1. Install the GPFS code for SN, as described in “Installing the GPFS code for<br />
SN” on page 234<br />
2. Install the GPFS code for Blue Gene/L I/O nodes, as described in “Installing<br />
the GPFS code for Blue Gene/L I/O nodes” on page 235<br />
3. Configure ssh and scp on all Blue Gene/L nodes, as described in “Configuring<br />
ssh and scp on SN and I/O nodes” on page 237<br />
4. Create the bgIO cluster, as described in “Creating the bgIO cluster” on<br />
page 244<br />
……..<br />
Chapter 5. File systems 233
Installing the GPFS code for SN<br />
Figure 5-5 illustrates the steps to install the GPFS code for SN.<br />
172.30.1.1<br />
bglsn_fn<br />
OpenPower<br />
SLES 9 PPC 64bit<br />
Install GPFS<br />
RPMs (PPC64)<br />
Complile portability<br />
layer<br />
Service Node<br />
/bgl<br />
(local)<br />
Functional<br />
Ethernet (Gbit)<br />
Install GPFS<br />
BlueGene<br />
RPMs (MCP)<br />
Figure 5-5 GPFS code install on the Blue Gene/L system<br />
To install the code, follow these steps:<br />
1. Create a new directory for the GPFS code for the SN and populate it with the<br />
correct GPFS RPMs.<br />
You can do this using the self-extracting product image,<br />
gpfs_install-2.3.0-0_sles9_ppc64, from the GPFS for Linux on POWER<br />
CD-ROM to the new directory. Example 5-8 shows commands that we used.<br />
Example 5-8 Installing GPFS code on SN<br />
I/O Node<br />
I/O Node<br />
/bgl (nfs)<br />
root@bglsn_fn~/> mkdir -p /tmp/gpfslpp_for_servicenode/updates<br />
root@bglsn_fn~/> cd /tmp/gpfslpp_for_servicenode<br />
root@bglsn_fn~/> cp /cdrom/*/gpfs_install-2.3.0-0_sles9_ppc64 .<br />
root@bglsn_fn~/> ./gpfs_install-2.3.0-0_sles9_ppc64 --silent<br />
234 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
I/O Node
After you accept the license agreement, the GPFS product installation images<br />
reside in the extraction target directory (in our case,<br />
/tmp/gpfslpp_for_servicenode). See Example 5-9.<br />
Example 5-9 GPFS RPMs for SN<br />
root@bglsn_fn~/> cd /tmp/gpfslpp_for_servicenode<br />
root@bglsn_fn~/> ls gpfs.*<br />
gpfs.base-2.3.0-11.sles9.ppc64.rpm<br />
gpfs.docs-2.3.0-11.noarch.rpm<br />
gpfs.gpl-2.3.0-11.noarch.rpm<br />
gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />
2. Install the GPFS code for the SN:<br />
cd /tmp/gpfslpp_for_servicenode<br />
rpm -ivh gpfs*.rpm<br />
3. Install any updates available. To do this, copy any update rpms to the<br />
/tmp/gpfslpp_for_servicenode/updates directory and then issue the following<br />
commands:<br />
cd /tmp/gpfslpp_for_servicenode/updates<br />
rpm -Uvh gpfs*.rpm<br />
4. Create the GPFS portability layer binaries. Follow the instructions in the<br />
/usr/lpp/mmfs/src/README (on the SN). The files mmfslinux, lxtrace,<br />
tracedev, and dumpconv are installed in /usr/lpp/mmfs/bin after you have<br />
completed the instructions<br />
Installing the GPFS code for Blue Gene/L I/O nodes<br />
To install the GPFS code for Blue Gene/L I/O nodes, follow these steps:<br />
1. Download the GPFS for MCP RPMs on to the SN.<br />
Create a new directory for the GPFS code for the I/O nodes and copy the<br />
correct RPMs. You can copy these RPMs from the secure Blue Gene/L<br />
software portal:<br />
mkdir -p /tmp/gpfslpp_for_ionodes/updates<br />
Attention: Make sure you do not mix the RPMs for PPC64 with the ones for<br />
I/O nodes (PPC440, 32-bit)!<br />
Chapter 5. File systems 235
You should see a list similar to the following:<br />
cd /tmp/gpfslpp_for_ionodes; ls gpfs.*<br />
gpfs.base-2.3.0-11.ppc.rpm<br />
gpfs.docs-2.3.0-11.noarch.rpm<br />
gpfs.gplbin-2.3.0-11.ppc.rpm<br />
gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />
gpfs.gpl-2.3.0-11.noarch.rpm<br />
2. Install the GPFS code for the I/O nodes:<br />
cd /tmp/gpfslpp_for_ionodes<br />
rpm --root /bgl/BlueLight/driver/ppc/bglsys/bin/bglOS --nodeps<br />
-ivh gpfs*.rpm<br />
Note: It is important to note the following rpm command argument.<br />
--root <br />
This command argument forces the to be used as root<br />
directory for the RPMs installation.<br />
3. Install any updates that are available to the code. To do this, copy any update<br />
RPMs to the /tmp/gpfslpp_for_ionodes/updates directory and issue the<br />
following commands:<br />
cd /tmp/gpfslpp_for_iondoes/updates<br />
rpm -Uvh gpfs*.rpm<br />
236 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Configuring ssh and scp on SN and I/O nodes<br />
Important: In this section, we use the following naming conventions:<br />
► $BGL_SITEDISTDIR: normally points to the /bgl/dist directory.<br />
► $BGL_DISTDIR: normally points to the /bgl/Bluelight/ppcfloor/dist directory.<br />
► $BGL_SNIP is the SN’s IP address on the functional network.<br />
► $SN_HOSTNAME is the SN’s host name on the functional network.<br />
► $IONODE_IPS is a wildcarded IP address representing all I/O nodes.<br />
For example, if the I/O nodes have IP addresses 172.30.100.1 through<br />
172.30.100.128, and 172.30.101.1 through 172.30.101.128, a reasonable<br />
value for $IONODE_IPS would be 172.30.10?.*<br />
► $IONODE_HOSTNAMES is a wildcard for the host name representing all I/O<br />
nodes.<br />
For example, if the I/O nodes have host names ionode1, ionode2, and so<br />
forth, a reasonable value for $IONODE_HOSTNAMES would be ionode*.<br />
In these examples we have chosen to use the RSA2 type for ssh keys. You<br />
can choose other key types (RSA1, DSA). However, when chosen, it is<br />
strongly recommended to use the same key type for all ssh keys.<br />
GPFS requires to execute commands and copy configuration files between all<br />
nodes in the cluster without being prompted for a password. For GPFS on Linux,<br />
the default remote command execution and copy program is secure shell/copy.<br />
This is why we have to prepare the nodes (SN and I/O - see Figure 5-6).<br />
Chapter 5. File systems 237
172.30.1.1<br />
bglsn_fn<br />
OpenPower<br />
Service<br />
Node<br />
SLES 9 PPC 64bit<br />
/bgl<br />
(local)<br />
GPFS GPFS<br />
Configure SSH<br />
for Service Node<br />
Functional<br />
Ethernet (Gbit)<br />
Configure SSH<br />
for I/O nodes<br />
I/O Node<br />
I/O Node<br />
/bgl (nfs)<br />
Figure 5-6 Configuring ssh on SN and I/O nodes (for GPFS)<br />
To configure ssh and scp on SN and I/O nodes, follow these steps:<br />
1. Ensure that the host name that is associated with the SN is unique. Check<br />
both the /etc/hosts file and DNS.<br />
Note: For avoiding network and name resolution problems, we strongly<br />
recommend that you maintain a consistent name resolution using local<br />
files (/etc/hosts). Even though you can use DNS, it is not useful to add<br />
DNS entries for I/O nodes, because they should not be accessible directly<br />
by the users.<br />
2. In the /etc/hosts file, add an entry for each I/O node, and check for duplicate<br />
IP addresses or IP labels (names).<br />
3. Copy this newly update /etc/hosts file to the correct directory in the Blue<br />
Gene/L tree (to make it available to the I/O nodes):<br />
cp /etc/hosts $BGL_SITEDISTDIR/etc/hosts<br />
chmod 644 $BGL_SITEDISTDIR/etc/hosts<br />
238 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
I/O Node<br />
I/O Node
4. Create and verify the directories for the root user on the SN and the I/O<br />
nodes, as shown in Example 5-10.<br />
Example 5-10 Directories for ssh client files<br />
root@bglsn_fn~/> chmod 755 $BGL_SITEDISTDIR<br />
root@bglsn_fn~/> mkdir $BGL_SITEDISTDIR/root<br />
root@bglsn_fn~/> chmod 700 $BGL_SITEDISTDIR/root<br />
root@bglsn_fn~/> mkdir $BGL_SITEDISTDIR/root/.ssh<br />
root@bglsn_fn~/> chmod 700 $BGL_SITEDISTDIR/root/.ssh<br />
5. Check the ssh keys pair for root user on the I/O nodes. Check for both the<br />
private key file (/bgl/dist/root/.ssh/id_rsa), and the public key file<br />
(/bgl/dist/root/.ssh/id_rsa.pub). If the keys have not been created, use the<br />
following command:<br />
ssh-keygen -t rsa -b 1024 -f $BGL_SITEDISTDIR/root/.ssh/id_rsa -N<br />
''<br />
6. Check the ssh keys pair for root user on the on the SN. Check for both the<br />
private key file (~/.ssh/id_rsa.pub), and the public key file (~/.ssh/id_rsa.pub).<br />
If these have not been created, use the following command:<br />
ssh-keygen -t rsa -b 1024 -f /root/.ssh/id_rsa -N ''<br />
7. Check the ssh keys pair for ssh daemon on the I/O nodes. Check for both the<br />
private key file (/bgl/dist/etc/ssh/ssh_host_rsa_key), and the public key file<br />
(/bgl/dist/etc/ssh/ssh_host_rsa_key.pub). If these have not been created, use<br />
the following command:<br />
ssh-keygen -t rsa -b 1024 -f<br />
$BGL_SITEDISTDIR/etc/ssh/ssh_host_rsa_key -N ''<br />
8. Check the ssh keys pair for ssh daemon on the SN. Check for both the private<br />
key file (/etc/ssh/ssh_host_rsa_key), and the public key file<br />
(/etc/ssh/ssh_host_rsa_key.pub). If these have not been created (most<br />
unlikely!), use the following command.<br />
ssh-keygen -t rsa -b 1024 -f /etc/ssh/ssh_host_rsa_key -N ''<br />
9. Create the authorized_keys file for all nodes in the bgIO cluster. Copy the root<br />
user’s public key file from the SN to a temporary file (/tmp/authorized_keys).<br />
Then, append this file with root user’s public key file from the I/O nodes:<br />
cat /root/.ssh/id_rsa.pub >>/tmp/authorized_keys<br />
cat $BGL_SITEDISTDIR/root/.ssh/id_rsa.pub >>/tmp/authorized_keys<br />
Chapter 5. File systems 239
Having created the /tmp/authorized_keys file, then distribute it. Check if either<br />
the SN or the I/O nodes already have an authorized_keys file. If one already<br />
exists, then append the /tmp/authorized_keys file to the end of the existing<br />
one:<br />
cat /tmp/authorized_keys >>/root/.ssh/authorized_keys<br />
cat /tmp/authorized_keys >><br />
$BGL_SITEDISTDIR/root/.ssh/authorized_keys<br />
10.Create the known_hosts file for all nodes in the bgIO cluster. Create a<br />
temporary known_hosts file for both the SN and the I/O nodes. Then combine<br />
these two files to create the /tmp/known_hosts_gpfs file, as shown in<br />
Example 5-11.<br />
Example 5-11 Creating the known_hosts file<br />
root@bglsn_fn~/> echo ”$BGL_SNIP,$SN_HOSTNAME $(cat /etc/ssh/ssh_host_rsa_key.pub)”<br />
>>\ /tmp/known_hosts_sn<br />
root@bglsn_fn~/> echo ”$IONODE_IPS,$IONODE_HOSTNAMES $(cat \<br />
/bgl/dist/etc/ssh/ssh_host_rsa_key.pub)” >> /tmp/known_hosts_io<br />
root@bglsn_fn~/> cp /tmp/known_hosts_sn /tmp/known_hosts_gpfs<br />
root@bglsn_fn~/> cat /tmp/known_hosts_io >> /tmp/known_hosts_gpfs<br />
Note: The variables $BGL_SNIP, $SN_HOSTNAME, $IONODE_IPS, and<br />
$IONODE_HOSTNAMES are explained in the Note on page 213.<br />
Example 5-12 shows one entry in the file that uses the wildcard character (*).<br />
This character saves having to add entries for every I/O node individually.<br />
Example 5-12 The known_hosts file entry that uses wild card chars<br />
172.30.2.*,ionode* ssh-rsa<br />
AAAAB3NzaC1yc2EAAAABIwAAAIEA27GK+WllP58rmK//LGhE4NKBHDdb30x4Kvrkb3ibbRs<br />
41eHuLE3/KIV0IQkwi36F4hg5gRBC2vbBINaIJvwiybovpoL2gfpFTeRworWvVI3goBAJh/<br />
/hIeT+J9sm+Iogxe2iQ6Q6TfsdPss4dkq3nvGM/HmUULsohgT3u494vVc= root@bglsn<br />
After creating the /tmp/known_hosts_gpfs file, distribute it (see<br />
Example 5-13). Check whether either the SN or the I/O nodes already have a<br />
known_hosts file. If one already exists, then append the<br />
/tmp/known_hosts_gpfs file to the end of the existing one.<br />
240 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 5-13 Distributing known_hosts file<br />
root@bglsn_fn~/> cat /tmp/known_hosts_gpfs >> /root/.ssh/known_hosts<br />
root@bglsn_fn~/> touch $BGL_SITEDISTDIR/root/.ssh/known_hosts<br />
root@bglsn_fn~/> cat /tmp/known_hosts_gpfs >>\<br />
$BGL_SITEDISTDIR/root/.ssh/known_hosts<br />
Attention: If the authorized_keys and known_hosts files already exist and if<br />
you want to append to these files, check for duplicate entries in these files. If<br />
there are duplicate entries, only the first occurrence are considered. Thus, you<br />
can have authentication problems!<br />
11.Test unprompted command execution.<br />
Verify that the ssh files are configured properly by using ssh between all the<br />
bgIO cluster nodes without being prompted for password or host key<br />
acceptance. This test requires the sshd daemon to be running on the I/O<br />
nodes to be tested. The simplest way to achieve this is to ensure that you<br />
have a sitefs file in the $BGL_SITEDISTDIR/etc/rc.d/init.d/sitefs directory and<br />
that this sitefs file includes the following line:<br />
echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />
If you do not have a sitefs file, then you can create one using the example<br />
found at the end of the following file:<br />
$BGL_SITEDISTDIR/docs/ionode.README<br />
For your convenience, this file is also included in Appendix C, “The<br />
ionode.README file” on page 431.<br />
The sitefs script that we used for this book is shown in Appendix B, “The sitefs<br />
file” on page 423. The lines that are important to check for in your sitefs script<br />
are shown in bold in Example 5-14.<br />
Example 5-14 The sitefs file with GPFS enabled<br />
# Uncomment the first line to start GPFS.<br />
# Optionally uncomment the other lines to change the defaults for<br />
# GPFS.<br />
echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />
# echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />
Chapter 5. File systems 241
When the "GPFS_STARTUP=1" line is included in the sitefs script and the sitefs<br />
script is also linked into the startup script files then the sshd daemon will be<br />
started on the I/O nodes at bootup by the S16sshd startup script. Check that<br />
the following symbolic links are in place so that your sitefs file will be called<br />
during the I/O node initialization:<br />
ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />
$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S10sitefs<br />
ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />
$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K90sitefs<br />
Now you can boot a block and check that you can connect using ssh to one<br />
I/O node and then from this I/O node back to the SN. First, boot a block and<br />
establish the IP address of the I/O nodes by using the {i} write_con<br />
hostname command from the mmcs_db_console (see Example 5-15).<br />
Example 5-15 Using mmcs_db_console to boot a block and check for I/O nodes<br />
mmcs$ list_blocks<br />
OK<br />
R000_128 root(1) connected<br />
mmcs$ select_block R000_128<br />
OK<br />
mmcs$ redirect R000_128 on<br />
OK<br />
mmcs$ {i} write_con hostname<br />
OK<br />
mmcs$ Mar 22 14:47:25 (I) [1079031008] {119}.0: h<br />
Mar 22 14:47:25 (I) [1079031008] {102}.0: h<br />
Mar 22 14:47:25 (I) [1079031008] {17}.0: h<br />
Mar 22 14:47:25 (I) [1079031008] {17}.0: ostname<br />
172.30.2.2<br />
................. >> ...............<br />
Mar 22 14:47:25 (I) [1079031008] {0}.0: ostname<br />
172.30.2.1<br />
$<br />
Mar 22 14:47:25 (I) [1079031008] {34}.0: ostname<br />
172.30.2.5<br />
$<br />
Example 5-15 shows that we have eight I/O nodes with IP addresses from<br />
172.30.2.1 - 172.30.2.8. Now, check whether you can connect using ssh to<br />
242 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
one of these nodes from the SN and then back again using both the IP<br />
address and the node name as listed in the /etc/hosts file:<br />
bglsn:/tmp # ssh root@172.30.2.1<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
This shows that you have sshd into ionode with the IP address of 172.30.2.1.<br />
Now, you ssh back to the SN using the IP address:<br />
$ ssh root@172.30.1.1<br />
Last login: Wed Mar 22 14:45:13 2006 from 192.168.100.60<br />
bglsn:~ #<br />
bglsn:~ # exit<br />
logout<br />
Connection to bglsn_fn.itso.ibm.com closed.<br />
Now, you show that you can also ssh between an I/O node and the SN using<br />
the alias names rather than the IP addresses, as shown in Example 5-16.<br />
Example 5-16 Verifying ssh connection using IP labels<br />
$ hostname<br />
ionode1<br />
$ ssh root@bglsn_fn.itso.ibm.com<br />
Last login: Wed Mar 22 15:31:09 2006 from ionode1<br />
bglsn:~ # ssh root@ionode1<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
$ hostname<br />
ionode1<br />
$ exit<br />
Connection to ionode1 closed.<br />
bglsn:~ # exit<br />
logout<br />
Connection to bglsn_fn.itso.ibm.com closed.<br />
$ exit<br />
Connection to ionode1 closed.<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
This concludes our ssh tests, and we have now confirmed that the ssh setup is<br />
correct.<br />
Chapter 5. File systems 243
Creating the bgIO cluster<br />
After you have installed the GPFS packages and have configured remote<br />
command execution, you can create the GPFS cluster (bgIO). Figure 5-7<br />
illustrates this step.<br />
Figure 5-7 Creating the GPFS cluster named bgIO on the Blue Gene/L system<br />
To create the bgIO cluster, follow these steps:<br />
1. Create a GPFS node file called service.node that contains only the SN entry.<br />
Initially we have a single node in the bgIO cluster (Example 5-17).<br />
Example 5-17 GPFS node definition file for bgIO cluster<br />
bglsn:/tmp # echo ”$SN_HOSTNAME:quorum ” >> /tmp/service.node<br />
bglsn:/tmp # cat service.node<br />
bglsn_fn:quorum<br />
bglsn:/tmp #<br />
2. Use the /tmp/service.node file to create the bgIO cluster. Here is the<br />
command to be issued from the SN:<br />
/usr/lpp/mmfs/bin/mmcrcluster -n service.node -p bglsn_fn -C bgIO<br />
-A -r /usr/bin/ssh -R /usr/bin/scp<br />
Set the pagepool to 128M and any other GPFS configuration parameters that<br />
you might want to change at this time. Changing the pagepool (default 64<br />
MB) to 128 MB improves performance. Larger values than 128 MB can result<br />
in GPFS not being able to load (I/O node has only 2 GB of RAM).<br />
# mmchconfig pagepool=128M<br />
# mmchconfig dataStructureDump=/var/mmfs/tmp<br />
244 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
3. Check the bgIO cluster to verify the parameters as shown in Example 5-18.<br />
Example 5-18 The bgIO cluster configuration<br />
bglsn:/tmp # /usr/lpp/mmfs/bin/mmlscluster<br />
GPFS cluster information<br />
========================<br />
GPFS cluster name: bgIO.itso.ibm.com<br />
GPFS cluster id: 12402351528774401789<br />
GPFS UID domain: bgIO.itso.ibm.com<br />
Remote shell command: /usr/bin/ssh<br />
Remote file copy command: /usr/bin/scp<br />
GPFS cluster configuration servers:<br />
-----------------------------------<br />
Primary server: bglsn_fn.itso.ibm.com<br />
Secondary server: (none)<br />
Node number Node name IP address Full node name Remarks<br />
-----------------------------------------------------------------------------------<br />
1 bglsn_fn 172.30.1.1 bglsn_fn.itso.ibm.com quorum node<br />
4. Start the bgIO GPFS cluster using the mmstartup -a command and check the<br />
/var/adm/ras/mmfs.log.latest file to ensure that GPFS has started (see<br />
Example 5-19) and look for ‘mmfsd ready’.<br />
Example 5-19 The mfmfs.log.latest file showing that GPFS is started and ready<br />
bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmstartup -a<br />
Mon Mar 20 14:14:30 EST 2006: mmstartup: Starting GPFS ...<br />
bglsn:/mnt/chriss/gpfs/BGL # cat /var/adm/ras/mmfs.log.latest<br />
Mon Mar 20 14:14:31 EST 2006 runmmfs starting<br />
Removing old /var/adm/ras/mmfs.log.* files:<br />
/bin/mv: cannot stat `/var/adm/ras/mmfs.log.previous': No such file or<br />
directory<br />
Unloading modules from /usr/lpp/mmfs/bin<br />
Loading modules from /usr/lpp/mmfs/bin<br />
Module Size Used by<br />
mmfslinux 268384 1 mmfs<br />
tracedev 35552 2 mmfs,mmfslinux<br />
Removing old /var/mmfs/tmp files:<br />
Mon Mar 20 14:14:33 2006: mmfsd initializing. {Version: 2.3.0.10<br />
Built: Jan 16 2006 13:07:54} ...<br />
Mon Mar 20 14:14:34 EST 2006 /var/mmfs/etc/gpfsready invoked<br />
Chapter 5. File systems 245
Mon Mar 20 14:14:34 2006: mmfsd ready<br />
Mon Mar 20 14:14:34 EST 2006: mmcommon mmfsup invoked<br />
Mon Mar 20 14:14:34 EST 2006: /var/mmfs/etc/mmfsup.scr invoked<br />
bglsn:/mnt/chriss/gpfs/BGL #<br />
5.3.8 Cross mounting the GPFS file system on to Blue Gene/L cluster<br />
Most of this section deals with authentication and the swapping that is associated<br />
with openssl based certificates (keys).<br />
The following steps are necessary to cross mount the GPFS file system on the<br />
Blue Gene/L:<br />
1. Configure GPFS authentication on gpfsNSD and bgIO clusters.<br />
2. Mount the GPFS file system on the SN.<br />
3. Add all the I/O nodes to the bgIO cluster.<br />
4. Boot a block and check for automatic mount of the GPFS file system.<br />
Configuring authentication on gpfsNSD and bgIO cluster<br />
Figure 5-8 illustrates our environment before the bgIO and gpfsNSD cluster are<br />
authenticated with each other.<br />
Figure 5-8 Configure GPFS authentication on both clusters<br />
246 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
To configure authentication on gpfsNSD and bgIO cluster, follow these steps:<br />
1. Generate the SSL keys on gpfsNSD and bgIO clusters.<br />
On both gpfsNSD and bgIO clusters, first ensure that GPFS is stopped and<br />
openssl packages are installed.<br />
On one node in gpfsNSD cluster and on the SN run the mmauth genkey<br />
command, as shown in Example 5-20. This command generates the<br />
public/private key pair that is saved in the /var/mmfs/ssl directory.<br />
Example 5-20 Generating GPFS cluster ssl keys<br />
###### On one node in gpfsNSD cluster (p630n01):<br />
p630n01][/]> /usr/lpp/mmfs/bin/mmauth genkey<br />
Verifying GPFS is stopped on all nodes ...<br />
Generating RSA private key, 512 bit long modulus<br />
...............++++++++++++<br />
...............++++++++++++<br />
e is 65537 (0x10001)<br />
id_rsa1 100% 497<br />
0.5KB/s 00:00<br />
id_rsa1 100% 497<br />
0.5KB/s 00:00<br />
mmauth: Command successfully completed<br />
####### and on the Service Node:<br />
bglsn:/root # /usr/lpp/mmfs/bin/mmauth genkey<br />
Verifying GPFS is stopped on all nodes ...<br />
Generating RSA private key, 512 bit long modulus<br />
...............++++++++++++<br />
...............++++++++++++<br />
e is 65537 (0x10001)<br />
id_rsa1 100% 497<br />
0.5KB/s 00:00<br />
id_rsa1 100% 497<br />
0.5KB/s 00:00<br />
mmauth: Command successfully completed<br />
2. Set cipherList=AUTHONLY on gpfsNSD.<br />
On both gpfsNSD and bgIO clusters ensure that GPFS is stopped. On one<br />
node in each cluster run the mmchconfig cipherList=AUTHONLY command, as<br />
shown in Example 5-21. This command sets GPFS to authenticate and<br />
checks authorization for network connections. This is required for cross<br />
cluster communications.<br />
Chapter 5. File systems 247
Example 5-21 Telling clusters to authenticate cross-cluster connections<br />
bglsn:/root # /usr/lpp/mmfs/bin/mmchconfig cipherList=AUTHONLY<br />
Verifying GPFS is stopped on all nodes ...<br />
mmchconfig: Command successfully completed<br />
mmchconfig: 6027-1371 Propagating the changes to all affected nodes.<br />
This is an asynchronous process.<br />
3. Exchange the SSL public keys between clusters<br />
Copy the bgIO cluster public key to one of the nodes in the gpfsNSD cluster,<br />
and the gpfsNSD cluster public key to the SN.<br />
Then, add the keys to the authorization list on each cluster. Use mmauth add<br />
command on both clusters, as shown in Example 5-22.<br />
Example 5-22 Authenticating bgIO and gpfsNSD clusters<br />
# On one node in gpfsNSD cluster:<br />
p630n01][/]> scp bglsn_fn:/var/mmfs/ssl/id_rsa.pub ~/id_rsa.pub.bgIO<br />
p630n01][/]> /usr/lpp/mmfs/bin/mmauth add bgIO -k ~/id_rsa.pub.bgIO<br />
and, on Service Node:<br />
bglsn:/ # scp p630n01_fn:/var/mmfs/ssl/id_rsa.pub ~/id_rsa.pub.gpfsNSD<br />
bglsn:/ # /usr/lpp/mmfs/bin/mmauth add gpfsNSD -k ~/id_rsa.pub.gpfsNSD<br />
-n p630n01_fn,p630n02_fn<br />
Note: As of the current GPFS version (V2.3), you need to specify the NSD<br />
nodes when authorizing the remote cluster, in our case p630n01_fn and<br />
p630n02_fn.<br />
4. Allow bgIO access to the GPFS-FS exported by gpfsNSD.<br />
Start GPFS daemons on gpfsNSD cluster, and then use the mmauth grant<br />
command to allow the bgIO access to the GPFS file system. In the case<br />
shown the device name of the GPFS file system that bgIO is granted access<br />
is gpfs1. The bgIO cluster name used with the mmauth grant command<br />
MUST be the actual name of the cluster as shown by the mmlscluster<br />
command on the SN (in our case, bgIO).<br />
p630n01][/]> /usr/lpp/mmfs/bin/mmstartup -a<br />
p630n01][/]> /usr/lpp/mmfs/bin/mmauth grant bgIO -f gpfs1<br />
Note: The cluster name is bgIO.itso.ibm.com but the short name is<br />
allowed.<br />
248 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
5. Add the remote file system to the bgIO cluster.<br />
On the bgIO cluster, ensure that GPFS is shut down. Then, run the<br />
mmremotefs add command. This command tells the bgIO cluster about the<br />
remote file system it can mount. Note that these commands must be issued<br />
by the root user. The gpfsNSD cluster name gpfsNSD used with the<br />
mmremotefs add command after the -C parameter must be the actual name of<br />
the gpfsNSD cluster as shown by the mmlscluster command run on the<br />
gpfsNSD cluster.<br />
The local device name for the remote GPFS-FS is bubu_gpfs, and /bubu is<br />
the local mount point for the file system.<br />
bglsn:/root # /usr/lpp/mmfs/bin/mmremotefs add bubu_gpfs1 -f<br />
gpfs1 -C gpfsNSD -T /bubu<br />
Mounting the GPFS file system on the SN<br />
Figure 5-9 illustrates the cross-mounted file system that we provided for our Blue<br />
Gene/L system.<br />
Figure 5-9 Cross-mount /gpfs1 from the gpfsNSD cluster on the SN<br />
Chapter 5. File systems 249
To mount the GPFS file system on the SN, follow these steps:<br />
1. On the SN start GPFS and ensure that the remote file system can be<br />
mounted, as shown in Example 5-23.<br />
Example 5-23 Mounting the remote file system on bgIO<br />
bglsn:/root # /usr/lpp/mmfs/bin/mmstartup -a<br />
Mon Mar 20 15:00:11 EST 2006: mmstartup: Starting GPFS ...<br />
bglsn:/root # mount /bubu<br />
bglsn:/root # df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
/dev/sdb3 70614928 4632540 65982388 7% /<br />
tmpfs 1898508 8 1898500 1% /dev/shm<br />
/dev/sda4 489452 50972 438480 11% /tmp<br />
/dev/sda1 9766544 1997804 7768740 21% /bgl<br />
/dev/sda2 9766608 698992 9067616 8% /dbhome<br />
p630n03:/bglscratch 36700160 5952 36694208 1% /bglscratch<br />
p630n03_fn:/nfs_mnt 104857600 11300064 93557536 11% /mnt<br />
/dev/bubu_gpfs1 1138597888 2918400 1135679488 1% /bubu<br />
2. Enable the remote file system from the gpfsNSD cluster (/gpfs1) to mount<br />
automatically over the local mount point (/bubu) when GPFS is started on the<br />
bgIO cluster. Use the mmremotefs update command, as shown in<br />
Example 5-24.<br />
Example 5-24 Changing remote file system to automount at bgIO cluster startup<br />
bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmremotefs update bubu_gpfs1 -A yes<br />
bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmshutdown -a<br />
......<br />
bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmstartup -a<br />
Mon Mar 20 15:08:12 EST 2006: mmstartup: Starting GPFS ...<br />
bglsn:/mnt/chriss/gpfs/BGL # df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
/dev/sdb3 70614928 4635404 65979524 7% /<br />
tmpfs 1898508 8 1898500 1% /dev/shm<br />
/dev/sda4 489452 50972 438480 11% /tmp<br />
/dev/sda1 9766544 2005192 7761352 21% /bgl<br />
/dev/sda2 9766608 698992 9067616 8% /dbhome<br />
p630n03:/bglscratch 36700160 5952 36694208 1% /bglscratch<br />
p630n03_fn:/nfs_mnt 104857600 11300064 93557536 11% /mnt<br />
/dev/bubu_gpfs1 1138597888 2918400 1135679488 1% /bubu<br />
bglsn:/mnt/chriss/gpfs/BGL #<br />
250 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Adding all the I/O nodes to the bgIO cluster<br />
Figure 5-10 illustrates our complete GPFS on Blue Gene/L configuration. Adding<br />
all I/O nodes to the bgIO cluster is the last step before you can actually start<br />
using the GPFS file system for running user jobs.<br />
Figure 5-10 Adding I/O nodes to the bgIO GPFS cluster<br />
To add all the I/O nodes to the bgIO cluster, follow these steps:<br />
1. Create a node definition file (ionodes) that contains a list of the I/O nodes to<br />
be added (their IP label), as in Example 5-25.<br />
Note: You can choose not to add all I/O nodes to the bgIO cluster.<br />
However, this means that certain I/O nodes will not be able to access the<br />
GPFS file systems. This is acceptable if you use manual block allocation or<br />
if the job submission system (automated scheduler) can be made aware of<br />
this configuration.<br />
Chapter 5. File systems 251
Example 5-25 Node definition file for I/O nodes<br />
bglsn:/mnt/chriss/gpfs/BGL # cat /tmp/ionodes<br />
ionode1<br />
ionode2<br />
ionode3<br />
ionode4<br />
ionode5<br />
ionode6<br />
ionode7<br />
ionode8<br />
2. To add the nodes to the bgIO cluster we must ensure first that we have a<br />
block booted that contains all the I/O nodes you want to use with GPFS. Then<br />
add the nodes to the bgIO cluster using mmadnode command, as shown in<br />
Example 5-26.<br />
Example 5-26 Adding the I/O nodes to the bgIO cluster<br />
bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmaddnode -n<br />
/tmp/ionodes<br />
Mon Mar 20 15:13:34 EST 2006: mmaddnode: Processing node ionode1<br />
Mon Mar 20 15:13:35 EST 2006: mmaddnode: Processing node ionode2<br />
Mon Mar 20 15:13:36 EST 2006: mmaddnode: Processing node ionode3<br />
Mon Mar 20 15:13:37 EST 2006: mmaddnode: Processing node ionode4<br />
Mon Mar 20 15:13:38 EST 2006: mmaddnode: Processing node ionode5<br />
Mon Mar 20 15:13:39 EST 2006: mmaddnode: Processing node ionode6<br />
Mon Mar 20 15:13:40 EST 2006: mmaddnode: Processing node ionode7<br />
Mon Mar 20 15:13:41 EST 2006: mmaddnode: Processing node ionode8<br />
mmaddnode: Command successfully completed<br />
mmaddnode: Propagating the changes to all affected nodes.<br />
This is an asynchronous process.<br />
Attention: If some of the nodes are not available (not booted, bad network<br />
connection, ssh not configured, and so forth), you need to correct the situation<br />
and retry the mmadnode command with the respective nodes.<br />
Booting a block an check automatic mount of GPFS-FS<br />
You now have to start GPFS on the newly added nodes. The best way to do this<br />
is to de-allocate the block, and then allocate it again. In this way, you can check<br />
that the file system (/bubu ) is mounted on the I/O nodes automatically.<br />
Example 5-27 shows the check performed for a small block that has just two I/O<br />
nodes using mmcs_db_console.<br />
252 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 5-27 Using mmcs_db_console to check the GPFS file system on I/O nodes<br />
mmcs$ allocate R000_J106_32<br />
OK<br />
mmcs$ select_block R000_J106_32<br />
OK<br />
mmcs$ redirect R000_J106_32 on<br />
OK<br />
mmcs$ {i} write_con df | grep bubu<br />
OK<br />
mmcs$ Apr 04 16:09:20 (I) [1083225312] {17}.0: d<br />
Apr 04 16:09:20 (I) [1083225312] {17}.0: f | grep bubu<br />
/dev/bubu_gpfs1 1138597888 3204096 1135393792 1% /bubu<br />
$<br />
Apr 04 16:09:20 (I) [1083225312] {0}.0: d<br />
Apr 04 16:09:20 (I) [1083225312] {0}.0: f | grep bubu<br />
/dev/bubu_gpfs1 1138597888 3204096 1135393792 1% /bubu<br />
$<br />
Finally, Figure 5-11 shows the bgIO cluster with the I/O nodes active and having<br />
the GPFS file system mounted. The block is ready now for running jobs.<br />
Figure 5-11 GPFS file system cross mounted to bgIO and I/O nodes added<br />
Chapter 5. File systems 253
5.3.9 GPFS problem determination methodology<br />
The methodology presented in this section is intended to help with a wide variety<br />
of problems. It is set out in sections that first deal with the SN and then the bgIO<br />
cluster. It is intended that you undertake the methodology in the order that we<br />
present it and, if a check passes, then you proceed to the next check until you<br />
find the problem. If you think you already know in which area the problem lies,<br />
then by all means go straight to that section. If, however, you are unsure of<br />
exactly where the problem lies, then use the methodology in the order presented<br />
because this often uncovers the simplest problems quickly and easily before you<br />
spend a long time looking for a solution to an assumed problem rather than the<br />
one you actually have.<br />
Checking that the GPFS-FS can be mounted on the SN<br />
Here is the methodology. All the commands in this section should be run on the<br />
SN as root.<br />
► Check that the GPFS is started on the SN, as described in “Checking that the<br />
GPFS is started” on page 255<br />
► Check the GPFS log files for problems, as described in “Checking the GPFS<br />
log files for problems” on page 256<br />
► Check that the GPFS-FS can mount on the SN, as described in “Checking<br />
that the GPFS-FS can mount on the SN” on page 257<br />
► Check that the GPFS-FS can mount on the I/O nodes, as described in<br />
“Checking that the GPFS-FS can mount on the I/O nodes” on page 257<br />
► Check that the GPFS-FS is configured on the SN, as described in “Checking<br />
that the file system is configured on the SN” on page 259<br />
► Check that the GPFS-FS is authorized to mount on the SN, as described in<br />
“Checking that the SN is authorized to mount the GPFS-FS” on page 261<br />
Checking that the GPFS-FS can be mounted on the gpfsNSD<br />
cluster<br />
Here is the methodology. You should run all of the commands in this section on<br />
one of the gpfsNSD cluster nodes as root.<br />
► Check GPFS is started on all nodes of gpfsNSD cluster, as described in<br />
“Checking that the GPFS is started” on page 255<br />
► Check GPFS-FS is configured on gfpsNSD cluster, as described in “Checking<br />
that GPFS-FS is configured on gfpsNSD cluster” on page 261<br />
► Check GPFS-FS can mount on the gfpsNSD cluster, as described in<br />
“Checking that GPFS-FS can mount on the gfpsNSD cluster” on page 262<br />
254 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
► Check GPFS-FS disks are available on gpfsNSD, as described in “Checking<br />
that the GPFS-FS disks are available on gpfsNSD” on page 263<br />
► Check the bgIO cluster is authorized to mount the GPFS-FS, as described in<br />
“Checking that bgIO cluster is authorized to mount the GPFS-FS” on<br />
page 263<br />
5.3.10 GPFS Checklists<br />
This section includes checklist for the GPFS.<br />
Checking that the GPFS is started<br />
Use the mmgetstate command to check if GPFS is started on either the SN or<br />
any node in the gpfsNSD cluster. Example 5-28 shows three (all) active nodes in<br />
our gpfsNSD cluster.<br />
Example 5-28 Checking GPFS node status<br />
[p630n01][/]> mmgetstate -a<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 p630n01_fn active<br />
2 p630n02_fn active<br />
3 p630n03_fn active<br />
The same command was run on the SN and shows that GPFS is started on the<br />
SN (see Example 5-29).<br />
Example 5-29 The mmgetstate command on the SN<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate -a<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
If this command shows that GPFS is not active on all nodes, you need to start<br />
GPFS on all nodes using the mmstartup -a command on any node in the cluster.<br />
If only some of the nodes are inactive, start them individually as follows (in this<br />
case we started GPFS on the SN only):<br />
bglsn:/tmp # mmstartup<br />
Wed Mar 29 17:40:27 EST 2006: mmstartup: Starting GPFS ...<br />
Chapter 5. File systems 255
Checking the GPFS log files for problems<br />
The GPFS log files for nodes with local storage are kept in /var/adm/ras<br />
directory. The latest log is named mmfs.log.latest and shows the history of the<br />
message since the last time that GPFS was started on that node.<br />
Example 5-30 shows the latest GPFS log file on the SN. This shows mmfs ready,<br />
which means that GPFS was functioning properly on this node. It also shows the<br />
file system /bubu_gpfs has been mounted from a remote cluster known as<br />
gpfsNSD.<br />
Example 5-30 Latest GPFS log on SN<br />
bglsn:/var/adm/ras # cat mmfs.log.latest<br />
Wed Mar 29 17:40:27 EST 2006 runmmfs starting<br />
Removing old /var/adm/ras/mmfs.log.* files:<br />
Unloading modules from /usr/lpp/mmfs/bin<br />
Loading modules from /usr/lpp/mmfs/bin<br />
Module Size Used by<br />
mmfslinux 268384 1 mmfs<br />
tracedev 35552 2 mmfs,mmfslinux<br />
Removing old /var/mmfs/tmp files:<br />
Wed Mar 29 17:40:30 2006: mmfsd initializing. {Version: 2.3.0.10<br />
Built: Jan 16 2006 13:07:54} ...<br />
Wed Mar 29 17:40:30 2006: OpenSSL library loaded<br />
Wed Mar 29 17:40:30 EST 2006 /var/mmfs/etc/gpfsready invoked<br />
Wed Mar 29 17:40:30 2006: mmfsd ready<br />
Wed Mar 29 17:40:30 EST 2006: mmcommon mmfsup invoked<br />
Wed Mar 29 17:40:30 EST 2006: /var/mmfs/etc/mmfsup.scr invoked<br />
Wed Mar 29 17:40:30 EST 2006: mounting /dev/bubu_gpfs1<br />
Wed Mar 29 17:40:31 2006: Waiting to join remote cluster p630n01_fn<br />
Wed Mar 29 17:40:31 2006: Connecting to 172.30.1.31 p630n01_fn<br />
Wed Mar 29 17:40:31 2006: Connected to 172.30.1.31 p630n01_fn<br />
Wed Mar 29 17:40:31 2006: Joined remote cluster gpfsNSD<br />
Wed Mar 29 17:40:31 2006: Command: mount bubu_gpfs1<br />
Wed Mar 29 17:40:31 2006: Connecting to 172.30.1.32 p630n02_fn<br />
Wed Mar 29 17:40:31 2006: Connected to 172.30.1.32 p630n02_fn<br />
Wed Mar 29 17:40:32 2006: Command: err 0: mount p630n01_fn:gpfs1<br />
Wed Mar 29 17:40:32 EST 2006: finished mounting /dev/bubu_gpfs1<br />
If you are experiencing problems with GPFS starting on not mounting the file<br />
system then this is a good place to look.<br />
GPFS log files on the I/O nodes are usually found under the following directory<br />
on the SN (or as specified in $BGL_DISTDIR/etc/rc.d/init.d/gpfs script).<br />
‘/bgl/gpfsvar//var/adm/ras’<br />
256 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Checking that the GPFS-FS can mount on the SN<br />
If the remote GPFS file system is not mounted, you need to identify it first. To find<br />
the remote file system to be mounted, use the mmremotefs command, and then<br />
try to mount it, as shown in Example 5-31.<br />
Example 5-31 Checking remote file systems<br />
bglsn:/tmp # mmremotefs show all<br />
Local Name Remote Name Cluster name Mount Point Mount Options<br />
Automount<br />
bubu_gpfs1 gpfs1 p630n01_fn /bubu rw<br />
yes<br />
bglsn:/tmp # mount /bubu<br />
mount: /dev/bubu_gpfs1 already mounted or /bubu busy<br />
mount: according to mtab, /dev/bubu_gpfs1 is already mounted on /bubu<br />
In our example the /bubu file system was already mounted.<br />
Checking that the GPFS-FS can mount on the I/O nodes<br />
Note: Only attempt this test if the GPFS-FS can be mounted on the SN (see<br />
previous check).<br />
First boot a block using the mmcs_db_console and check that GPFS has started<br />
on the nodes (Example 5-32).<br />
Example 5-32 Checking GPFS on I/O nodes<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ list_blocks<br />
OK<br />
mmcs$ allocate R000_128<br />
OK<br />
mmcs$ quit<br />
OK<br />
mmcs_db_console is terminating, please wait...<br />
mmcs_db_console: closing database connection<br />
mmcs_db_console: closed database connection<br />
mmcs_db_console: closing console port<br />
mmcs_db_console: closed console port<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate -a<br />
Chapter 5. File systems 257
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
2 ionode1 active<br />
3 ionode2 active<br />
4 ionode3 active<br />
5 ionode4 active<br />
6 ionode5 active<br />
7 ionode6 active<br />
8 ionode7 active<br />
9 ionode8 active<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
Then, connect using ssh to an I/O node and try to mount the GPFS file system on<br />
the I/O node, as shown in Example 5-33.<br />
Example 5-33 Checking GPFS-FS can mount on I/O nodes<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode4<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
$ df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
rootfs 7931 1644 5878 22% /<br />
/dev/root 7931 1644 5878 22% /<br />
172.30.1.1:/bgl 9766544 2450872 7315672 26% /bgl<br />
172.30.1.33:/bglscratch<br />
36700160 71776 36628384 1% /bglscratch<br />
$ /usr/lpp/mmfs/bin//mmgetstate<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
5 ionode4 active<br />
$ mount /bubu<br />
$ df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
rootfs 7931 1644 5878 22% /<br />
/dev/root 7931 1644 5878 22% /<br />
172.30.1.1:/bgl 9766544 2450872 7315672 26% /bgl<br />
172.30.1.33:/bglscratch 36700160 71776 36628384 1% /bglscratch<br />
/dev/bubu_gpfs1 1138597888 3204096 1135393792 1% /bubu<br />
258 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
If you are unable to connect through ssh in to an I/O node or if the mmgetstate -a<br />
command shows that no I/O nodes are active, you need to investigate your sitefs<br />
script (see also Appendix B, “The sitefs file” on page 423) and ensure that the<br />
following two important steps have been executed:<br />
► Added the following line to the /bgl/dist/etc/rc.d/init.d/sitefs file:<br />
echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />
► Added the following symbolic links to ensure the sitefs called at bootup.<br />
fbglsn:/bgl/dist/etc/rc.d/rc3.d # ls -als<br />
total 0<br />
0 drwxr-xr-x 2 root root 112 Mar 27 15:16 .<br />
0 drwxr-xr-x 4 root root 96 Mar 27 14:52 ..<br />
0 lrwxrwxrwx 1 root root 16 Mar 27 14:52 K90sitefs -><br />
../init.d/sitefs<br />
0 lrwxrwxrwx 1 root root 16 Mar 27 14:52 S10sitefs -><br />
../init.d/sitefs<br />
bglsn:/bgl/dist/etc/rc.d/rc3.d #<br />
Checking that the file system is configured on the SN<br />
To check that GPFS is configured, we ran both the mmlsconfig and the<br />
mmlscluster commands. Example 5-34 shows that the following important<br />
information from the mmlsconfig command:<br />
► Cluster name - bgIO.itso.ibm.com<br />
► pagepool - we have set it is 128M (maximum recommended value)<br />
► cipherList set to AUTHONLY<br />
Example 5-34 The mmlsconfig output<br />
bglsn:/tmp # mmlsconfig<br />
Configuration data for cluster bgIO.itso.ibm.com:<br />
------------------------------------------------clusterName<br />
bgIO.itso.ibm.com<br />
clusterId 12402351528774401789<br />
clusterType lc<br />
multinode yes<br />
autoload yes<br />
useDiskLease yes<br />
maxFeatureLevelAllowed 813<br />
cipherList AUTHONLY<br />
pagepool 128M<br />
File systems in cluster bgIO.itso.ibm.com:<br />
------------------------------------------<br />
(none)<br />
Chapter 5. File systems 259
Example 5-35 shows the output of the mmlscluster command and displays the<br />
following relevant information:<br />
► Cluster name - on SN this is bgIO.itso.ibm.com<br />
► Remote shell command - on SN must be /usr/bin/ssh.<br />
► Remote copy command - on SN must be /usr/bin/scp<br />
► SN is the only quorum node.<br />
Example 5-35 The mmlscluster output<br />
bglsn:/tmp # mmlscluster<br />
GPFS cluster information<br />
========================<br />
GPFS cluster name: bgIO.itso.ibm.com<br />
GPFS cluster id: 12402351528774401789<br />
GPFS UID domain: bgIO.itso.ibm.com<br />
Remote shell command: /usr/bin/ssh<br />
Remote file copy command: /usr/bin/scp<br />
GPFS cluster configuration servers:<br />
-----------------------------------<br />
Primary server: bglsn_fn.itso.ibm.com<br />
Secondary server: (none)<br />
Node number Node name IP address Full node name<br />
Remarks<br />
-----------------------------------------------------------------------<br />
--------<br />
1 bglsn_fn 172.30.1.1 bglsn_fn.itso.ibm.com<br />
quorum node<br />
2 ionode1 172.30.2.1 ionode1<br />
3 ionode2 172.30.2.2 ionode2<br />
4 ionode3 172.30.2.3 ionode3<br />
5 ionode4 172.30.2.4 ionode4<br />
6 ionode5 172.30.2.5 ionode5<br />
7 ionode6 172.30.2.6 ionode6<br />
8 ionode7 172.30.2.7 ionode7<br />
9 ionode8 172.30.2.8 ionode8<br />
Example 5-34 and Example 5-35 also reveal that GPFS is configured correctly. If<br />
you have problems with GPFS cluster configuration, refer to the installation<br />
instructions and to GPFS manuals (see 5.3.11, “References” on page 264).<br />
260 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Checking that the SN is authorized to mount the GPFS-FS<br />
To check that the SN is authorized to mount the GPFS file system. Use the<br />
mmauth command, as shown in Example 5-36.<br />
Example 5-36 Checking authorized access for the remote file system<br />
bglsn:/tmp # mmremotefs show all<br />
Local Name Remote Name Cluster name Mount Point Mount Options<br />
Automount<br />
bubu_gpfs1 gpfs1 gpfsNSD /bubu rw<br />
yes<br />
This output shows that the SN is authorized to mount the /bubu_gpfs file system<br />
from the cluster named gpfsNSD for both read and write operations. If this is not<br />
correct, run the mmauth command to fix (see also 5.3.8, “Cross mounting the<br />
GPFS file system on to Blue Gene/L cluster” on page 246).<br />
Checking that GPFS-FS is configured on gfpsNSD cluster<br />
To check that GPFS is configured on a cluster we run both the mmlsconfig<br />
command and the mmlscluster command. Example 5-37 shows the following<br />
important information from the mmlsconfig command:<br />
► Cluster name - gpfsNSD<br />
► pagepool - if not shown, it means it is set to default (64 MB)<br />
► cipherList set to AUTHONLY<br />
► local file system device name is /dev/gpfs1<br />
Example 5-37 mmlsconfig output on the gpfsNSD cluster<br />
[p630n01][/]> mmlsconfig<br />
Configuration data for cluster gpfsNSD:<br />
-----------------------------------------clusterName<br />
gpfsNSD<br />
clusterId 12402351657622744194<br />
clusterType lc<br />
multinode yes<br />
autoload no<br />
useDiskLease yes<br />
maxFeatureLevelAllowed 813<br />
cipherList AUTHONLY<br />
[p630n01_fn]<br />
File systems in cluster gpfsNSD:<br />
-----------------------------------<br />
/dev/gpfs1<br />
Chapter 5. File systems 261
We also ran the mmlscluster command. Example 5-38 shows the following<br />
relevant information from the mmlscluster command:<br />
► Cluster name - gpfsNSD<br />
► Remote shell command - /usr/bin/ssh<br />
► Remote copy command - /usr/bin/scp<br />
► Configured nodes - in our cluster all three nodes participate in quorum<br />
decisions<br />
Example 5-38 The mmlscluster output on the gpfsNSD cluster<br />
[p630n01][/]> mmlscluster<br />
GPFS cluster information<br />
========================<br />
GPFS cluster name: gpfsNSD<br />
GPFS cluster id: 12402351657622744194<br />
GPFS UID domain: gpfsNSD<br />
Remote shell command: /usr/bin/ssh<br />
Remote file copy command: /usr/bin/scp<br />
GPFS cluster configuration servers:<br />
-----------------------------------<br />
Primary server: p630n01_fn<br />
Secondary server: p630n02_fn<br />
Node number Node name IP address Full node name Remarks<br />
-----------------------------------------------------------------------------------<br />
1 p630n01_fn 172.30.1.31 p630n01_fn quorum node<br />
2 p630n02_fn 172.30.1.32 p630n02_fn quorum node<br />
3 p630n03_fn 172.30.1.33 p630n03_fn quorum node<br />
As in the previous Example 5-38, we can see that GPFS is configured correctly. If<br />
you have problems with GPFS cluster configuration, refer to the installation and<br />
problem determination instructions found in GPFS manuals (see 5.3.11,<br />
“References” on page 264).<br />
Checking that GPFS-FS can mount on the gfpsNSD cluster<br />
To mount the GPFS file system on the gpfsNSD check the locally configured file<br />
system device name from “Checking that GPFS-FS is configured on gfpsNSD<br />
cluster” on page 261, then use the mount command:<br />
[p630n01][/]> mount /dev/gpfs1<br />
GPFS: 6027-514 Cannot mount /dev/gpfs1 on /gpfs1: Already mounted.<br />
As you can see, in this case the file system was already mounted.<br />
262 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Checking that the GPFS-FS disks are available on gpfsNSD<br />
Use the mmlsdisk command to check that the disks belonging to the GPFS-FS<br />
are available and sane. This is only required if the file system cannot be locally<br />
mounted on the gpfsNSD cluster (Example 5-39).<br />
Example 5-39 Checking disk availability<br />
[p630n01][/]> mmlsdisk /dev/gpfs1<br />
disk driver sector failure holds holds<br />
name type size group metadata data status<br />
availability<br />
------------ -------- ------ ------- -------- ----- -------------<br />
------------<br />
GPFS1_n01_b nsd 512 1 yes yes ready up<br />
GPFS2_n01_a nsd 512 1 yes yes ready up<br />
GPFS3_n02_b nsd 512 2 yes yes ready up<br />
GPFS4_n02_a nsd 512 2 yes yes ready up<br />
As you can see, in this case all the disks allocated for the GPFS-FS are ready<br />
and available. If some of these disks were unavailable (“down”) this would<br />
prevent the GPFS-FS from mounting. To fix this problem, first check that all disks<br />
are properly connected to the servers and available to the operating system, then<br />
use the mmchdisk command to recover the disks to the ready state.<br />
Checking that bgIO cluster is authorized to mount the<br />
GPFS-FS<br />
Use the mmauth show all command to show if the bgIO cluster is authorized to<br />
mount the GPFS-FS, as shown in Example 5-40.<br />
Example 5-40 Checking file system is “exported” on storage cluster (gpfsNSD)<br />
[p630n01][/]> mmauth show all<br />
Cluster name: bgIO.itso.ibm.com<br />
Cipher list: AUTHONLY<br />
SHA digest: 22813e0fa7f4aa76982cd33cf705ee8c085b21a0<br />
File system access: gpfs1 (rw, root allowed)<br />
Cluster name: gpfsNSD (this cluster)<br />
Cipher list: AUTHONLY<br />
SHA digest: d02d0d706c8f7f14ce6366e1d6fe8a1a217ae1c5<br />
File system access: (all rw)<br />
As you can see, in this case the bgIO cluster named bgIO.itso.ibm.com is<br />
authorized to mount the locally created GPFS-FS device called gpfs1.<br />
Chapter 5. File systems 263
5.3.11 References<br />
For more information about the GPFS command and concepts, refer to the<br />
GPFS V2.3 documentation, which also available from the Cluster Information<br />
Center:<br />
► Concepts, Planning, and Installation <strong>Guide</strong>, GA22-7968-02<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23/bl1ins10/bl1ins10.html<br />
► Administration and Programming Reference, SA22-7967-02<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23/bl1adm10/bl1adm10.html<br />
► <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>, GA22-7969-02<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23/bl1pdg10/bl1pdg10.html<br />
► Documentation updates<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23_doc_updates/docerrata.html<br />
264 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chapter 6. Scenarios<br />
6<br />
This chapter contains a variety of problem determination scenarios that we<br />
captured while running on the Blue Gene/L system that we used in the<br />
development of this redbook. We constructed each scenario based on the<br />
problem determination methodology that we discuss throughout the book.<br />
We approach each problem with similar patterns:<br />
► A description of the problem<br />
► Detailed checking on how related problems can be revealed<br />
► How to resolve the problem or to transfer the problem to other scenarios<br />
► Lessons learned<br />
© Copyright IBM Corp. 2006. All rights reserved. 265
6.1 Introduction<br />
In some scenarios, we need to intentionally inject the error that we assume that<br />
could happen in real life and cause problems. Creating these scenarios helps to<br />
hone the problem determination procedures. And, if we can create the error,<br />
there are chances that a similar problem in the field can occur. The following list<br />
contains the error injection scenarios that are presented in this chapter.<br />
Due to design and usability considerations, we have divided a Blue Gene/L<br />
environment into core system and functional components, and the scenarios are<br />
grouped into categories of problems in:<br />
► Blue Gene/L core system<br />
– Hardware (cards, power supplies, cables and so forth)<br />
– Software (DB2 and processes)<br />
– System configuration (remote command execution: ssh, rsh, NFS)<br />
► File system (NFS and GPFS)<br />
► Job submission (mpirun and LoadLeveler)<br />
Depending on past experiences and the actual situations, usually there is more<br />
than one way to approach a problem. Our intention was that our problem<br />
determination methodology leads to one of these categories. At the beginning of<br />
a category, scanning through the table of scenarios can help you to spot a similar<br />
(if not identical) problem.<br />
To describe the problem at hand, we try not to give away the cause of the<br />
problem. However, having some hunch on what has just happened or gone<br />
wrong can serve as a starting point. This starting point is chosen from multiple<br />
hypotheses listed in the problem description section.<br />
The starting point leads to a checklist. The checklists that we discuss in this<br />
chapter are specific to the category. Only this time, with a problem at hand,<br />
detailed and specific checking are performed in every step of identifying a<br />
problem, based on the checklist.<br />
At the end of each scenario, the problem might be resolved or just identified.<br />
Otherwise, some pointer is provided to aid in transferring the problem<br />
determination to another scenario. Because all of this process is carried out on a<br />
new system that has just been set up, we might run into pitfalls and unexpected<br />
findings. These findings are included in the section on what we have learned.<br />
When a new problem is discovered, we use the same methodology. A new<br />
scenario is created and added into a category. Odd problems are gathered under<br />
the miscellaneous scenarios.<br />
266 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6.2 Blue Gene/L core system scenarios<br />
This section looks into problem scenarios that will affect the core Blue Gene/L<br />
components. The core system consists of:<br />
► Blue Gene/L racks<br />
► Functional Network<br />
► Service Network<br />
► Service Node<br />
► Blue Gene/L Database (running on Service Node)<br />
► Blue Gene/L system processes<br />
Here is a list of scenarios we tested for the core Blue Gene/L system<br />
1. Hardware error: Compute card error<br />
2. Functional network: Defective cable<br />
3. Service network: Defective cable<br />
4. Service Node functional network interface down<br />
5. SN service network interface down<br />
6. The /bgl file system full on the SN (no GPFS)<br />
7. The / file system full on the SN<br />
8. The /tmp file system is full on the SN<br />
9. The ciodb daemon is not running on the SN<br />
10.The idoproxy daemon not running on the SN<br />
11.The mmcs_server is not running on the SN<br />
12.DB2 not started on the SN<br />
13.The bglsysdb user OS password changed (Linux)<br />
14.Uncontrolled rack power off<br />
In each scenario, we follow the same process. First, we check the system is<br />
currently operational. We do this by allocating a block in mmcs_db_console,<br />
submit_job to run the hello.rts application and free_block to de-allocate the<br />
block. After we have proved our system, we then inject the scenario that we want<br />
to test. Then, we try our job submission again. This method should cause the<br />
problems and then we investigate.<br />
Each of the scenarios is split into the following sections.<br />
► Error injection<br />
► <strong>Problem</strong> determination<br />
► Lessons learned<br />
Chapter 6. Scenarios 267
6.2.1 Hardware error: Compute card error<br />
In this scenario, we replaced a compute card in a node card with a compute card<br />
with a defect chip.<br />
Error injection<br />
Power off the rack, and replace a compute card with the faulty compute card.<br />
The discovery process successfully detected the compute card and populated<br />
the DB2 database.<br />
<strong>Problem</strong> determination<br />
Because all resources are available, we try and allocate a block which includes<br />
the faulty compute card. This fails with the message shown in Example 6-1.<br />
Example 6-1 A node fails to boot<br />
mmcs$ allocate R000_J104_32<br />
FAIL<br />
Microloader Assertion<br />
Apr 01 16:55:06 (E) [1088451808] test1:R000_J104_32 RAS event:<br />
KERNEL FATAL: Microloader Assertion<br />
Apr 01 16:55:06 (E) [1088451808] test1:R000_J104_32 RAS event:<br />
KERNEL FATAL: VALIDATE_LOAD_IMAGE_CRC_IN_DRAM<br />
Because the content of the log file mentions about the RAS event, we then look<br />
into the RAS event Web page (shown in Figure 6-1).<br />
Figure 6-1 RAS event indicates a location of the faulty card<br />
From the RAS event we find the location of the failing compute card -<br />
R00-M0-N1-CJ16-U01.<br />
268 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Lessons learned<br />
The error has occurred while manipulating a block and has been recorded in the<br />
mmcs server log file. RAS event shows detailed information which was not<br />
included in the log file.<br />
6.2.2 Functional network: Defective cable<br />
In this scenario, we simulate a cable failure on the functional network. We do this<br />
by removing one external ethernet connection from a Node Card.<br />
Error injection<br />
Physically pulled out one of the cables out from the front of R00-M0-N2.<br />
<strong>Problem</strong> <strong>Determination</strong><br />
We try and allocate the block. It fails with:<br />
mmcs$<br />
no ethernet link<br />
Looking in the bglsn-bgdb0-mmcs_db_server-current.log (latest<br />
mmcs_db_server.log) we see:<br />
Mar 09 13:57:53 (E) [1086973152] root:R000_128 RAS event: KERNEL<br />
FATAL: no ethernet link<br />
We can see a RAS event has been raised, so we look for further evidence in the<br />
RAS database. It shows the same message with a location code:<br />
R00-M0-N2-I:J18-U01. We then look at the physical hardware and find the fault.<br />
Lessons learned<br />
Functional network is essential to the operation of Blue Gene/L. The system will<br />
not operate even with one link disabled<br />
6.2.3 Service network: Defective cable<br />
In this scenario, we simulate a cable failure on the service network. We do this by<br />
removing a connection to one the external ports on the Service Card.<br />
Error injection<br />
Physically pulled out the GB cable from the front of Service Card<br />
<strong>Problem</strong> <strong>Determination</strong><br />
We try and allocate the block. It fails with the message shown in Example 6-2.<br />
Chapter 6. Scenarios 269
Example 6-2 mmcs message: service card link failure<br />
mmcs$<br />
idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />
to node/link/service card [R00-M0-N0]<br />
ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />
Looking in the latest mmcs_db_server.log we find the message shown in<br />
Example 6-3.<br />
Example 6-3 Message from mmcs_db_server log<br />
Mar 09 14:43:32 (I) [1084843232] root:R000_128 allocate: FAIL;connect:<br />
idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />
to node/link/service card [R00-M0-N0]<br />
ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />
We conclude that we cannot talk to the Service Card over the service network<br />
and fix the problem by replacing the cable.<br />
We then went on to see what happens if we replace the service network GBit link<br />
and pull the IDo link on the Service Card to the service network. We found the<br />
block booted and the application could run normally.<br />
We then moved onto using Discovery. Nothing happened when we unplugged<br />
the Service Network port marked ‘IDo’ when discovery was running. We we did<br />
the same for the GBit port, and hardware was marked M for missing in the<br />
database.<br />
Lessons learned<br />
We learned the following lessons:<br />
► The Service network is needed to boot a block. However, booting a block<br />
does not use the IDo port on the Service Card.<br />
► A simple ping does not work to the Service card as a method to see whether<br />
it is alive (as this is does not use IP/ICMP communication).<br />
► The idoproxy uses the GBit port on the front of the Service Card, as does the<br />
Discovery process.<br />
► We observed that the System Controller uses the IDo link to initialize the<br />
Service Card, after it initializes everything to use the GBit connection.<br />
270 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6.2.4 Service Node functional network interface down<br />
In this scenario we simulate the functional network interface being disabled or<br />
removed from the Service Node (SN).<br />
Error injection<br />
We have used the following command on the SN to disable the functional<br />
network interface:<br />
bglsn:/bgl/BlueLight/logs/BGL # ifconfig eth3 down<br />
<strong>Problem</strong> determination<br />
When we try and boot the block we get the following message:<br />
mmcs$<br />
Error:unable to mount filesystem.<br />
Looking in the latest mmcs_db_server.log we see:<br />
Mar 09 17:37:43 (E) [1086973152] root:R000_128 RAS event: KERNEL<br />
FATAL: Error: unable to mount filesystem<br />
This message is reported for every I/O node. We know that the I/O nodes use<br />
NFS from the SN. SN check list get us to look at the network configuration. This<br />
shows us that eth3 is disabled.<br />
Lessons learned<br />
The NFS file system is mounted to the I/O nodes over the Functional Network<br />
and is required to run jobs. The functional Network is required to run an<br />
application.<br />
6.2.5 SN service network interface down<br />
In this scenario we simulate the service network interface being disabled or<br />
removed from the SN.<br />
Error injection<br />
We have used the following command on the SN to disable the service network<br />
interface:<br />
bglsn:/bgl/BlueLight/logs/BGL # ifconfig eth0 down<br />
Chapter 6. Scenarios 271
<strong>Problem</strong> determination<br />
When we booted the block the message shown in Example 6-4 appears in the<br />
mmcs_db_console.<br />
Example 6-4 The mmcs message: no network connection to service card<br />
mmcs$<br />
idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />
to node/link/service card [R00-M0-N0]<br />
ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />
Looking in the latest mmcs_db_server.log, we find the message shown in<br />
Example 6-5.<br />
Example 6-5 Failed connection message recorded mmcs_db_server log<br />
Mar 09 17:04:02 (I) [1084843232] root:R000_128 allocate: FAIL;connect:<br />
idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />
to node/link/service card [R00-M0-N0]<br />
ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />
Using the knowledge gained from previous scenarios we deduce this is a Service<br />
Network fault. Using the methodology we check the network interfaces on the SN<br />
as shown in Example 6-6.<br />
Example 6-6 Checking interface status<br />
bglsn:/bgl/BlueLight/logs/BGL # ip ad<br />
1: lo: mtu 16436 qdisc noqueue<br />
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00<br />
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo<br />
inet6 ::1/128 scope host<br />
valid_lft forever preferred_lft forever<br />
2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />
link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />
inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />
In this case, eth0 is not up on the SN. We bring it back up using the ifup<br />
command, and check again, as in Example 6-7.<br />
Example 6-7 Bringing up the Service Network interface (eth0)<br />
bglsn:/bgl/BlueLight/logs/BGL # ifup eth0<br />
bglsn:/bgl/BlueLight/logs/BGL # ip ad<br />
1: lo: mtu 16436 qdisc noqueue<br />
272 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00<br />
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo<br />
inet6 ::1/128 scope host<br />
valid_lft forever preferred_lft forever<br />
2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000<br />
link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff<br />
inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0<br />
inet6 fe80::20d:60ff:fe4d:28ea/64 scope link<br />
valid_lft forever preferred_lft forever<br />
After bringing up the interface we can boot the block.<br />
Lessons learned<br />
The service network is required to boot a block because the idoproxy sends the<br />
microloader through the Service Network (IDo bridge).<br />
6.2.6 The /bgl file system full on the SN (no GPFS)<br />
This is a scenario to see what happens when /bgl is 100% full on the SN.<br />
Note: In this scenario we do not use GPFS.<br />
Error injection<br />
Use dd to create a huge file in /bgl on the SN that fills the file system up to 100%.<br />
A similar error would appear if you run out of inodes on the same /bgl file system.<br />
<strong>Problem</strong> determination<br />
We submitted the job but did not receive the output of that job. We got the<br />
message shown on Example 6-8.<br />
Example 6-8 Job error due to /bgl file system full<br />
mmcs$ list_jobs<br />
OK<br />
JOBID STATUS USERNAME BLOCKID<br />
EXECUTABLE<br />
28 E root R000_128<br />
/bgl/hello/hello.rts<br />
Job is in error state. We look at the output for the job:<br />
could not open /bgl/hello/R000_128-28.stdout: No space left on<br />
device<br />
Chapter 6. Scenarios 273
Looking in the system logs we see the ciodb records shown in Example 6-9.<br />
Example 6-9 Extract from ciodb records revealing file system full<br />
Mar 13 14:35:35 (I) [1074048864] Starting Job 28<br />
Mar 13 14:35:35 (I) [1079563488] New thread 1079563488, for jobid 28<br />
Mar 13 14:35:35 (I) [1079563488] Jobid is 28, homedir is /bgl/hello<br />
Mar 13 14:35:35 (E) [1079563488] 0x4058bea8<br />
Mar 13 14:35:35 (E) [1079563488] could not open<br />
/bgl/hello/R000_128-28.stdout: No space left on device<br />
Mar 13 14:35:35 (E) [1079563488] Job 28 set to START_ERROR, exit<br />
status= 255, errtext= could not open /bgl/hello/R000_128-28.stdout: No<br />
space left on device<br />
Mar 13 14:35:35 (I) [1079563488] cleanup job polling thread 1079563488<br />
Correlating the messages, we determine that the /bgl file system is full (this file<br />
system is used for this job’s output) by using df /bgl on the SN.<br />
Lessons learned<br />
We learned the following lessons:<br />
► We need space for jobs to write their output (stdout(1), stderr(2)).<br />
► ciodb records any problems while writing the output file. In Example 6-9 (I)<br />
means I/O node output, (E) means error. ciodb talks to the ciod daemons on<br />
the I/O nodes (which are doing the file I/O operations), and the ciod daemons<br />
report back the that they are not able to write on the NFS file system (/bgl<br />
exported from the SN).<br />
6.2.7 The / file system full on the SN<br />
This is a scenario that tests what happens when / is 100% full on the SN.<br />
Error injection<br />
Use the dd command to create a huge file in / on the SN that fills the file system<br />
(100% reported by the df / command run on the SN).<br />
<strong>Problem</strong> determination<br />
Submitted the job. Job ran OK. There are no error messages related to the job.<br />
274 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Lessons learned<br />
We learned the following lessons:<br />
► The root (“/”) file system is not written to when running a job.<br />
► Having / full does not affect running jobs on Blue Gene/L. However, some OS<br />
functionality (Linux) and the database (DB2) will eventually have problems if<br />
they cannot write to ‘/’, and this in turn will have an effect on Blue Gene/L<br />
system processes.<br />
6.2.8 The /tmp file system is full on the SN<br />
This is a scenario to see what happens when /tmp is 100% full on the SN.<br />
Error injection<br />
Use dd to create a huge file in /tmp on the SN that fills the file system up to 100%.<br />
<strong>Problem</strong> determination<br />
Allocate a block fails with the following message:<br />
mmcs$ allocate R000_128<br />
DBBlockController::allocateBlock failed: invalid XML<br />
Looking in the latest mmcs_db_server.log we see the output shown in<br />
Example 6-10.<br />
Example 6-10 Message when allocating a block fails due to /tmp full<br />
Mar 13 12:12:29 (I) [1084843232] root allocate R000_128<br />
Mar 13 12:12:29 (I) [1084843232] root<br />
DBMidplaneController::addBlock(R000_128)<br />
Mar 13 12:12:32 (I) [1084843232] root:R000_128 allocate:<br />
FAIL;DBBlockController::allocateBlock failed: invalid XML<br />
Mar 13 12:12:32 (I) [1084843232] root:R000_128<br />
DBMidplaneController::removeBlock(R000_128)<br />
Mar 13 12:12:32 (I) [1084843232] root DBBlockController::disconnect()<br />
setBlockState(R000_128, FREE) successful<br />
Using the methodology we are directed to check to see if any file systems are<br />
full. We find /tmp 100% full. Freeing space up allows the block to allocate and the<br />
job to run. However, some other processes might be affected also (no pty can be<br />
created when /tmp is full), thus a reboot in maintenance mode might be required<br />
for the SN.<br />
Chapter 6. Scenarios 275
Lessons learned<br />
The /tmp directory is used when a block is manipulated. An image of the block in<br />
XML format is written to /tmp. You must have space in /tmp for Blue Gene/L to<br />
work.<br />
6.2.9 The ciodb daemon is not running on the SN<br />
In this scenario check what happens when the ciodb daemon is not running, and<br />
if this affects starting a job.<br />
Error injection<br />
Even though this scenario might seem unlikely, we considered that this could<br />
actually happen (ciodb daemon not running) due to a bad library (OS upgrade,<br />
Blue Gene/L driver update, and so on).<br />
First we tried to kill the ciodb daemon (kill -9 ciodb_pid), but the bglmaster<br />
daemon respawns the process. Then we decided to attach a debugger to the<br />
process and stop its execution, as shown in Example 6-11.<br />
Example 6-11 Attaching the debugger to the ciodb process<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />
Starting BGLMaster<br />
Logging to<br />
/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-12:29:03.log<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [30625]<br />
ciodb started [30626]<br />
mmcs_server started [30627]<br />
monitor stopped<br />
perfmon stopped<br />
bglsn:/bgl/BlueLight/logs/BGL # gdb --pid=30626<br />
<strong>Problem</strong> determination<br />
We then submit a job and check it for a while, and we see that it just sits there in<br />
starting (S) state (see Example 6-12), JOBID #38.<br />
Example 6-12 Checking the job status<br />
mmcs$ list_blocks<br />
OK<br />
R000_128 root(1) connected<br />
mmcs$ list_jobs<br />
OK<br />
276 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
JOBID STATUS USERNAME BLOCKID<br />
EXECUTABLE<br />
38 S root R000_128<br />
/bgl/hello/hello.rts<br />
We further investigate the RAS events, but we cannot find anything relevant.<br />
We looked at the Runtime information in the Web browser and found that the<br />
status of the job #38 listed as “Ready to start” and the block was in the initialized<br />
(I) state.<br />
We further checked the Configuration information in the Web browser and found<br />
no disabled hardware (everything OK).<br />
Looking in the system logs for ciodb, we see the messages shown in<br />
Example 6-13.<br />
Example 6-13 The ciodb log messages<br />
03/14/06-12:29:03 ./startciodb STARTED<br />
03/14/06-12:29:03 RUN CIODB ./ciodb --useDatabase BGL --dbproperties<br />
db.properties --nortschecking<br />
03/14/06-12:29:03 logging to<br />
/bgl/BlueLight/logs/BGL/bglsn-ciodb-2006-0314-12:29:03.log<br />
Mar 14 12:29:03 (I) [1074052960] ciodb[30626]: started: $Name:<br />
V1R2M1_020_2006 $<br />
Mar 14 12:29:03 (E) [1074052960] No running job records in the database<br />
We look further for job with JOBID #38, but there is no record for this JOBID.<br />
Instead we see the messages for the job with the JOBID #37, which looks OK, a<br />
shown in Example 6-14.<br />
Example 6-14 Records from ciodb showing correct execution of Job #37<br />
Mar 14 12:27:02 (I) [1074052960] Starting Job 37<br />
Mar 14 12:27:02 (I) [1079567584] New thread 1079567584, for jobid 37<br />
Mar 14 12:27:02 (I) [1079567584] Jobid is 37, homedir is /bgl/hello/<br />
contacting control node 0 at 172.30.2.1:7000...ok<br />
contacting control node 1 at 172.30.2.2:7000...ok<br />
contacting control node 2 at 172.30.2.5:7000...ok<br />
contacting control node 3 at 172.30.2.6:7000...ok<br />
contacting control node 4 at 172.30.2.7:7000...ok<br />
contacting control node 5 at 172.30.2.8:7000...ok<br />
contacting control node 6 at 172.30.2.3:7000...ok<br />
contacting control node 7 at 172.30.2.4:7000...Mar 14 12:27:04 (I)<br />
[1079567584] Job loaded: 37<br />
Chapter 6. Scenarios 277
Mar 14 12:27:04 (I) [1079567584] About to launch /bgl/hello/hello.rts<br />
Mar 14 12:27:04 (I) [1079567584] Job 37 set to RUNNING<br />
Mar 14 12:27:04 (E) [1079567584] Job 37 set to TERMINATED, exit status=<br />
0, errtext=<br />
Mar 14 12:27:04 (I) [1079567584] cleanup job polling thread 1079567584<br />
We can conclude that ciodb does not seem to be handing our job submission. At<br />
this point we suspect ciodb. We try and stop and restart the bglmaster (see<br />
Example 6-15).<br />
Example 6-15 Stopping and checking the bglmaster<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />
Stopping BGLMaster<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [30625]<br />
ciodb started [30626]<br />
mmcs_server started [30627]<br />
monitor stopped<br />
perfmon stopped<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
Timed out on socket connection to BGLMaster daemon at 127.0.0.1:32035<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # netstat -a | grep 32035<br />
tcp 0 0 localhost:32035 *:*<br />
LISTEN<br />
tcp 1 0 localhost:32035 localhost:42999<br />
CLOSE_WAIT<br />
tcp 0 0 localhost:42999 localhost:32035<br />
FIN_WAIT2<br />
At first, we can see that the bglmaster does not stop. We can see the that the<br />
bglmaster status command returns a socket timeout message for port<br />
localhost:32035. Further we check with netstat -a| grep tcp and see<br />
FIN_WAIT2 status.<br />
278 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
The connection should be closed on this port for the bglmaster to be able to<br />
finish the stopping request. We use the ps -ef command to see the status of the<br />
system processes, bglmaster, ciodb, idoproxy and mmcs_db_server, as shown<br />
in Example 6-16.<br />
Example 6-16 Checking the bglmaster process<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i bglmaster|grep -v grep<br />
root 30619 1 0 12:29 ? 00:00:00 ./BGLMaster --consoleip 127.0.0.1<br />
--consoleport 32035 --configfile bglmaster.init --autorestart y --db2profile<br />
/dbhome/bgdb2cli/sqllib/ db2profile --dbproperties db.properties<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i idoproxy|grep -v grep<br />
root 21551 26793 0 14:59 pts/5 00:00:00 grep -i idoproxy<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ciodb |grep -v grep<br />
root 30626 1 0 12:29 ? 00:00:00 [ciodb] <br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i mmcs_server |grep -v grep<br />
root 21601 26793 0 15:00 pts/5 00:00:00 grep -i mmcs_server<br />
We can see that idoproxy and mmcs_db_server are not running. However,<br />
because bglmaster would not stop and we also see a ciodb process,<br />
we first kill the bglmaster, then the ciodb process:<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # kill -9 30619; kill -9<br />
30626<br />
Now that we have removed the bad processes, we can try and restart them, as<br />
shown in Example 6-17.<br />
Example 6-17 Restarting the bglmaster after cleanup<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />
Starting BGLMaster<br />
Logging to<br />
/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-15:01:17.log<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [21809]<br />
ciodb started [21810]<br />
mmcs_server started [21811]<br />
monitor stopped<br />
perfmon stopped<br />
When the bglmaster started clean, our job (#38) that was stuck in (S)tarting<br />
status ran correctly.<br />
Chapter 6. Scenarios 279
Lessons learned<br />
We learned the following lessons:<br />
► The ciodb daemon controls the submission of the job to the Blue Gene/L. If it<br />
is not running jobs do not start.<br />
► If your jobs are not running, it is a good idea to check your system’s Blue<br />
Gene/L daemons using the ps -ef command, and look for any <br />
Blue Gene/L system processes.<br />
6.2.10 The idoproxy daemon not running on the SN<br />
In this scenario we stop idoproxy daemon to check the impact on starting and<br />
running a job.<br />
Error injection<br />
Again, we use the same debugger technique as we did for ciodb: attach the gdb<br />
debugger to the idoproxy process, as in Example 6-18.<br />
Example 6-18 Attaching gdb to the idoproxy process<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [12455]<br />
ciodb started [14394]<br />
mmcs_server started [14395]<br />
monitor stopped<br />
perfmon stopped<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ido<br />
root 12455 14324 0 10:42 ? 00:00:00 ./idoproxydb<br />
-enableflush -loguserinfo db.properties BlueGene1<br />
bglsn:/bgl/BlueLight/logs/BGL # gdb --pid=12455<br />
<strong>Problem</strong> determination<br />
When we try to allocate the block we get the following error in the<br />
mmcs_db_console:<br />
connect: idoproxy communication failure: socket recv timeout<br />
The following message was also found in the mmcs server log:<br />
Mar 14 10:48:33 (I) [1084843232] root:R000_128 allocate:<br />
FAIL;connect: idoproxy communication failure: socket recv timeout<br />
280 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
No other message was found in any of the remaining system logs. Moreover,<br />
bglmaster shows everything is running, as shown in Example 6-19.<br />
Example 6-19 Bglmaster status with idoproxy stopped<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [12455]<br />
ciodb started [14394]<br />
mmcs_server started [14395]<br />
monitor stopped<br />
perfmon stopped<br />
Because we suspect an issue with idoproxy (which is controlled by the<br />
bglmaster), we then try to restart the bglmaster:<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster restart<br />
Stopping BGLMaster<br />
....<br />
The command hangs when trying to start the bglmaster, waiting for the socket on<br />
port 32035 (owned by idoproxy) to close the connection (so it can finish the<br />
actual stopping process). However, this does not happen, thus we have to open<br />
another terminal to be able to kill the “hanging” process, then restart the<br />
bglmaster, as in Example 6-20.<br />
Example 6-20 Recovering from a “hanging” idoproxy daemon<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i bglmaster| grep -v grep<br />
root 14324 1 0 Mar09 ? 00:00:00 ./BGLMaster --consoleip 127.0.0.1<br />
--consoleport 32035 --configfile bglmaster.init --autorestart y --db2profile<br />
/dbhome/bgdb2cli/sqllib/ db2profile --dbproperties db.properties<br />
## >><br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />
Stopping BGLMaster<br />
## ><br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
Timed out on socket connection to BGLMaster daemon at 127.0.0.1:32035<br />
## ><br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i idoproxy|grep -v grep<br />
root 12455 1 0 12:29 ? 00:00:00 [idoproxy] <br />
Chapter 6. Scenarios 281
glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # kill -9 12455<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ciodb |grep -v grep<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i mmcs_server |grep -v grep<br />
root 14395 14395 0 15:00 pts/5 00:00:00 grep -i mmcs_server<br />
## ><br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />
Stopping BGLMaster<br />
## ><br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />
Starting BGLMaster<br />
Logging to /bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0406-17:07:53.log<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [8260]<br />
ciodb started [8261]<br />
mmcs_server started [8262]<br />
monitor stopped<br />
perfmon stopped<br />
Finally, we can now start mmcs_db_console and run a job.<br />
Lessons learned<br />
We learned the following lessons:<br />
► mmcs_db_console requires a connection to idoproxy to work.<br />
► If you are having problems with Blue Gene/L system processes (controlled by<br />
bglmaster), ensure that all old instances of the system processes are<br />
properly cleaned up before restarting bglmaster.<br />
6.2.11 The mmcs_server is not running on the SN<br />
In this scenario, we stop mmcs_db_server to check the impact on starting or<br />
running a job.<br />
Error injection<br />
We use the same technique as we used for ciodb and idoproxy (attach the<br />
debugger, then stop the process), as shown in Example 6-21.<br />
282 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 6-21 Attaching the debugger to the mmcs_server process<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [21809]<br />
ciodb started [26636]<br />
mmcs_server started [28222]<br />
monitor stopped<br />
perfmon stopped<br />
bglsn:/bgl/BlueLight/logs/BGL # gdb --pid=28222<br />
<strong>Problem</strong> determination<br />
We try to connect to the mmcs_server using the mmcs_db_console:<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
The mmcs_db_console hangs and never returns a prompt (which means we<br />
cannot submit a job), thus we connect to the SN using another terminal, and start<br />
the problem determination.<br />
First, we checked the RAS events log, but we could not find anything relevant.<br />
Next, we looked at the “Runtime” and “Configuration” information in the Blue<br />
Gene/L Web browser, but we could not find anything related to this issue.<br />
Next, we check the status of the Blue Gene/L system processes:<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
Timed out on socket connection to BGLMaster daemon at<br />
127.0.0.1:32035<br />
We follow the same procedure as in 6.2.9, “The ciodb daemon is not running on<br />
the SN” on page 276 (ciodb) and 6.2.10, “The idoproxy daemon not running on<br />
the SN” on page 280 (idoproxy). We have to kill all the bglmaster spawned<br />
processes, then we restart the bglmaster.<br />
We can now start mmcs_db_console. We are able to submit and run a job.<br />
Lessons learned<br />
We learned the following lessons:<br />
► The mmcs_server daemon is essential to the running of Blue Gene/L.<br />
► mmcs_db_console is an interface to this process.<br />
► If you are having problems with system processes, ensure that all old system<br />
processes are properly cleaned up before restarting bglmaster.<br />
Chapter 6. Scenarios 283
6.2.12 DB2 not started on the SN<br />
In this scenario we reboot the SN on a system where the DB2 Blue Gene/L<br />
instance is not set to start automatically. This could happen if during DB2<br />
installation on the SN, automatic restart of the DB2 instances was not chosen.<br />
Error injection<br />
We turned off automatic DB2 start at system startup using the command shown<br />
in Example 6-22.<br />
Example 6-22 Turning off automatic DB2 start<br />
bglsn:~ # su -l bglsysdb<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
/opt/IBM/db2/V8.1/instance/db2iauto -off bglsysdb<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2set -i bglsysdb<br />
DB2COMM=tcpip<br />
Note: If the instance was set to autostart, we would see:<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
/opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2set -i bglsysdb<br />
DB2COMM=tcpip<br />
DB2AUTOSTART=YES<br />
We then rebooted the SN.<br />
<strong>Problem</strong> determination<br />
After the SN rebooted, we try to start the Blue Gene/L system processes, as<br />
shown in Example 6-23.<br />
Example 6-23 Starting bglmaster daemon when DB2 is stopped<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />
Starting BGLMaster<br />
Logging to /bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-18:35:29.log<br />
./BGLMaster: error while loading shared libraries: libdb2.so.1: cannot open shared<br />
object file: No such file or directory<br />
bglmaster start command failed: ./BGLMaster --consoleip 127.0.0.1 --consoleport 32035<br />
--configfile bglmaster.init --autorestart y --db2profile ~bgdb2cli/sqllib/db2profile<br />
--dbproperties db.properties 2>&1<br />
>/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-18:35:29.log<br />
## ><br />
284 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # . /discovery/db.src<br />
SQL30081N A communication error has been detected. Communication protocol<br />
being used: "TCP/IP". Communication API being used: "SOCKETS". Location<br />
where the error was detected: "127.0.0.1". Communication function detecting<br />
the error: "connect". Protocol specific error code(s): "111", "*", "*".<br />
SQLSTATE=08001<br />
The db.src file contains a DB2 connect statement (db2 connect to bgdb0 user<br />
bglsysdb using bglsysdb) which gives the error message seen in Example 6-23<br />
(SQL30081N), thus we can conclude that DB2 was not started.<br />
Next, we tried to see what the RAS drill down shows, and we found the message<br />
shown in Example 6-24 at the top of the Web page.<br />
Example 6-24 Message returned to the Web page when DB2 not running<br />
Warning: odbc_connect(): SQL error: [IBM][CLI Driver] SQL30081N A<br />
communication error has been detected. Communication protocol being<br />
used: "TCP/IP". Communication API being used: "SOCKETS". Location where<br />
the error was detected: "127.0.0.1". Communication function detecting<br />
the error: "connect". Protocol specific error code(s): "111", "*", "*".<br />
SQLSTATE=08001 , SQL state 08001 in SQLConnect in<br />
/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/web/ras.php on line 110<br />
Again, the same SQL message (SQL30081N) indicates that the Web interface<br />
cannot communicate with DB2 either.<br />
We then executed the command shown in Example 6-25 to confirm that DB2 is<br />
not running.<br />
Example 6-25 Checking the DB2 processes<br />
bglsn:~ # ps -ef | grep db2<br />
root 7711 1 0 18:24 ? 00:00:00 /opt/IBM/db2/V8.1/bin/db2fmcd<br />
root 8593 1 0 18:30 pts/0 00:00:00 /dbhome/bgdb2cli/sqllib/bin/db2bp<br />
8468A0 5 A<br />
bglsysdb 10119 1 0 18:32 pts/0 00:00:00 /dbhome/bglsysdb/sqllib/bin/db2bp<br />
9735A1000 5 A<br />
root 10725 1 0 18:35 pts/4 00:00:00 /dbhome/bgdb2cli/sqllib/bin/db2bp<br />
10277A0 5 A<br />
root 12387 12339 0 18:46 pts/7 00:00:00 grep db2<br />
Chapter 6. Scenarios 285
Now we start DB2 on the SN; after DB2 starts, we also start the bglsmaster<br />
daemon, as shown in Example 6-26.<br />
Example 6-26 Starting DB2 and bglmaster<br />
bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> db2start<br />
03/14/2006 19:36:50 0 0 SQL1063N DB2START processing was<br />
successful.<br />
SQL1063N DB2START processing was successful.<br />
bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> # ./bglmaster start<br />
bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> # ./bglmaster status<br />
idoproxy started [21262]<br />
ciodb started [21263]<br />
mmcs_server started [21264]<br />
monitor stopped<br />
perfmon stopped<br />
We submit a job which starts and runs to completion. Finally, we make sure DB2<br />
is going to start automatically the next time system is started, as shown in<br />
Example 6-27.<br />
Example 6-27 Turning on automatic DB2 start<br />
bglsn:~ # su -l bglsysdb<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
/opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2set -i bglsysdb<br />
DB2COMM=tcpip<br />
DB2AUTOSTART=YES<br />
Lessons learned<br />
We learned the following lessons:<br />
► DB2 is the core element of Blue Gene/L. Nothing works if it is not started.<br />
► DB2 should be configured to automatically start on reboot.<br />
286 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6.2.13 The bglsysdb user OS password changed (Linux)<br />
Because DB2 user (bglsysdb) password authentication is set to “unix” during the<br />
installation of the SN, we decided to try and change the UNIX password and test.<br />
Error injection<br />
We changed the OS password for the bglsysdb user:<br />
bglsn:/bgl/ # passwd bglsysdb<br />
<strong>Problem</strong> determination<br />
We then try to allocate a block so we can run our job.<br />
mmcs_console error :<br />
mmcs$ allocate R000_128<br />
FAIL<br />
lost connection to mmcs_server<br />
use mmcs_server_connect to reconnect<br />
When we check the system logs we find the following messages in the mmcs and<br />
ciodb logs (see Example 6-28).<br />
Example 6-28 Error message in system logs when bglsysdb password changed<br />
--CLI ERROR-------------cliRC<br />
= -1<br />
line = 167<br />
file = DBConnection.cc<br />
SQLSTATE = 08001<br />
Native Error Code = -30082<br />
[IBM][CLI Driver] SQL30082N Attempt to establish connection failed<br />
with security reason "24" ("USERNAME AND/OR PASSWORD INVALID").<br />
SQLSTATE=08001<br />
-------------------------<br />
Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />
schema is {bglsysdb}<br />
Unable to connect, aborting...<br />
We can see for this we cannot connect to the system as bglsysdb, with the<br />
message "USERNAME AND/OR PASSWORD INVALID". However, as superuser<br />
(root) we can switch to bglsysdb (su - bglsysdb) and check if the user context is<br />
fine, so it must be the password we need to change, and update the<br />
db.properties file with the new password.<br />
Chapter 6. Scenarios 287
After changing the db.properties with the new password we need to restart the<br />
bglmaster, as shown in Example 6-29.<br />
Example 6-29 Changing the password in DB2.properties file<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # vi db.properties<br />
... >; save and exit...<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster restart<br />
Stopping BGLMaster<br />
Starting BGLMaster<br />
Logging to<br />
/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0315-14:56:25.log<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [14561]<br />
ciodb started [14562]<br />
mmcs_server started [14563]<br />
monitor stopped<br />
perfmon stopped<br />
We can now use the mmcs_db_console to allocate a block, as shown in<br />
Example 6-30.<br />
Example 6-30 Allocating a block<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ allocate R000_128<br />
OK<br />
Finally, we run a job.<br />
Lessons learned<br />
DB2 user password control is tied to unix authentication. The password that the<br />
Blue Gene/L system processes use to talk to DB2 is contained in the<br />
db.properties file. If you change the DB2 user password you MUST update this<br />
file.<br />
288 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6.2.14 Uncontrolled rack power off<br />
This time we decided to power off the rack without doing any of the<br />
PrepareForService preparation before hand.<br />
Error injection<br />
We switched the rack circuit breaker off. We wait for 20 seconds and power back<br />
on. This could happen in real life when there are power line fluctuations.<br />
<strong>Problem</strong> <strong>Determination</strong><br />
We try to run a job, but block allocation fails twice in a row—first with<br />
communication failure and next time with initchip error invalid JtagID, as<br />
shown in Example 6-31.<br />
Example 6-31 Trying to allocate a block after a rack power glitch<br />
mmcs$ allocate R000_128<br />
FAIL<br />
connect: idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT<br />
connection lost to node/link/service card [R00-M0-N0]<br />
ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />
mmcs$ allocate R000_128<br />
FAIL<br />
connect: initchip error invalid JtagID<br />
As we can see, idoproxy cannot talk to the Service Card.<br />
We next check the system logs. mmcs_server logs and see the messages shown<br />
in Example 6-32.<br />
Example 6-32 Message in mmcs_server log<br />
Mar 15 17:04:29 (I) [1086321888] root allocate R000_128<br />
Mar 15 17:04:29 (I) [1086321888] root<br />
DBMidplaneController::addBlock(R000_128)<br />
Mar 15 17:04:29 (I) [1086321888] root:R000_128<br />
BlockController::connect()<br />
Mar 15 17:04:33 (I) [1086321888] root:R000_128<br />
BlockController::disconnect() releasing node and ido connections<br />
Mar 15 17:04:33 (I) [1086321888] root:R000_128 allocate: FAIL;connect:<br />
idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost<br />
to node/link/service card [R00-M0-N0]<br />
ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL<br />
Chapter 6. Scenarios 289
Mar 15 17:04:33 (I) [1086321888] root:R000_128<br />
DBMidplaneController::removeBlock(R000_128)<br />
The idoproxy logs shows the messages in Example 6-33.<br />
Example 6-33 The idoproxy messages<br />
Mar 15 17:04:29 (I) [1102423264]<br />
root:ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL OPEN(s)<br />
Mar 15 17:04:33 (E) [1084568800] Send Timeout... IPAddr=10.0.0.18<br />
IPMask=255.255.0.0 LicensePlate=ff:f2:9f:16:f0:95:00:0d:60:e9:0f:6a<br />
Mar 15 17:04:33 (E) [1084568800] packet failure -1... IPAddr=10.0.0.18<br />
IPMask=255.255.0.0 LicensePlate=ff:f2:9f:16:f0:95:00:0d:60:e9:0f:6a<br />
Mar 15 17:04:33 (I) [1102423264]<br />
root:ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL CLOSE<br />
► The RAS errors show this for all Node Cards.<br />
► System processes seem running fine, apart from the logged messages<br />
(Example 6-32 and Example 6-33).<br />
► Checking DB2 reveals it is up and running.<br />
Because we have communication errors, we then check out the Service Network<br />
and the Functional Network (eth0 and eth3 in our case), using the ifconfig and<br />
ethtool commands, as shown in Example 6-34.<br />
Example 6-34 Checking network interfaces<br />
bglsn:/ # ifconfig<br />
eth0 Link encap:Ethernet HWaddr 00:0D:60:4D:28:EA<br />
inet addr:10.0.0.1 Bcast:10.0.255.255 Mask:255.255.0.0<br />
inet6 addr: fe80::20d:60ff:fe4d:28ea/64 Scope:Link<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
RX packets:35228867 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:35220594 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:1000<br />
RX bytes:4595599148 (4382.7 Mb) TX bytes:6839888628 (6523.0<br />
Mb)<br />
Base address:0xe800 Memory:f8120000-f8140000<br />
..... >....<br />
eth3 Link encap:Ethernet HWaddr 00:11:25:08:30:90<br />
inet addr:172.30.1.1 Bcast:172.30.255.255 Mask:255.255.0.0<br />
inet6 addr: fe80::211:25ff:fe08:3090/64 Scope:Link<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
RX packets:14054383 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:24871645 errors:0 dropped:0 overruns:0 carrier:0<br />
290 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
collisions:0 txqueuelen:1000<br />
RX bytes:7954413066 (7585.9 Mb) TX bytes:20968793643<br />
(19997.3 Mb)<br />
Base address:0xec00 Memory:c0080000-c00a0000<br />
..... >....<br />
bglsn:/ # ethtool eth0<br />
Settings for eth0:<br />
Supported ports: [ TP ]<br />
Supported link modes: 10baseT/Half 10baseT/Full<br />
100baseT/Half 100baseT/Full<br />
1000baseT/Full<br />
Supports auto-negotiation: Yes<br />
Advertised link modes: 10baseT/Half 10baseT/Full<br />
100baseT/Half 100baseT/Full<br />
1000baseT/Full<br />
Advertised auto-negotiation: Yes<br />
Speed: 1000Mb/s<br />
Duplex: Full<br />
Port: Twisted Pair<br />
PHYAD: 0<br />
Transceiver: internal<br />
Auto-negotiation: on<br />
Supports Wake-on: umbg<br />
Wake-on: g<br />
Current message level: 0x00000007 (7)<br />
Link detected: yes<br />
bglsn:/ # ethtool eth3<br />
Settings for eth3:<br />
Supported ports: [ TP ]<br />
Supported link modes: 10baseT/Half 10baseT/Full<br />
100baseT/Half 100baseT/Full<br />
1000baseT/Full<br />
Supports auto-negotiation: Yes<br />
Advertised link modes: 10baseT/Half 10baseT/Full<br />
100baseT/Half 100baseT/Full<br />
1000baseT/Full<br />
Advertised auto-negotiation: Yes<br />
Speed: 1000Mb/s<br />
Duplex: Full<br />
Port: Twisted Pair<br />
PHYAD: 0<br />
Transceiver: internal<br />
Auto-negotiation: on<br />
Supports Wake-on: umbg<br />
Wake-on: g<br />
Chapter 6. Scenarios 291
Current message level: 0x00000007 (7)<br />
Link detected: yes<br />
► From Example 6-34 we can also see that the IP configuration is correct on the<br />
network interfaces.<br />
► Next, we use the ping command over Functional and Service Networks, and<br />
check the lights of the RJ45 jacks functional and service interface. These<br />
verifications do no reveal any problem.<br />
► Following, we check the lights of the Service and Node Cards. Node Cards<br />
Light are out and the Service Cards lights are cycling. This means the cards<br />
are uninitialized. We need to run the discovery process to re-discover the<br />
system.<br />
► For this, we to do a PrepareForService operation to get the system into a<br />
good state, as shown in Example 6-35:<br />
Example 6-35 Running PrepareForService on the system<br />
bglsn:/discovery # ./PrepareForService R00<br />
Logging to<br />
/bgl/BlueLight/logs/BGL/PrepareForService-2006-03-16-10:47:01.log<br />
Mar 16 10:47:02.912 EST: PrepareForService started<br />
Mar 16 10:47:03.125 EST: Preparing 1 Midplanes in rack R00<br />
Mar 16 10:47:03.702 EST: @ killMidplaneBlocks - kill_midplane_jobs R000<br />
failed (FAIL;command?)<br />
Mar 16 10:47:06.213 EST: Freed any blocks using R000<br />
Mar 16 10:47:06.222 EST: @ killMidplaneBlocks - Retried 1 time(s)<br />
before we were able to 'free any blocks using this midplane' -<br />
Midplane(R000)!<br />
Mar 16 10:47:06.222 EST:<br />
Mar 16 10:47:11.403 EST: @ buildServiceCardObj - Exception occurred<br />
while building an iDo for ServiceCard(mLctn(R00-M0-S),<br />
mCardSernum(2033394a3033373900000000594c31304b3530363130304a),<br />
mLp(FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), mIp(10.0.0.12), mType(2))<br />
Mar 16 10:47:11.403 EST: @ buildServiceCardObj - Exception was<br />
(java.io.IOException: Could not contact iDo with<br />
LP=FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2 and IP=/10.0.0.12 because<br />
java.lang.RuntimeException: Communication error: (DirectIDo for<br />
Uninitialized DirectIDo for<br />
FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2@/10.0.0.12:0 is in state =<br />
COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber<br />
= 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout<br />
= 1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync<br />
Command = 10, Actual Sync Reply = -1))<br />
..... > .....<br />
292 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Mar 16 10:48:11.470 EST: @ buildServiceCardObj - Exception occurred<br />
while building an iDo for ServiceCard(mLctn(R00-M0-S),<br />
mCardSernum(2033394a3033373900000000594c31304b3530363130304a),<br />
mLp(FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), mIp(10.0.0.12), mType(2))<br />
Mar 16 10:48:11.471 EST: @ buildServiceCardObj - Exception was<br />
(java.io.IOException: Could not contact iDo with<br />
LP=FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2 and IP=/10.0.0.12 because<br />
java.lang.RuntimeException: Communication error: (DirectIDo for<br />
Uninitialized DirectIDo for<br />
FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2@/10.0.0.12:0 is in state =<br />
COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber<br />
= 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout<br />
= 1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync<br />
Command = 10, Actual Sync Reply = -1))<br />
..... > .....<br />
► From the error messages (bold in Example 6-35) we can see the system<br />
cannot talk to the IDo chips, concluding that the Service Card is NOT<br />
initialized.<br />
► We now run discovery on the system:<br />
cd /discovery<br />
./SystemController start<br />
./Discovery0 start<br />
./PostDiscovery start<br />
► The lights of the Service Card come back to normal state, but the Node Cards<br />
do not.<br />
► We try to run PrepareForService again, but it fails because the previously<br />
failed attempt. We need to close the previous service action manually:<br />
db2 "update bglserviceaction set status = 'C' where id = 2"<br />
► We then ran PrepareForService again on the rack.<br />
Note: Unfortunately this failed, as our test system does not have the hardware<br />
expected in a typical Blue Gene/L rack (1/2 rack at least).<br />
To overcome this, we added the -FORCE option and the ServiceAction<br />
completed as expected.<br />
Chapter 6. Scenarios 293
As EndServiceAction did not complete successfully, we decided to manually<br />
mark all the IDo chips as missing (M) in the database, so the discovery process<br />
would pick them up:<br />
db2 "update bglidochip set status = 'M' where ipaddress like<br />
'10.0%'"<br />
Discovery found all existing hardware. After it found all the hardware, we stopped<br />
the discovery process and restarted the Blue Gene/L system processes. We<br />
could then allocate a block submit and a job.<br />
Lessons learned<br />
We learned the following lessons:<br />
► If a rack is power cycled without preparation, the rack it goes into a<br />
un-initialized state. Without the discovery processes running (specially<br />
SystemController), the Service Card does not get initialized and the system<br />
cannot talk to the IDo chips through the switch on the Service Card. You<br />
should always use PrepareForService and EndServiceAction and do a<br />
controlled power down.<br />
► If you have an unplanned rack power outage, you should leave the rack off,<br />
then start the discovery process. This marks all hardware as missing, and<br />
when this is complete, power up and let discovery find and initialize the<br />
hardware. When complete, stop discovery and start the system processes to<br />
bring back the system in production.<br />
6.3 File system scenarios<br />
This section addresses problem determination for issues related to both NFS<br />
and GPFS.<br />
Here is the list of file system scenarios that we ran. In each of these scenarios,<br />
we injected a problem manually. We show the steps taken to determine the<br />
problem and the resolution.<br />
1. Port mapper daemon not running on the SN.<br />
2. NFS daemon not running on the SN.<br />
3. GPFS pagepool (wrongly) set to 512 MB on bgIO cluster nodes.<br />
4. Secure shell (ssh) is broken (interactive authentication required).<br />
5. The /bgl file system becomes full.<br />
6. Install of new Blue Gene/L driver code.<br />
7. Duplicate IP in /etc/hosts.<br />
294 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
8. Missing IO node in /etc/hosts.<br />
9. Duplicate entries (additional aliases) for the SN in /etc/hosts.<br />
In each of the scenario, s we first run the system to prove that the system is<br />
working. To do this we use LoadLeveler to run the IOR application that writes<br />
data to the GPFS file system from all nodes. After this has run successfully, we<br />
then inject the problem and rerun the same job. Each of the scenarios is split into<br />
the following sections.<br />
► Error injection<br />
► <strong>Problem</strong> determination<br />
► Lessons learned<br />
6.3.1 Port mapper daemon not running on the SN<br />
In this scenario we are not using GPFS or LoadLeveler but just the core system<br />
loading the application from the /bgl file system. In this scenario, we cause the<br />
problem by killing the portmap daemon that is required by the NFS server.<br />
Error Injection<br />
The error injected here was to kill the portmap process running on the SN.<br />
<strong>Problem</strong> determination<br />
Here we ran the ““Hello world!” program from a mmcs_db_console session. Here is<br />
the error that we received:<br />
allocate R000_128 == failed with :<br />
mmcs_console : Error: unable to mount filesystem<br />
Example 6-36 presents mmcs_db_server.log when portmapd is not running.<br />
Example 6-36 Messages in mmcs_db_server.log (portmapd not running)<br />
Mar 13 16:02:02 (I) [1084843232] root:R000_128 DBBlockController::waitBoot(R000_128)<br />
Mar 13 16:02:13 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />
unable to mount filesystem<br />
Mar 13 16:02:16 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />
unable to mount filesystem<br />
Mar 13 16:02:16 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />
unable to mount filesystem<br />
Mar 13 16:02:16 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error:<br />
unable to mount filesystem<br />
Mar 13 16:02:17 (I) [1084843232] root:R000_128 allocate: FAIL;Error: unable to mount<br />
filesystem<br />
Chapter 6. Scenarios 295
Mar 13 16:02:17 (I) [1084843232] root:R000_128<br />
DBMidplaneController::removeBlock(R000_128)<br />
Mar 13 16:02:17 (I) [1084843232] root BlockController::quiesceMailbox() waiting for<br />
ras events and I/O node shutdown<br />
Mar 13 16:02:17 (I) [1097856224] mmcs DatabaseCommandThread started: block<br />
R000_128, user root, action 3<br />
Mar 13 16:02:17 (I) [1097856224] mmcs setusername root<br />
Mar 13 16:02:17 (I) [1097856224] root db_free R000_128<br />
Mar 13 16:02:17 (I) [1097856224] root DBMidplaneController::addBlock(R000_128)<br />
Mar 13 16:02:17 (I) [1097856224] root:R000_128 DBBlockController::freeBlock()<br />
setBlockState(R000_128, TERMINATING) successful<br />
As we can see the file system cannot be mounted, according to the problem<br />
determination methodology we go straight to the NFS checks:<br />
► Check the export list on the SN:<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # showmount -e<br />
mount clntudp_create: RPC: Port mapper failure - RPC: Unable to<br />
receive<br />
The RPC: Port mapper failure - RPC: Unable to receive error indicates a<br />
port mapper daemon issue. To fix this problem, we ran the following command:<br />
/etc/init.d/portmap restart && /etc/init.d/nfsserver restart<br />
Lessons learned<br />
We learned the following lessons:<br />
► If the port mapper daemon dies on the SN we get the following error in either<br />
the mmcs_db_console or the mmcs_db_server error log.<br />
Error: unable to mount filesystem<br />
► To diagnose the problem, we can run the showmount -e command on the SN.<br />
6.3.2 NFS daemon not running on the SN<br />
In this scenario we use only the core system (SN, racks, networks) loading the<br />
application from the /bgl file system exported by the SN.<br />
Error Injection<br />
The error injected here was to kill the nfsd process running on the SN.<br />
296 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
<strong>Problem</strong> determination<br />
Here we ran the “Hello world!” program from a mmcs_db_console session. Here is<br />
the error that we received:<br />
allocate R000_128 == failed with :<br />
mmcs_console : Error: unable to mount filesystem<br />
Now looking at the mmcs_db_server log we see the error:<br />
KERNEL FATAL: Error: unable to mount filesystem<br />
Using the problem determination methodology we go straight to the NFS checks.<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # showmount -e<br />
mount clntudp_create: RPC: Program not registered<br />
This error indicates a problem with the NFS server.To fix this we did the<br />
following.<br />
/etc/init.d/nfsserver restart<br />
Lessons learned<br />
We learned the following lessons:<br />
► If the nfsd daemon dies on the SN, we get the following error in either the<br />
mmcs_db_console or the mmcs_db_server error log when allocating a block:<br />
Error: unable to mount file system<br />
► To diagnose the problem we can run the ‘showmount -e’ command on the SN<br />
6.3.3 GPFS pagepool (wrongly) set to 512MB on bgIO cluster nodes<br />
This scenario we change the GPFS pagepool for the I/O nodes on the bgIO<br />
cluster. This is pinned kernel memory used for file and metadata caching by the<br />
GPFS daemon. Due to limited memory (RAM) on the I/O nodes, a large<br />
pagepool will prevent the applications from running (or even prevent GPFS<br />
daemon from starting on the I/O nodes).<br />
Error injection<br />
Example 6-37 shows the error we injected (no blocks were allocated at this time).<br />
Example 6-37 Changing the GPFS pagepool<br />
bglsn:/bgl/BlueLight/logs/BGL # mmchconfig pagepool=512M<br />
mmchconfig: Command successfully completed<br />
mmchconfig: Propagating the changes to all affected nodes.<br />
This is an asynchronous process.<br />
Chapter 6. Scenarios 297
<strong>Problem</strong> determination<br />
In this scenario we use IBM LoadLeveler to submit a job (for details see 4.4, “IBM<br />
LoadLeveler” on page 167). We leave LoadLeveler to automatically allocate a<br />
block. We observe that the job did not produce any I/O files after five minutes<br />
(usually, this job runs in under one minute). As a starting point we check the<br />
mmcs_server log:<br />
bglsn:/bgl/BlueLight/logs/BGL # view<br />
bglsn-mmcs_db_server-2006-0316-15:42:04.log<br />
Initially we did not find any relevant message, so we decided to log on the an I/O<br />
node and check from there (see Example 6-38).<br />
Example 6-38 Checking GPFS on one I/O node<br />
bglsn:/bgl/BlueLight/logs/BGL # ssh root@ionode5<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
$ df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
rootfs 7931 1644 5878 22% /<br />
/dev/root 7931 1644 5878 22% /<br />
172.30.1.1:/bgl 9766544 2449008 7317536 26% /bgl<br />
172.30.1.33:/bglscratch<br />
36700160 64448 36635712 1% /bglscratch<br />
$ /usr/lpp/mmfs/bin/mmgetstate<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
6 ionode5 down<br />
Note: We could also use mmgetstate -a command on the SN. This would<br />
return the status of all nodes in the bgIO GPFS cluster. However, we have<br />
chosen to go directly to one of the nodes allocated for the LoadLeveler job.<br />
We then look into GPFS log (mmfs.log.latest) on the respective I/O node, as<br />
shown in Example 6-39.<br />
Example 6-39 GPFS log on the I/O node<br />
$ cat /var/mmfs/gen/mmfslog<br />
/bin/cat: /proc/kallsyms: No such file or directory<br />
Tue Mar 28 15:56:47 2006: mmfsd initializing. {Version: 2.3.0.10<br />
Built: Jan 16 2006 13:08:25} ...<br />
298 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Tue Mar 28 15:56:48 2006: Not enough memory to allocate internal data<br />
structure.<br />
Tue Mar 28 15:56:48 2006: The mmfs daemon is shutting down abnormally.<br />
Tue Mar 28 15:56:48 2006: mmfsd is shutting down.<br />
Tue Mar 28 15:56:48 2006: Reason for shutdown: LOGSHUTDOWN called<br />
Tue Mar 28 15:56:49 EST 2006 runmmfs starting<br />
Removing old /var/adm/ras/mmfs.log.* files:<br />
Tue Mar 28 15:56:49 EST 2006 runmmfs: respawn 9 waiting 336 seconds<br />
before restarting mmfsd<br />
From the GPFS log we can see that the mmfs daemon is respawning and we also<br />
see the following message, which indicates where the problem might be:<br />
Not enough memory to allocate internal data structure.<br />
Moreover, 10 minutes after the job submission, we also get in the application<br />
(job) log file the messages shown in Example 6-40.<br />
Example 6-40 Application messages<br />
test1@bglfen1:/bglscratch/test1> view ior-gpfs.out<br />
IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />
Run began: Tue Mar 28 15:51:26 2006<br />
Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />
/bglscratch/test1/applications/IOR/ior-inputs<br />
Machine: Blue Gene L ionode3<br />
Summary:<br />
api = MPIIO (version=2, subversion=0)<br />
test filename = /bubu/Examples/IOR/IOR-output-MPIIO-1<br />
access = single-shared-file<br />
clients = 128 (16 per node)<br />
repetitions = 4<br />
xfersize = 32768 bytes<br />
blocksize = 1 MiB<br />
aggregate filesize = 128 MiB<br />
delaying 1 seconds . . .<br />
** error **<br />
** error **<br />
** error **<br />
** error **<br />
** error **<br />
** error **<br />
** error **<br />
Chapter 6. Scenarios 299
** error **<br />
** error **<br />
** error **<br />
** error **<br />
** error **<br />
** error **<br />
** error **<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
** error **<br />
** error **<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
** error **<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
ERROR in aiori-MPIIO.c (line 130): cannot open file.<br />
MPI File does not exist, error stack:<br />
MPI File does not exist, error stack:<br />
MPI File does not exist, error stack:<br />
MPI File does not exist, error stack:<br />
ADIOI_BGL_OPEN(54): File /bubu/Examples/IOR/IOR-output-MPIIO-1 does not<br />
exist<br />
ADIOI_BGL_OPEN(54): File /bubu/Examples/IOR/IOR-output-MPIIO-1 does not<br />
exist<br />
ADIOI_BGL_OPEN(54): File /bubu/Examples/IOR/IOR-output-MPIIO-1 does not<br />
exist<br />
** exiting **<br />
** exiting **<br />
** exiting **<br />
** exiting **<br />
From the application output file, we can see that the application is unable to find<br />
the output file in GPFS which is also a good indication that there is a problem<br />
with GPFS.<br />
Note: The GPFS file system (/bubu) is available on the SN, thus the<br />
application output file can be found inside the file system. Only the I/O nodes<br />
cannot write (nor read) to (from) the GPFS file system.<br />
300 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
We can also check one of the I/O node logs in the /bgl/BlueLight/logs/BGL<br />
directory on the SN, as in Example 6-41.<br />
Example 6-41 I/O node log messages<br />
Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Starting<br />
GPFS<br />
Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
Disabling protocol version 1. Could not load host key<br />
Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
Mar 28 15:51:19 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
R00-M0-N0-I:J18-U01 /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up<br />
on I/O nod<br />
Mar 28 15:51:19 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O node ionode1 :<br />
172.30.2.1<br />
Mar 28 15:51:19 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
[ciod:initialized]<br />
Mar 28 15:51:20 (I) [1088451808] root:RMP28Mr154042300 {0}.0: e<br />
ionode1 : 172.30.2.1<br />
Starting syslog services<br />
Starting ciod<br />
Starting XL Compiler Environment for I/O node<br />
ciod: version "Jan 10 2006 16:25:12"<br />
ciod: running in virtual node mode with 32 processors<br />
Mar 28 15:51:20 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
/bin/sh: can't access tty; job control turned off<br />
$<br />
Mar 28 15:51:24 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
Switching to coprocessor mode<br />
Mar 28 15:51:24 (I) [1088451808] root:RMP28Mr154042300 {0}.0:<br />
Mar 28 15:52:03 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Mar 28<br />
15:52:02<br />
Mar 28 15:52:03 (I) [1088451808] root:RMP28Mr154042300 {0}.0: ionode1<br />
mmfs: mmfsd: Error=MMFS_ABNORMAL_SHUTDOWN, ID=0x1FB9260D, Tag=0<br />
Mar 28 15:52:02 ionode1 mmfs: Shutting down abnormally due to error in<br />
/project/sprelcs3b/build/rcs3bs010a/src/avs/fs/mmfs/ts/bufmgr/linux/buf<br />
mgr-plat.C line 170 retCode -1, reasonCode 0<br />
Mar 28 15:56:48 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Mar 28<br />
15:56:47<br />
Chapter 6. Scenarios 301
Mar 28 15:56:48 (I) [1088451808] root:RMP28Mr154042300 {0}.0: ionode1<br />
mmfs: mmfsd: Error=MMFS_ABNORMAL_SHUTDOWN, ID=0x1FB9260D, Tag=0<br />
Mar 28 15:56:47 ionode1 mmfs: Shutting down abnormally due to error in<br />
/project/sprelcs3b/build/rcs3bs010a/src/avs/fs/mmfs/ts/bufmgr/linux/buf<br />
mgr-plat.C line 170 retCode -1, reasonCode 0<br />
Mar 28 16:02:30 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Mar 28<br />
16:02:30<br />
Mar 28 16:02:30 (I) [1088451808] root:RMP28Mr154042300 {0}.0: ionode1<br />
mmfs: mmfsd: Error=MMFS_ABNORMAL_SHUTDOWN, ID=0x1FB9260D, Tag=0<br />
Mar 28 16:02:30 ionode1 mmfs: Shutting down abnormally due to error in<br />
/project/sprelcs3b/build/rcs3bs010a/src/avs/fs/mmfs/ts/bufmgr/linux/buf<br />
mgr-plat.C line 170 retCode -1, reasonCode 0<br />
bglsn:/bgl/BlueLight/logs/BGL #<br />
From Example 6-41 we can see that GPFS was started at 15:41:15 and it is not<br />
until 15:51:19 that we get the following message from S40gpfs:<br />
: GPFS did not come up on I/O node ionode1 :<br />
Looking in the file /bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d/gpfs, we can see why<br />
from the code shown in Example 6-42.<br />
Example 6-42 Excerpt from GPFS startup script for I/O nodes<br />
# This file will be created by mmfsup.scr to signal that GPFS startup<br />
is<br />
# complete<br />
upfile=/tmp/mmfsup.done<br />
# Create mmfsup script that will run when GPFS is ready<br />
cat
then ras_advisory "$0: GPFS did not come up on I/O node<br />
$HOSTID"<br />
exit 1<br />
fi<br />
done<br />
rm -f $upfile<br />
echo "$0: GPFS is ready on I/O node $HOSTID_LOC"<br />
;;<br />
stop)<br />
# Set defaults for GPFS configuration variables<br />
GPFS_STARTUP=0<br />
# Obtain overrides from config file<br />
GPFSFILE=/etc/sysconfig/gpfs<br />
[ -r $GPFSFILE ] && . $GPFSFILE<br />
The loop (in bold) explains the timeout: 300 x 2 seconds = 600 seconds. Thus,<br />
the 10 minutes wait if GPFS fails to come up.<br />
Lessons learned<br />
Always make sure you follow the Blue Gene/L documentation when changing ant<br />
GPFS parameters. Because I/O nodes have a very particular configuration, you<br />
need to be extra careful with GPFS.<br />
6.3.4 Secure shell (ssh) is broken<br />
For this scenario we remove the following files so that is should be impossible for<br />
the root user to communicate between GPFS nodes in the bgIO cluster without<br />
interactive authentication (keys acceptance or password prompting):<br />
► /bgl/dist/root/.ssh/known_hosts<br />
► /bgl/dist/root/.ssh/authorized_keys<br />
► /root/.ssh/known_hosts<br />
► /root/.ssh/authorized_keys<br />
Error injection<br />
Example 6-43 shows the way that we injected the error.<br />
Example 6-43 Removing ssh authentication files<br />
bglsn:/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d # cd /root/.ssh<br />
bglsn:~/.ssh # ls -lrt<br />
total 38<br />
Chapter 6. Scenarios 303
-rw-r--r-- 1 root root 220 Mar 19 17:12 id_rsa.pub<br />
-rw------- 1 root root 887 Mar 19 17:12 id_rsa<br />
-rw-r--r-- 1 root root 2976 Mar 19 19:12 known_hosts.b4gpfs<br />
-rw-r--r-- 1 root root 440 Mar 28 13:40 authorized_keys<br />
-rw-r--r-- 1 root root 4140 Mar 28 17:31 known_hosts<br />
drwx------ 2 root root 280 Mar 28 17:31 .<br />
drwxr-xr-x 27 root root 1768 Mar 30 14:54 ..<br />
bglsn:~/.ssh # mv known_hosts known_hosts.orig<br />
bglsn:~/.ssh # mv authorized_keys authorized_keys.orig<br />
bglsn:~/.ssh # cd /bgl/dist/root/.ssh/<br />
bglsn:/bgl/dist/root/.ssh # ls -lrt<br />
total 32<br />
drwx------ 3 root root 72 Mar 17 15:57 ..<br />
-rw-r--r-- 1 root root 220 Mar 19 17:05 id_rsa.pub<br />
-rw------- 1 root root 887 Mar 19 17:05 id_rsa<br />
-rw-r--r-- 1 root root 4091 Mar 19 19:14 known_hosts_gpfs<br />
-rw-r--r-- 1 root root 440 Mar 28 13:55 authorized_keys<br />
-rw-r--r-- 1 root root 940 Mar 28 17:33 known_hosts<br />
drwx------ 2 root root 272 Mar 28 17:33 .<br />
bglsn:/bgl/dist/root/.ssh # mv known_hosts known_hosts.orig<br />
bglsn:/bgl/dist/root/.ssh # mv authorized_keys authorized_keys.orig<br />
bglsn:/bgl/dist/root/.ssh #<br />
<strong>Problem</strong> determination<br />
Check the status of LoadLeveler as shown in Example 6-44.<br />
Example 6-44 Checking the LoadLeveler status before submitting the job<br />
test1@bglfen1:/bglscratch/test1> llq<br />
llq: There is currently no job status to report.<br />
test1@bglfen1:/bglscratch/test1> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.02 8 PPC64 Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
304 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Next, submit the LoadLeveler job named ior-gpfs.cmd (see Example 6-45).<br />
Example 6-45 Submitting the LoadLeveler job<br />
test1@bglfen1:/bglscratch/test1> set -o vi<br />
test1@bglfen1:/bglscratch/test1> ls<br />
applications hello-file-2.rts hello.rts ior-gpfs.out hello128.cmd<br />
ello-file.rts ior-gpfs.cmd ior-gpfs.out.ciod-hung-scenario hello.cmd<br />
hello-gpfs.err ior-gpfs.err hello-file-1.rts hello-gpfs.out<br />
ior-gpfs.err.ciod-hung-scenario<br />
test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />
llsubmit: The job "bglfen1.itso.ibm.com.37" has been submitted.<br />
test1@bglfen1:/bglscratch/test1> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
------------------------ ---------- ----------- -- --- ------------<br />
----------bglfen1.37.0<br />
test1 3/28 10:50 I 50 small<br />
1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0<br />
Example 6-46 shows that the LoadLeveler the job is actually running.<br />
Example 6-46 LoadLeveler queue shows job is “Running”<br />
test1@bglfen1:/bglscratch/test1> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
------------------------ ---------- ----------- -- --- ------------<br />
---------bglfen1.37.0<br />
test1 3/28 10:50 R 50 small<br />
bglfen1<br />
Now, check at the IOR job output file (see Example 6-47).<br />
Example 6-47 IOR job output file<br />
test1@bglfen1:/bglscratch/test1> tail -f ior-gpfs.out<br />
IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />
Run began: Tue Mar 28 12:11:48 2006<br />
Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />
/bglscratch/test1/applications/IOR/ior-inputs<br />
Machine: Blue Gene L ionode3<br />
Chapter 6. Scenarios 305
Summary:<br />
api = MPIIO (version=2, subversion=0)<br />
test filename = /bubu/Examples/IOR/IOR-output-MPIIO-1<br />
access = single-shared-file<br />
clients = 128 (16 per node)<br />
repetitions = 10<br />
xfersize = 32768 bytes<br />
blocksize = 1 MiB<br />
aggregate filesize = 128 MiB<br />
delaying 1 seconds . . .<br />
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />
------ --------- ---------- --------- -------- -------- -------- ---write<br />
7.27 1024.00 32.00 0.104517 17.00 1.59 0<br />
delaying 1 seconds . . .<br />
read 75.77 1024.00 32.00 0.004636 1.68 0.002146 0<br />
delaying 1 seconds . . .<br />
write 7.15 1024.00 32.00 0.027025 17.35 1.51 1<br />
delaying 1 seconds . . .<br />
read 76.39 1024.00 32.00 0.004542 1.67 0.002301 1<br />
delaying 1 seconds . . .<br />
bglsn:/bubu/Examples/IOR # ls -lrt<br />
total 245888<br />
-rw-r--r-- 1 test1 itso 33554432 Mar 21 17:14 IOR-output<br />
drwxrwxrwx 3 root root 32768 Mar 22 10:00 ..<br />
-rw-r--r-- 1 test1 itso 134217728 Mar 23 10:37 IOR-output-MPIIO<br />
drwxr-xr-x 2 test1 itso 32768 Mar 28 12:14 .<br />
-rw-r--r-- 1 test1 itso 133922816 Mar 28 12:14 IOR-output-MPIIO-1<br />
test1@bglfen1:/bglscratch/test1> llq<br />
llq: There is currently no job status to report.<br />
test1@bglfen1:/bglscratch/test1><br />
Apparently, the job ran correctly. Assuming that we are not application<br />
specialists, we need to consult the application owner to verify that the output is<br />
what was expected.<br />
306 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
We get the confirmation that the application’s output fine, however we decide to<br />
perform additional checks. First do some basic checks on the SN, as shown in<br />
Example 6-48.<br />
Example 6-48 Additional node checks<br />
##### GPFS checks:<br />
bglsn:/bubu/Examples/IOR # df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
/dev/sdb3 70614928 4743632 65871296 7% /<br />
tmpfs 1898508 8 1898500 1% /dev/shm<br />
/dev/sda4 489452 95272 394180 20% /tmp<br />
/dev/sda1 9766544 2448544 7318000 26% /bgl<br />
/dev/sda2 9766608 699636 9066972 8% /dbhome<br />
p630n03:/bglscratch 36700160 64384 36635776 1% /bglscratch<br />
p630n03_fn:/nfs_mnt 104857600 11339904 93517696 11% /mnt<br />
/dev/bubu_gpfs1 1138597888 3269632 1135328256 1% /bubu<br />
p630n03:/bglhome 2621440 98464 2522976 4% /bglhome<br />
bglsn:/bubu/Examples/IOR # touch /bubu/foo<br />
bglsn:/bubu/Examples/IOR # mmgetstate<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
bglsn:/bubu/Examples/IOR # mmgetstate -a<br />
The authenticity of host 'ionode6 (172.30.2.6)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode7 (172.30.2.7)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode5 (172.30.2.5)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode3 (172.30.2.3)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode1 (172.30.2.1)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode4 (172.30.2.4)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Chapter 6. Scenarios 307
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode2 (172.30.2.2)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode8 (172.30.2.8)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'bglsn_fn.itso.ibm.com (172.30.1.1)' can't be established.<br />
RSA key fingerprint is 52:81:f5:75:be:d6:8e:cf:65:a4:b5:23:46:1a:c6:94.<br />
Are you sure you want to continue connecting (yes/no)? no<br />
^C<br />
mmgetstate: Interrupt received.==============================<br />
Because we should not be prompted to accept the host identity, we realize that<br />
there is a problem. We doublecheck using the mmdsh and mmchconfig commands,<br />
as shown in Example 6-49.<br />
Example 6-49 Checking GPFS remote command execution<br />
bglsn:/bubu/Examples/IOR # export WCOLL=/tmp/ionodes<br />
bglsn:/bubu/Examples/IOR # mmdsh date<br />
mmdsh: ionode1 rsh process had return code 1.<br />
mmdsh: ionode2 rsh process had return code 1.<br />
mmdsh: ionode3 rsh process had return code 1.<br />
mmdsh: ionode4 rsh process had return code 1.<br />
mmdsh: ionode5 rsh process had return code 1.<br />
ionode1: ionode1: Connection refused<br />
ionode2: ionode2: Connection refused<br />
ionode3: ionode3: Connection refused<br />
ionode4: ionode4: Connection refused<br />
ionode5: ionode5: Connection refused<br />
ionode6: ionode6: Connection refused<br />
ionode7: ionode7: Connection refused<br />
ionode8: ionode8: Connection refused<br />
bglsn:/bubu/Examples/IOR # mmchconfig pagepool=128M -i<br />
mmchconfig: Command successfully completed<br />
The authenticity of host 'bglsn_fn.itso.ibm.com (172.30.1.1)' can't be<br />
established.<br />
RSA key fingerprint is 52:81:f5:75:be:d6:8e:cf:65:a4:b5:23:46:1a:c6:94.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode2 (172.30.2.2)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode5 (172.30.2.5)' can't be established.<br />
308 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode7 (172.30.2.7)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode8 (172.30.2.8)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode4 (172.30.2.4)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode6 (172.30.2.6)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode1 (172.30.2.1)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? The authenticity<br />
of host 'ionode3 (172.30.2.3)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? no<br />
no<br />
no<br />
no<br />
Please type 'yes' or 'no': no<br />
Please type 'yes' or 'no': no<br />
Please type 'yes' or 'no': no<br />
no<br />
no<br />
Please type 'yes' or 'no': 'NO'<br />
Please type 'yes' or 'no': no<br />
^C<br />
mmchconfig: Interrupt received: changes not propagated.<br />
From Example 6-49, we can conclude that we have GPFS authentication<br />
problems. However, this does not affect the job, because authentication is only<br />
required when GPFS executes commands to start/stop daemons and modify<br />
GPFS cluster configuration.<br />
Chapter 6. Scenarios 309
Now that we know that the GPFS configuration commands have a problem, we<br />
check that the block is booted, and try to connect interactively into an I/O node<br />
using ssh, as in Example 6-50.<br />
Example 6-50 Allocating a block w/ GPFS and ssh authentication broken<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ list_blocks<br />
OK<br />
RMP28Mr121102270 root(0) connected<br />
mmcs$ redirect RMP28Mr121102270 on<br />
OK<br />
mmcs$ {i} write_con hostname<br />
FAIL<br />
block not selected<br />
mmcs$ select_block RMP28Mr121102270<br />
OK<br />
$<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode5<br />
The authenticity of host 'ionode5 (172.30.2.5)' can't be established.<br />
RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b.<br />
Are you sure you want to continue connecting (yes/no)? yes<br />
Warning: Permanently added 'ionode5,172.30.2.5' (RSA) to the list of<br />
known hosts.<br />
root@ionode5's password:<br />
Permission denied, please try again.<br />
root@ionode5's password:<br />
Permission denied, please try again.<br />
root@ionode5's password:<br />
Permission denied (publickey,password,keyboard-interactive).<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
Because the root user on the I/O nodes does not have a password, ssh does not<br />
allow us to login, so we need to check the ssh authentication.<br />
Before we leave this scenario, let us check whether GPFS can be stopped and<br />
started with the ssh broken, as shown in Example 6-51.<br />
Example 6-51 Checking GPFS can be stopped/started<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
310 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
connected to DB2<br />
mmcs$ list_blocks<br />
OK<br />
RMP28Mr121102270 root(0) connected<br />
mmcs$ free RMP28Mr121102270<br />
OK<br />
mmcs$<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmshutdown<br />
Tue Mar 28 13:33:58 EST 2006: mmshutdown: Starting force unmount of<br />
GPFS file Tue Mar 28 13:34:13 EST 2006: mmshutdown: Shutting down GPFS<br />
daemons<br />
Tue Mar 28 13:34:43 EST 2006: mmshutdown: Finished<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn down<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmstartup<br />
Tue Mar 28 13:35:46 EST 2006: mmstartup: Starting GPFS ...<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
/dev/sdb3 70614928 4732208 65882720 7% /<br />
tmpfs 1898508 8 1898500 1% /dev/shm<br />
/dev/sda4 489452 95276 394176 20% /tmp<br />
/dev/sda1 9766544 2448564 7317980 26% /bgl<br />
/dev/sda2 9766608 699636 9066972 8% /dbhome<br />
p630n03:/bglscratch 36700160 64416 36635744 1% /bglscratch<br />
p630n03_fn:/nfs_mnt 104857600 11339904 93517696 11% /mnt<br />
p630n03:/bglhome 2621440 98464 2522976 4% /bglhome<br />
/dev/bubu_gpfs1 1138597888 3269632 1135328256 1% /bubu<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
Even though it might look strange, GPFS started correctly and mounted the<br />
remote file system. This is according to GPFS behavior (see the following note).<br />
Chapter 6. Scenarios 311
Lessons learned<br />
From this scenario, we learn that the ssh authentication within the bgIO cluster is<br />
only required for changes to GPFS configuration and does not affect data traffic<br />
or the stopping and starting of GPFS on a local node.<br />
Note: Because GPFS is started on each node individually when the node is<br />
booted, the fact that ssh authentication is broken does not affect the GPFS<br />
cluster as long as the GPFS configuration file for each node (mmsdrfs) does<br />
not change.<br />
6.3.5 The /bgl file system full (Blue Gene/L uses GPFS)<br />
This scenario analyzes what happens when /bgl file system fills up to 100%. We<br />
ran this scenario again to see how it affects a complex Blue Gene/L system.<br />
Note: In this scenario, our Blue Gene/L system USES GPFS.<br />
Error injection<br />
We used the dd command to create a file called /bgl/largefile that filled the /bgl<br />
file system (100%).<br />
<strong>Problem</strong> determination<br />
We started by running the LoadLeveler job (ior-gpfs.cmd). We see that while<br />
LoadLeveler shows the job as running we see that the block is still in the process<br />
of booting. Looking at the mmcs_db_server log reveals the GPFS message shown<br />
in bold in Example 6-52.<br />
Example 6-52 GPFS cannot be started<br />
Mar 27 17:34:33 (I) [1096078560] {0}.0: Mon Mar 27 17:34:32 EST 2006<br />
mmautoload: GPFS is waiting for cluster data reposi<br />
Mar 27 17:34:33 (I) [1096078560] {0}.0: tory<br />
Mar 27 17:37:39 (I) [1096078560] {17}.0: tory<br />
mmautoload: The GPFS environment cannot be initialized.<br />
mmautoload: Correct the problem and use mmstartup to start GPFS.<br />
mmautoload: The GPFS environment cannot be initialized.<br />
Mar 27 17:37:39 (I) [1096078560] {0}.0: Mon Mar 27 17:37:38 EST 2006<br />
mmautoload: GPFS is waiting for cluster data reposi<br />
Mar 27 17:37:39 (I) [1096078560] {0}.0: tory<br />
312 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
mmautoload: The GPFS environment cannot be initialized.<br />
mmautoload: Correct the problem and use mmstartup to start GPFS.<br />
mmautoload: The GPFS environment cannot be initialized.<br />
We now check the block status as shown by the mmcs_console (see<br />
Example 6-53).<br />
Example 6-53 Block status<br />
mmcs_console:<br />
list bglblock <br />
==> DBBlock record<br />
_blockid = RMP27Mr173143250<br />
_numpsets = 0<br />
_numbps = 0<br />
_owner = root<br />
_istorus = 000<br />
_sizex = 0<br />
_sizey = 0<br />
_sizez = 0<br />
_description = LoadLeveler Partition<br />
_mode = C<br />
_options =<br />
_status = B<br />
_statuslastmodified = 2006-03-27 17:31:54.131411<br />
_mloaderimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts<br />
_blrtsimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts<br />
_linuximg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf<br />
_ramdiskimg =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf<br />
_debuggerimg = none<br />
_debuggerparmsize = 0<br />
_createdate = 2006-03-27 17:31:43.673653<br />
Chapter 6. Scenarios 313
We can see that the partition is still in B (booting) state. Next, we check the<br />
LoadLeveler cluster and queue (job) status as shown in Example 6-54.<br />
Example 6-54 LoadLeveler cluster and job status<br />
test1@bglfen1:/bglscratch/test1/applications/IOR> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />
bglfen1.itso.ibm.com Avail 1 1 Run 1 0.00 30 PPC64 Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />
PPC64/Linux2 2 machines 1 jobs 1 running<br />
Total Machines 2 machines 1 jobs 1 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
test1@bglfen1:/bglscratch/test1/applications/IOR> llq<br />
Id Owner Submitted ST PRI Class Running On<br />
------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.35.0<br />
test1 3/27 16:15 R 50 small bglfen1<br />
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted<br />
Next, we test the file system status on the SN (see Example 6-55).<br />
Example 6-55 SN file system status<br />
test1@bglfen1:/bglscratch/test1/applications/IOR> df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
/dev/sdc3 70614928 15459236 55155692 22% /<br />
tmpfs 3955528 8 3955520 1% /dev/shm<br />
p630n03:/nfs_mnt 104857600 11339904 93517696 11% /nfs_mnt<br />
p630n03:/bglscratch 36700160 64384 36635776 1% /bglscratch<br />
bglsn_fn:/bgl 9766560 9766560 0 100% /bgl<br />
p630n03:/bglhome 2621440 98464 2522976 4% /bglhome<br />
314 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
As we can see in Example 6-54, the job with the ID bglfen1.35.0 appears to be<br />
in Running state, so we look for more details (Example 6-56).<br />
Example 6-56 Details on LoadLeveler job bglfen1.35.0<br />
test1@bglfen1:/bglscratch/test1/applications/IOR> llq -s bglfen1.35.0<br />
=============== Job Step bglfen1.itso.ibm.com.35.0 ===============<br />
Job Step Id: bglfen1.itso.ibm.com.35.0<br />
Job Name: bglfen1.itso.ibm.com.35<br />
Step Name: 0<br />
Structure Version: 10<br />
Owner: test1<br />
Queue Date: Mon 27 Mar 2006 04:15:20 PM EST<br />
Status: Running<br />
Reservation ID:<br />
Requested Res. ID:<br />
Scheduling Cluster:<br />
Submitting Cluster:<br />
Sending Cluster:<br />
Requested Cluster:<br />
Schedd History:<br />
Outbound Schedds:<br />
Submitting User:<br />
Execution Factor: 1<br />
Dispatch Time: Mon 27 Mar 2006 05:31:44 PM EST<br />
Completion Date:<br />
Completion Code:<br />
Favored Job: No<br />
User Priority: 50<br />
user_sysprio: 0<br />
class_sysprio: 0<br />
group_sysprio: 0<br />
System Priority: -275512<br />
q_sysprio: -275512<br />
Previous q_sysprio: 0<br />
Notifications: Complete<br />
Virtual Image Size: 472 kb<br />
Large Page: N<br />
Checkpointable: no<br />
Ckpt Start Time:<br />
Good Ckpt Time/Date:<br />
Ckpt Elapse Time: 0 seconds<br />
Fail Ckpt Time/Date:<br />
Ckpt Accum Time: 0 seconds<br />
Checkpoint File:<br />
Chapter 6. Scenarios 315
Ckpt Execute Dir:<br />
Restart From Ckpt: no<br />
Restart Same Nodes: no<br />
Restart: yes<br />
Preemptable: no<br />
Preempt Wait Count: 0<br />
Hold Job Until:<br />
RSet: RSET_NONE<br />
Mcm Affinity Options:<br />
Cmd: /usr/bin/mpirun<br />
Args: -exe /bglscratch/test1/applications/IOR/IOR.rts<br />
-args "-f /bglscratch/test1/applications/IOR/ior-inputs"<br />
Env:<br />
In: /dev/null<br />
Out: /bglscratch/test1/ior-gpfs.out<br />
Err: /bglscratch/test1/ior-gpfs.err<br />
Initial Working Dir: /bglscratch/test1/applications/IOR<br />
Dependency:<br />
Resources:<br />
Requirements: (Arch == "PPC64") && (OpSys == "Linux2")<br />
Preferences:<br />
Step Type: Blue Gene<br />
Size Requested: 128<br />
Size Allocated: 128<br />
Shape Requested:<br />
Shape Allocated:<br />
Wiring Requested: MESH<br />
Wiring Allocated: MESH<br />
Rotate: True<br />
Blue Gene Status:<br />
Blue Gene Job Id:<br />
Partition Requested:<br />
Partition Allocated: RMP27Mr173143250<br />
Error Text:<br />
Node Usage: shared<br />
Submitting Host: bglfen1.itso.ibm.com<br />
Schedd Host: bglfen1.itso.ibm.com<br />
Job Queue Key:<br />
Notify User: test1@bglfen1.itso.ibm.com<br />
Shell: /bin/bash<br />
LoadLeveler <strong>Group</strong>: No_<strong>Group</strong><br />
Class: small<br />
Ckpt Hard Limit: undefined<br />
Ckpt Soft Limit: undefined<br />
Cpu Hard Limit: undefined<br />
316 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Cpu Soft Limit: undefined<br />
Data Hard Limit: undefined<br />
Data Soft Limit: undefined<br />
Core Hard Limit: undefined<br />
Core Soft Limit: undefined<br />
File Hard Limit: undefined<br />
File Soft Limit: undefined<br />
Stack Hard Limit: undefined<br />
Stack Soft Limit: undefined<br />
Rss Hard Limit: undefined<br />
Rss Soft Limit: undefined<br />
Step Cpu Hard Limit: undefined<br />
Step Cpu Soft Limit: undefined<br />
Wall Clk Hard Limit: 00:30:00 (1800 seconds)<br />
Wall Clk Soft Limit: 00:30:00 (1800 seconds)<br />
Comment:<br />
Account:<br />
Unix <strong>Group</strong>: itso<br />
NQS Submit Queue:<br />
NQS Query Queues:<br />
Negotiator Messages:<br />
Bulk Transfer: No<br />
Step Adapter Memory: 0 bytes<br />
Adapter Requirement:<br />
Step Cpus: 0<br />
Step Virtual Memory: 0.000 mb<br />
Step Real Memory: 0.000 mb<br />
==================== EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.35.0<br />
====================<br />
The status of job step is : Running<br />
Since job step status is not Idle, Not Queued, or Deferred, no attempt<br />
has been made to determine why this job step has not been started.<br />
Important: Because job step status is not Idle, Queued, or Deferred, no<br />
attempt has been made to determine why this job step has not been started.<br />
However, as the job stays in Running state for quite some time, we decide to<br />
investigate further. We know from the LoadLeveler job command file that a<br />
GPFS file system is used for storing the files related to running this job. Thus, we<br />
decide to investigate GPFS on the I/O nodes.<br />
Chapter 6. Scenarios 317
We realize that we cannot find either I/O node boot messages, nor GPFS log<br />
messages (which are written on the /bgl file system NFS exported from the SN),<br />
and we realize that /bgl file system this is full (100%); we identify and clear the<br />
/bgl/largefile and the block continues to boot. After the block booted, we logged<br />
in to one of the I/O nodes (ssh) and got the messages shown in Example 6-57.<br />
Example 6-57 Checking the GPFS file system<br />
$ df /bubu<br />
Mar 27 17:47:39 (I) [1096078560] {119}.0: [ciod:initialized]<br />
Mar 27 17:47:39 (I) [1096078560] {102}.0: [ciod:initialized]<br />
Mar 27 17:47:39 (I) [1096078560] {17}.0: R00-M0-N0-I:J18-U11<br />
/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O nod<br />
Mar 27 17:47:39 (I) [1096078560] {0}.0: R00-M0-N0-I:J18-U01<br />
/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O nod<br />
Mar 27 17:47:39 (I) [1096078560] {0}.0: [ciod:initialized]<br />
Mar 27 17:47:39 (I) [1096078560] {0}.0: e ionode1 : 172.30.2.1<br />
Starting syslog services<br />
Starting ciod<br />
Starting XL Compiler Environment for I/O node<br />
ciod: version "Jan 10 2006 16:25:12"<br />
ciod: running in virtual node mode with 32 processors<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
/bin/sh: can't access tty; job control turned off<br />
$ df /bubu<br />
df: `/bubu': No such file or directory<br />
$ mount<br />
rootfs on / type rootfs (rw)<br />
/dev/root on / type ext2 (rw)<br />
none on /proc type proc (rw)<br />
172.30.1.1:/bgl on /bgl type nfs (rw,v3,rsize=8192,wsize=8192,<br />
Mar 27 17:47:39 (I) [1096078560] {85}.0: [ciod:initialized]<br />
Mar 27 17:47:39 (I) [1096078560] {68}.0: rootfs on / type rootfs (rw)<br />
318 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Mar 27 17:47:39 (I) [1096078560] {51}.0: [ciod:initialized]<br />
Mar 27 17:47:39 (I) [1096078560] {51}.0: df:<br />
Mar 27 17:47:39 (I) [1096078560] {34}.0: [ciod:initialized]<br />
Mar 27 17:47:39 (I) [1096078560] {17}.0: [ciod:initialized]<br />
Mar 27 17:47:39 (I) [1096078560] {0}.0: /var/etc/rc.d/rc3.d/S40gpfs:<br />
GPFS did not come up on I/O node ionode1 : 172.30.2.1<br />
Mar 27 17:47:39 (I) [1096078560] {0}.0:<br />
hard,udp,nolock,addr=172.30.1.1)<br />
172.30.1.33:/bglscratch on /bglscratch type nfs<br />
(rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.30.1.33)<br />
$ mount<br />
rootfs on / type rootfs (rw)<br />
/dev/root on / type ext2 (rw)<br />
none on /proc type proc (rw)<br />
172.30.1.1:/bgl on /bgl type nfs<br />
(rw,v3,rsize=8192,wsize=8192,hard,udp,nolock,addr=172.30.1.1)<br />
172.30.1.33:/bglscratch on /bglscratch type nfs<br />
(rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.30.1.33)<br />
$<br />
Mar 27 17:47:39 (I) [1096078560] {85}.0: df:<br />
Mar 27 17:47:39 (I) [1096078560] {85}.0: `/bubu': No such file or<br />
directory<br />
Because the GPFS file system (/bubu) cannot be found, we checked the 10<br />
minute GPFS start timeout and see that it has expired. So, we have to reboot the<br />
block for GPFS to be able to start and allow the job to run.<br />
Lessons learned<br />
We learned the following lessons:<br />
► If the /bgl file system becomes full then GPFS has a problem stated and we<br />
see the following message in the mmcs_db_console log.<br />
The GPFS environment cannot be initialized.<br />
► After the 10 minute timeout period has expired and after fixing the problem<br />
(making some space available in /bgl), we need to reboot the block.<br />
Chapter 6. Scenarios 319
Note: Even though we can manually correct the problem and bring up GPFS,<br />
it is better to reboot the block to allow GPFS to cleanly start and mount the file<br />
system.<br />
6.3.6 Installing new Blue Gene/L driver code (driver update)<br />
This scenario documents the process of upgrading the Blue Gene/L code. It is<br />
designed to show what happens if the GPFS code is not also updated at the<br />
same time.<br />
Error injection<br />
The error injected is updating of the Blue Gene/L code itself. This section is<br />
rather long because we present all the checks that we went through before<br />
updating the code and the actions that are required after the update RPMs are<br />
installed.<br />
First we check the code levels:<br />
1. Check the correct code levels and the process that we intend to use.<br />
Example 6-58 shows the Blue Gene/L driver levels.<br />
Example 6-58 Current Blue Gene/L code levels<br />
bglsn:~ # rpm -qa | grep bgl<br />
libglade-0.17-230.1<br />
bglmpi-2006.1.2-1<br />
bglbaremetal-2006.1.2-1<br />
bglos-4.1-0<br />
bglcmcs-2006.1.2-1<br />
libglade2-2.0.1-501.3<br />
bglblrtstool-2006.1.2-1<br />
bglcnk-2006.1.2-1<br />
bgldiag-2006.1.2-1<br />
bglmcp-2006.1.2-1<br />
Example 6-59 shows the code levels for the GPFS installed for the SN.<br />
Example 6-59 GPFS for Blue Gene/L<br />
bglsn:~ # rpm -qa | grep gpfs<br />
gpfs.docs-2.3.0-10<br />
gpfs.gpl-2.3.0-10<br />
gpfs.msg.en_US-2.3.0-10<br />
gpfs.base-2.3.0-10<br />
320 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Example 6-60 lists the installed code levels for the GPFS for the I/O node or<br />
nodes.<br />
Note: You do not need to compile the GPL portability layer for the MCP. The<br />
gpfs.gplbin package contains the compiled modules for the portability code.<br />
Example 6-60 GPFS for PPC440 (I/O nodes processor) levels<br />
bglsn:/bgl/BlueLight # rpm --root<br />
/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/bin/bglOS \ -qa | grep gpfs<br />
gpfs.base-2.3.0-10<br />
gpfs.gpl-2.3.0-10<br />
gpfs.msg.en_US-2.3.0-10<br />
gpfs.docs-2.3.0-10<br />
gpfs.gplbin-2.3.0-11<br />
2. Check for the currently installed version of the Blue Gene/L driver code, as<br />
shown in Example 6-61.<br />
Example 6-61 Current Blue Gene/L driver level<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ version<br />
OK<br />
mmcs_db_server $Name: V1R2M1_020_2006 $ Jan 10 2006 16:23:15<br />
mmcs$<br />
3. Ensure that we have all the required update packages downloaded onto the<br />
SN. Again, we need two sets of packages for GPFS, one for SN (PPC64-bit),<br />
one for I/O nodes’ MCP (PPC440 - 32-bit), as shown in Example 6-62.<br />
Example 6-62 GPFS for PPC440 (MCP - I/O nodes OS)<br />
bglsn:/mnt/LPP/GPFS # cd PPC64_PTF11<br />
bglsn:/mnt/LPP/GPFS/PPC64_PTF11 # ls -lrt<br />
total 4882<br />
-rw-r--r-- 1 root root 4161526 Mar 31 19:21<br />
gpfs.base-2.3.0-11.ppc64.rpm<br />
-rw-r--r-- 1 root root 113285 Mar 31 19:21<br />
gpfs.docs-2.3.0-11.noarch.rpm<br />
-rw-r--r-- 1 root root 331349 Mar 31 19:21<br />
gpfs.gpl-2.3.0-11.noarch.rpm<br />
Chapter 6. Scenarios 321
-rw-r--r-- 1 root root 58882 Mar 31 19:21<br />
gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />
bglsn:/mnt/LPP/GPFS/PPC64_PTF11 # cd ../MCP_PTF11<br />
bglsn:/mnt/LPP/GPFS/MCP_PTF11 # ls -lrt<br />
total 5196<br />
-rw-r--r-- 1 root root 4161526 Mar 31 19:21 gpfs.base-2.3.0-11.ppc.rpm<br />
-rw-r--r-- 1 root root 113285 Mar 31 19:21<br />
gpfs.docs-2.3.0-11.noarch.rpm<br />
-rw-r--r-- 1 root root 331349 Mar 31 19:21<br />
gpfs.gpl-2.3.0-11.noarch.rpm<br />
-rw-r--r-- 1 root root 644241 Mar 31 19:21<br />
gpfs.gplbin-2.3.0-11.ppc.rpm<br />
-rw-r--r-- 1 root root 58882 Mar 31 19:21<br />
gpfs.msg.en_US-2.3.0-11.noarch.rpm<br />
We now review the install process from the readme file that we downloaded from<br />
the Blue Gene/L Web site along with the new driver. Here we present a list of the<br />
actions that are relevant for our configuration:<br />
1. Download RPMs into appropriate directories.<br />
Note: You can use a naming convention of your choice for the download<br />
directories. You just need to make sure you install the right code into the<br />
right directories.<br />
2. Install RPMs and rebuild BLRTS tool chain.<br />
3. If you have any customized scripts that mount your file systems stored in<br />
/bgl/BlueLight/ppcfloor/dist/etc/rc.d/rc3.d directory, you have to re-create<br />
them with each driver upgrade.<br />
4. Stop the control system jobs running on the SN.<br />
5. Update the following symbolic link:<br />
rm /bgl/BlueLight/ppcfloor<br />
ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc<br />
/bgl/BlueLight/ppcfloor<br />
6. Determine the home directory of user bgdb2cli in the file<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/db2profile:<br />
echo ~bgdb2cli<br />
Change "INSTHOME=/u/bgdb2cli" to "INSTHOME=X", where X = the result<br />
from the echo ~bgdb2cli command.<br />
322 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
In the file /bgl/BlueLight/ppcfloor/bglsys/bin/db2cshrc change "setenv<br />
INSTHOME /u/bgdb2cli" to "setenv INSTHOME X", where X = the result from the<br />
echo ~bgdb2cli command.<br />
Check the new settings:<br />
bglsn:~ # echo ~bgdb2cli<br />
/dbhome/bgdb2cli<br />
bglsn:~ # grep INSTHOME=<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />
INSTHOME=/dbhome/bgdb2cli<br />
7. Rebind the new jar file.<br />
8. Update discovery directory.<br />
Everyone should run these two commands:<br />
cp /bgl/BlueLight//ppc/bglsys/discovery/runPopIpPool<br />
/discovery<br />
cp<br />
/bgl/BlueLight//ppc/bglsys/discovery/ServiceNetwork.c<br />
fg /discovery<br />
9. Update the database schema. If you are upgrading from V1R1M1, run:<br />
/bgl/BlueLight/ppcfloor/bglsys/schema/updateSchema.pl<br />
--dbproperties XX --driver V1R2M1<br />
where XX = your db.properties file (for example,<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties).<br />
10.The Blue Gene/L upgrade is now complete. When ready, start the control<br />
system to resume any jobs that you stopped in Step 4.<br />
Having checked these steps, verify that the updated system can run a job using<br />
LoadLeveler:<br />
test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />
The ouptut file shows job ran OK.<br />
Chapter 6. Scenarios 323
Step 1<br />
Now we install the new Blue Gene/L RPMs as shown in Example 6-63.<br />
Example 6-63 Installing new Blue Gene/L driver RPMs<br />
bglsn:/mnt/BGL/BGL_V1R2M2 # rpm -ivh bgl*.rpm<br />
Preparing... ########################################### [100%]<br />
1:bglos ########################################### [ 11%]<br />
2:bglblrtstool ########################################### [ 22%]<br />
=====================================================================================<br />
=<br />
=== RPM has installed but automatic building of the blrts toolchain not successful<br />
===<br />
=== Unable to attempt blrts toolchain build, /bgl/downloads not found<br />
===<br />
=== Follow the manual instructions for building the toolchain<br />
===<br />
=====================================================================================<br />
=<br />
error: %post(bglblrtstool-2006.1.2-2) scriptlet failed, exit status 1<br />
3:bgliontool ########################################### [ 33%]<br />
===================================================================================<br />
=== RPM has installed but automatic building of the IO toolchain not successful ===<br />
=== Unable to attempt blrts toolchain build, /bgl/downloads not found ===<br />
=== Follow the manual instructions to build the IO node toolchain ===<br />
===================================================================================<br />
error: %post(bgliontool-2006.1.2-2) scriptlet failed, exit status 1<br />
4:bglmpi ########################################### [ 44%]<br />
5:bglbaremetal ########################################### [ 56%]<br />
6:bglcmcs ########################################### [ 67%]<br />
7:bglcnk ########################################### [ 78%]<br />
8:bgldiag ########################################### [ 89%]<br />
9:bglmcp ########################################### [100%]<br />
bglsn:/mnt/BGL/BGL_V1R2M2 #<br />
The output from the rpm install command shows that we now need to rebuild the<br />
BLRTS toolchain.<br />
Step 2<br />
To rebuild the BLRTS toolchain, we first have to edit the script as the original<br />
curl -O commands in the retrieveToolChains.sh script located in the<br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/toolchain/ directory. The actual<br />
script would not work on our system due to our firewall configuration which only<br />
allows Web traffic (http protocol).<br />
324 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
We got around this by copying the file to the /tmp/downloads directory and<br />
substituting the wget command for the curl -O commands. Example 6-64 shows<br />
the modified script.<br />
Example 6-64 Modified retrieveToolChains.sh script<br />
bglsn:/bgl/downloads # cat retrieveToolChains.sh<br />
#! /bin/bash<br />
###################################################################<br />
# Product(s): */<br />
# 5733-BG1 */<br />
# */<br />
# (C) Copyright IBM Corp. 2004, 2004 */<br />
# All rights reserved. */<br />
# US Government Users Restricted Rights - */<br />
# Use, duplication or disclosure restricted */<br />
# by GSA ADP Schedule Contract with IBM Corp. */<br />
# */<br />
# Licensed Materials-Property of IBM */<br />
# */<br />
##################################################################<br />
# This script is to help facilitate the retrieval of the GNU<br />
# components necessary to build toolchains<br />
#<br />
# The script utilizes curl to go to the ftp.gnu.org site and<br />
# ftp the appropriate tarballs to your system. These<br />
# files will be installed in the CWD.<br />
#<br />
# To utilize the script, you will need to insure curl is<br />
# installed on your sytem and your system has ftp access to<br />
# outside sites.<br />
#<br />
# Once you run this script to download the appropriate<br />
# packages, you can install the BlueGene/L patches and<br />
# commence on building the Toolchain.<br />
##############################################################<br />
# Grab all of the Toolchain components necessary to build both<br />
# the blrts and the linux toolchains for BlueGene/L<br />
wget ftp://ftp.gnu.org/gnu/binutils/binutils-2.13.tar.gz<br />
wget ftp://ftp.gnu.org/gnu/gcc/gcc-3.2/gcc-3.2.tar.gz<br />
wget ftp://ftp.gnu.org/gnu/gdb/gdb-5.3.tar.gz<br />
wget ftp://ftp.gnu.org/gnu/glibc/glibc-2.2.5.tar.gz<br />
wget ftp://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.2.5.tar.gz<br />
Chapter 6. Scenarios 325
We used the updated script to retrieve the toolchain, as shown in Example 6-65.<br />
Example 6-65 Retrieving the BLRTS tool chain<br />
bglsn:/bgl/downloads # ./retrieveToolChains.sh<br />
--12:02:45-- ftp://ftp.gnu.org/gnu/binutils/binutils-2.13.tar.gz<br />
=> `binutils-2.13.tar.gz'<br />
Resolving ftp.gnu.org... 199.232.41.7<br />
Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />
Logging in as anonymous ... Logged in!<br />
==> SYST ... done. ==> PWD ... done.<br />
==> TYPE I ... done. ==> CWD /gnu/binutils ... done.<br />
==> PASV ... done. ==> RETR binutils-2.13.tar.gz ... done.<br />
Length: 12,790,277 (unauthoritative)<br />
100%[==================================================================<br />
====>] 12,790,277 1.67M/s ETA 00:00<br />
12:02:52 (1.67 MB/s) - `binutils-2.13.tar.gz' saved [12790277]<br />
--12:02:52-- ftp://ftp.gnu.org/gnu/gcc/gcc-3.2/gcc-3.2.tar.gz<br />
=> `gcc-3.2.tar.gz'<br />
Resolving ftp.gnu.org... 199.232.41.7<br />
Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />
Logging in as anonymous ... Logged in!<br />
==> SYST ... done. ==> PWD ... done.<br />
==> TYPE I ... done. ==> CWD /gnu/gcc/gcc-3.2 ... done.<br />
==> PASV ... done. ==> RETR gcc-3.2.tar.gz ... done.<br />
Length: 26,963,731 (unauthoritative)<br />
100%[==================================================================<br />
====>] 26,963,731 1.73M/s ETA 00:00<br />
12:03:08 (1.69 MB/s) - `gcc-3.2.tar.gz' saved [26963731]<br />
--12:03:08-- ftp://ftp.gnu.org/gnu/gdb/gdb-5.3.tar.gz<br />
=> `gdb-5.3.tar.gz'<br />
Resolving ftp.gnu.org... 199.232.41.7<br />
Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />
Logging in as anonymous ... Logged in!<br />
==> SYST ... done. ==> PWD ... done.<br />
==> TYPE I ... done. ==> CWD /gnu/gdb ... done.<br />
==> PASV ... done. ==> RETR gdb-5.3.tar.gz ... done.<br />
Length: 14,707,600 (unauthoritative)<br />
326 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
100%[==================================================================<br />
====>] 14,707,600 1.72M/s ETA 00:00<br />
12:03:16 (1.67 MB/s) - `gdb-5.3.tar.gz' saved [14707600]<br />
--12:03:16-- ftp://ftp.gnu.org/gnu/glibc/glibc-2.2.5.tar.gz<br />
=> `glibc-2.2.5.tar.gz'<br />
Resolving ftp.gnu.org... 199.232.41.7<br />
Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />
Logging in as anonymous ... Logged in!<br />
==> SYST ... done. ==> PWD ... done.<br />
==> TYPE I ... done. ==> CWD /gnu/glibc ... done.<br />
==> PASV ... done. ==> RETR glibc-2.2.5.tar.gz ... done.<br />
Length: 16,657,505 (unauthoritative)<br />
100%[==================================================================<br />
====>] 16,657,505 1.74M/s ETA 00:00<br />
12:03:26 (1.68 MB/s) - `glibc-2.2.5.tar.gz' saved [16657505]<br />
--12:03:26--<br />
ftp://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.2.5.tar.gz<br />
=> `glibc-linuxthreads-2.2.5.tar.gz'<br />
Resolving ftp.gnu.org... 199.232.41.7<br />
Connecting to ftp.gnu.org[199.232.41.7]:21... connected.<br />
Logging in as anonymous ... Logged in!<br />
==> SYST ... done. ==> PWD ... done.<br />
==> TYPE I ... done. ==> CWD /gnu/glibc ... done.<br />
==> PASV ... done. ==> RETR glibc-linuxthreads-2.2.5.tar.gz ...<br />
done.<br />
Length: 226,543 (unauthoritative)<br />
100%[==================================================================<br />
====>] 226,543 1.15M/s<br />
12:03:26 (1.15 MB/s) - `glibc-linuxthreads-2.2.5.tar.gz' saved [226543]<br />
Chapter 6. Scenarios 327
Example 6-66 shows the files that were downloaded into the /bgl/downloads<br />
directory.<br />
Example 6-66 Downloaded files for the tool chain<br />
bglsn:/bgl/downloads # ls -lrt<br />
total 69751<br />
-rwxr-xr-x 1 root root 1913 Apr 1 12:02 retrieveToolChains.sh<br />
-rw-r--r-- 1 root root 12790277 Apr 1 12:02 binutils-2.13.tar.gz<br />
-rw-r--r-- 1 root root 26963731 Apr 1 12:03 gcc-3.2.tar.gz<br />
-rw-r--r-- 1 root root 14707600 Apr 1 12:03 gdb-5.3.tar.gz<br />
-rw-r--r-- 1 root root 16657505 Apr 1 12:03 glibc-2.2.5.tar.gz<br />
-rw-r--r-- 1 root root 226543 Apr 1 12:03<br />
glibc-linuxthreads-2.2.5.tar.gz<br />
We can now proceed to rebuild the BLRTS toolchain. This process can take<br />
some time (it took us about half an hour). Thus in Example 6-67, we only show<br />
the last part of this process. We could not tell precisely that this process was<br />
successful, thus we tested the return code to verify that the tool chain had in fact<br />
been built correctly.<br />
Example 6-67 BLRTS tool chain built<br />
bglsn:/bgl/downloads #<br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/toolchain/buildBlrtsToolChain<br />
.sh<br />
..... > .....<br />
make[4]: Entering directory<br />
`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb/gdbse<br />
rver'<br />
n=`echo gdbserver | sed 's,x,x,'`; \<br />
if [ x$n = x ]; then n=gdbserver; else true; fi; \<br />
/bin/sh /bgl/downloads/gnu/gdb-5.3/install-sh -c gdbserver<br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/blrts-gnu/bin/$n; \<br />
/bin/sh /bgl/downloads/gnu/gdb-5.3/install-sh -c -m 644<br />
/bgl/downloads/gnu/gdb-5.3/gdb/gdbserver/gdbserver.1<br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/blrts-gnu/man/man1/$n.1<br />
make[4]: Leaving directory<br />
`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb/gdbse<br />
rver'<br />
make[3]: Leaving directory<br />
`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb'<br />
make[2]: Leaving directory<br />
`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb'<br />
328 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
make[1]: Leaving directory<br />
`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build'<br />
bglsn:/bgl/downloads # echo $?<br />
0<br />
Step 3<br />
We saved our customized scripts (containing our local NFS and GPFS<br />
configuration)<br />
Step 4<br />
Because the tool chain build was successful, we stop the control system<br />
(Example 6-68).<br />
Example 6-68 Stopping the bglmaster<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [11220]<br />
ciodb started [11221]<br />
mmcs_server started [11222]<br />
monitor stopped<br />
perfmon stopped<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />
Stopping BGLMaster<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [11220]<br />
ciodb started [11221]<br />
mmcs_server started [11222]<br />
monitor stopped<br />
perfmon stopped<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
BGLMaster is not started<br />
We also check the LoadLeveler status and stop, as shown in Example 6-69.<br />
Example 6-69 Stopping LoadLeveler<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 4318 PPC64 Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
Chapter 6. Scenarios 329
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llctl -g stop<br />
llctl: Sent stop command to host bglsn.itso.ibm.com<br />
llctl: Sent stop command to host bglfen1.itso.ibm.com<br />
llctl: Sent stop command to host bglfen2.itso.ibm.com<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llstatus<br />
04/01 12:49:39 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />
"LoadL_negotiator" on port 9614. errno = 111<br />
04/01 12:49:39 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />
"LoadL_negotiator" on port 9614. errno = 111<br />
llstatus: 2512-301 An error occurred while receiving data from the LoadL_negotiator<br />
daemon on host bglsn.itso.ibm.com.<br />
Step 5<br />
We now update the ppcfloor symbolic link:<br />
bglsn:/ # rm /bgl/BlueLight/ppcfloor<br />
bglsn:/ # ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />
/bgl/BlueLight/ppcfloor<br />
Step 6<br />
Next, we check and update db2profile and db2cshrc files, as shown in<br />
Example 6-70.<br />
Example 6-70 Updating the DB2 environment files<br />
bglsn:/ # vi /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />
# Also in this file ..<br />
bglsn:/ # vi /bgl/BlueLight/ppcfloor/bglsys/bin/db2cshrc<br />
# Check the original password in the old driver db.properties<br />
database_name=bgdb0<br />
database_user=bglsysdb<br />
database_password=bglsysdb<br />
# database_password=db24bgls<br />
database_schema_name=bglsysdb<br />
system=BGL<br />
min_pool_connections=1<br />
# Web Console Configuration<br />
mmcs_db_server_ip=127.0.0.1<br />
330 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
mmcs_db_server_port=32031<br />
mmcs_max_reply_size=8029<br />
mmcs_max_history_size=2097152<br />
mmcs_redirect_server_ip=default<br />
mmcs_redirect_server_port=32032<br />
Now we update the DB2 user password in the db.properties file that is located in<br />
the /bgl/BlueLight/ppcfloor/bglsys/bin/ directory (use vi) and use this updated<br />
password to rebuild the jar file.<br />
Step 7<br />
We rebind the new jar file.<br />
bglsn:/ # java -cp /bgl/BlueLight/ppcfloor/bglsys/bin/ido.jar<br />
com.ibm.db2.jcc.DB2Binder -url jdbc:db2://127.0.0.1:50001/bgdb0<br />
-user bglsysdb -password bglsysdb -size 200<br />
Step 8<br />
We update the /discovery directory saving the original configuration files, as<br />
shown in Example 6-71.<br />
Example 6-71 Updating the /discovery directory<br />
bglsn:/ # cp /bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/runPopIpPool<br />
/discovery<br />
bglsn:/ # cp<br />
/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/ServiceNetwork.cfg<br />
/discovery<br />
Step 9<br />
We also update the database schema, as shown in Example 6-72.<br />
Example 6-72 Updating the Blue Gene/L database schema<br />
bglsn:/ # /bgl/BlueLight/ppcfloor/bglsys/schema/updateSchema.pl<br />
--dbproperties /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
--driver V1R2M1<br />
BlueGene/L database schema update utility, version V1R2M1.<br />
database bgdb0<br />
schema bglsysdb<br />
username bglsysdb<br />
previous driver 020-6<br />
target driver 080-6<br />
verbose level: 0<br />
Chapter 6. Scenarios 331
Log will be written to<br />
/bgl/BlueLight/logs/BGL/updateSchema-2006-04-01-13:18:29.log.<br />
Connected to database bgdb0 as user bglsysdb<br />
Updating to driver 080-6 schema.<br />
Finished updating database schema to driver 080-6.<br />
We have now completed the upgrade process (which constitutes the error<br />
injection for this scenario).<br />
<strong>Problem</strong> determination<br />
To determine the problem, we followed these steps:<br />
1. We start the control system as shown in Example 6-73.<br />
Example 6-73 Starting the bglmaster process<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />
Starting BGLMaster<br />
Logging to<br />
/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0401-13:25:44.log<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [26694]<br />
ciodb started [26695]<br />
mmcs_server started [26696]<br />
monitor stopped<br />
perfmon stopped<br />
2. We start LoadLeveler (see Example 6-74).<br />
Example 6-74 Starting LoadLeveler<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llctl -g start<br />
llctl: Attempting to start LoadLeveler on host bglsn.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
llctl: Attempting to start LoadLeveler on host bglfen1.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
llctl: Attempting to start LoadLeveler on host bglfen2.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 6608 PPC64 Linux2<br />
332 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
glfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
3. We check that the upgrade was successful (Example 6-75).<br />
Example 6-75 Checking the driver version<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ version<br />
OK<br />
mmcs_db_server $Name: V1R2M2_120_2006 $ Mar 28 2006 17:54:10<br />
mmcs$<br />
4. We now try and boot a block:<br />
OK<br />
mmcs_db_server $Name: V1R2M2_120_2006 $ Mar 28 2006 17:54:10<br />
mmcs$ allocate R000_128<br />
This command apparently hung for 10 minutes because the sitefs file still has<br />
the entry to start GPFS and the ~ppcfloor/dist/etc/rc.d/init.d/gpfs file has the<br />
10 minute timeout because it cannot start GPFS.<br />
Example 6-76 shows the output from an I/O node log before the 10 minutes<br />
GPFS load timeout.<br />
Example 6-76 I/O node log messages showing GPFS cannot start<br />
Apr 01 13:29:00 (I) [1088451808] root:R000_128 {51}.0: Starting GPFS<br />
Apr 01 13:29:00 (I) [1088451808] root:R000_128 {51}.0:<br />
/var/etc/rc.d/rc3.d/S40gpfs: 2: /usr/lpp/mmfs/bin/mmautoload: not found<br />
Apr 01 13:29:01 (I) [1088451808] root:R000_128 {51}.0: Disabling<br />
protocol version 1. Could not load host key<br />
Apr 01 13:29:01 (I) [1088451808] root:R000_128 {51}.0:<br />
bglsn:/bgl/BlueLight/logs/BGL #<br />
Chapter 6. Scenarios 333
However, we were able to ssh to the I/O node and run the df command (see<br />
Example 6-77).<br />
Example 6-77 Running the df command on an I/O node<br />
$ df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
rootfs 7931 1640 5882 22% /<br />
/dev/root 7931 1640 5882 22% /<br />
172.30.1.1:/bgl 9766544 4008736 5757808 42% /bgl<br />
172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />
This check proves that the sshd daemon was started. Using the ps -ef<br />
command on that I/O node during the 10 minute GPFS timeout, we confirmed<br />
that the S40gpfs process is still running (with a sleep thread running).<br />
After the 10 minutes, we see the block booted, but the /bubu file system<br />
(GPFS) is not available (see Example 6-78).<br />
Example 6-78 GPFS startup timeout has expired<br />
Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0: Starting SSH<br />
Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0: Starting GPFS<br />
Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0:<br />
/var/etc/rc.d/rc3.d/S40gpfs: 2: /usr/lpp/mmfs/bin/mmautoload: not found<br />
Apr 01 13:29:01 (I) [1088451808] root:R000_128 {17}.0: Disabling<br />
protocol version 1. Could not load host key<br />
Apr 01 13:29:01 (I) [1088451808] root:R000_128 {17}.0:<br />
Apr 01 13:39:01 (I) [1088451808] root:R000_128 {17}.0:<br />
R00-M0-N0-I:J18-U11 /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up<br />
on I/O nod<br />
Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0:<br />
[ciod:initialized]<br />
Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0: e ionode2 :<br />
172.30.2.2<br />
Starting syslog services<br />
Starting ciod<br />
Starting XL Compiler Environment for I/O node<br />
BusyBox v1.00-rc2 (2006.03.24-21:52+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
/bin/sh: can't access tty; job control turned off<br />
$ ciod: version "Mar 28 2006 17:56:13"<br />
ciod: running in virtual node mode with 32 processors<br />
334 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0:<br />
/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O node ionode2 :<br />
172.30.2.2<br />
5. We switch the ppcfloor link back to the old driver to see if we can immediately<br />
recover the situation and begin running jobs. However, we stop the control<br />
system to make the change, as shown in Example 6-79.<br />
Example 6-79 Stopping bglmaster to prepare reverting to previous driver<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />
Stopping BGLMaster<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
BGLMaster is not started<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
6. We switch the symbolic link, as shown in Example 6-80.<br />
Example 6-80 Reverting to the previous driver version (ppcfloor link)<br />
bglsn:/ # #ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc /bgl/BlueLight/ppcfloor<br />
bglsn:/ # ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor<br />
bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />
0 lrwxrwxrwx 1 root root 41 Apr 1 12:53 /bgl/BlueLight/ppcfloor -><br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />
bglsn:/ # ln -fs /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor<br />
bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />
0 lrwxrwxrwx 1 root root 41 Apr 1 12:53 /bgl/BlueLight/ppcfloor -><br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />
bglsn:/ # rm /bgl/BlueLight/ppcfloor<br />
bglsn:/ # ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor<br />
bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />
0 lrwxrwxrwx 1 root root 41 Apr 1 14:31 /bgl/BlueLight/ppcfloor -><br />
/bgl/BlueLight/V1R2M1_020_2006-060110/ppc<br />
7. We then restart the control system (Example 6-81).<br />
Example 6-81 Restarting bglmaster after reverting to previous driver<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster statut<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [28068]<br />
ciodb started [28069]<br />
mmcs_server started [28070]<br />
monitor stopped<br />
perfmon stopped<br />
Chapter 6. Scenarios 335
8. We try to run a job. We start by allocating a block as in Example 6-82.<br />
Example 6-82 Allocating a block (mixed bgl driver versions)<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ list_blocks<br />
OK<br />
mmcs$ allocate R000_128<br />
OK<br />
mmcs$ quit<br />
OK<br />
mmcs_db_console is terminating, please wait...<br />
mmcs_db_console: closing database connection<br />
mmcs_db_console: closed database connection<br />
mmcs_db_console: closing console port<br />
mmcs_db_console: closed console port<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode1<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
$ df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
rootfs 7931 1644 5878 22% /<br />
/dev/root 7931 1644 5878 22% /<br />
172.30.1.1:/bgl 9766544 4008760 5757784 42% /bgl<br />
172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />
/dev/bubu_gpfs1 1138597888 3304448 1135293440 1% /bubu<br />
336 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Here can see that the GPFS file system is again working at the old Blue<br />
Gene/L driver level. Now we try to run the job, as shown in Example 6-83.<br />
Example 6-83 Running a job (mixed driver versions)<br />
test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />
llsubmit: The job "bglfen1.itso.ibm.com.50" has been submitted.<br />
test1@bglfen1:/bglscratch/test1> llq<br />
Id Owner Submitted ST PRI Class Running On<br />
------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.50.0<br />
test1 4/1 12:56 R 50 small bglfen1<br />
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted<br />
test1@bglfen1:/bglscratch/test1> ls -lrt<br />
total 1598796<br />
-rwxr-xr-x 1 test1 itso 7553497 2006-03-21 11:33 hello-file-1.rts<br />
-rwxr--r-- 1 test1 itso 1826286 2006-03-21 15:53 hello.rts<br />
-rwxr-xr-x 1 test1 itso 7553579 2006-03-22 11:05 hello-file-2.rts<br />
drwxr-xr-x 4 test1 itso 4096 2006-03-23 10:16 applications<br />
-rw-r--r-- 1 test1 itso 638 2006-03-23 11:08 hello.cmd<br />
-rw-r--r-- 1 test1 itso 639 2006-03-23 11:16 hello128.cmd<br />
-rw-r--r-- 1 test1 itso 635 2006-03-24 16:48 ior-gpfs.cmd<br />
-rw-r--r-- 1 test1 itso 2643 2006-03-24 16:53 ior-gpfs.out.ciod-hung-scenario<br />
-rw-r--r-- 1 test1 itso 9435 2006-03-24 17:20 ior-gpfs.err.ciod-hung-scenario<br />
-rw-r--r-- 1 test1 itso 637 2006-03-28 17:41 cds_ior-gpfs.cmd<br />
-rwxr-xr-x 1 test1 itso 7546841 2006-03-29 11:47 hello-file.rts<br />
-rw-r--r-- 1 test1 itso 3755 2006-03-29 11:49 core.0<br />
-rw-r--r-- 1 root root 0 2006-03-29 19:36 R000_128-233.stdout<br />
-rw-r--r-- 1 root root 0 2006-03-29 19:36 R000_128-233.stderr<br />
-rw-r--r-- 1 root root 0 2006-03-29 19:38 R000_128-235.stdout<br />
-rw-r--r-- 1 root root 0 2006-03-29 19:38 R000_128-235.stderr<br />
-rw-r--r-- 1 test1 itso 1536 2006-03-29 19:44 hello-gpfs.out<br />
-rw-r--r-- 1 test1 itso 10415 2006-03-29 19:44 hello-gpfs.err<br />
-rwxr-xr-x 1 test1 itso 7539442 2006-03-30 10:02 hello-world.rts<br />
-rw-r--r-- 1 test1 itso 1605021456 2006-03-30 19:05 hello.out<br />
-rw-r--r-- 1 test1 itso 9339 2006-03-30 19:06 hello.err<br />
-rw-r--r-- 1 test1 itso 0 2006-04-01 15:48 ior-gpfs.out<br />
-rw-r--r-- 1 test1 itso 4284 2006-04-01 15:48 ior-gpfs.err<br />
test1@bglfen1:/bglscratch/test1> tail -f ior-gpfs.out<br />
IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />
Run began: Sat Apr 1 14:41:26 2006<br />
Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />
/bglscratch/test1/applications/IOR/ior-inputs<br />
Machine: Blue Gene L ionode3<br />
Chapter 6. Scenarios 337
Summary:<br />
api = MPIIO (version=2, subversion=0)<br />
test filename = /bubu/Examples/IOR/IOR-output-MPIIO-0<br />
access = single-shared-file<br />
clients = 128 (16 per node)<br />
repetitions = 5<br />
xfersize = 32768 bytes<br />
blocksize = 1 MiB<br />
aggregate filesize = 128 MiB<br />
delaying 1 seconds . . .<br />
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />
------ --------- ---------- --------- -------- -------- -------- ---write<br />
6.96 1024.00 32.00 0.108306 17.77 1.71 0<br />
delaying 1 seconds . . .<br />
read 76.59 1024.00 32.00 0.004671 1.67 0.001274 0<br />
delaying 1 seconds . . .<br />
write 6.61 1024.00 32.00 0.026399 19.02 1.35 1<br />
delaying 1 seconds . . .<br />
read 77.22 1024.00 32.00 0.004509 1.65 0.001283 1<br />
delaying 1 seconds . . .<br />
write 6.65 1024.00 32.00 0.024646 18.50 1.85 2<br />
delaying 1 seconds . . .<br />
read 76.83 1024.00 32.00 0.004666 1.66 0.001285 2<br />
delaying 1 seconds . . .<br />
This test proves that the job did indeed run as expected. Now, we switch the<br />
ppcfloor link back to the new driver, as shown in Example 6-84.<br />
Example 6-84 Switching driver versions<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop<br />
Stopping BGLMaster<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # cd /<br />
bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />
0 lrwxrwxrwx 1 root root 41 Apr 1 14:31 /bgl/BlueLight/ppcfloor -><br />
/bgl/BlueLight/V1R2M1_020_2006-060110/ppc<br />
bglsn:/ # rm /bgl/BlueLight/ppcfloor<br />
bglsn:/ # ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />
/bgl/BlueLight/ppcfloor<br />
bglsn:/ # ls -als /bgl/BlueLight/ppcfloor<br />
0 lrwxrwxrwx 1 root root 41 Apr 1 14:46 /bgl/BlueLight/ppcfloor -><br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc<br />
bglsn:/ # ./bglmaster start<br />
338 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
-bash: ./bglmaster: No such file or directory<br />
bglsn:/ # cd -<br />
/bgl/BlueLight/ppcfloor/bglsys/bin<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start<br />
Starting BGLMaster<br />
Logging to<br />
/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0401-14:47:00.log<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status<br />
idoproxy started [28946]<br />
ciodb started [28947]<br />
mmcs_server started [28948]<br />
monitor stopped<br />
perfmon stopped<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin #<br />
9. Having switched to the latest Blue Gene/L driver, we now upgrade or install<br />
the compatible GPFS code just on the I/O nodes to see if that will be able to<br />
access the GPFS file system on the I/O nodes (see Example 6-85).<br />
Example 6-85 Installing or updating GPFS for I/O nodes (for new driver)<br />
bglsn:/mnt/LPP/GPFS/MCP_PTF11 # rpm --root<br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/bglsys \ /bin/bglOS --nodeps -ivh<br />
gpfs*.rpm<br />
Preparing... ########################################### [100%]<br />
1:gpfs.msg.en_US ########################################### [ 20%]<br />
2:gpfs.base ########################################### [ 40%]<br />
3:gpfs.docs ########################################### [ 60%]<br />
4:gpfs.gpl ########################################### [ 80%]<br />
5:gpfs.gplbin ########################################### [100%]<br />
10.We check that the installed or upgraded GPFS for the I/O nodes works (see<br />
Example 6-86).<br />
Example 6-86 Checking the new GPFS for I/O node installation<br />
bglsn:/ # rpm --root<br />
/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/bglsys/bin/bglOS -qa | grep<br />
gpfs<br />
gpfs.base-2.3.0-11<br />
gpfs.gpl-2.3.0-11<br />
gpfs.msg.en_US-2.3.0-11<br />
gpfs.docs-2.3.0-11<br />
gpfs.gplbin-2.3.0-11<br />
Chapter 6. Scenarios 339
11.We test that we can boot a block and that the GPFS file system becomes<br />
available, as shown in Example 6-87.<br />
Example 6-87 Booting a block and checking GPFS availability<br />
mmcs$ allocate R000_128<br />
OK<br />
mmcs$ quit<br />
OK<br />
mmcs_db_console is terminating, please wait...<br />
mmcs_db_console: closing database connection<br />
mmcs_db_console: closed database connection<br />
mmcs_db_console: closing console port<br />
mmcs_db_console: closed console port<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode5<br />
BusyBox v1.00-rc2 (2006.03.24-21:52+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
$ df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
rootfs 7931 1644 5878 22% /<br />
/dev/root 7931 1644 5878 22% /<br />
172.30.1.1:/bgl 9766544 4025264 5741280 42% /bgl<br />
172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />
/dev/bubu_gpfs1 1138597888 3304448 1135293440 1% /bubu<br />
We can see that the GPFS file system (/bubu) is available again on the I/O<br />
nodes by just installing the new code.<br />
12.We try to run a LoadLeveler job, but first we free up the block, as shown in<br />
Example 6-88.<br />
Example 6-88 Unallocating the block for preparing the LoadLeveler run<br />
bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console<br />
connecting to mmcs_server<br />
connected to mmcs_server<br />
connected to DB2<br />
mmcs$ list_blocks<br />
OK<br />
R000_128 root(0) connected<br />
mmcs$ free R000_128<br />
OK<br />
test1@bglfen1:/bglscratch/test1> llq<br />
llq: There is currently no job status to report.<br />
test1@bglfen1:/bglscratch/test1> llstatus<br />
340 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 1397 PPC64 Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd<br />
llsubmit: The job "bglfen1.itso.ibm.com.51" has been submitted.<br />
test1@bglfen1:/bglscratch/test1> llq<br />
Id Owner Submitted ST PRI Class Running On<br />
------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.51.0<br />
test1 4/1 13:26 R 50 small bglfen2<br />
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted<br />
test1@bglfen1:/bglscratch/test1> tail -f ior-gpfs.out<br />
IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />
Run began: Sat Apr 1 15:11:14 2006<br />
Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />
/bglscratch/test1/applications/IOR/ior-inputs<br />
Machine: Blue Gene L ionode3<br />
Summary:<br />
api = MPIIO (version=2, subversion=0)<br />
test filename = /bubu/Examples/IOR/IOR-output-MPIIO-0<br />
access = single-shared-file<br />
clients = 128 (16 per node)<br />
repetitions = 5<br />
xfersize = 32768 bytes<br />
blocksize = 1 MiB<br />
aggregate filesize = 128 MiB<br />
delaying 1 seconds . . .<br />
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />
------ --------- ---------- --------- -------- -------- -------- ---write<br />
7.46 1024.00 32.00 0.100233 16.60 1.51 0<br />
delaying 1 seconds . . .<br />
read 77.16 1024.00 32.00 0.004662 1.65 0.002156 0<br />
Chapter 6. Scenarios 341
delaying 1 seconds . . .<br />
write 6.61 1024.00 32.00 0.034111 18.72 1.87 1<br />
delaying 1 seconds . . .<br />
read 76.49 1024.00 32.00 0.004613 1.67 0.002134 1<br />
delaying 1 seconds . . .<br />
write 6.25 1024.00 32.00 0.024525 20.08 1.67 2<br />
delaying 1 seconds . . .<br />
read 77.15 1024.00 32.00 0.004607 1.65 0.001303 2<br />
delaying 1 seconds . . .<br />
write 6.88 1024.00 32.00 0.023640 18.38 1.27 3<br />
delaying 1 seconds . . .<br />
read 76.10 1024.00 32.00 0.004649 1.68 0.001303 3<br />
delaying 1 seconds . . .<br />
This series of steps proves that a LoadLeveler initiated job ran successfully as<br />
well.<br />
Lessons learned<br />
We learned the following lessons:<br />
► When upgrading the Blue Gene/L driver version we must also re-install the<br />
GPFS code for the I/O nodes.<br />
► If we upgrade the Blue Gene/L driver version and forget to upgrade the GPFS<br />
code, or do not have it available for install, then Blue Gene/L operation can be<br />
restored by simply switching the ppcfloor symbolic link back to the original<br />
version.<br />
► Again, we found that there is a 10 minute timeout for the GPFS startup on the<br />
I/O nodes to boot if there is a problem.<br />
342 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6.3.7 Duplicate IP addresses in /etc/hosts<br />
This scenario deals with errors in the /etc/hosts file on the SN (duplicate IP<br />
addresses).<br />
Error injection<br />
Here the error is injected with no blocks allocated. Now change the /etc/hosts file<br />
on the SN.The /etc/hosts file was changes by changing the IP address listed for<br />
ionode2, as shown in Example 6-89.<br />
Example 6-89 Duplicate IP address in /etc/hosts<br />
# BGL I/O nodes<br />
172.30.2.1 ionode1<br />
172.30.2.1 ionode2<br />
172.30.2.3 ionode3<br />
172.30.2.4 ionode4<br />
172.30.2.5 ionode5<br />
172.30.2.6 ionode6<br />
172.30.2.7 ionode7<br />
<strong>Problem</strong> determination<br />
When we ran the job this time using the mpirun command, it ran without any<br />
problems, as shown in Example 6-90.<br />
Example 6-90 Job running with mpirun (duplicate I/O node IP address)<br />
test1@bglfen1:/bglscratch/test1> mpirun -partition R000_128 -np 128 -exe<br />
/bglscratch/test1/applications/IOR/IOR.rts -args "-f<br />
/bglscratch/test1/applications/IOR/ior-inputs" -cwd<br />
/bglscratch/test1/applications/IOR<br />
+ mpirun -partition R000_128 -np 128 -exe /bglscratch/test1/applications/IOR/IOR.rts<br />
-args '-f /bglscratch/test1/applications/IOR/ior-inputs' -cwd<br />
/bglscratch/test1/applications/IOR<br />
IOR-2.8.3: MPI Coordinated Test of Parallel I/O<br />
Run began: Wed Apr 5 15:23:49 2006<br />
Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f<br />
/bglscratch/test1/applications/IOR/ior-inputs<br />
Machine: Blue Gene L ionode3<br />
Summary:<br />
api = MPIIO (version=2, subversion=0)<br />
test filename = /bubu/Examples/IOR/IOR-output-MPIIO-0<br />
access = single-shared-file<br />
Chapter 6. Scenarios 343
clients = 128 (16 per node)<br />
repetitions = 5<br />
xfersize = 32768 bytes<br />
blocksize = 1 MiB<br />
aggregate filesize = 128 MiB<br />
delaying 1 seconds . . .<br />
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter<br />
------ --------- ---------- --------- -------- -------- -------- ---write<br />
6.97 1024.00 32.00 0.122974 17.73 1.55 0<br />
delaying 1 seconds . . .<br />
read 76.59 1024.00 32.00 0.004633 1.66 0.002440 0<br />
delaying 1 seconds . . .<br />
write 7.46 1024.00 32.00 0.026446 16.61 1.59 1<br />
delaying 1 seconds . . .<br />
read 76.72 1024.00 32.00 0.004517 1.66 0.001283 1<br />
delaying 1 seconds . . .<br />
write 6.14 1024.00 32.00 0.023408 20.57 1.39 2<br />
delaying 1 seconds . . .<br />
read 75.71 1024.00 32.00 0.004603 1.69 0.001307 2<br />
delaying 1 seconds . . .<br />
write 5.99 1024.00 32.00 0.025519 21.04 1.97 3<br />
delaying 1 seconds . . .<br />
read 75.01 1024.00 32.00 0.004542 1.70 0.001311 3<br />
delaying 1 seconds . . .<br />
write 6.90 1024.00 32.00 0.025010 18.18 1.71 4<br />
delaying 1 seconds . . .<br />
read 75.54 1024.00 32.00 0.004614 1.69 0.001288 4<br />
Max Write: 7.46 MiB/sec (7.83 MB/sec)<br />
Max Read: 76.72 MiB/sec (80.45 MB/sec)<br />
Run finished: Wed Apr 5 15:25:45 2006<br />
test1@bglfen1:/bglscratch/test1><br />
We now login through ssh into an I/O node and check the /etc/hosts file:<br />
$ cat /etc/hosts<br />
172.30.2.1 ionode1<br />
172.30.2.2 ionode2<br />
172.30.2.3 ionode3<br />
As we can see, the /etc/hosts file on the I/O nodes was not affected by the<br />
injected error.<br />
344 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
We now investigate possible effects on GPFS (see Example 6-91).<br />
Example 6-91 Investigating GPFS<br />
bglsn:/etc # mmgetstate -a<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
2 ionode1 active<br />
2 ionode1 active<br />
4 ionode3 active<br />
5 ionode4 active<br />
6 ionode5 active<br />
7 ionode6 active<br />
8 ionode7 active<br />
9 ionode8 active<br />
bglsn:/etc #<br />
We also see that output of the mmgetstate -a command looks correct on the SN.<br />
Lessons learned<br />
Duplicate IP addresses in the /etc/hosts file on the SN have little effect on the<br />
running jobs. Even though we can assume GPFS is going to be impacted, this is<br />
revealed only if a cluster GPFS change occurs.<br />
6.3.8 Missing I/O node in /etc/hosts<br />
This scenario shows missing I/O node entries form the /etc/hosts file on the SN.<br />
Error injection<br />
Example 6-92 shows the error injected with no blocks allocated.<br />
Example 6-92 Missing I/O node in /etc/hosts on the SN<br />
bglsn:/ # cat /etc/hosts<br />
.... > ....<br />
172.30.2.1 ionode1<br />
#172.30.2.2 ionode2<br />
172.30.2.3 ionode3<br />
172.30.2.4 ionode4<br />
172.30.2.5 ionode5<br />
172.30.2.6 ionode6<br />
.... > ....<br />
Chapter 6. Scenarios 345
<strong>Problem</strong> determination<br />
When we ran the job using the mpirun command there were no problems. So we<br />
further investigate the state of GPFS on the SN (see Example 6-93).<br />
Example 6-93 The mmgetstate command reveals a missing node<br />
bglsn:/tmp # mmgetstate -a<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
2 ionode1 active<br />
4 ionode3 active<br />
5 ionode4 active<br />
6 ionode5 active<br />
7 ionode6 active<br />
8 ionode7 active<br />
9 ionode8 active<br />
mmgetstate: The following nodes could not be reached: ionode2<br />
Here, we see that ionode2 is not listed in the output, so we check the /etc/hosts<br />
files entries to get the IP address of ionde2:<br />
bglsn:/tmp # grep ionode2 /etc/hosts<br />
#172.30.2.2 ionode2<br />
bglsn:/tmp # grep ionode2 /etc/hosts<br />
We then try to login using ssh into ionode2 to check the GPFS status, as shown<br />
in Example 6-94.<br />
Example 6-94 GPFS status<br />
bglsn:/tmp # ssh root@172.30.2.2<br />
BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash)<br />
Enter 'help' for a list of built-in commands.<br />
$ df<br />
Filesystem 1K-blocks Used Available Use% Mounted on<br />
rootfs 7931 1644 5878 22% /<br />
/dev/root 7931 1644 5878 22% /<br />
172.30.1.1:/bgl 9766544 4076160 5690384 42% /bgl<br />
172.30.1.33:/bglscratch 36700160 2411424 34288736 7% /bglscratch<br />
/dev/bubu_gpfs1 1138597888 3353600 1135244288 1% /bubu<br />
$ hostname<br />
ionode2<br />
346 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Lessons learned<br />
We learned the following lessons:<br />
► We learn that the mmgetstate command relies on system name resolution (n<br />
this case the /etc/hosts file) to be able to query the state of the remote nodes<br />
► Missing entries in the /etc/hosts file do not necessarily mean that GPFS are<br />
unable to load on the I/O nodes. Because the GPFS configuration has not<br />
been altered, this actually depends which node has initiated the GPFS<br />
communication.<br />
Note: We emphasize again the importance of correct and identical name<br />
resolution for all nodes in all inter-connected GPFS clusters.<br />
6.3.9 Adding an extra alias for the SN in /etc/hosts<br />
This scenario adds an extra alias SN in the /etc/hosts file on the SN.<br />
Error injection<br />
Here the error is injected with no blocks allocated.<br />
We changed the /etc/hosts file by adding an SN alias:<br />
# Service Node interfaces<br />
10.0.0.1 bglsn_sn.itso.ibm.com bglsn_sn<br />
#172.30.1.1 bglsn_fn.itso.ibm.com bglsn_fn<br />
172.30.1.1 fluff bglsn_fn.itso.ibm.com bglsn_fn<br />
<strong>Problem</strong> determination<br />
When we ran the job this time using the mpirun command, it ran without any<br />
problems. So, while the job was running, we investigate the state of GPFS from<br />
the SN (see Example 6-95).<br />
Example 6-95 Checking GPFS node status<br />
bglsn:/tmp # mmgetstate -a<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
2 ionode1 active<br />
3 ionode2 active<br />
4 ionode3 active<br />
5 ionode4 active<br />
6 ionode5 active<br />
Chapter 6. Scenarios 347
7 ionode6 active<br />
8 ionode7 active<br />
9 ionode8 active<br />
We ran the job, and it was successful. We then stopped GPFS, freed the block,<br />
and restarted GPFS, as shown in Example 6-96.<br />
Example 6-96 Restarting the bgIO GPFS cluster<br />
bglsn:/tmp # mmstartup -a<br />
Wed Apr 5 15:49:25 EDT 2006: mmstartup: Starting GPFS ...<br />
mmdsh: ionode7 rsh process had return code 1.<br />
mmdsh: ionode4 rsh process had return code 1.<br />
mmdsh: ionode1 rsh process had return code 1.<br />
mmdsh: ionode8 rsh process had return code 1.<br />
mmdsh: ionode6 rsh process had return code 1.<br />
mmdsh: ionode3 rsh process had return code 1.<br />
mmdsh: ionode5 rsh process had return code 1.<br />
mmdsh: ionode2 rsh process had return code 1.<br />
ionode7: ssh: connect to host ionode7 port 22: Connection refused<br />
ionode4: ssh: connect to host ionode4 port 22: Connection refused<br />
ionode1: ssh: connect to host ionode1 port 22: Connection refused<br />
ionode8: ssh: connect to host ionode8 port 22: Connection refused<br />
ionode3: ssh: connect to host ionode3 port 22: Connection refused<br />
ionode6: ssh: connect to host ionode6 port 22: Connection refused<br />
ionode5: ssh: connect to host ionode5 port 22: Connection refused<br />
ionode2: ssh: connect to host ionode2 port 22: Connection refused<br />
However, the errors in the mmstartup command were expected because no block<br />
was initialized at the time:<br />
bglsn:/tmp # mmgetstate -a<br />
Node number Node name GPFS state<br />
-----------------------------------------<br />
1 bglsn_fn active<br />
bglsn:/tmp #<br />
Lessons learned<br />
We learn that the mmgetstate command relies on entries in the /etc/hosts file to<br />
be able to access the state of the remote nodes, but the addition of alias entries<br />
in the /etc/hosts file doesn’t mean that GPFS will be unable to load on the I/O<br />
nodes (name resolution still works fine).<br />
348 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
6.4 Job submission scenarios<br />
This section presents scenarios related to mpirun and IBM LoadLeveler. We try<br />
to address the common problems observed by the users of the system.<br />
However, in the real world there might be other unknown problems which we<br />
might not have addressed in this chapter. Our basic goal is to introduce the<br />
common problem determination methodology for the job running environment on<br />
Blue Gene/L.<br />
Depending on the process of job submission method used (that is mpirun or<br />
LoadLeveler), we have divided this section in two subsections, one for running<br />
jobs using mpirun, and one for LoadLeveler. In each subsection we first make<br />
sure we can submit a job which is executed successfully. We then inject the<br />
errors, resulting in a job failure and then perform problem determination following<br />
the methodology mentioned in Chapter 2, “<strong>Problem</strong> determination methodology”<br />
on page 55.<br />
6.4.1 The mpirun command: scenarios description<br />
In this section, we discuss problems generally encountered while submitting jobs<br />
to a Blue Gene/L system. For example, job submission has as mandatory<br />
prerequisites the Blue Gene/L database (DB2) and the Control system<br />
processes up and running (at the higher end), appropriate partitions availability<br />
(number of nodes, partition shape), and the file systems available (at lower end).<br />
In our scenarios we do not care about application errors (wrong libraries,<br />
execution arguments, bad programming and so forth), rather focus on the job<br />
submission action itself, and how the job interacts with the system. We have<br />
tested the following list of scenarios, concentrated on what we consider as basic<br />
errors observed while submitting jobs on the system:<br />
1. Environment variables not set before submitting jobs<br />
– $MMCS_SERVER_IP variable on FEN<br />
– $BRIDGE_CONFIG_FILE variable on SN<br />
– $DB_PROPERTY variable on SN<br />
– Database user profile not sourced (db2profile)<br />
2. Remote command execution not set correctly (rsh)<br />
Each scenario consists of the following topics:<br />
► Error injection<br />
► <strong>Problem</strong> determination<br />
► Lessons learned<br />
Chapter 6. Scenarios 349
6.4.2 The mpirun command: environment variables not set<br />
As previously mentioned, in this scenario we deal with unset environment<br />
variables.<br />
Scenario 1<br />
Environment variable MMCS_SERVER_IP is not set on the Front-End Node (FEN).<br />
Error injection<br />
We omit the line with MMCS_SERVER_IP in users profile (~/.bashrc or ~/.cshrc)<br />
test1@bglfen1:~> env |grep MMCS_SERVER_IP<br />
test1@bglfen1:~><br />
As we can see, the environment variable is missing (not set).<br />
<strong>Problem</strong> determination<br />
By default, the mpirun command writes standard output (1) and error (2) on to the<br />
console from where the job was submitted, unless specifically redirected in the<br />
application. In our scenario, the job uses the console, as shown in Example 6-97.<br />
Example 6-97 Job’s stderr(2) on the console (missing $MMCS_SERVER_IP)<br />
test1@bglfen1:/bglscratch/test1> mpirun -partition R000_J108_32 -exe<br />
/bglscratch/test1/hello.rts -cwd /bglscratch/test1<br />
FE_MPI (ERROR): BG/L Service Node not set:<br />
FE_MPI (ERROR): Please set 'MMCS_SERVER_IP'<br />
env. variable or<br />
FE_MPI (ERROR): specify the Service Node<br />
with '-host' argument.<br />
Usage:<br />
or<br />
mpirun [options]<br />
mpirun [options] binary [arg1 arg2 ...]<br />
Try "mpirun -h" for more details.<br />
We can see from the command output the reason of the failure. We omitted the<br />
$MMCS_SERVER_IP variable and the -host argument. By referring to the “The<br />
mpirun checklist” on page 166 and following the checks mentioned there, we find<br />
a missing environment variable (MMCS_SERVER_IP).<br />
350 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Scenario 2<br />
Environment variable BRIDGE_CONFIG_FILE is not set on the SN.<br />
Error injection<br />
We omit the line with BRIDGE_CONFIG_FILE in user’s profile (~/.bashrc or ~/.cshrc).<br />
Example 6-98 shows the ~/.bashrc file for user test1.<br />
Note: Normally users (other than system administrator - root) are not allowed<br />
to login onto the SN.<br />
Example 6-98 Environment variable BRIDGE_CONFIG_FILE not set on SN<br />
test1@bglfen1:~> cat ~/.bashrc<br />
if [ $hstnm = "bglsn" ]<br />
then<br />
#export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />
export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />
fi<br />
<strong>Problem</strong> determination<br />
By default, the mpirun command writes standard output (1) and error (2) on to the<br />
console from where the job was submitted, unless specifically redirected in the<br />
application. In our scenario, the job uses the console, as shown in Example 6-99.<br />
Example 6-99 Jobs stderr(2) for missing BRIDGE_CONFIG_FILE on SN<br />
test1@bglfen1:/bglscratch/test1> mpirun -partition R000_128 -np 128<br />
-exe /bglscratch/test1/hello-world.rts -verbose 1<br />
FE_MPI (Info) : Initializing MPIRUN<br />
FE_MPI (Info) : Scheduler interface library<br />
loaded<br />
BE_MPI (ERROR): The environment parameter<br />
"BRIDGE_CONFIG_FILE" is not set, set it to point to the configuration<br />
file<br />
BE_MPI (Info) : Starting cleanup sequence<br />
BE_MPI (Info) : BG/L Job alredy terminated /<br />
hasn't been added<br />
BE_MPI (Info) : Partition wasn't allocated by<br />
the mpirun - No need to remove<br />
FE_MPI (Info) : Back-End invoked:<br />
FE_MPI (Info) : - Service Node: bglsn<br />
FE_MPI (Info) : - Back-End pid: 6979 (on<br />
service node)<br />
Chapter 6. Scenarios 351
BE_MPI (Info) : == BE completed ==<br />
FE_MPI (ERROR): Failure list:<br />
FE_MPI (ERROR): - 1. Failed to get machine<br />
serial number (bridge configuration file not found?) (failure #15)<br />
FE_MPI (Info) : == FE completed ==<br />
FE_MPI (Info) : == Exit status: 1 ==<br />
We can see from the command output the reason of the failure. Also, for clarity<br />
reasons, we used the -verbose 1 option. By referring to the “The mpirun<br />
checklist” on page 166 and following the checks mentioned there, we find a<br />
missing environment variable (BRIDGE_CONFIG_FILE not set).<br />
Scenario 3<br />
Environment variable DB_PROPERTY is not set on the SN.<br />
Error injection<br />
We omit the line with DB_PROPERTY in user’s profile (~/.bashrc or ~/.cshrc).<br />
Example 6-100 shows the ~/.bashrc file for user test1.<br />
Example 6-100 Environment variable DB_PROPERTY not set on SN<br />
test1@bglsn:~> cat ~/.bashrc<br />
if [ $hstnm = "bglsn" ]<br />
then<br />
export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />
#export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />
fi<br />
<strong>Problem</strong> determination<br />
By default, the mpirun command writes standard output (1) and error (2) on to the<br />
console from where the job was submitted, unless specifically redirected in the<br />
application. In our scenario, the job uses the console, as shown in<br />
Example 6-101.<br />
Example 6-101 Job’s stderr(2) for missing DB_PROPERTY variable on SN<br />
test1@bglfen1:/bglscratch/test1> mpirun -partition R000_J108_32 -exe<br />
/bglscratch/test1/hello.rts -cwd /bglscratch/test1 -verbose 1<br />
FE_MPI (Info) : Initializing MPIRUN<br />
FE_MPI (Info) : Scheduler interface library<br />
loaded<br />
BE_MPI (ERROR): db.properties file not found<br />
352 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
BE_MPI (ERROR): If it is not in the CWD,<br />
please set DB_PROPERTY env. var. to point to the file's location.<br />
BE_MPI (Info) : Starting cleanup sequence<br />
BE_MPI (Info) : BG/L Job alredy terminated /<br />
hasn't been added<br />
BE_MPI (Info) : Partition wasn't allocated by<br />
the mpirun - No need to remove<br />
FE_MPI (Info) : Back-End invoked:<br />
FE_MPI (Info) : - Service Node: bglsn<br />
FE_MPI (Info) : - Back-End pid: 13613 (on<br />
service node)<br />
BE_MPI (Info) : == BE completed ==<br />
FE_MPI (ERROR): Failure list:<br />
FE_MPI (ERROR): - 1. Failed to locate<br />
db.properties file (failure #14)<br />
FE_MPI (Info) : == FE completed ==<br />
FE_MPI (Info) : == Exit status: 1 ==<br />
We can see from the command output the reason of the failure. Also, for clarity<br />
reasons, we used the -verbose 1 option. By referring to the “The mpirun<br />
checklist” on page 166 and following the checks mentioned there, we find a<br />
missing environment variable (DB_PROPERTY not set).<br />
Scenario 4<br />
Database user profile not sourced (db2profile) on the SN.<br />
Error injection<br />
We omit to source the db2profile in user’s profile (~/.absurd or ~/.cshrc).<br />
Example 6-102 shows the ~/.bashrc file for user test1.<br />
Example 6-102 db2profile not sourced on the SN<br />
test1@bglsn:~> cat ~/.bashrc<br />
if [ $hstnm = "bglsn" ]<br />
then<br />
export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt<br />
export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
# source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile<br />
fi<br />
Chapter 6. Scenarios 353
<strong>Problem</strong> determination<br />
By default, the mpirun command writes standard output (1) and error (2) on to the<br />
console from where the job was submitted, unless specifically redirected in the<br />
application. In our scenario, the job uses the console, as shown in<br />
Example 6-103.<br />
Example 6-103 db2profile not set on the SN<br />
test1@bglfen1:/bglscratch/test1/applications/IOR> mpirun -partition<br />
R000_128 -np 128 -exe /bglscratch/test1/applications/IOR/IOR.rts -args<br />
"-f /bglscratch/test1/applications/IOR/ior-inputs" -cwd<br />
/bglscratch/test1/applications/IOR<br />
-CLI INVALID HANDLE----cliRC<br />
= -2<br />
line = 148<br />
file = DBConnection.cc<br />
-CLI INVALID HANDLE----cliRC<br />
= -2<br />
line = 167<br />
file = DBConnection.cc<br />
Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />
schema is {bglsysdb}<br />
-CLI INVALID HANDLE----cliRC<br />
= -2<br />
line = 167<br />
file = DBConnection.cc<br />
Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />
schema is {bglsysdb}<br />
-CLI INVALID HANDLE----cliRC<br />
= -2<br />
line = 167<br />
file = DBConnection.cc<br />
Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />
schema is {bglsysdb}<br />
-CLI INVALID HANDLE----cliRC<br />
= -2<br />
line = 167<br />
file = DBConnection.cc<br />
Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />
schema is {bglsysdb}<br />
354 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
-CLI INVALID HANDLE----cliRC<br />
= -2<br />
line = 167<br />
file = DBConnection.cc<br />
Values used for connection: database is {bgdb0}, user is {bglsysdb},<br />
schema is {bglsysdb}<br />
Unable to connect, aborting...<br />
FE_MPI (ERROR): blk_receive_incoming_message()<br />
- !<br />
FE_MPI (ERROR): blk_receive_incoming_message()<br />
- ! Receiver thread exited<br />
FE_MPI (ERROR): blk_receive_incoming_message()<br />
- ! Error code = 11<br />
FE_MPI (ERROR): blk_receive_incoming_message()<br />
- !<br />
FE_MPI (ERROR): blk_receive_incoming_message()<br />
- Switching to cleanup sequence...<br />
FE_MPI (ERROR): Failure list:<br />
FE_MPI (ERROR): - 1. One or more threads<br />
died (failure #57)<br />
We can see from the command output the reason of the failure. Also, for clarity<br />
reasons, we used the -verbose 1 option. By referring to the “The mpirun<br />
checklist” on page 166 and following the checks mentioned there, we find that the<br />
db2profile was not sourced on the SN.<br />
6.4.3 The mpirun command: incorrect remote command<br />
execution (rsh) setup<br />
In this scenario, we simulate bad rsh authentication between one FEN and SN.<br />
This prevents mpirun front-end (running on the FEN) from talking to the mpirun<br />
back-end (running on the SN).<br />
Error injection<br />
The error has been injected by omitting (in this case, commented with “#”) the<br />
line corresponding to user test1@bglfen1 in ~/.rhosts file of the user test1@bglsn<br />
(see Example 6-104).<br />
Example 6-104 User’s .rhosts file<br />
test1@bglfen1:~> rsh bglsn ls<br />
Permission denied.<br />
test1@bglsn:~> cat ~/.rhosts<br />
Chapter 6. Scenarios 355
# bglfen1 test1<br />
bglfen2 test1<br />
bglsn test1<br />
<strong>Problem</strong> determination<br />
We submit a job and check the console (default is the terminal from were the job<br />
was submitted). We see the stderr(2) messages shown in Example 6-105.<br />
Example 6-105 remote command execution fails (FEN to SN)<br />
test1@bglfen1:~> mpirun -np 32 -partition R000_J106_32 -exe<br />
/bglscratch/test1/hello-world.rts -cwd /bglscratch/test1 -verbose 1<br />
FE_MPI (Info) : Initializing MPIRUN<br />
FE_MPI (Info) : Scheduler interface library<br />
loaded<br />
Permission denied.<br />
FE_MPI (ERROR): waitForBackendConnections() -<br />
child process died unexpectedly<br />
FE_MPI (ERROR): Failed to get control and data<br />
connections from service node<br />
FE_MPI (ERROR): Failure list:<br />
FE_MPI (ERROR): - 1. Failed to execute<br />
Back-End mpirun on service node (failure #13)<br />
FE_MPI (Info) : == FE completed ==<br />
FE_MPI (Info) : == Exit status: 1 ==<br />
This error shown in Example 6-105 is observed on the console when a job is<br />
submitted using mpirun. The error message (shown in bold) is Permission<br />
denied and also Failed to execute Back-End mpirun on the Service Node.<br />
Because we got the error from the mpirun command, and there is little<br />
information (stderr(2)) about the reason of the failure (Permission denied), we<br />
check according to “The mpirun checklist” on page 166. Going through the steps<br />
one by one we could see that the environment variables are set correctly on FEN<br />
and SN. We drill down and use the -only_test_protocol flag for mpirun, which<br />
tests the remote shell environment across the FENs and the SN, as shown in<br />
Example 6-106.<br />
Example 6-106 The mpirun argument of only_test_protocol<br />
test1@bglfen1:~> mpirun -np 32 -partition R000_J106_32 -exe<br />
/bglscratch/test1/hello-world.rts -cwd /bglscratch/test1 -verbose 1<br />
-only_test_protocol<br />
FE_MPI (Info) : Initializing MPIRUN<br />
356 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
FE_MPI (Info) : Scheduler interface library<br />
loaded<br />
FE_MPI (WARN) :<br />
======================================<br />
FE_MPI (WARN) : = Front-End - Only checking<br />
protocol =<br />
FE_MPI (WARN) : = No actual usage of the BG/L<br />
Bridge =<br />
FE_MPI (WARN) :<br />
======================================<br />
Permission denied.<br />
FE_MPI (ERROR): waitForBackendConnections() -<br />
child process died unexpectedly<br />
FE_MPI (ERROR): Failed to get control and data<br />
connections from service node<br />
FE_MPI (ERROR): Failure list:<br />
FE_MPI (ERROR): - 1. Failed to execute<br />
Back-End mpirun on service node (failure #13)<br />
FE_MPI (Info) : == FE completed ==<br />
FE_MPI (Info) : == Exit status: 1 ==<br />
From Example 6-106 the highlighted messages indicate the same set of errors<br />
which we have observed during our initial job run. This points to us to a problem<br />
with the remote command execution. Depending on the users’ shell environment<br />
(for example, rsh in this case), we proceed to the respective checklist. Because<br />
by default rsh is used by mpirun, we decide to execute a simple date command<br />
from the FEN to the SN:<br />
test1@bglfen1:~> rsh bglsn date<br />
Permission denied.<br />
This states clearly Permission denied, therefore we first check the ~/.rhosts file<br />
of test1 user, and detect the missing line for test1@bglfen1. We correct and<br />
re-submit the job successfully.<br />
Lessons learned<br />
We learned the following lessons:<br />
► Environment variables on the FEN and SN should be set appropriately before<br />
submitting the jobs on the Blue Gene/L system.<br />
► Users’ shell environment needs to be set appropriately in order to<br />
successfully submit a job on the system<br />
Chapter 6. Scenarios 357
6.4.4 LoadLeveler: scenarios description<br />
The scenarios in this section deal with problems that you can encounter when<br />
running IBM LoadLeveler in a Blue Gene/L environment. We assume that you<br />
have basic knowledge on IBM LoadLeveler. See the IBM LoadLeveler Using and<br />
Administering <strong>Guide</strong>, SA22-7881, for reference.<br />
To better understand LoadLeveler problems on Blue Gene/L, a basic knowledge<br />
of Blue Gene/L system administration is also required. The following sections<br />
present the scenarios that we developed and tested for this book.<br />
6.4.5 LoadLeveler: job failed<br />
In this section, we analyze some possible reasons for a LoadLeveler job to fail.<br />
This scenario does not contain manual error injection, because we encountered<br />
these issues during our testing.<br />
<strong>Problem</strong> description<br />
There are many reasons a job could fail. In this scenario, we do not look into<br />
application specific failures. Based on the discussion in 4.4.3, “How LoadLeveler<br />
plugs into Blue Gene/L” on page 172, a MPI job submitted into LoadLeveler<br />
queue might fail if one of the following happen:<br />
► LoadLeveler cannot talk to mpirun processes<br />
► mpirun processes cannot talk to LoadLeveler<br />
► Some files or file systems are not accessible<br />
► <strong>Problem</strong>s with Blue Gene/L partitions<br />
Detailed checking<br />
If one or all of the hypothesized problems above happened, there might be some<br />
error messages recorded in the job stderr(2) file. Therefore, the first place to start<br />
is the job output.<br />
To determine where the job saves those file, the job command file is one place to<br />
look. The file used in this scenario is listed in Example 4-16 on page 178. It<br />
contains the following two lines:<br />
#@ output =<br />
/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />
#@ error =<br />
/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />
358 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
LoadLeveler replaces these variables such as $(schedd_host), $(jobid), $(stepid)<br />
with their values. The three variables are combined to make up the jobid. The<br />
two file names are:<br />
/bgl/loadl/out/hello.bglfen2.2.0.out<br />
/bgl/loadl/out/hello.bglfen2.2.0.err<br />
Another way to determine where the output files are stored is to use the llq -l<br />
command, which displays a long list of job attributes, as shown<br />
Example 6-107. The output of this command is relevant only if the job is still in<br />
Running (R) state.<br />
Example 6-107 Finding job stdout and stderr(2) from llq -l command<br />
loadl@bglfen1:/bgl/loadlcfg> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
------------------------ ---------- ----------- -- --- ------------<br />
----------bglfen2.2.0<br />
test1 3/30 14:17 R 50 small<br />
bglfen1<br />
1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0<br />
preempted<br />
loadl@bglfen1:/bgl/loadlcfg> llq -l bglfen2.2.0<br />
...<br />
Out: /bglscratch/test1/ior-gpfs.out<br />
Err: /bglscratch/test1/ior-gpfs.err<br />
...<br />
Investigating the stderr(2) and stdout(1) files might reveal that the stdout(1) file is<br />
empty, which means the job either did not start, or did not advance much in<br />
execution before it failed. However, there is information in the stderr(2) file. The<br />
contents of the file is listed in Example 6-108.<br />
Example 6-108 Job stderr(2) contents<br />
FE_MPI (Info) : Initializing MPIRUN<br />
FE_MPI (Debug): Started, pid=2983,<br />
exe=/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/bin/mpirun,<br />
cwd=/bgl/loadlcfg<br />
FE_MPI (Debug): Collecting arguments ...<br />
FE_MPI (Debug): Checking for arguments in the<br />
environment<br />
FE_MPI (Debug): Parsing command line arguments<br />
FE_MPI (Debug): Checking usage...<br />
Chapter 6. Scenarios 359
FE_MPI (Info) : Scheduler interface library<br />
loaded<br />
FE_MPI (Debug): Collecting arguments from<br />
external source (scheduler)<br />
FE_MPI (Debug): Calling external source to<br />
fill parameters...<br />
FE_MPI (Debug): Setting failure #12<br />
FE_MPI (Debug): Closing messaging queues<br />
FE_MPI (Debug): FE threads are down with the<br />
following return codes:<br />
FE_MPI (Debug): FE Sender : N/A (not<br />
initialized)<br />
FE_MPI (Debug): FE Receiver : N/A (not<br />
initialized)<br />
FE_MPI (Debug): FE Input : N/A (not<br />
initialized)<br />
FE_MPI (Debug): FE Output : N/A (not<br />
initialized)<br />
FE_MPI (Debug): Child shell process never<br />
started.<br />
FE_MPI (ERROR): Failure list:<br />
FE_MPI (ERROR): - 1. Front-End<br />
initialization failed (failure #12)<br />
FE_MPI (Info) : == FE completed ==<br />
FE_MPI (Info) : == Exit status: 1 ==<br />
In Example 6-108 the two ERROR messages are:<br />
FE_MPI (ERROR): Failure list:<br />
FE_MPI (ERROR): - 1. Front-End<br />
initialization failed (failure #12)<br />
These messages are from mpirun (FE_MPI). They indicate a Front-End<br />
initialization failure (failure #12).<br />
According to the job command file, from the line we can see that the job was run<br />
with mpirun option flag -verbose 2:<br />
#@ arguments = -verbose 2 -exe /bgl/hello/hello.rts<br />
The next thing to try is to increase the verbosity to a higher number such<br />
(-verbose 4) to get more details on ERROR messages. In our case, the<br />
-verbose 4 option produced the messages shown in Example 6-108.<br />
360 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
From here we get another clue: the option flag value set on the #@ arguments<br />
line in the job command file does not seem to get passed to mpirun, which<br />
means that mpirun cannot talk to LoadLeveler.<br />
Drilling down the LoadLeveler checklist, we scan through the following items:<br />
► LoadLeveler run queue shows normal operation.<br />
► Simple job submission: This is a simple ""Hello world!" job that fails.<br />
Therefore, the simple job submission check is not relevant.<br />
► Job command file: The file used for this job has been checked and altered as<br />
previously mentioned (-verbose 4) with little impact.<br />
► LoadLeveler processes/daemons seem to be working as expected (llstatus<br />
and llq commands).<br />
► LoadLeveler logs: This is a job failure so the log file to check is StartLog and<br />
StarterLog. The StartLog file contains the messages shown in<br />
Example 6-109.<br />
Example 6-109 StartLog file contents<br />
03/23 10:23:24 TI-57 JOB_STATUS: Job bglsn.itso.ibm.com.21.0 Status =<br />
READY<br />
03/23 10:23:24 TI-57 QUEUED_STATUS: Step bglsn.itso.ibm.com.21.0<br />
queueing status to schedd at bglsn.itso.ibm.com<br />
03/23 10:23:24 TI-59 JOB_STATUS: Job bglsn.itso.ibm.com.21.0 Status =<br />
RUNNING<br />
03/23 10:23:24 TI-59 QUEUED_STATUS: Step bglsn.itso.ibm.com.21.0<br />
queueing status to schedd at bglsn.itso.ibm.com<br />
03/23 10:23:24 TI-61 Notification of user tasks termination received<br />
from Starter for job step bglsn.itso.ibm.com.21.0<br />
03/23 10:23:24 TI-62 JOB_STATUS: Job bglsn.itso.ibm.com.21.0 Status =<br />
COMPLETED<br />
03/23 10:23:24 TI-62 QUEUED_STATUS: Step bglsn.itso.ibm.com.21.0<br />
queueing status to schedd at bglsn.itso.ibm.com<br />
03/23 10:23:24 TI-62 Cleanup_dir: Job<br />
/root/execute/bglsn.itso.ibm.com.21.0 Removing directory.<br />
Chapter 6. Scenarios 361
The log messages indicate the job completed. There is no failure information.<br />
Looking into the StarterLog file reveals some error messages even though it<br />
indicates the job completed, as shown in Example 6-110.<br />
Example 6-110 StarterLog file contents<br />
03/23 10:23:24 TI-0 bglsn.21.0 llcheckpriv program exited, termsig = 0,<br />
coredump = 0, retcode = 0<br />
03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending READY status to<br />
Startd<br />
03/23 10:23:24 TI-0 bglsn.21.0 Main task program started (pid=11862<br />
process count=1).<br />
03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending RUNNING status to<br />
Startd<br />
03/23 10:23:24 TI-0 LoadLeveler: 2539-453 Illegal protocol (121),<br />
received from another process on this machine - bglsn.itso.ibm.com.<br />
This daemon "LoadL_starter" is running protocol version (130).<br />
03/23 10:23:24 TI-0 bglsn.21.0 Task exited, termsig = 0, coredump = 0,<br />
retcode = 1<br />
03/23 10:23:24 TI-0 bglsn.21.0 User environment epilog not run, no<br />
program was specified.<br />
03/23 10:23:24 TI-0 bglsn.21.0 Epilog not run, no program was<br />
specified.<br />
03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending COMPLETED status to<br />
Startd<br />
03/23 10:23:24 TI-0 bglsn.21.0 ********** STARTER exiting<br />
***************<br />
The ERROR messages Illegal protocol (121)... indicate some libraries are<br />
mismatched. Therefore, we perform one more check of environment variables<br />
and library links. Because all other checks have been performed, the last step is<br />
to make sure the environment variables and library links are set up correctly on<br />
the SN. Example 6-111 shows how we checked the variables.<br />
Example 6-111 Checking Blue Gene/L environment variables<br />
loadl@bglsn:~> echo $BRIDGE_CONFIG_FILE<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />
loadl@bglsn:~> echo $DB_PROPERTY<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
loadl@bglsn:~> echo $MMCS_SERVER_IP<br />
bglsn.itso.ibm.com<br />
362 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Checking the library links is more subtitle, because this problem might occur if a<br />
link was removed (easy to detect) or a library binary was replaced (not obvious,<br />
requires further investigation). The following error message from the<br />
LoadLeveler Starter log does not explicitly point to a specific library:<br />
03/23 10:23:24 TI-0 LoadLeveler: 2539-453 Illegal protocol (121),<br />
received from another process on this machine - bglsn.itso.ibm.com.<br />
This daemon "LoadL_starter" is running protocol version (130).<br />
This message means that you need to check all LoadLeveler related libraries on<br />
both the SN and FENs. The script shown in Example 4-14 on page 175 can be<br />
used for this checking.<br />
In our scenario two libraries (binaries) are replaced on all nodes and the problem<br />
is resolved.<br />
Lessons learned<br />
We learned the following lessons:<br />
► A simple "Hello world!" job can serve as a LoadLeveler checking tool. If this<br />
job fails, it’s a good indication that mpirun and basic job submission are OK.<br />
Nevertheless, investigating the job stderr(2) and stdout(1) files is a good<br />
place to start. See 4.3, “Submitting jobs using built-in tools” on page 149.<br />
► When the error messages from the job are not explicit, there are usually other<br />
ways to get more debugging information. Because the job goes through<br />
different components such as the LoadLeveler daemons and mpirun<br />
processes, their respective log files should be investigated to find traces (and<br />
errors) of the job. The various logs might reflect different status of the job<br />
depending on the different states. Understanding the job life cycle at various<br />
levels is key to finding errors in job submission process. See 4.3.3, “Example<br />
of submitting a job using mpirun” on page 163.<br />
► Checking the libraries used by LoadLeveler is a sensitive step related to a job<br />
failure. However, when such errors occur, we found out that mismatched<br />
libraries are often the causes.<br />
6.4.6 LoadLeveler: job in hold state<br />
In this section we investigate some possible causes of a job held in queue for a<br />
long time. We do not perform any error injection, as we have already<br />
encountered some of these issues while working with LoadLeveler.<br />
Chapter 6. Scenarios 363
<strong>Problem</strong> description<br />
A job submitted to LoadLeveler is going to run in batch mode. Depending on the<br />
resources available, the job might not be able to run immediately after<br />
submission. It is important to check the status of the job in the LoadLeveler<br />
queue. In this scenario, the job is seen in the Hold state. There are many<br />
hypotheses. Some of them are as following:<br />
► A dynamic problem happens on the node that the Scheduler is not aware of at<br />
the time of sending the job.<br />
► When Starter starts the job, it encounters a problem and the job cannot be<br />
started.<br />
► The job could be held by owner or administrator for some reasons.<br />
Detailed checking<br />
It could be the job’s owner, who does not see the job running for a while after it is<br />
submitted. She checks the job output and does not see it. The LoadLeveler<br />
administrator could also spot the job in Hold state in the queue. See<br />
Example 6-112.<br />
Example 6-112 Job in Hold state<br />
test1@bglfen1:/bgl/loadl> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
------------------------ ---------- ----------- -- --- ------------<br />
----------bglfen1.23.0<br />
test1 3/26 14:06 H 50 small<br />
1 job step(s) in queue, 0 waiting, 0 pending, 0 running, 1 held, 0<br />
preempted<br />
A quick check on LoadLeveler shows the cluster working normally. See<br />
Example 6-113.<br />
Example 6-113 Normal LoadLeveler cluster status<br />
test1@bglfen1:/bgl/loadl> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />
OpSys<br />
bglfen1.itso.ibm.com Avail 1 0 Idle 0 0.02 109 PPC64<br />
Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.07 9999 PPC64<br />
Linux2<br />
PPC64/Linux2 2 machines 1 jobs 0 running<br />
364 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Total Machines 2 machines 1 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
When we verify the run queue (llq command), we can see that the queue is OK.<br />
If the job is Hold state it might be because an user or administrator has done this.<br />
Example 6-114 shows the queue status.<br />
Example 6-114 Job in Hold state by user<br />
loadl@bglfen1:/bgl/loadl> llq -l bglfen1.3.0 | more<br />
=============== Job Step bglfen1.itso.ibm.com.3.0 ===============<br />
Job Step Id: bglfen1.itso.ibm.com.3.0<br />
Job Name: bglfen1.itso.ibm.com.3<br />
Step Name: 0<br />
Structure Version: 10<br />
Owner: loadl<br />
Queue Date: Thu 06 Apr 2006 10:01:18 AM EDT<br />
Status: User Hold<br />
Reservation ID:<br />
Requested Res. ID:<br />
Scheduling Cluster:<br />
....<br />
Next, we check in the LoadL_config file for the following keywords<br />
(Example 6-115):<br />
► MAX_JOB_REJECT<br />
► ACTION_ON_MAX_REJECT<br />
Example 6-115 MAX_JOB_REJECT and ACTION_ON_MAX_REJECT<br />
# The MAX_JOB_REJECT value determines how many times a job can be<br />
# rejected before it is canceled or put on hold. The default<br />
value<br />
# is 0, which indicates a rejected job will immediately be<br />
canceled<br />
# or placed on hold. MAX_JOB_REJECT may be set to unlimited<br />
rejects<br />
# by specifying a value of -1.<br />
#<br />
Chapter 6. Scenarios 365
MAX_JOB_REJECT = 0<br />
#<br />
# When ACTION_ON_MAX_REJECT is HOLD, jobs will be put on user hold<br />
# when the number of rejects reaches the MAX_JOB_REJECT value.<br />
When<br />
# ACTION_ON_MAX_REJECT is CANCEL, jobs will be canceled when the<br />
# number of rejects reaches the MAX_JOB_REJECT value. The default<br />
# value is HOLD.<br />
#<br />
ACTION_ON_MAX_REJECT = HOLD<br />
If the job is already in Hold state, the llq -s command cannot analyze<br />
its status. To verify this, we released the job using the llhold -r command (put it<br />
back to Idle state), then quickly checked with llq -s bglfen1.23.0, as shown in<br />
Example 6-116.<br />
Example 6-116 Releasing held job to check for reasons<br />
test1@bglfen1:/bgl/loadl> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
------------------------ ---------- ----------- -- --- ------------<br />
----------bglfen1.23.0<br />
test1 3/26 14:06 I 50 small<br />
(alloc)<br />
1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0<br />
preempted<br />
test1@bglfen1:/bgl/loadl>llq -s bglfen1.23.0<br />
...<br />
...<br />
==================== EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.23.0<br />
====================<br />
Step waiting on following partitions to become FREE:<br />
RMP26Mr151607127<br />
Step waiting on following partitions to become FREE:<br />
RMP26Mr151607127<br />
test1@bglfen1:/bgl/loadl> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
366 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
------------------------ ---------- ----------- -- --- ------------<br />
----------bglfen1.23.0<br />
test1 3/26 14:06 H 50 small<br />
1 job step(s) in queue, 0 waiting, 0 pending, 0 running, 1 held, 0<br />
preempted<br />
From Example 6-116, we can see that the Scheduler allocates the node resource<br />
to the job. The status shown by llq -s command is normal.<br />
The next check is to look into the LoadLeveler StartLog and StarterLog files to<br />
see if there is any information about jobid bglfen1.23.0, as shown in<br />
Example 6-117.<br />
Example 6-117 Job REJECTED according to StartLog<br />
03/26 14:28:59 TI-1667 JOB_START: Received start order from<br />
bglfen1.itso.ibm.com.<br />
03/26 14:28:59 TI-1667 inside StartdStep constructor for 269709664<br />
03/26 14:28:59 TI-1667 shutdown_active_count increasing from 0 to 1<br />
03/26 14:28:59 TI-1667 JOB_START: Step bglfen1.itso.ibm.com.23.0:<br />
Starting. Starter process id = 19333<br />
03/26 14:28:59 TI-4 Starter Table:<br />
StarterPid = 19333, ClientMachine = bglfen1.itso.ibm.com, user = test1,<br />
State = 2048, Flags = 8196, sid = bglfen1.itso.ibm.com.23.0, StateTimer<br />
= 1143401339<br />
03/26 14:28:59 TI-1671 Notification of user tasks termination received<br />
from Starter for job step bglfen1.itso.ibm.com.23.0<br />
03/26 14:28:59 TI-1672 JOB_STATUS: Job bglfen1.itso.ibm.com.23.0 Status<br />
= REJECTED<br />
03/26 14:28:59 TI-1672 QUEUED_STATUS: Step bglfen1.itso.ibm.com.23.0<br />
queueing status to schedd at bglfen1.itso.ibm.com<br />
03/26 14:28:59 TI-1672 Cleanup_dir: Job<br />
/home/loadl/execute/bglfen1.itso.ibm.com.23.0 Removing directory.<br />
03/26 14:28:59 TI-1672 ruid(0) euid(5001)<br />
03/26 14:28:59 TI-1672 shutdown_active_count decreasing from 1 to 0<br />
Chapter 6. Scenarios 367
From the StartLog, we can see that the job is rejected but there is no further<br />
information provided. The next log to check is StarterLog, shown in<br />
Example 6-118.<br />
Example 6-118 Error found in StarterLog: permission denied<br />
03/26 14:07:09 TI-0 ********** STARTER starting up<br />
***********<br />
03/26 14:07:09 TI-0 LoadLeveler: LoadL_starter started, pid = 19333<br />
03/26 14:07:09 TI-0 Sending starter pid 19333.<br />
03/26 14:28:59 TI-0 bglfen1.23.0 Prolog not run, no program was<br />
specified.<br />
03/26 14:28:59 TI-0 bglfen1.23.0 run_dir =<br />
/home/loadl/execute/bglfen1.itso.ibm.com.23.0<br />
03/26 14:28:59 TI-0 bglfen1.23.0 Sending request for executable to<br />
Schedd<br />
03/26 14:28:59 TI-0 03/26 14:28:59 TI-0 bglfen1.23.0 User environment<br />
prolog not run, no program was specified.<br />
03/26 14:28:59 TI-0 LoadLeveler: 2539-475 Cannot receive command from<br />
client bglfen1.itso.ibm.com, errno =2.<br />
03/26 14:28:59 TI-0 bglfen1.23.0 llcheckpriv program exited, termsig =<br />
0, coredump = 0, retcode = -2<br />
03/26 14:28:59 TI-0 bglfen1.23.0 LoadL_starter: Cannot open stdout<br />
file. /bgl/loadl/out/hello.bglfen1.23.0.out: Permission denied (13)<br />
03/26 14:28:59 TI-0 bglfen1.23.0 User environment epilog not run, no<br />
program was specified.<br />
03/26 14:28:59 TI-0 bglfen1.23.0 cleanupStdErr: cannot stat<br />
/bgl/loadl/out/hello.bglfen1.23.0.err. rc=-1 errno=2 [No such file or<br />
directory]<br />
03/26 14:28:59 TI-0 bglfen1.23.0 Epilog not run, no program was<br />
specified.<br />
03/26 14:28:59 TI-0 bglfen1.itso.ibm.com.23.0 Sending REJECTED status<br />
to Startd<br />
03/26 14:28:59 TI-0 bglfen1.23.0 ********** STARTER exiting<br />
***************<br />
In the StarterLog, it we can see a “Permission denied” problem is associated<br />
with the stdout(1) file: /bgl/loadl/out/hello.bglfen1.23.0.out.<br />
Note: There might not be enough information to pinpoint on which node the<br />
job is started. Checks for StartLog and StarterLog files for multiple nodes in<br />
the LoadLeveler cluster might have to be performed.<br />
368 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
In our case, it turned out that the directory belongs to user ID loadl and user ID<br />
test1’s job is trying to write the stderr(2) and stdout(1) into. We corrected the<br />
permissions and solved the problem.<br />
Lessons learned<br />
We learned the following lessons:<br />
► A job can be put into Hold state for many reasons. The job ID is essential for<br />
checking its status.<br />
► The job needs to be released in order for llq -s to assess the<br />
reasons (and get additional clues).<br />
► LoadLeveler NegotiatorLog might not have any information about this job. In<br />
this case, the job was rejected by a Starter on one of the (LoadLeveler)<br />
nodes.<br />
► The most difficult part is to look for the job ID in StartLog or StarterLog on<br />
every node<br />
6.4.7 LoadLeveler: job disappears<br />
In this section, we investigate some possible causes of a job disappearing from<br />
the queue.<br />
Error injection<br />
We pass a wrong argument to the mpirun command in the LoadLeveler job<br />
command file hello.cmd. The wrong argument is -verbose 5 (not supported by<br />
mpirun). This causes the job to quickly fail and disappear from LoadLeveler<br />
queue.<br />
<strong>Problem</strong> description<br />
After the job is submitted, if it is a small job and there are no other jobs in the<br />
queue, it might run quickly and disappear from the queue. Sometimes, the job<br />
owner does not get a chance to see the job in Running state in the LoadLeveler<br />
queue. We could identify the following reasons:<br />
► The job might have run quickly (finished successfully) and nothing is wrong.<br />
► The job runs and fails quickly.<br />
► The job can be canceled by an user or an administrator.<br />
Chapter 6. Scenarios 369
Detailed checking<br />
We start by checking the job stderr(2) and stdout(1) files, following the job<br />
command checklist in 4.4.9, “LoadLeveler checklist” on page 186. The job<br />
command file that we used for this job is hello.cmd. It includes the following lines:<br />
#@ output =<br />
/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out<br />
#@ error =<br />
/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err<br />
The two files (stdout(1) and stderr(2)) are located in /bgl/loadl/out/ directory, as<br />
shown in Example 6-119.<br />
Example 6-119 Checking job output files<br />
test1@bglfen1:/bgl/loadl/out> ls -ltr<br />
-rw-r--r-- 1 loadl loadl 0 Mar 26 16:15 hello.bglfen1.24.0.out<br />
-rw-r--r-- 1 loadl loadl 193 Mar 26 16:15 hello.bglfen1.24.0.err<br />
Notice that the stdout(1) file is empty, thus we check the stderr(2) contents (listed<br />
in Example 6-120).<br />
Example 6-120 Job error in stderr(2) file<br />
FE_MPI (ERROR): Incorrect verbose option: '5'<br />
Usage:<br />
or<br />
mpirun [options]<br />
mpirun [options] binary [arg1 arg2 ...]<br />
Try "mpirun -h" for more details.<br />
The error message comes from mpirun front-end, and indicates that there is a<br />
usage error when mpirun is invoked. This leads back to the job command file<br />
“hello.cmd”. The keyword to check for is #@ arguments:<br />
#@ arguments = -verbose 5 -exe /bglscratch/test1/hello.rts<br />
That concludes this scenario because the -verbose option flag is set to 5 which is<br />
an invalid value.<br />
In case of the job is canceled by another user or the system (LoadLeveler)<br />
administrator, the stderr(2) and stdout(1) files should show different types of<br />
error messages, thus, the next step would be to check the StartLog and<br />
StarterLog file on the node that runs the job.<br />
370 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Lessons learned<br />
We learned the following lessons:<br />
► When a job runs and fails quickly, you might not see it in the queue.<br />
► It is essential to set the output and error keywords in the job command file.<br />
► Check the job stdout(1) and stderr(2) for errors first.<br />
6.4.8 LoadLeveler: Blue Gene/L is absent<br />
In this section, we analyze what happens if the LoadLeveler cannot talk to the<br />
Blue Gene/L database. As LoadLeveler uses the bridge API to interrogate and<br />
submit requests to the Blue Gene/L Control system, the libraries needed to<br />
perform these tasks are critical for LoadLeveler.<br />
Error injection<br />
We intentionally remove the symbolic link to a Blue Gene/L provided library that<br />
LoadLeveler uses.<br />
<strong>Problem</strong> description<br />
If the llstatus command displays the message Blue Gene is absent,<br />
LoadLeveler is not able to communicate with the Blue Gene/L control server<br />
through the bridge API. On the Blue Gene/L side, there are several scenarios<br />
that might render the bridge API unavailable. Those are described in the core<br />
system scenarios, in 6.2, “Blue Gene/L core system scenarios” on page 267. In<br />
this scenario, we focus on the libraries that LoadLeveler uses to communicate<br />
with the bridge API. Even though this can rarely happen, the following can cause<br />
problems with libraries:<br />
► A library binary file can be removed or corrupted.<br />
► A symbolic link pointing to the library can be broken.<br />
► An LoadLeveler or Blue Gene/L upgrade can be perform on some of the<br />
nodes, but not on all.<br />
► An upgrade of libraries might also alter the symbolic links to point to the<br />
wrong binaries.<br />
Detailed checking<br />
Before injecting the error, we perform some basic checking to make sure<br />
LoadLeveler is operational and a job is submitted and run successfully (see<br />
Example 6-121).<br />
Chapter 6. Scenarios 371
Example 6-121 LoadLeveler basic checks<br />
loadl@bglfen1:~> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />
OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 27 PPC64<br />
Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64<br />
Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is present<br />
All machines on the machine_list are present.<br />
loadl@bglfen1:~> llctl -g stop<br />
llctl: Sent stop command to host bglsn.itso.ibm.com<br />
llctl: Sent stop command to host bglfen1.itso.ibm.com<br />
llctl: Sent stop command to host bglfen2.itso.ibm.com<br />
loadl@bglfen1:~><br />
loadl@bglfen1:~> llctl -g start<br />
llctl: Attempting to start LoadLeveler on host bglsn.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
llctl: Attempting to start LoadLeveler on host bglfen1.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
llctl: Attempting to start LoadLeveler on host bglfen2.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
loadl@bglfen1:~><br />
The link to a library binary is then removed. LoadLeveler is started and no errors<br />
are returned. See Example 6-122.<br />
Example 6-122 Starting LoadLeveler<br />
loadl@bglfen1:~> llctl -g start<br />
llctl: Attempting to start LoadLeveler on host bglsn.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
372 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
llctl: Attempting to start LoadLeveler on host bglfen1.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
llctl: Attempting to start LoadLeveler on host bglfen2.itso.ibm.com<br />
LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130<br />
CentralManager = bglsn.itso.ibm.com<br />
An user or the administrator wants to see status of LoadLeveler. The llstatus<br />
command shows Blue Gene is absent, as shown in Example 6-123.<br />
Example 6-123 llstatus message “Blue Gene is absent”<br />
loadl@bglfen1:~> llstatus<br />
Name Schedd InQ Act Startd Run LdAvg Idle Arch<br />
OpSys<br />
bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.07 2 PPC64<br />
Linux2<br />
bglfen2.itso.ibm.com Avail 0 0 Idle 0 0.04 9999 PPC64<br />
Linux2<br />
PPC64/Linux2 2 machines 0 jobs 0 running<br />
Total Machines 2 machines 0 jobs 0 running<br />
The Central Manager is defined on bglsn.itso.ibm.com<br />
The BACKFILL scheduler with Blue Gene support is in use<br />
Blue Gene is absent<br />
All machines on the machine_list are present.<br />
Any jobs submitted at this time will stay in Idle state, as shown in<br />
Example 6-124.<br />
Example 6-124 Job in Idle state<br />
loadl@bglfen1:/bgl/loadl> llq<br />
Id Owner Submitted ST PRI Class<br />
Running On<br />
------------------------ ---------- ----------- -- --- ------------<br />
----------bglfen1.19.0<br />
loadl 3/24 13:43 I 50 small<br />
1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0<br />
preempted<br />
Chapter 6. Scenarios 373
We first check the LoadLeveler NegotiatorLog, which shows an error of opening<br />
a library, as in Example 6-125.<br />
Example 6-125 The loadBrigeLibrary error in NegotiatorLog<br />
03/28 18:04:48 TI-1 Machine Number index is 0, adapter list size is 0<br />
03/28 18:04:48 TI-1 Machine Number index is 1, adapter list size is 0<br />
03/28 18:04:48 TI-1 Machine Number index is 2, adapter list size is 0<br />
03/28 18:04:48 TI-1 BG: int BgManager::loadBridgeLibrary() - start<br />
03/28 18:04:48 TI-1 int BgManager::loadBridgeLibrary(): Failed to open<br />
library, /usr/lib64/libbglbridge.so, errno=25 (libtableapi.so.1: cannot<br />
open shared object file: No such file or directory)<br />
03/28 18:04:48 TI-1 int BgManager::initializeBg(BgMachine*): Failed to<br />
load Bridge API library<br />
03/28 18:04:48 TI-1 *************************************************<br />
03/28 18:04:48 TI-1 *** LOADL_NEGOTIATOR STARTING UP ***<br />
03/28 18:04:48 TI-1 *************************************************<br />
03/28 18:04:48 TI-1<br />
03/28 18:04:48 TI-1 LoadLeveler: LoadL_negotiator started, pid = 19594<br />
03/28 18:04:48 TI-1 void<br />
LlMachine::queueStreamMaster(OutboundTransAction*): Set destination to<br />
master. Transaction route flag is now central manager sending<br />
transaction CMnotifyCmd to Master<br />
03/28 18:04:48 TI-6 LoadLeveler: Listening on port 9614 service<br />
LoadL_negotiator<br />
The message points to the library /usr/lib64/libbglbridge.so, which is a symbolic<br />
link (we removed earlier) to the shared object file libtableapi.so.1 (binary).<br />
Note: Should this error occur in real life, we would check both the link and the<br />
binary file.<br />
In this scenario, after the link is restored, LoadLeveler detects this, and the next<br />
llstatus command shows the message:<br />
Blue Gene is present.<br />
Jobs will run normal.<br />
374 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Lessons learned<br />
We learned the following lessons:<br />
► As LoadLeveler creates multiple links to a binary shared library, we have to<br />
remove all links to create the problem.<br />
► Without access to the Blue Gene/L bridge API, the jobs in LoadLeveler queue<br />
will stay in Idle state. They cannot be run.<br />
► It is essential to check whether Blue Gene/L is present from LoadLeveler’s<br />
perspective.<br />
6.4.9 LoadLeveler: LoadLeveler cannot start<br />
In this section we investigate possible causes for LoadLeveler daemons not<br />
starting. These daemons include the Central Manager (Master) and the<br />
Negotiator.<br />
<strong>Problem</strong> description<br />
When LoadLeveler cannot be started or the commands return errors, it means<br />
either the LoadLeveler Master or Negotiator daemons cannot start. This scenario<br />
focuses on the Negotiator problems, which could be one of the following:<br />
► LoadLeveler configuration files cannot be accessed<br />
► The Master daemon cannot start the Negotiator daemon on the Central<br />
Manager node<br />
► The Negotiator daemon not starting<br />
► LoadLeveler cannot access the necessary libraries<br />
Similar scenario can be followed to check problems with other daemons as well.<br />
However, the libraries, error messages and log files will be different.<br />
Detailed checking<br />
The common signs for the Negotiator problems are that the LoadLeveler<br />
commands return error messages similar to the ones shown in Example 6-126.<br />
Example 6-126 LoadLeveler Negotiator error messages<br />
loadl@bglsn:~/cmd> llq<br />
03/21 09:35:52 llq: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />
"LoadL_negotiator" on port 9614. errno = 111<br />
03/21 09:35:52 llq: 2539-463 Cannot connect to bglsn.itso.ibm.com<br />
"LoadL_negotiator" on port 9614. errno = 111<br />
llq: 2512-301 An error occurred while receiving data from the<br />
LoadL_negotiator daemon on host bglsn.itso.ibm.com.<br />
Chapter 6. Scenarios 375
The error messages shown in Example 6-126 are the starting point for this<br />
scenario. We have observed that the LoadLeveler commands do not display<br />
normal information or status. Therefore, the checks in “LoadLeveler cluster and<br />
node status” on page 186 cannot be performed.<br />
This is not a problem related to jobs, thus the check related to “LoadLeveler run<br />
queue” on page 193 and “Job command file” on page 198 can be skipped for<br />
now.<br />
We have to determine which node is the Central Manager node, and which one<br />
has the LoadL_negotiator daemon running. As we have seen, when Blue Gene/L<br />
is involved, this is always the SN. In this case, we can perform the checks in<br />
“LoadLeveler configuration keywords” on page 205 which pinpoint the Central<br />
Manager node.<br />
When on the SN, we focus on the checks in “LoadLeveler processes, logs, and<br />
persistent storage” on page 202. The ps command shows if the Negotiator<br />
daemon is running. Example 6-127 shows that the LoadL_negotiator process is<br />
running. Even so, the process could start and crash quickly. Comparing the<br />
Negotiator’s process ID between two instances of the ps command might reveal<br />
this if the process ID is different.<br />
Example 6-127 Looking for the LoadL_negotiator process<br />
bglsn:~ # ps -ef | grep LoadL<br />
loadl 20189 1 0 15:40 ? 00:00:00<br />
/opt/ibmll/LoadL/full/bin/LoadL_master<br />
loadl 20199 20189 0 15:40 ? 00:00:05 LoadL_negotiator -f -c<br />
/tmp -C /tmp<br />
root 24894 24792 0 18:02 pts/22 00:00:00 grep LoadL<br />
If the Negotiator process starts and crashes, a LoadLeveler command should<br />
display the same error messages as in Example 6-126. If this happens quickly,<br />
the Negotiator might have not enough time to write log information, in which case<br />
the MasterLog should be checked first.<br />
376 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
As highlighted in Example 6-128, the Master daemon has recorded some errors<br />
while starting LoadL_negotiator. As we can see, the Negotiator died a couple of<br />
seconds after it started.<br />
Example 6-128 Negotiator error messages in MasterLog<br />
04/05 15:23:17 TI-1 CentralManager = bglsn.itso.ibm.com<br />
04/05 15:23:17 TI-1 Inode monitoring will not be performed on<br />
/home/loadl/log because it is a Reiser Filesystem which does not limit<br />
the number of inodes.<br />
04/05 15:23:17 TI-1 LoadLeveler: LoadL_master started, pid = 17689<br />
04/05 15:23:17 TI-11 LoadLeveler: 2539-463 Cannot connect to<br />
bglsn.itso.ibm.com "LoadL_negotiator" on port 9614. errno = 111<br />
04/05 15:23:17 TI-10 LoadL_negotiator started, pid = 17710<br />
04/05 15:24:21 TI-10 "LoadL_negotiator" on "bglsn.itso.ibm.com" died<br />
due to signal 11, attempting to restart<br />
04/05 15:24:21 TI-10 LoadL_negotiator started, pid = 18330<br />
04/05 15:24:27 TI-21 LoadLeveler: 2539-463 Cannot connect to<br />
bglsn.itso.ibm.com "LoadL_negotiator" on port 9614. errno = 111<br />
04/05 15:30:28 TI-10 "LoadL_negotiator" on "bglsn.itso.ibm.com" died<br />
due to signal 11, attempting to restart<br />
04/05 15:30:28 TI-10 LoadL_negotiator started, pid = 18506<br />
04/05 15:30:28 TI-28 Got SHUTDOWN command from "loadl" with uid = 7001<br />
on machine "bglsn.itso.ibm.com"<br />
04/05 15:30:28 TI-28 Master shutting down now.<br />
04/05 15:30:28 TI-30 LoadLeveler: 2539-463 Cannot connect to<br />
bglsn.itso.ibm.com "LoadL_negotiator" on port 9614. errno = 111<br />
Another error message in the MasterLog also indicates a problem with<br />
connecting to TCP port 9614. Investigating the NegotiatorLog reveals the same<br />
error messages (see Example 6-129).<br />
Example 6-129 Port 9614 error messages in NegotiatorLog<br />
04/05 15:30:28 TI-1 *************************************************<br />
04/05 15:30:28 TI-1 *** LOADL_NEGOTIATOR STARTING UP ***<br />
04/05 15:30:28 TI-1 *************************************************<br />
04/05 15:30:28 TI-1<br />
04/05 15:30:28 TI-1 LoadLeveler: LoadL_negotiator started, pid = 18506<br />
04/05 15:30:28 TI-6 LoadLeveler: 2539-479 Cannot listen on port 9614<br />
for service LoadL_negotiator.<br />
04/05 15:30:28 TI-1 void<br />
LlMachine::queueStreamMaster(OutboundTransAction*): Set destination to<br />
master. Transaction route flag is now central manager sending<br />
transaction CMnotifyCmd to Master<br />
Chapter 6. Scenarios 377
04/05 15:30:28 TI-6 LoadLeveler: Batch service may already be running<br />
on this machine.<br />
04/05 15:30:28 TI-6 LoadLeveler: Delaying 1 seconds and retrying ...<br />
04/05 15:30:28 TI-7 LoadLeveler: Listening on port 9612 service<br />
LoadL_negotiator_collector<br />
04/05 15:30:28 TI-6 LoadLeveler: 2539-479 Cannot listen on port 9614<br />
for service LoadL_negotiator.<br />
04/05 15:30:28 TI-6 LoadLeveler: Batch service may already be running<br />
on this machine.<br />
04/05 15:30:28 TI-6 LoadLeveler: Delaying 2 seconds and retrying ...<br />
04/05 15:30:28 TI-8 LoadLeveler: Listening on path<br />
/tmp/negotiator_unix_stream_socket<br />
04/05 15:30:28 TI-10 Dispatching.<br />
As we can see that there is something with TCP port 9614, we use the netstat<br />
command to check the status, as shown in Example 6-130. The command output<br />
shows that TCP port 9614 is already in LISTEN state which is OK for now.<br />
However, at the time the Negotiator had the problem, this port could’ve been in a<br />
different state, such as FIN_WAIT, thus the port might not be available because<br />
Negotiator starts and terminates rapidly.<br />
Example 6-130 Checking the state of a port or socket<br />
bglsn:/tmp # netstat -an | grep 9614<br />
tcp 0 0 0.0.0.0:9614 0.0.0.0:*<br />
LISTEN<br />
Additional error messages pointing to problems with creating a socket are<br />
revealed in SchedLog and StartLog, as shown in Example 6-131.<br />
Example 6-131 Socket error messages in StartLog and SchedLog<br />
Startlog:<br />
04/05 13:00:18 TI-10 LoadLeveler: 2539-484 Cannot start unix socket on<br />
path /tmp/startd_unix_dgram_socket. errno = 98<br />
04/05 13:00:18 TI-4 Starter Table:<br />
04/05 13:00:18 TI-16 LoadLeveler: 2539-463 Cannot connect to<br />
bglsn.itso.ibm.com "LoadL_negotiator_collector" on port 9612. errno =<br />
111<br />
SchedLog:<br />
04/05 13:22:31 TI-99 LoadLeveler: 2539-475 Cannot receive command from<br />
client bglsn.itso.ibm.com, errno =25.<br />
LoadLeveler: Failed to route task_resource_req_list (43008) in virtual<br />
int Task::encode(LlStream&)<br />
378 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
LoadLeveler: Failed to route node_tasks (34006) in virtual int<br />
Node::encode(LlStream&)<br />
LoadLeveler: Failed to route step_nodes (40033) in virtual int<br />
Step::encode(LlStream&)<br />
LoadLeveler: Failed to route StepList Steps (41002) in virtual int<br />
StepList::encode(LlStream&)<br />
LoadLeveler: Failed to route job_steps (22009) in virtual int<br />
Job::encode(LlStream&)<br />
04/05 13:22:31 TI-97 LoadLeveler: 2539-475 Cannot receive command from<br />
client bglsn.itso.ibm.com, errno =25.<br />
LoadLeveler: Failed to route job_environment_vectors (22008) in virtual<br />
int Job::encode(LlStream&)<br />
As it turns out, the key to this problem is that in Linux environment, certain<br />
daemons can choose to open their temporary communication socket in the /tmp<br />
directory, most LoadLeveler daemons do.<br />
In normal situations, a daemon opens the socket in /tmp then removes it on clean<br />
exit. However, if some environment or network problems occur and the daemon<br />
exits abnormally, the temporary socket file might be left behind.<br />
When the daemon is started next time, it might not be able to overwrite the same<br />
file, specially when the daemon is started as a different user ID. In LoadLeveler<br />
case this could happen if the loadl user starts LoadLeveler, after it has previously<br />
started (and did not exit cleanly) by root user.<br />
As a result, the same file (left over by root user) cannot be overwritten by another<br />
user (loadl). These files need to be cleaned up manually from /tmp directory on<br />
every node before LoadLeveler can be started again.<br />
Note: The socket files created by different daemons have different names, as<br />
shown in Example 6-132.<br />
Example 6-132 Daemons’ socket files in /tmp<br />
bglsn:/tmp # ls -l *socket<br />
srwxrwxrwx 1 loadl loadl 0 Apr 5 15:40 negotiator_unix_stream_socket<br />
bglfen1:/tmp # ls -l *socket<br />
srwxrwxrwx 1 loadl loadl 0 Apr 5 13:33 startd_unix_dgram_socket<br />
srwxrwxrwx 1 loadl loadl 0 Apr 5 13:33 startd_unix_stream_socket<br />
Chapter 6. Scenarios 379
To detect the left-over socket files in /tmp of every node, you need to stop<br />
LoadLeveler on all nodes. Use the ps command to make sure no LoadLeveler<br />
daemons are still around. Then, check the /tmp directory on every node for<br />
socket files with names similar to the ones in Example 6-132.<br />
In our case this was the issue and we solved it by stopping LoadLeveler on all<br />
nodes, cleaning the leftover socket files and restarted LoadLeveler.<br />
If manual removal of the socket files does not resolve the problem, the next step<br />
is to look into the core files generated by Negotiator (on abnormal exit). We can<br />
identify the core files from the following error messages:<br />
04/05 15:24:21 TI-10 "LoadL_negotiator" on "bglsn.itso.ibm.com" died<br />
due to signal 11, attempting to restart<br />
These errors can be found in the MasterLog shown in Example 6-128 on<br />
page 377.<br />
On a Blue Gene/L system, it is usual to set up processes to redirect their core<br />
files into the common directory /bgl/cores (/bgl is NFS mounted on all nodes:<br />
SN, FENs, and I/O nodes).<br />
However, LoadLeveler on Linux has one requirement to be able to generate core<br />
files: LoadLeveler has to be started as the root user ID.<br />
If LoadLeveler is not set up to run as the root user, this needs to be changed first.<br />
See Appendix A, “Installing and setting up LoadLeveler for Blue Gene/L” on<br />
page 409 for a procedure to set up the loadl user ID, then issue the command<br />
llctl -g start as root. If the Negotiator daemon crashes with signal 11 again,<br />
this time a core file is going to be generated in /bgl/cores/.<br />
Then, the core file can be investigated using the debugger (gdb). Depending on<br />
the information revealed from the trace stack of the memory dump, different<br />
procedures can be followed. If the trace stack does not have enough debugging<br />
information, a debugger enabled version of the LoadL_negotiator binary has to<br />
be used for recreating and generating the core file again.<br />
Additional checking can continue with the libraries validation and their symbolic<br />
links. See “Environment variables, network, and library links” on page 206.<br />
380 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Lessons learned<br />
We learned the following lessons:<br />
► If a daemon process exits abnormally, a new process might be spawned<br />
automatically. This is defined in the LoadL_config file with the keyword<br />
#@RESTART_PER_HOUR.<br />
► A temporary socket file is generated in the /tmp directory by the LoadLeveler<br />
daemons. If a daemon exits abnormally, the socket file might be left behind.<br />
► A Negotiator process exit with signal 11 should generate a core file for<br />
debugging purposes. On Linux systems, this core file is generated in /tmp<br />
directory by default. However, on Blue Gene/L system, this is usually<br />
configured by the system administrator to /bgl/cores/.<br />
Chapter 6. Scenarios 381
382 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Chapter 7. Additional topics<br />
This chapter presents two additional topics of interest for a Blue Gene/L<br />
environment:<br />
► Cluster <strong>Systems</strong> Management<br />
► Secure shell<br />
7<br />
Although a basic Blue Gene/L system can function without either Cluster<br />
<strong>Systems</strong> Management or secure shell, these products are needed for the<br />
centralized management and integration of the Blue Gene/L system in your<br />
computing environment.<br />
© Copyright IBM Corp. 2006. All rights reserved. 383
7.1 Cluster <strong>Systems</strong> Management<br />
This section provides a high-level introduction to Cluster <strong>Systems</strong> Management<br />
(CSM) and its use with Blue Gene/L. We do not provide the detailed information<br />
that you need to plan, install, and run CSM. For this, refer to the CSM product<br />
documentation. CSM support for Blue Gene/L was released as part of CSM 1.5<br />
in November 2005. The information in this section pertains to this level of CSM.<br />
However, be sure to check the latest CSM documentation for updates, which is<br />
available at:<br />
http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html<br />
7.1.1 Overview of CSM<br />
The Blue Gene Service Node software records configuration, RAS, and<br />
environmental information in the Blue Gene DB2 database. It also provides a<br />
Web interface and several CLI tools for working with the information that is<br />
stored. However, it is up to the Blue Gene administrators and users to watch or<br />
check the database, to determine when a problem occurs, and then to take<br />
appropriate action.<br />
CSM is an IBM licensed software product that is used to manage clusters<br />
consisting of AIX and Linux systems, from one to thousands. CSM has many<br />
capabilities, but in this section we focus on just one: its rich event monitoring and<br />
automated response capability. With CSM, you can specify what constitutes an<br />
event, and what should happen, automatically, if and when that event occurs.<br />
By installing CSM on your Blue Gene SN, you can automate the task of watching<br />
the database and taking corrective actions. For example, you could use CSM to<br />
watch the status of all the midplanes. If a midplane is marked in error or is<br />
marked as missing, the CSM software can detect this event and take whatever<br />
action that you have specified automatically. Perhaps you want to be paged or to<br />
receive an urgent e-mail when the event occurs. Or perhaps a special script<br />
should be run. Or perhaps all of these responses should happen simultaneously.<br />
CSM allows you to set up whatever monitoring and automated responses you<br />
need.<br />
If you are thinking “This monitoring and automated response stuff sounds<br />
interesting, but I don’t get it. Why drag a cluster management product like CSM<br />
into the picture?” Well, one good reason is precisely the monitoring and<br />
automated response capabilities of CSM! These capabilities are very powerful<br />
and customizable. You can ignore everything else about CSM if you choose.<br />
384 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Over time, as you become more familiar with the other capabilities of CSM and<br />
begin to view your Blue Gene/L systems, such as SN, Front-End Nodes (FENs),<br />
and File Servers, as a set of systems that you would like to manage from a single<br />
point of control, you might find more and more reasons to use other features of<br />
CSM.<br />
Moreover, if you have a raft of other IBM systems around (running Linux and<br />
AIX) these could be centrally managed along with your Blue Gene system from<br />
the same management server. For now, however, we concentrate on a one-node<br />
CSM cluster, your Blue Gene SN.<br />
To use CSM with your Blue Gene in the simplest manner possible, begin with a<br />
fully installed, fully operational Blue Gene/L system, including SN, FENs, and file<br />
servers. Next, obtain CSM through your IBM sales representative, or just grab<br />
the free, full-featured 60 day try-and-buy version from:<br />
http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html<br />
Follow the instructions in the manual CSM for AIX 5L and Linux V1.5 Planning<br />
and Installation <strong>Guide</strong>, SA23-1344-01 to install and configure the CSM<br />
management server software on your SN. Then follow the instructions in the<br />
same book for adding the optional CSM support for Blue Gene. When you’ve<br />
done this, your Blue Gene SN will double as a CSM management server and as<br />
the lone managed node in the CSM cluster.<br />
Note: The CSM software (server and client) is installed on the SN. If you want<br />
to configure your FENs and File Servers as managed nodes too, you can by<br />
installing the CSM client software, but that is optional. However, no CSM<br />
software is installed on the Blue Gene I/O or Compute Nodes.<br />
In the sections that follow, we discuss the monitoring and automated response<br />
capabilities of CSM and how to use them. You can find more information in the<br />
following CSM and Reliable Scalable Cluster Technology (RSCT) product<br />
publications (all available at the link previously mentioned):<br />
► CSM for AIX 5L and Linux V1.5 Administration <strong>Guide</strong>, SA23-1343-01<br />
► CSM for AIX 5L and Linux V1.5 Command and Technical Reference,<br />
SA23-1345-01<br />
► RSCT Administration <strong>Guide</strong>, SA22-7889-10<br />
► RSCT for Linux Technical Reference, SA22-7893-10<br />
Chapter 7. Additional topics 385
7.1.2 Monitoring the Blue Gene/L database with CSM<br />
When CSM is installed and configured on your SN (including the optional Blue<br />
Gene support), you can monitor the Blue Gene database for events of interest<br />
using a few simple commands.<br />
Assuming we have a condition (BGNodeErr), first you need to associate the<br />
condition with a response (BroadcastEventsAnyTime) by running the following<br />
command:<br />
# mkcondresp BGNodeErr BroadcastEventsAnyTime<br />
Then, you have to start monitoring the condition by running the command:<br />
# startcondresp BGNodeErr<br />
► A condition is a persistent CSM monitoring construct that identifies what to<br />
monitor, and what to monitor for. Basically, BGNodeErr is concerned with the<br />
Blue Gene database table TBGLNode, and in particular, is concerned with<br />
TBGLNode row updates that set the status column to E (Error) or M (Missing).<br />
► A response is a persistent CSM monitoring construct that identifies an action<br />
to take. In this example, BroadcastEventsAnyTime is a response that puts up<br />
a wall message for each event passed to it.<br />
► An event is a dynamic CSM monitoring construct generated by CSM when a<br />
monitored condition’s event expression evaluates true (that is, when that<br />
which is being monitoring for occurs).<br />
By running startcondresp BGNodeErr you are effectively telling CSM to monitor<br />
the Blue Gene database table TBGLNode for row updates that set the status<br />
column to E or M. This means that CSM is expected to generate an event<br />
whenever either type of row update occurs. Because you ran mkcondresp<br />
BGNodeErr BroadcastEventsAnyTime before that, CSM also knows that it should<br />
pass all such events to the BroadcastEventsAnyTime response, which, by<br />
design, puts up a wall message for each event passed to it.<br />
BGNodeErr is a predefined condition. CSM provides many predefined<br />
conditions. To get a list, simply run lscondition. To learn what a particular<br />
condition is for, run lscondition condition_name, as shown in Example 7-1.<br />
Example 7-1 Displaying condition information<br />
# lscondition BGNodeErr<br />
Displaying condition information:<br />
condition 1:<br />
Name = "BGNodeErr"<br />
Node = "c96m5sn02"<br />
386 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
MonitorStatus = "Not monitored"<br />
ResourceClass = "IBM.Sensor"<br />
EventExpression = "SD.Uint32>0"<br />
EventDescription = "An event will be generated when the node<br />
status is \"Error\" or \"Missing\" for an I/O or Compute Node in the<br />
Blue Gene system."<br />
RearmExpression = ""<br />
RearmDescription = ""<br />
SelectionString = "Name==\"BGNodeErr\""<br />
Severity = "c"<br />
NodeNames = {}<br />
MgtScope = "m"<br />
BroadcastEventsAnyTime is a predefined response. CSM provides many<br />
predefined responses. To get a list, simply run lsresponse. To learn what a<br />
particular response does, run lsresponse response_name, as shown in<br />
Example 7-2.<br />
Example 7-2 Displaying response information<br />
# lsresponse BroadcastEventsAnyTime<br />
Displaying response information:<br />
ResponseName = "BroadcastEventsAnyTime"<br />
Node = "c96m5sn02"<br />
Action = "wallEvent"<br />
DaysOfWeek = 1-7<br />
TimeOfDay = 0000-2400<br />
ActionScript = "/usr/sbin/rsct/bin/wallevent"<br />
ReturnCode = 0<br />
CheckReturnCode = "n"<br />
EventType = "b"<br />
StandardOut = "n"<br />
EnvironmentVars = ""<br />
UndefRes = "n"<br />
7.1.3 Customizing the monitoring capabilities of CSM<br />
The monitoring capabilities and predefined conditions and responses that come<br />
with CSM are powerful and useful capabilities. However, these capabilities CSM<br />
is also customizable. But before talking about the commands that you can use to<br />
customize the monitoring capabilities of CSM, we need to describe in more detail<br />
the whole CSM Blue Gene database monitoring story.<br />
Chapter 7. Additional topics 387
To simplify the earlier monitoring discussion, we purposely neglected to mention<br />
a few things. If you examine the output of lscondition BGNodeErr shown in<br />
Example 7-1, you will notice that there is no mention of the TBGLNode table, or<br />
our interest in row updates that set the status column to E or M.<br />
So where is this encoded? And how does the monitoring of the Blue Gene<br />
database really work? Let us consider the diagram in Figure 7-1.<br />
Service Node<br />
software<br />
CSM<br />
response<br />
CSM<br />
condition<br />
CSM<br />
sensor<br />
Database stored<br />
procedure<br />
Database trigger<br />
Blue Gene DATABASE<br />
(DB2)<br />
Figure 7-1 Blue Gene CSM monitoring diagram<br />
At the base of the diagram are the Blue Gene database and the SN software that<br />
writes to the database. Everything above that comes with CSM or is created by<br />
CSM when you run various commands. At the top are a couple of CSM<br />
monitoring constructs that we talked about already — a response and a<br />
condition.<br />
Below these is a CSM monitoring construct called a sensor, which we will explain<br />
later. And below the sensor are two DB2 constructs - a stored procedure and a<br />
trigger. The sensor, condition, and response used are up to you. You can use<br />
predefined ones, ones that you define, or a mix of the two types. We discuss how<br />
to define your own later. The trigger and stored procedure are created<br />
388 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
automatically for you by CSM when you start monitoring with the startcondresp<br />
command mentioned in the previous section.<br />
The large up-pointing arrow on the right simply indicates the overall flow; in<br />
layman’s terms, the trigger watches the database for the thing of interest to<br />
happen. If and when that thing happens, the trigger gathers pertinent data and<br />
passes it to the stored procedure. The stored procedure is just a middleman that<br />
passes the data to the sensor.<br />
When the condition becomes aware of new data in the sensor, the condition<br />
evaluates its EventExpression, which is based on sensor data. If the<br />
EventExpression evaluates true, an event is generated and passed to the<br />
response. The response does its thing — puts up a wall message, or sends an<br />
e-mail, or runs a script — whatever it is defined to do.<br />
Whoa! That is pretty complicated! Can’t you just insert your own database trigger<br />
and stored procedure into the database and initiate a desired action directly?<br />
The answer is "Yes," but you would need strong DB2 database administrator and<br />
programmer skills and the willingness to invest the time and effort to develop and<br />
to test the necessary DB2 constructs and code.<br />
Before introducing the CSM commands used to create custom monitoring<br />
constructs, let’s take a closer look at the constructs called sensors. As the<br />
diagram above shows, a condition is paired with a sensor. Look again at the<br />
output of lscondition BGNodeErr shown in Example 7-1. Two attributes specify<br />
with which sensor the condition BGNodeErr is paired:<br />
ResourceClass = "IBM.Sensor"<br />
SelectionString = "Name==\"BGNodeErr\""<br />
That is, condition BGNodeErr is paired with a sensor of the same name. You can<br />
obtain sensor details by running the lssensor sensor_name command, as shown<br />
in Example 7-3.<br />
Example 7-3 Displaying sensor information<br />
# lssensor BGNodeErr<br />
Name = BGNodeErr<br />
ActivePeerDomain =<br />
Command = /opt/csm/csmbin/bgmanage_trigger -t TBGLNODE -C STATUS<br />
-o u -x "n.STATUS = 'E' OR n.STATUS = 'M'" -p LOCATION,STATUS<br />
BGNodeErr<br />
ConfigChanged = 0<br />
ControlFlags = 5<br />
Chapter 7. Additional topics 389
Description = This sensor is updated when the node status is "Error"<br />
or "Missing" in the Blue Gene system. Use "SD.Uint32>0" as the event<br />
expression in all corresponding conditions.<br />
ErrorExitValue = 1<br />
ExitValue = 0<br />
Float32 = 0<br />
Float64 = 0<br />
Int32 = 0<br />
Int64 = 0<br />
NodeNameList = {c96m5sn02}<br />
RefreshInterval = 0<br />
SD = [,0,0,0,0,0,0]<br />
SavedData =<br />
String =<br />
Uint32 = 0<br />
Uint64 = 0<br />
UserName = bglsysdb<br />
Here, you see that sensor BGNodeErr’s Command attribute identifies TBGLNode<br />
as the Blue Gene table of interest, and n.STATUS = 'E' OR n.STATUS = 'M' as<br />
the column values to watch for. When you start monitoring, the sensor’s<br />
Command (bgmanage_trigger), is called with the arguments shown. It is<br />
bgmanage_trigger’s job to create the DB2 Trigger and Stored Procedure<br />
required.<br />
The purpose of this exposé is to point out that there are three main CSM<br />
monitoring constructs that are involved in monitoring the Blue Gene/L database:<br />
a sensor, a condition, and a response. There are many predefined ones that you<br />
can use, but if these do not meet your needs entirely, you can define your own.<br />
7.1.4 Defining your own CSM monitoring constructs<br />
In general, you define custom sensors with the CSM mksensor command.<br />
However, for Blue Gene/L sensors, you should use the CSM bgmksensor<br />
command instead because it understands the Blue Gene/L database and<br />
provides all the flags and options that are needed for creating a Blue Gene/L<br />
database sensor. You define custom conditions and responses using the<br />
standard CSM mkcondition and mkresponse commands.<br />
390 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
For example, suppose that you want to monitor the TBGLFanEnviroment table<br />
for high fan temperatures. CSM provides no predefined sensor for this type of<br />
monitoring. However, it is easy to define your own. On the SN, you can run this<br />
command:<br />
# bgmksensor -t TBGLFanEnvironment -o i -x “n.temperature>35”<br />
-p location,temperature BGFanTempHi<br />
Which translates to:<br />
Create a Blue Gene sensor named BGFanTempHi. Whenever a row is<br />
inserted in the TBGLFanEnvironment table with a temperature above 35<br />
degrees Celsius, the sensor caches the fan location and temperature and<br />
notifies all conditions that care. (The values to cache are specified with the -p<br />
flag, and are the values passed to the response in the generated event<br />
notification).<br />
Your then create a condition to monitor this new sensor. For example, run a<br />
command similar to the following on your CSM management server:<br />
# mkcondition -d ‘Generate an event when the temperature of a Blue<br />
Gene fan module rises above 35 degrees Celsius.’ -r IBM.Sensor -s<br />
Name=”BGFanTempHi”’ -m m -e “SD.Uint32>0” BGFanTempHi<br />
This command translates to:<br />
Create a condition named BGFanTempHi to monitor the sensor with the<br />
same name. (The -m and -e flags must simply be set as shown for all<br />
conditions created to monitor Blue Gene sensors).<br />
To start monitoring, run the following commands on your CSM management<br />
server:<br />
# mkcondresp BGFanTempHi MsgEventsToRootAnyTime "E-mail root<br />
off-shift" LogCSMEventsAnyTime<br />
# startcondresp BGFanTempHi<br />
After you have done all this, CSM creates the necessary DB2 trigger and stored<br />
procedure for you and turns monitoring on for condition BGFanTempHi<br />
automatically.<br />
Chapter 7. Additional topics 391
To verify that all is working properly, check the following items:<br />
► As root, run the following command:<br />
lsaudrec -l<br />
Near or at the bottom of the output, you should see the information shown in<br />
Example 7-4.<br />
Example 7-4 Checking monitoring conditions<br />
# lsaudrec -l<br />
.....>>> Omitted lines Omitted lines
With monitoring on, each time a row is added to the TBGLFanEnvironment<br />
table with a temperature above 35 degrees Celsius, an event will be<br />
generated. Due to the mkcondresp command (see “Monitoring the Blue<br />
Gene/L database with CSM” on page 386), the event will kick off three<br />
responses on the management server:<br />
– A message to root announcing the event (Example 7-7).<br />
Example 7-7 Message to root user when event BGFanTempHi happens<br />
Message from root@c96m5sn02 on at 15:01 ...<br />
Critical Event occurred:<br />
Condition: BGFanTempHi<br />
Node: c96m5sn02.ppd.pok.ibm.com<br />
Resource: BGFanTempHi<br />
Resource Class: Sensor<br />
Resource Attribute: SD<br />
Attribute Type: CT_SD_PTR<br />
Attribute Value: [location=J302 temperature=3.6E1,0,1,0,0,0,0]<br />
Time: Friday 07/21/06 15:00:59<br />
– An e-mail to root user (on CSM management server) with event details if<br />
the event occurs off-shift.<br />
– Logging of the event in the /var/log/csm/systemEvents file.<br />
The e-mail and log entry are shown in Example 7-8.<br />
Example 7-8 The e-mail and log entry for BGFanTempHi event<br />
Friday 07/21/06 15:00:59<br />
Condition Name: BGFanTempHi<br />
Severity: Critical<br />
Event Type: Event<br />
Expression: SD.Uint32>0<br />
Resource Name: BGFanTempHi<br />
Resource Class: IBM.Sensor<br />
Data Type: CT_SD_PTR<br />
Data Value: ["location=J302 temperature=3.6E1",0,1,0,0,0,0]<br />
Node Name: c96m5sn02.ppd.pok.ibm.com<br />
Node NameList: {c96m5sn02.ppd.pok.ibm.com}<br />
Resource Type: 0<br />
Chapter 7. Additional topics 393
7.1.5 Miscellaneous related information<br />
If you are interested in examining the DB2 Trigger and Stored Procedure that<br />
CSM creates for you, run bgmksensor with the -v flag (bgmksensor formulates the<br />
SQL statements used to create the Trigger and Stored Procedure before they’re<br />
actually needed so that it can test them), or query the Blue Gene database<br />
directly (after you have started monitoring with the startcondresp command).<br />
The database trigger name is derived from the sensor name by appending _CSM<br />
suffix (for example, for a sensor named BGFanTempHi, there will be a Trigger<br />
named BGFanTempHi_CSM when that sensor is actively used in monitoring).<br />
The database stored procedure story is actually more complicated than we’ve<br />
described. We created two Stored Procedures. One was named<br />
COMMON_BGP and the other COMMON_BGP_ext. The trigger calls<br />
COMMON_BGP which in turn calls COMMON_BGP_ext. COMMON_BGP exists<br />
to catch any SQL exceptions that occur. COMMON_BGP_ext calls a utility<br />
named refresh_sensor in a shared library named bgrefresh_sensor.so.<br />
refresh_sensor writes the data into the sensor.<br />
We created yet another DB2 construct that we have not mentioned because it<br />
plays a minor role—a DB2 sequence. Its name is derived from the sensor name<br />
by appending _CSM (for example, for a sensor named BGFanTempHi, there will be<br />
a Sequence named BGFanTempHi_CSM). The trigger uses the sequence to<br />
obtain a new number for each event it forwards to COMMON_BGP.<br />
After you have started monitoring, you can log on to the SN as bglsysdb and run<br />
the commands shown in Example 7-9.<br />
Example 7-9 Examining the constructs created by CSM in the Blue Gene database<br />
bglsysdb@bglsn~> db2 connect to bgdb0<br />
bglsysdb@bglsn~> db2 "select text from syscat.triggers where trigname =<br />
'BGFANTEMPHI_CSM'"<br />
bglsysdb@bglsn~> db2 "select procname,text from syscat.procedures where procname like<br />
'COMMON_BGP%'"<br />
bglsysdb@bglsn~> db2 "select * from syscat.sequences where seqname = '<br />
BGFANTEMPHI_CSM'"<br />
7.1.6 Conclusion<br />
CSM brings powerful and customizable monitoring and automated response<br />
capabilities to the Blue Gene/L environment. By exploiting them you can<br />
minimize or eliminate much of the manual problem determination work often<br />
facing the Blue Gene/L system administrator.<br />
394 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
7.2 Secure shell<br />
Furthermore, because these capabilities are extensions to existing CSM<br />
capabilities, you can easily monitor and set up automated responses for<br />
non-Blue Gene/L database problems as well. For example, get paged when the<br />
/var file system on your SN fills up; send an urgent e-mail to the appropriate<br />
person when the number of users on a FEN crosses some threshold. Run a<br />
script when the network adapter on a File Server is being overwhelmed. And<br />
there are so many other examples. Moving beyond monitoring, CSM offers a<br />
wealth of other capabilities that could help you manage your Blue Gene/L<br />
systems, such as distributed command execution, configuration file<br />
management, software maintenance, and so forth.<br />
This section begins with a short introduction to cryptographic techniques in a<br />
computing environment, continues with a secure shell overview, and ends with<br />
an example of how to use secure shell in a clustering environment.<br />
7.2.1 Basic cryptography<br />
One of the biggest problems in a networking environment is to design and<br />
implement a security mechanism that allows the whole computing environment<br />
to function properly, without interruptions, and also to ensure reliable data<br />
manipulation (data you can trust). Depending on the information (data) travelling<br />
across networks, you need to make sure that:<br />
► The data gets form sender to receiver unaltered: data integrity.<br />
► The data cannot be interpreted (understood) by anyone eavesdropping on the<br />
communication channel: data privacy.<br />
► The data arrived at the receiver side is coming from who the receiver thinks is<br />
coming from: data authenticity.<br />
For all these reasons and more, a series of cryptographic techniques has been<br />
developed and implemented. The cryptographic techniques are based on<br />
mathematical algorithms translated (programmed) into computer language.<br />
Thus, security has become an important component of a highly available and<br />
reliable computing environment.<br />
Chapter 7. Additional topics 395
The cryptographic technique are employed in data integrity, data privacy, and<br />
data authenticity in various forms and complexities, depending on the data<br />
security required:<br />
► Authentication (verifying identities)<br />
The parties communicating need to know and verify (with a reasonable level<br />
of trust) each other’s identity.<br />
► Authorization (access control)<br />
After the parties identities have been established, this establishes what<br />
actions will we allow someone to perform.<br />
► Data signing<br />
In addition to establishing the identities at the beginning of the communication<br />
session, it is also necessary to avoid the “man in the middle” attack.<br />
► Data encryption<br />
If we also want to avoid that someone else (third party) understands the data,<br />
we need to encrypt it in such a way that it can only be decrypted at the<br />
destination.<br />
► Accountability<br />
For multiple reasons we need to be able to trace back any system activity.<br />
It is also important to establish the effective level of security, in order to ensure<br />
an acceptable performance level. Sometimes too much security can be<br />
disturbing, because the systems can actually spend more time enforcing the<br />
security than for data computing.<br />
The encryption based mechanisms used in securing network communication<br />
are:<br />
► Symmetric key<br />
► Public/Private key pair (also known as PKI - Public Key Infrastructure)<br />
► Hash functions<br />
► Combinations of these techniques<br />
► HW cryptography<br />
396 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Symmetric key cryptography<br />
Symmetric key cryptography (Figure 7-2) uses a single (secret) key known by<br />
both parties in communication. It is relatively fast, but it has a drawback in the fact<br />
that the shared secret (key) must be distributed (somehow) to the communicating<br />
parties. If this shared secret is distributed over the (insecure) network, additional<br />
precautions must be taken.<br />
Nina encrypts with Dan's Public Key Dan decrypts using his own Private Key<br />
Nina<br />
Figure 7-2 Symmetric key encryption<br />
Message from Nina to<br />
Dan<br />
Unsecured Network<br />
Dan and Nina's Identical Key<br />
Algorithms used for symmetric key implementations include:<br />
► Data Encryption Standard (DES) algorithm<br />
The most commonly-used bulk cipher or block cipher algorithm, which was<br />
developed by IBM.<br />
► Commercial Data Masking Facility (CDMF)<br />
A method to shrink a 56-bit DES key to a 40-bit key suitable for export, which<br />
was also designed by IBM.<br />
► Triple DES<br />
The same as DES, but information is encrypted three times in a row using<br />
different keys each time.<br />
Dan<br />
Chapter 7. Additional topics 397
► RC2/RC4 algorithms<br />
RC2: block cipher (similar to DES), RC4: stream cipher (possible 40-bit key).<br />
Developed by Ron Rivest (RSA Data Security) and permit variable length<br />
keys<br />
► International Data Encryption Algorithm (IDEA)<br />
Has 128-bit key as is not a Government imposed standard. Pretty Good<br />
Privacy (PGP) uses IDEA and is freely available.<br />
Public Key Infrastructure<br />
Public Key Infrastructure (PKI) is based on a pair of asymmetric keys for each<br />
party in communication:<br />
► A private key (which never leaves the host where it has been generated<br />
► A public key which is send over the network to corresponding party for<br />
various reasons, such as digital signature, authentication, and so forth.<br />
In PKI, the information encrypted with one key (either public or private), can only<br />
be decrypted with its pair. There is no need to securely share a secret between<br />
the sender and receiver, however, this mechanism is much less efficient than<br />
symmetric key encryption, thus it is not efficient for bulk data encryption.<br />
The only one widely-used general purpose public key mechanism is Rivest,<br />
Shamir, and Adelman (RSA) algorithm, which relies on the factorization of large<br />
numbers, and property of RSA Data Security, Inc.<br />
Figure 7-3 presents a simplified diagram of how the PKI is used. In this case,<br />
Nina is sending an encrypted message (data) to Dan. For this, Nina uses Dan’s<br />
public key, thus ensuring that the data can only be decrypted by Dan. We do not<br />
show here other mechanisms, like how to make sure Nina has received Dan’s<br />
public key, or how Dan can be sure the message comes from Nina, and that it<br />
has not been altered while travelling through the communication channel.<br />
398 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 7-3 PKI - Nina sending a message to Dan<br />
Secure shell is based on PKI and uses several of these techniques to make sure<br />
data that gets from Nina to Dan can be trusted.<br />
7.2.2 Secure shell basics<br />
Nina encrypts with Dan's Public Key Dan decrypts using his own Private Key<br />
Nina's key<br />
pair<br />
Private<br />
Public<br />
Message from Nina to<br />
Dan<br />
Unsecured Network<br />
Initial key exchange<br />
(DAN to NINA only)<br />
Private<br />
Dan's key<br />
pair<br />
Secure shell is a client-server tool implemented for making possible secure<br />
system administration operations across complex networks. Secure shell<br />
employs a number of cryptographic techniques and provides a series of facilities<br />
(functions, protocols, and so forth) for the system administrators to use in their<br />
job.<br />
Beside the basic remote login, it also provides functions as remote command<br />
execution, remote file copy, and more sophisticated techniques, such as<br />
tunnelling (encrypted communication channel) for various other applications.<br />
In this section we provide basic information about secure shell server and client.<br />
Public<br />
Chapter 7. Additional topics 399
Secure shell server<br />
The secure shell server runs as a service (daemon) on the host system and<br />
provides the infrastructure for allowing incoming clients to connect and perform<br />
various administrative actions.<br />
Figure 7-4 shows the major dependencies of the secure shell server (sshd).<br />
Usually, this is started as a service at system boot time using /etc/rc.d/init.d/sshd<br />
script (depending on the init runtime mode).<br />
Server keys:<br />
rsa1, rsa2,<br />
dsa<br />
Server<br />
configuration<br />
files<br />
SSL Libraries<br />
Secure shell<br />
daemon (sshd)<br />
Server<br />
binaries and<br />
commands<br />
Figure 7-4 Secure shell server dependencies<br />
These dependencies are:<br />
► Secure socket layer (SSL) libraries, commands and headers.<br />
These provide the cryptographic functions and commands for various<br />
operations (key generation, encryption/decrypting).<br />
► Server configuration binaries and scripts<br />
This is the actual server code (executables, libraries, and scripts).<br />
400 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
Listen on port<br />
TCP:22<br />
User<br />
authentication<br />
configuration<br />
System<br />
authentication<br />
files and<br />
libraries
► Server configuration files<br />
These are used by the server for creating its runtime environment and are<br />
usually located in /etc/ssh directory.<br />
► System authentication files and libraries<br />
Secure shell server uses these files to pass authentication requests to the<br />
system (user/password).<br />
► Server keys<br />
As a daemon (service), ssh server runs under the root user ID, and<br />
represents an entity with its own identity, thus it has its own pair of keys<br />
(actually, it has three pairs of keys: rsa1 rsa2, and dsa), also known as ssh<br />
host keys.<br />
► User authentication files are actually located in each user’s ~/.ssh directory,<br />
and they represent the identities (users) known to the server to be<br />
authenticated using their public keys (ssh server “knows” client’s identity<br />
based on client’s public key).<br />
► TCP port 22<br />
This is the default port used by the ssh server to listen for incoming requests.<br />
It is configurable in /etc/ssh/sshd_config.<br />
Secure shell client<br />
Secure shell client provides (among others) a set of tools used for executing<br />
administrative tasks on remote systems. On a Linux (and AIX) system these are:<br />
► /usr/bin/scp<br />
Secure remote copy program; replaces /usr/bin/rcp<br />
► /usr/bin/sftp<br />
Secure file transfer program; replaces /usr/bin/ftp; however it has less<br />
features than classic ftp<br />
► /usr/bin/slogin<br />
Secure remote login program; replaces /usr/bin/rlogin<br />
► /usr/bin/ssh<br />
Secure shell program; replaces /usr/bin/rsh<br />
Chapter 7. Additional topics 401
Secure shell client programs also depend on SSL libraries to perform data<br />
encryption/decryption. The client software has a general configuration file,<br />
/etc/ssh/ssh_config, and a directory for each user (~/.ssh) for storing user pair(s)<br />
of keys and authentication files:<br />
► known_hosts<br />
This file stores the public keys of the servers the user has connected. Beside<br />
the keys, you can also specify certain options for connections to remote<br />
servers.<br />
► authorized_keys<br />
Even though this file is stored in users’ ~/.ssh directory, it is actually used by<br />
the local ssh daemon, and contains the public keys of the remote identities<br />
(user@host) allowed to execute commands on the local operating system. It<br />
can also contain various options, such as command redirection, login<br />
parameters, and so forth.<br />
Note: The ~/.ssh/authorized_keys (or authorized_keys2) is actually used<br />
by the local ssh server (sshd).<br />
► User’s ssh client configuration file (optional) if you want to specify additional<br />
configuration parameters (besides the ones in /etc/ssh/ssh_config)<br />
7.2.3 Sample configuration in a cluster environment<br />
In a cluster environment, because the nodes (systems) have to work together,<br />
some type of remote command must be set up between nodes. Generally, the<br />
requirement for such a program allows known identities (users and services)<br />
access across the network to remote services based on unprompted<br />
authentication (establishing the remote identity without interactive<br />
authentication—prompt for user name or password).<br />
402 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Figure 7-5 presents the files involved in remote command execution using secure<br />
shell with un-prompted authentication.<br />
node1 (ssh client)<br />
user: root@node1<br />
files:<br />
~/.ssh/id_rsa<br />
~/.ssh/id_rsa.pub<br />
~/.known_hosts<br />
command:<br />
/usr/bin/ssh<br />
node2 (ssh server)<br />
Figure 7-5 The ssh files involved in un-prompted authentication<br />
files:<br />
/etc/ssh/ssh_host_rsa_key<br />
/etc/ssh/ssh_host_rsa_key.pub<br />
user: root@node2<br />
files:<br />
~/.ssh/authorized_keys<br />
sshd: Listen on<br />
TCP port 22<br />
In the diagram, we assume that user root@node1 wants to execute a command<br />
(date) on node2, also as user root (root@node2).<br />
root@node1_# ssh root@node2 date<br />
We assume that no previous configuration was done nor that initial trust has<br />
been established. Thus, the following happens:<br />
► User root@node1 sends its identity to the server, specifying he wants to<br />
connect as root@node2.<br />
► The server (sshd) node2 sends its public key<br />
(/etc/ssh/ssh_host_rsa_key.pub) to the terminal that was used to initiate the<br />
connection, asking the user (root@node1) if he wants to accept the key (you<br />
have to explicitly type in “yes”).<br />
► When accepted, the key is stored in ~/.ssh/known_hosts file for root@node1<br />
user, and a session key is generated. The session key will be used to encrypt<br />
the information further transmitted during this session (until connection<br />
closes).<br />
Note: At this point in time, the ~/.ssh directory might not even exist. When<br />
you accept the server’s key, the directory is created along with the<br />
known_hosts file.<br />
► The server looks for root@node2 user’s ~/.ssh/authorized_key file.<br />
Chapter 7. Additional topics 403
► If this file exists, it checks for root@node1 public key (which has not yet been<br />
created on node1).<br />
► Because it cannot authenticate the user, it passes the control to the system<br />
authentication which prompts for root@node2 password.<br />
► When typed in correctly, the date command is executed and its result is<br />
returned to root@node1 user’s terminal.<br />
However, if configuration and initial trust have been performed, the date<br />
command is executed without prompting the user for a password.<br />
To achieve this you have to do the following:<br />
► On node1, as root, generate the ssh client keys, type rsa, no passphrase:<br />
root@node1 #/usr/bin/ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ''<br />
► On node1, as root user, grab node2 ssh server’s key, and store it in local root<br />
user’s known_hosts file:<br />
root@node1 #/usr/bin/ssh-keyscan -t rsa node2 >> ~/.ssh/known_hosts<br />
► On node1, send root user’s public key previously generated to root@node2:<br />
root@node1 #/usr/bin/scp ~/.ssh/id_rsa.pub<br />
root@node2:~/.ssh/node1_rsa.pub<br />
► On node2, add the public key received to root user’s authorized_keys file:<br />
root@node2 #cat ~/.ssh/node1_rsa.pub >> ~/.ssh/authorized_keys<br />
Now, you can execute remote commands as root@node2, from root@node1,<br />
without a password.<br />
Attention: For simplicity, our example refers to a one-way configuration, that<br />
is root user using ssh client on node1 executes a command (without<br />
interactive authentication) as root on node2.<br />
If you want user root@node2 to be able to run a remote command without<br />
interactive authentication) as root on node1, you need to configure symmetric<br />
files on both nodes.<br />
Objective<br />
Because many of the today’s clusters are connected through multiple networks,<br />
we would like our cluster to be as secure as possible and, at the same time, to<br />
allow access to remote resources in the cluster as seamless as possible. For this<br />
purpose, secure shell offers a reasonable solution with a good compromise for<br />
the security/administrative overhead ratio, thus it is the de facto standard tool for<br />
remote administration tasks. However, some basic security knowledge and a<br />
404 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
good understanding of the secure shell client-server implementation is required<br />
to achieve good.<br />
Our objective is to set up a cluster for remote command execution for root user<br />
on all nodes without interactive authentication, using a single set of keys for all<br />
ssh servers (daemons) running on every node in the cluster.<br />
For our exercise we have used the sample cluster shown in Figure 7-6.<br />
p630n01 p630n01 p630n01<br />
172.16.1.31 172.16.1.32 172.16.1.33<br />
Figure 7-6 Sample cluster configuration<br />
The three nodes run AIX 5.3 TL4, openssl 0.9.7-2g, and openssh 4.1. As it is<br />
outside the scope of this material, we do not explain how to install the software or<br />
configure the basic OS.<br />
Checking and setting up the configuration<br />
We started from scratch with ssh (no previous connection, nor authentication<br />
configuration). We have performed the following steps:<br />
► On node p630n01 we checked the server configuration for the location of the<br />
server keys, then generated three pairs of keys for the secure shell server<br />
(rsa1, rsa2, and dsa):<br />
root@p630n01_#/usr/bin/ssh-keygen -t rsa1 -f /etc/ssh/ssh_host_key<br />
-N ''<br />
root@p630n01_#/usr/bin/ssh-keygen -t rsa -f<br />
/etc/ssh/ssh_host_rsa_key -N ''<br />
root@p630n01_#/usr/bin/ssh-keygen -t dsa -f<br />
/etc/ssh/ssh_host_dsa_key -N ''<br />
► We restarted the ssh daemon (sshd) on p630n01 to learn the new keys.<br />
► We propagated all three pairs of keys in /etc/ssh directory to p630n02 and<br />
p630n03 (using scp).<br />
Chapter 7. Additional topics 405
► We restarted the ssh daemons on p630n02 and p630n03.<br />
Note: If you work remotely from a single node, it is good idea to restart the<br />
entire node if you cannot handle just the ssh daemons.<br />
► To avoid any conflicts, we wiped out the contents of the root user’s<br />
~/.ssh/known_hosts file on p630n01:<br />
root@p630n01_# > ~/.ssh/known_hosts<br />
► We grabbed the ssh server public keys from all nodes in a new known_hosts<br />
file. For this we use a file (my_nodes) containing the host names (IP labels) of<br />
nodes p630n01,2,3 (text file, one host name per line):<br />
root@p630n01_# /usr/bin/ssh-keyscan -t rsa -f my_nodes ><br />
~/.ssh/known_hosts<br />
Note: The format of known_hosts file uses one line per every host known.<br />
Using the ssh-keyscan command assumes all nodes are up and running (sshd<br />
as well). However, if not all nodes are up, or if you have tens of nodes (or<br />
more) in your cluster, and you plan to use a single pair of keys for all machines<br />
in your cluster, you can use wildcard characters to specify the host part of the<br />
key in the known_hosts file, using one line (and one key) to cover all hosts in a<br />
specified IP range. For details, check the ssh man pages.<br />
► We generated the root user pair of keys. For this, we have chosen rsa2<br />
(which is, in fact, the rsa type) algorithm:<br />
root@p630n01_#/usr/bin/ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ''<br />
► We copied the public key into a fresh authorized_keys file (if you already have<br />
such a file you need to check it for duplicates and correct entries):<br />
root@p630n01_# cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys<br />
► We propagated the entire contents or the ~/.ssh directory to all nodes in the<br />
cluster (p630n01,2,3).<br />
► We tested the authentication.<br />
7.2.4 Using ssh in a Blue Gene/L environment<br />
In a Blue Gene/L environment, secure shell is used to allow remote command<br />
execution between the SN and the I/O nodes (mainly for GPFS, but not only). It<br />
can also ne used for mpirun (between FENs and SN).<br />
As a particularity, in Linux, when using remote command execution between<br />
FENs and the SN, the pair rsh/rshd does not allow you to interactively login onto<br />
406 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
the SN, as the rlogind (remote login daemon), or telnetd are not usually running<br />
on the SN (default configuration in SUSE SLES9). However, if you plan to use<br />
ssh as remote command execution program between the FENs and the SN, you<br />
need a special configuration for the authorized_keys file of the users allowed to<br />
execute remote commands on the SN.<br />
As previously mentioned, the format of the authorized_keys file allows you to<br />
customize the behavior of the remote command execution with ssh (see the ssh<br />
man pages).<br />
Tip: Using the no-pty option in front of the public key in the authorized_keys<br />
file allows you to execute remote commands without being prompted for<br />
password. However, you will not get a pseudo-terminal (pty) if you try to open<br />
a shell (ssh without any command).<br />
For Open secure shell detailed information, credits and latest news, see:<br />
http://www.openssh.org<br />
Chapter 7. Additional topics 407
408 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
A<br />
Appendix A. Installing and setting up<br />
LoadLeveler for Blue Gene/L<br />
This appendix describes the steps for installing and setting up LoadLeveler for<br />
the Blue Gene/L system built during the writing of this red book. This is to show<br />
one way of setting up and configuring a LoadLeveler cluster. Special attention is<br />
directed to Blue Gene/L specific items and procedures such as:<br />
► The additional 32-bit library RPM<br />
► The specific option flag when running the installation script<br />
► Blue Gene/L configuration keywords<br />
► Blue Gene/L environment variables<br />
At the end of the setting up process, the message Blue Gene is present is key<br />
indication of a successful installation and configuration of LoadLeveler.<br />
© Copyright IBM Corp. 2006. All rights reserved. 409
Installing LoadLeveler on SN and FENs<br />
Obtaining the rpms<br />
This section explains how to install LoadLeveler on SN and FENs.<br />
The RPMs can come from the provided CD-ROM. The upgrade rpms can also be<br />
downloaded from the IBM service support Web site:<br />
http://www14.software.ibm.com/webapp/set2/sas/f/loadleveler/download/un<br />
ix.html<br />
There are five required rpms for LoadLeveler. Example A-1 lists the rpms for the<br />
IBM System p platform and the operating system is SLES9.<br />
Example: A-1 LoadLeveler RPMs<br />
LoadL-full-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />
LoadL-full-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />
LoadL-so-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />
LoadL-so-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />
LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm<br />
Note: LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm is required for Blue<br />
Gene/L only.<br />
On the download Web site, click View to see information about the upgrade<br />
(Figure A-1).<br />
Figure A-1 Downloading LoadLeveler rpms from the Web<br />
In addition to the five RPMs, the Java rpm has to be present in the same<br />
directory. Although the system has a later version of Java installed, the following<br />
rpm is required:<br />
IBMJava2-JRE-ppc64-1.4.2-0.0.ppc64.rpm<br />
410 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Installing the rpms<br />
Note: Without this Java rpm, the install_ll script does not run. See IBM TWS<br />
LoadLeveler Installation <strong>Guide</strong> (GI10-0763-02) for further information.<br />
First, the following command installs the LoadLeveler license in directory<br />
/opt/ibmll/LoadL/.<br />
rpm -ivh LoadL-full-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm<br />
Then, locate the install_ll script in /opt/ibmll/LoadL/sbin/ and invoke it as<br />
following:<br />
.install_ll -y -b -d <br />
where is the directory containing the LoadLeveler rpms.<br />
Note: The option flag -b is important to tell install_ll to install the<br />
LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm. Without this flag specified,<br />
install_ll installs only the rpms for a regular SLES9 node.<br />
The same installation procedure has to be repeated on all nodes. If available,<br />
dsh can be used to do installation on remote nodes.<br />
Setting up the LoadLeveler cluster<br />
This section explains how to set up the LoadLeveler cluster.<br />
LoadLeveler user and group IDs<br />
LoadLeveler requires one user ID to be the administrator. Also, it is<br />
recommended that a group ID is also created for the purposes of running<br />
LoadLeveler. If a group ID does not already exist, create a loadl group with the<br />
following command:<br />
groupadd -g 7000 loadl<br />
Note: The group ID 7000 is arbitrary. However, it has to be the same on all<br />
nodes in the cluster.<br />
If a user ID does not already exist, create a loadl user ID with the following<br />
command:<br />
useradd -d /home/loadl -g loadl -u 7001 -p loadl loadl<br />
Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 411
Create the home directory for user loadl. This is a local file system:<br />
mkdir /home/loadl<br />
LoadLeveler configuration<br />
Create a common (NFS mounted) directory to contain LoadLeveler configuration<br />
files:<br />
mkdir /bgl/loadlcfg<br />
Copy the two provided sample files LoadL_admin and LoadL_config from<br />
/opt/ibmll/LoadL/full/samples/ to /bgl/loadlcfg and make appropriate changes.<br />
Create the file /etc/LoadL.cfg with this content similar to Example A-2.<br />
Example: A-2 Content of /etc/LoadL.cfg<br />
LoadLUserid = loadl<br />
LoadL<strong>Group</strong>id = loadl<br />
LoadLConfig = /bgl/loadlcfg/LoadL_config<br />
Create LoadLeveler directories such as log, spool, execute under /home/loadl by<br />
issuing the following command as user loadl:<br />
/opt/ibmll/LoadL/full/bin/llinit -local /home/loadl<br />
Note: The llinit command also creates link to the LoadLeveler bin directory<br />
from /home/loadl so that user loadl can invoke the LoadLeveler commands<br />
such as llctl, llstatus, llq, and so forth, without requiring the directory added<br />
into the $PATH variable.<br />
Again, the llinit command has to be run on all nodes.<br />
Enable rsh on all nodes. Then, start LoadLeveler with the following command:<br />
llctl -g start<br />
412 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Enabling Blue Gene/L capabilities in LoadLeveler<br />
Up to this point, we have a regular LoadL cluster on these Linux nodes.<br />
LoadLeveler does not know anything about Blue Gene/L yet.<br />
To tell LoadLeveler about the Blue Gene/L system, add these Blue Gene/L<br />
specific keywords into the global configuration file /bgl/loadlcfg/LoadL_config:<br />
BG_ENABLED = true<br />
BG_ALLOW_LL_JOBS_ONLY = false<br />
BG_MIN_PARTITION_SIZE = 32<br />
BG_CACHE_PARTITIONS = true<br />
Now, stop and start LoadLeveler and it recognizes Blue Gene/L. But, it may<br />
says "Blue Gene is absent" because LoadLeveler cannot find the relevant Blue<br />
Gene/L libraries yet. See Figure 4-12 on page 172 for the message Blue Gene is<br />
present displayed from the llstatus command.<br />
To create the appropriate symbolic links for the libraries that LoadLeveler needs,<br />
run the following script as root user:<br />
/home/loadl/bglinks<br />
Presumably, the system administrator has to create this file and run it once on<br />
every node. The contents of the script is described in this red book. See 4.4.5,<br />
“Making the Blue Gene/L libraries available to LoadLeveler” on page 173.<br />
Setting Blue Gene/L specific environment variables<br />
Set up the following environment variables in the .bashrc for user loadl:<br />
export BRIDGE_CONFIG_FILE =<br />
/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config<br />
export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties<br />
export MMCS_SERVER_IP=bglsn.itso.ibm.com<br />
Note: The actual directory paths vary on different systems.<br />
Now stop and restart LoadLeveler. It should say Blue Gene is present. See<br />
Figure 4-12 on page 172 and 4.4.9, “LoadLeveler checklist” on page 186 for<br />
detail descriptions of LoadLeveler status checking.<br />
Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 413
Example A-3 presents a sample LoadL_admin file we used in our environment.<br />
Example: A-3 Sample LoadL_admin file<br />
# LoadL_admin file: Remove comments and edit this file to suit your<br />
installation.<br />
# This file consists of machine, class, user, group and adapter<br />
stanzas.<br />
# Each stanza has defaults, as specified in a "defaults:" section.<br />
# Default stanzas are used to set specifications for fields which are<br />
# not specified.<br />
# Class, user, group, and adapter stanzas are optional. When no<br />
adapter<br />
# stanzas are specified, LoadLeveler determines adapters dynamically.<br />
Refer to<br />
# Using and Administering LoadLeveler for detailed information about<br />
# keywords and associated values. Also see LoadL_admin.1 in the<br />
# ~loadl/samples directory for sample stanzas.<br />
#######################################################################<br />
######<br />
# DEFAULTS FOR MACHINE, CLASS, USER, AND GROUP STANZAS:<br />
# Remove initial # (comment), and edit to suit.<br />
#<br />
default: type = machine<br />
default: type = class # default class stanza<br />
wall_clock_limit = 30:00 # default wall clock limit<br />
default: type = user # default user stanza<br />
default_class = No_Class # default class = No_Class (not<br />
# optional)<br />
default_group = No_<strong>Group</strong> # default group = No_<strong>Group</strong><br />
(not<br />
# optional)<br />
default_interactive_class = inter_class<br />
default: type = group # default group stanza<br />
# priority = 0 # default <strong>Group</strong>Sysprio<br />
# maxjobs = -1 # default maximum jobs group<br />
is allowed<br />
# to run simultaneously (no<br />
limit)<br />
# maxqueued = -1 # default maximum jobs group<br />
is allowed<br />
414 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
# on system queue (no limit).<br />
does not<br />
# limit jobs submitted.<br />
#######################################################################<br />
######<br />
# MACHINE STANZAS:<br />
# These are the machine stanzas; the first machine is defined as<br />
# the central manager. mach1:, mach2:, etc. are machine name labels -<br />
# revise these placeholder labels with the names of the machines in the<br />
# pool, and specify any schedd_host and submit_only keywords and values<br />
# (true or false), if required.<br />
#######################################################################<br />
######<br />
bglsn.itso.ibm.com: type = machine<br />
schedd_host = true<br />
central_manager = true<br />
bglfen1.itso.ibm.com: type = machine<br />
schedd_host = true<br />
bglfen2.itso.ibm.com: type = machine<br />
schedd_host = true<br />
Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 415
Example A-4 presents a sample LoadL_config file we used in our environment.<br />
Example: A-4 Sample LoadL_config file<br />
#<br />
# Machine Description<br />
#<br />
ARCH = PPC64<br />
#<br />
# Blue Gene Specific Settings<br />
#<br />
BG_ENABLED = true<br />
BG_ALLOW_LL_JOBS_ONLY = false<br />
#BG_ALLOW_LL_JOBS_ONLY = true<br />
BG_MIN_PARTITION_SIZE = 32<br />
BG_CACHE_PARTITIONS = true<br />
#<br />
# Specify LoadLeveler Administrators here:<br />
#<br />
LOADL_ADMIN = loadl root<br />
#<br />
# Default to starting LoadLeveler daemons when requested<br />
#<br />
START_DAEMONS = TRUE<br />
#<br />
# Machine authentication<br />
#<br />
# If TRUE, only connections from machines in the ADMIN_LIST are<br />
accepted.<br />
# If FALSE, connections from any machine are accepted. Default<br />
if not<br />
# specified is FALSE.<br />
#<br />
MACHINE_AUTHENTICATE = FALSE<br />
#<br />
# Specify which daemons run on each node<br />
#<br />
SCHEDD_RUNS_HERE = True<br />
STARTD_RUNS_HERE = True<br />
# Specify pathnames<br />
416 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
#<br />
RELEASEDIR = /opt/ibmll/LoadL/full<br />
LOCAL_CONFIG = $(tilde)/LoadL_config.local<br />
ADMIN_FILE = /bgl/loadlcfg/LoadL_admin<br />
LOG = $(tilde)/log<br />
SPOOL = $(tilde)/spool<br />
EXECUTE = $(tilde)/execute<br />
HISTORY = $(SPOOL)/history<br />
RESERVATION_HISTORY = $(SPOOL)/reservation_history<br />
BIN = $(RELEASEDIR)/bin<br />
LIB = $(RELEASEDIR)/lib<br />
#<br />
# Specify port numbers<br />
#<br />
MASTER_STREAM_PORT = 9616<br />
NEGOTIATOR_STREAM_PORT = 9614<br />
SCHEDD_STREAM_PORT = 9605<br />
STARTD_STREAM_PORT = 9611<br />
COLLECTOR_DGRAM_PORT = 9613<br />
STARTD_DGRAM_PORT = 9615<br />
MASTER_DGRAM_PORT = 9617<br />
#<br />
# Specify a scheduler type: LL_DEFAULT, API, BACKFILL, GANG<br />
# API specifies that internal LoadLeveler scheduling algorithms be<br />
# turned off and LL_DEFAULT specifies that the original internal<br />
# LoadLeveler scheduling algorithm be used.<br />
#<br />
SCHEDULER_TYPE = BACKFILL<br />
#<br />
# Specify accounting controls<br />
# To turn reservation data recording on, add the flag A_RES to ACCT<br />
#<br />
ACCT = A_OFF A_RES<br />
ACCT_VALIDATION = $(BIN)/llacctval<br />
GLOBAL_HISTORY = $(SPOOL)<br />
#<br />
# Specify checkpointing intervals<br />
#<br />
MIN_CKPT_INTERVAL = 900<br />
MAX_CKPT_INTERVAL = 7200<br />
# perform cleanup of checkpoint files once a day<br />
# 24 hrs x 60 min/hr x 60 sec/min = 86400 sec/day<br />
Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 417
CKPT_CLEANUP_INTERVAL = 86400<br />
# sample source for the ckpt file cleanup program is shipped with<br />
LoadLeveler<br />
# and is found in: /usr/lpp/LoadL/full/samples/llckpt/rmckptfiles.c<br />
#<br />
# compile the source and indicate the location of the executable<br />
# as shown in the following example<br />
CKPT_CLEANUP_PROGRAM = /u/mylladmin/bin/rmckptfiles<br />
# LoadL_KeyboardD Macros<br />
#<br />
KBDD = $(BIN)/LoadL_kbdd<br />
KBDD_LOG = $(LOG)/KbdLog<br />
MAX_KBDD_LOG = 64000<br />
KBDD_DEBUG =<br />
#<br />
# Specify whether to start the keyboard daemon<br />
#<br />
#if HAS_X<br />
X_RUNS_HERE = True<br />
#else<br />
X_RUNS_HERE = False<br />
#endif<br />
#<br />
# LoadL_StartD Macros<br />
#<br />
STARTD = $(BIN)/LoadL_startd<br />
STARTD_LOG = $(LOG)/StartLog<br />
MAX_STARTD_LOG = 64000<br />
STARTD_DEBUG =<br />
POLLING_FREQUENCY = 5<br />
POLLS_PER_UPDATE = 24<br />
JOB_LIMIT_POLICY = 120<br />
JOB_ACCT_Q_POLICY = 300<br />
PROCESS_TRACKING = FALSE<br />
PROCESS_TRACKING_EXTENSION = $(BIN)<br />
#ifdef KbdDeviceName<br />
KBD_DEVICE = KbdDeviceName<br />
#endif<br />
#ifdef MouseDeviceName<br />
418 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
MOUSE_DEVICE = MouseDeviceName<br />
#endif<br />
#<br />
# LoadL_SchedD Macros<br />
#<br />
SCHEDD = $(BIN)/LoadL_schedd<br />
SCHEDD_LOG = $(LOG)/SchedLog<br />
MAX_SCHEDD_LOG = 64000<br />
SCHEDD_DEBUG =<br />
SCHEDD_INTERVAL = 120<br />
CLIENT_TIMEOUT = 30<br />
#<br />
# Negotiator Macros<br />
#<br />
NEGOTIATOR = $(BIN)/LoadL_negotiator<br />
NEGOTIATOR_DEBUG = D_NEGOTIATE D_FULLDEBUG<br />
NEGOTIATOR_LOG = $(LOG)/NegotiatorLog<br />
MAX_NEGOTIATOR_LOG = 64000<br />
NEGOTIATOR_INTERVAL = 60<br />
MACHINE_UPDATE_INTERVAL = 300<br />
NEGOTIATOR_PARALLEL_DEFER = 300<br />
NEGOTIATOR_PARALLEL_HOLD = 300<br />
NEGOTIATOR_REDRIVE_PENDING = 90<br />
NEGOTIATOR_RESCAN_QUEUE = 90<br />
NEGOTIATOR_REMOVE_COMPLETED = 0<br />
NEGOTIATOR_CYCLE_DELAY = 0<br />
NEGOTIATOR_CYCLE_TIME_LIMIT = 0<br />
#<br />
# Sets the interval between recalculation of the SYSPRIO values<br />
# for all the jobs in the queue<br />
#<br />
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 0<br />
#<br />
# GSmonitor Macros<br />
#<br />
GSMONITOR = $(BIN)/LoadL_GSmonitor<br />
GSMONITOR_DEBUG =<br />
GSMONITOR_LOG = $(LOG)/GSmonitorLog<br />
MAX_GSMONITOR_LOG = 64000<br />
#<br />
# Starter Macros<br />
Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 419
#<br />
STARTER = $(BIN)/LoadL_starter<br />
STARTER_DEBUG =<br />
STARTER_LOG = $(LOG)/StarterLog<br />
MAX_STARTER_LOG = 64000<br />
#<br />
# LoadL_Master Macros<br />
#<br />
MASTER = $(BIN)/LoadL_master<br />
MASTER_LOG = $(LOG)/MasterLog<br />
MASTER_DEBUG =<br />
MAX_MASTER_LOG = 64000<br />
RESTARTS_PER_HOUR = 12<br />
PUBLISH_OBITUARIES = TRUE<br />
OBITUARY_LOG_LENGTH = 25<br />
#<br />
# Specify whether log files are truncated when opened<br />
#<br />
TRUNC_MASTER_LOG_ON_OPEN = False<br />
TRUNC_STARTD_LOG_ON_OPEN = False<br />
TRUNC_SCHEDD_LOG_ON_OPEN = False<br />
TRUNC_KBDD_LOG_ON_OPEN = False<br />
TRUNC_STARTER_LOG_ON_OPEN = False<br />
TRUNC_NEGOTIATOR_LOG_ON_OPEN = False<br />
TRUNC_GSMONITOR_LOG_ON_OPEN = False<br />
#<br />
# Machine control expressions and macros<br />
#<br />
OpSys : "$(OPSYS)"<br />
Arch : "$(ARCH)"<br />
Machine : "$(HOST).$(DOMAIN)"<br />
#<br />
# Expressions used to control starting and stopping of foreign<br />
jobs<br />
#<br />
MINUTE = 60<br />
HOUR = (60 * $(MINUTE))<br />
StateTimer = (CurrentTime - EnteredCurrentState)<br />
BackgroundLoad = 0.7<br />
HighLoad = 1.5<br />
StartIdleTime = 15 * $(MINUTE)<br />
ContinueIdleTime = 5 * $(MINUTE)<br />
420 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
MaxSuspendTime = 10 * $(MINUTE)<br />
MaxVacateTime = 10 * $(MINUTE)<br />
KeyboardBusy = KeyboardIdle < $(POLLING_FREQUENCY)<br />
CPU_Idle = LoadAvg = $(HighLoad)<br />
#<br />
# See Using and Administering LoadLeveler for an explanation of these<br />
# control expressions<br />
#<br />
# START : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime)<br />
# SUSPEND : $(CPU_Busy) || $(KeyboardBusy)<br />
# CONTINUE : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime)<br />
# VACATE : $(StateTimer) > $(MaxSuspendTime)<br />
# KILL : $(StateTimer) > $(MaxVacateTime)<br />
START : T<br />
SUSPEND : F<br />
CONTINUE : T<br />
VACATE : F<br />
KILL : F<br />
#<br />
# The following (default) expression for SYSPRIO creates a FIFO<br />
job queue.<br />
#<br />
SYSPRIO: 0 - (QDate)<br />
#MACHPRIO: 0 - (1000 * (LoadAvg / (Cpus * Speed)))<br />
#<br />
# The following (default) expression for MACHPRIO orders<br />
# machines by load average.<br />
#<br />
MACHPRIO: 0 - (LoadAvg)<br />
#<br />
# The MAX_JOB_REJECT value determines how many times a job can be<br />
# rejected before it is canceled or put on hold. The default<br />
value<br />
# is 0, which indicates a rejected job will immediately be<br />
canceled<br />
# or placed on hold. MAX_JOB_REJECT may be set to unlimited<br />
rejects<br />
# by specifying a value of -1.<br />
Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 421
#<br />
MAX_JOB_REJECT = 0<br />
#<br />
# When ACTION_ON_MAX_REJECT is HOLD, jobs will be put on user hold<br />
# when the number of rejects reaches the MAX_JOB_REJECT value.<br />
When<br />
# ACTION_ON_MAX_REJECT is CANCEL, jobs will be canceled when the<br />
# number of rejects reaches the MAX_JOB_REJECT value. The default<br />
# value is HOLD.<br />
#<br />
ACTION_ON_MAX_REJECT = HOLD<br />
# Filesystem Monitor Interval and Threshholds<br />
# Monitoring Interval is in minutes and should be set according to how<br />
# Fast the filesystem grows<br />
FS_INTERVAL = 30<br />
# File System space threshholds are specified in Bytes. Scaling<br />
factors<br />
# such as K, M and G are allowed.<br />
FS_NOTIFY = 750KB,1MB<br />
FS_SUSPEND = 500KB, 750KB<br />
FS_TERMINATE = 100MB,100MB<br />
# File System inode threshholds are specified in number of inodes.<br />
Scaling fact<br />
ors<br />
# such as K, M and G are allowed.<br />
INODE_NOTIFY = 1K,1.1K<br />
INODE_SUSPEND = 500,750<br />
INODE_TERMINATE = 50,50<br />
422 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Appendix B. The sitefs file<br />
B<br />
This appendix includes the sitefs file that we used. We created it from the<br />
example sitefs file that is shown in the file:<br />
/bgl/BlueLight/ppcfloor/docs/ionode.README<br />
We added the lines that we needed and saved the file in the following directory<br />
on the Service Node so that it survived any upgrades to the Blue Gene/L driver:<br />
/bgl/dist/etc/rc.d/init.d<br />
© Copyright IBM Corp. 2006. All rights reserved. 423
The /bgl/dist/etc/rc.d/init.d/sitefs file<br />
In this appendix we provide/describe/discuss ...<br />
# US Government Users Restricted Rights -<br />
# Use, duplication or disclosure restricted<br />
# by GSA ADP Schedule Contract with IBM Corp.<br />
#<br />
# Licensed Materials-Property of IBM<br />
# -------------------------------------------------------------<br />
#<br />
# -------------------------------------------------------------<br />
# NOTE: The PATH environment variable is set to the following<br />
# upon entry to this script:<br />
# /bin.rd:/sbin.rd:/usr/bin:/bin:/usr/sbin:/sbin<br />
# The /bin.rd and /sbin.rd directories contain many of<br />
# the busybox commands and Blue Gene specific programs.<br />
# The /bin, /sbin, and /usr directories are symbolic<br />
# links to the NFS-mounted MCP.<br />
#--------------------------------------------------------------<br />
#<br />
#<br />
# /etc/init.d/syslog<br />
# Default-Stop:<br />
# Description: Start the system logging daemons<br />
### END INIT INFO<br />
# Source config file, if it exists.<br />
test -f /etc/sysconfig/syslog || exit 6<br />
. /etc/sysconfig/syslog<br />
BINDIR=/sbin.rd<br />
case "$SYSLOG_DAEMON" in<br />
syslog-ng)<br />
syslog=syslog-ng<br />
config=/etc/syslog-ng/syslog-ng.conf<br />
params="$SYSLOG_NG_PARAMS"<br />
;;<br />
*)<br />
syslog=syslogd<br />
config=/etc/syslog.conf<br />
params="$SYSLOGD_PARAMS"<br />
# Add additional sockets to SYSLOGD_PARAMS<br />
# Extract the names of the variables beginning with<br />
SYSLOGD_ADDITIONAL_SOCKET<br />
424 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
SYSLOGD_ADDITIONAL_SOCKET_LIST=`grep SYSLOGD_ADDITIONAL_SOCKET<br />
/etc/sysconfig/syslog | sed s/=.*//`<br />
for variable in ${SYSLOGD_ADDITIONAL_SOCKET_LIST}; do<br />
eval value=\$$variable<br />
test -n "${value}" && test -d ${value%/*} && \<br />
params="$params -a $value"<br />
done<br />
;;<br />
esac<br />
syslog_pid="/var/run/${syslog}.pid"<br />
# check config and programs<br />
test -x ${BINDIR}/$syslog || exit 5<br />
test -x ${BINDIR}/klogd || exit 5<br />
# If there is no config file in the ramdisk, create a simple one<br />
# that logs important messages to /dev/console<br />
if [ ! -e ${config} ]; then<br />
# Note: "*.warn" produces numerous boot messages that may affect<br />
boot performance.<br />
# Therefore, warnings (and higher log levels) are not logged,<br />
by default.<br />
echo "authpriv.none;*.emerg;*.alert;*.crit;*.err /dev/console" >><br />
${config}<br />
fi<br />
#<br />
# Do not translate symbol addresses for 2.6 kernel<br />
#<br />
case `uname -r` in<br />
0.*|1.*|2.[0-4].*)<br />
#!/bin/sh<br />
# -------------------------------------------------------------<br />
# Product(s):<br />
# 5733-BG1<br />
#<br />
# (C)Copyright IBM Corp. 2004, 2005<br />
# All rights reserved.<br />
# US Government Users Restricted Rights -<br />
# Use, duplication or disclosure restricted<br />
# by GSA ADP Schedule Contract with IBM Corp.<br />
#<br />
# Licensed Materials-Property of IBM<br />
# -------------------------------------------------------------<br />
Appendix B. The sitefs file 425
# -------------------------------------------------------------<br />
# NOTE: The PATH environment variable is set to the following<br />
#!/bin/sh<br />
# -------------------------------------------------------------<br />
# Product(s):<br />
# 5733-BG1<br />
#<br />
# (C)Copyright IBM Corp. 2004, 2005<br />
# All rights reserved.<br />
# US Government Users Restricted Rights -<br />
# Use, duplication or disclosure restricted<br />
chmod +x /var/mmfs/etc/mmfsup.scr<br />
# Start GPFS and wait for it to come up<br />
rm -f $upfile<br />
/usr/lpp/mmfs/bin/mmautoload<br />
retries=300<br />
until test -e $upfile<br />
do sleep 2<br />
let retries=$retries-1<br />
if [ $retries -eq 0 ]<br />
then ras_advisory "$0: GPFS did not come up on I/O node<br />
$HOSTID"<br />
exit 1<br />
fi<br />
done<br />
#!/bin/sh<br />
#<br />
# Sample sitefs script.<br />
#<br />
# It mounts a filesystem on /scratch, mounts /home for user files<br />
# (applications), creates a symlink for /tmp to point into some<br />
directory<br />
# in /scratch using the IP address of the I/O node as part of the<br />
directory<br />
# name to make it unique to this I/O node, and sets up environment<br />
# variables for ciod.<br />
#<br />
. /proc/personality.sh<br />
. /etc/rc.status<br />
#-------------------------------------------------------------------<br />
426 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
# Function: mountSiteFs()<br />
#<br />
# Mount a site file system<br />
# Attempt the mount up to 5 times.<br />
# If all attempts fail, send a fatal RAS event so the block fails<br />
# to boot.<br />
#<br />
# Parameter 1: File server IP address<br />
# Parameter 2: Exported directory name<br />
# Parameter 3: Directory to be mounted over<br />
# Parameter 4: Mount options<br />
#------------------------------------------------------------------mountSiteFs()<br />
{<br />
# Make the directory to be mounted over<br />
[ -d $3 ] || mkdir $3<br />
# Set to attempt the mount 5 times<br />
ATTEMPT_LIMIT=5<br />
ATTEMPT=1<br />
# Attempt the mount up to 5 times.<br />
# Echo a message to the I/O node log for each failed attempt.<br />
until test $ATTEMPT -gt $ATTEMPT_LIMIT || mount $1:$2 $3 -o $4; do<br />
echo "Attempt number $ATTEMPT to mount $1:$2 on $3 failed"<br />
sleep $ATTEMPT<br />
let ATTEMPT=$ATTEMPT+1<br />
done<br />
# If all attempts failed, send a fatal RAS event so the block fails<br />
to<br />
# boot. If the mount worked, echo a message to the I/O node log.<br />
if test $ATTEMPT -gt $ATTEMPT_LIMIT; then<br />
echo "Mounting $1:$2 on $3 failed" > /proc/rasevent/kernel_fatal<br />
exit<br />
else<br />
echo "Attempt number $ATTEMPT to mount $1:$2 on $3 worked"<br />
fi<br />
}<br />
#-------------------------------------------------------------------<br />
# Script: sitefs()<br />
#<br />
# Perform site-specific functions during startup and shutdown<br />
#<br />
Appendix B. The sitefs file 427
# Parameter 1: "start" - perform startup functions<br />
# "stop" - perform shutdown functions<br />
#-------------------------------------------------------------------<br />
# Set to ip address of site fileserver<br />
SITEFS=172.30.1.33<br />
# First reset status of this service<br />
rc_reset<br />
# Handle startup (start) and shutdown (stop)<br />
case "$1" in<br />
start)<br />
echo Mounting site filesystems<br />
# Mount a scratch file system...<br />
mountSiteFs $SITEFS /bglscratch /bglscratch<br />
tcp,rsize=32768,wsize=32768,async<br />
# Mount a home file system...<br />
# mountSiteFs $SITEFS /home /home tcp,rsize=8192,wsize=8192<br />
# Arrange something for tmp...<br />
# - make a unique directory in the scratch file system.<br />
# - rename the original /tmp to /tmp.original.<br />
# - point /tmp to the unique directory.<br />
# tmpdir=/scratch/ionodetmp/$BGL_IP<br />
# [ -d $tmpdir ] || mkdir -p $tmpdir<br />
# mv /tmp /tmp.original<br />
# ln -s $tmpdir /tmp<br />
# Setup environment variables for ciod<br />
# echo "export CIOD_RDWR_BUFFER_SIZE=262144" >><br />
/etc/sysconfig/ciod<br />
# echo "export DEBUG_SOCKET_STARTUP=ALL" >><br />
/etc/sysconfig/ciod<br />
klogd<br />
# Uncomment the following line to not start the NTP daemon<br />
# rm /etc/sysconfig/xntp<br />
# Uncomment the following line to not start the syslogd and<br />
# daemons<br />
# rm /etc/sysconfig/syslog<br />
428 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
for<br />
# Uncomment the first line to start GPFS.<br />
# Optionally uncomment the other lines to change the defaults<br />
# GPFS.<br />
echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />
# echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />
# echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >><br />
/etc/sysconfig/gpfs<br />
rc_status -v<br />
;;<br />
stop)<br />
echo Unmounting site filesystems<br />
# Put /tmp back<br />
# rm /tmp<br />
# mv /tmp.original /tmp<br />
# Unmount the scratch and home file systems<br />
umount -f /bglscratch<br />
# umount -f /home<br />
;;<br />
restart)<br />
;;<br />
status)<br />
;;<br />
esac<br />
rc_exit<br />
#------End of script---------<br />
Appendix B. The sitefs file 429
430 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Appendix C. The ionode.README file<br />
C<br />
This file might well change between releases. You can find the current version of<br />
the ionode.README file that is compatible with the version of the code currently<br />
in use on your Blue Gene/L system at:<br />
/bgl/BlueLight/ppcfloor/docs/ionode.README<br />
This appendix includes the ionode.README file. This file contains useful<br />
information about the ionode bootup sequence. It is installed when the<br />
bglmcp*.rpm is installed. The version shown here is compatible with release<br />
V2R2M1.<br />
© Copyright IBM Corp. 2006. All rights reserved. 431
gl/BlueLight/ppcfloor/docs/ionode.README file<br />
Copyright Notice<br />
All Rights Reserved Legend<br />
US Government Users Restricted Rights Notice<br />
I/O Node Startup and Shutdown Scripts<br />
=====================================<br />
This file contains a summary of the startup and shutdown scripts<br />
executed<br />
on an I/O node.<br />
"Dist" directories<br />
------------------<br />
There are three "distribution" directories involved during startup and<br />
shutdown.<br />
$BGL_DISTDIR<br />
The "dist" subdir located at the top of the Blue Gene driver install<br />
tree.<br />
This is referred to as the "system dist" or just "dist". This is<br />
normally<br />
located at "/bgl/BlueLight//ppc/dist".<br />
$BGL_SITEDISTDIR<br />
The "/bgl/dist" subdir. This is referred to as the "site dist", as it<br />
is<br />
intended for site-specific customization of the I/O node startup and<br />
shutdown.<br />
$BGL_OSDIR<br />
The Mini-Control Program (MCP) subdir. This contains many of the<br />
executables needed by programs that run in the I/O node. The I/O node<br />
/bin, /sbin, /lib, and /usr directories are links to directories within<br />
this subdir. This is normally located at "/bgl/OS/x.y", where "x.y" is<br />
the<br />
version and modification level of the MCP.<br />
All of these are treated as the root of a filesystem even though they<br />
are<br />
NOT explicitly exported and mounted.<br />
432 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Contents of the ramdisk<br />
-----------------------<br />
The ramdisk contains a subset of files located in the system dist<br />
directories. These files remain in the system dist tree for the<br />
convenience of an administrator to read.<br />
/bgl NFS mount point of /bgl for bootstrap<br />
/bin/* shell commands implemented by busybox<br />
/sbin/* shell commands implemented by busybox<br />
/dev/* special device files<br />
/lib empty on startup (busybox is static linked)<br />
/proc mounted proc filesystem<br />
/proc/personality binary personality config (read by ciod)<br />
/proc/personality.sh shell personality config<br />
/tmp temp in ramdisk (very small)<br />
/etc/fstab minimal fstab<br />
/etc/group minimal group (note: ciod does not need group<br />
names<br />
defined)<br />
/etc/passwd minimal passwd (note: ciod does not need names<br />
defined)<br />
/etc/inittab inittab for starting startup and shutdown<br />
scripts<br />
and console shell<br />
/etc/protocols minimal protocols<br />
/etc/rpc minimal rpc<br />
/etc/services minimal services<br />
/etc/sysconfig/xntp minimal NTP options file<br />
/etc/ntp.conf minimal NTP configuration file<br />
/etc/sysconfig/syslog minimal syslog options file<br />
/etc/syslog.conf minimal syslog configuration file<br />
/etc/rc.dist defines $BGL_DISTDIR, $BGL_SITEDISTDIR, and<br />
$BGL_OSDIR<br />
/etc/rc.shutdown first stage shutdown script<br />
/etc/rc.d/rc.sysinit first stage sysinit script<br />
/etc/rc.d/rc.ras verbose, ras_advisory and ras_fatal functions<br />
/etc/rc.d/rc.network bring up network and default route<br />
/etc/rc.d/rc3.d dir of start (S*) and shutdown (K*) scripts<br />
The last few "rc" scripts are of interest in this document.<br />
Startup flow between rc scripts<br />
Appendix C. The ionode.README file 433
-------------------------------<br />
The I/O node startup begins with /sbin/init which reads /etc/inittab.<br />
The<br />
sysinit rule in inittab is coded to run /etc/rc.d/rc.sysinit. From<br />
here<br />
the flow is as follows. Note that BGL_SITEDISTDIR is normally<br />
/bgl/dist.<br />
/etc/rc.d/rc.sysinit<br />
mounts /proc<br />
includes /proc/personality.sh (e.g. $BGL_IP, $BGL_FS, etc)<br />
includes /etc/rc.d/rc.ras<br />
run /etc/rc.d/rc.network to bring up network and default route<br />
test for ethernet link status<br />
mount /bgl<br />
includes /etc/rc.dist to define $BGL_DISTDIR, $BGL_SITEDISTDIR,<br />
and<br />
$BGL_OSDIR<br />
run second stage startup from $BGL_DISTDIR/etc/rc.d/rc.sysinit2<br />
$BGL_DISTDIR/etc/rc.d/rc.sysinit2<br />
NOTE: this second stage startup exists only in NFS<br />
replace empty /lib with symlink to $BGL_OSDIR/lib<br />
replace empty /usr with symlink to $BGL_OSDIR/usr<br />
replace empty /etc/rc.d/rc3.d with symlink under $BGL_DISTDIR<br />
load the tree device driver<br />
run $BGL_DISTDIR/etc/rc.d/rc3.d/S* start scripts and<br />
$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S* start scripts<br />
run $BGL_SITEDISTDIR/etc/rc.local<br />
Note that the start scripts run by rc.sysinit2 are selected from both<br />
the<br />
installation dist directory as well as the site dist directory and run<br />
in<br />
numeric order. Therefore the site dist directory can contain scripts<br />
that<br />
start before Blue Gene system software such as ciod. If a start script<br />
in<br />
the site dist directory has the same name as a start script in the<br />
installation dist directory, only the script in the installation dist<br />
directory is run. Start scripts having the same number are run<br />
alphabetically.<br />
Shutdown flow between rc scripts<br />
434 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
--------------------------------<br />
When a block is freed, the shutdown rule in /etc/inittab is coded to<br />
run<br />
/etc/rc.shutdown. From here, the flow is as follows:<br />
/etc/rc.shutdown<br />
run $BGL_SITEDISTDIR/etc/rc.local.shutdown<br />
run $BGL_DISTDIR/etc/rc.d/rc3.d/K* shutdown scripts and<br />
$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K* shutdown scripts<br />
in numeric order<br />
unmount /bgl and any remaining mounted file systems<br />
Shell variables during startup and shutdown<br />
-------------------------------------------<br />
The special file /proc/personality.sh is a shell script that sets shell<br />
variables to node specific values. These are used by the rc.* scripts<br />
and may also be useful within site-written start and shutdown scripts.<br />
These are not exported variables so each script will need to source<br />
this<br />
file (or export the variables to other scripts). See the example<br />
script<br />
below.<br />
The file /etc/rc.dist contains three additional variable definitions<br />
not<br />
found in /proc/personality.sh and may also be useful. These are marked<br />
with an (*) below.<br />
BGL_MAC Ethernet mac address of this I/O node<br />
BGL_IP IP address of this I/O node<br />
BGL_NETMASK Netmask to be used for this I/O node<br />
BGL_BROADCAST Broadcast address for this I/O node<br />
BGL_GATEWAY Gateway address for the default route for this I/O node<br />
BGL_MTU MTU for this I/O node<br />
BGL_FS Fileserver IP address for /bgl<br />
BGL_EXPORTDIR Fileserver export directory for /bgl<br />
BGL_LOCATION Location string for this I/O node<br />
BGL_PSETNUM PSet number for this I/O node (0..$BGL_NUMPSETS-1)<br />
BGL_NUMPSETS Total number of PSets in this block (i.e. # of I/O<br />
nodes)<br />
BGL_NODESINPSET Number of compute nodes served by this I/O node<br />
BGL_{X,Y,Z}SIZE Size of block in nodes<br />
BGL_VIRTUALNM 1 if the block is running in virtual node mode<br />
Appendix C. The ionode.README file 435
BGL_BLOCKID Name of block<br />
BGL_VERBOSE Defined if the block is created with the<br />
"io_node_verbose"<br />
option<br />
BGL_SNIP Functional network IP address of the service node<br />
BGL_MEMSIZE Bytes of RAM for this I/O node<br />
BGL_VERSION Blue Gene personality version<br />
BGL_DISTDIR(*) Path to top of dist directory in NFS<br />
BGL_SITEDISTDIR(*) Path to top of site dist directory in NFS<br />
(/bgl/dist)<br />
BGL_OSDIR(*) Path to I/O node MCP subdir (/bgl/OS/x.y)<br />
Environment Variables used by ciod<br />
----------------------------------<br />
The following environment variables are used by ciod:<br />
CIOD_RDWR_BUFFER_SIZE=value<br />
This value specifies the size, in bytes, of each buffer used by ciod to<br />
issue read and write system calls. One buffer is allocated for each<br />
compute node CPU associated with this I/O node. The default size, if<br />
not<br />
specified, is 87600 bytes. A larger size may improve performance. If<br />
you<br />
are using a GPFS file server, it is recommended that this buffer size<br />
match<br />
the GPFS block size. Experiment with different sizes, such as 262144<br />
(256K) or 524288 (512K), to get the best performance.<br />
DEBUG_SOCKET_STARTUP={ ALL, }<br />
When specified, this variable causes ciod to start up sockets for<br />
debugging<br />
all blocks, or the specified block.<br />
These variables can be specified in the ciod start script,<br />
$BGL_DISTDIR/etc/rc.d/rc3.d/S50ciod. It is recommended that you create<br />
a /etc/sysconfig/ciod file in the ramdisk that specifies these<br />
variables.<br />
You can create this file in your<br />
$BGL_SITEDISTDIR/etc/rc.d/rc3.d/Sxxxxxx<br />
script that runs before ciod is started. Refer to "Typical Site<br />
Customization" in this README for an example.<br />
Support for Network Time Protocol (NTP)<br />
436 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
---------------------------------------<br />
During I/O node startup, the $BGL_DISTDIR/etc/rc.d/rc3.d/S15xntpd<br />
script<br />
runs to set the time and date, and to start the NTP daemon (NTPD).<br />
There<br />
are three things you can do to configure this support:<br />
1. Set up your timezone file. The I/O node looks for the timezone file<br />
/etc/localtime. This is a symbolic link to<br />
$BGL_SITEDISTDIR/etc/localtime. Thus, to set up your timezone file,<br />
copy your service node's /etc/localtime file to<br />
$BGL_SITEDISTDIR/etc/localtime. If the timezone file is not found,<br />
the<br />
time is assumed to be UTC.<br />
2. Set up options used by the S15xntpd script. These options are<br />
located<br />
in file /etc/sysconfig/xntp. The default file contains two options:<br />
XNTPD_INITIAL_NTPDATE="AUTO-2"<br />
Specifies which NTP servers will be queried for the initial time<br />
and<br />
date.<br />
"AUTO" Query all of the NTP servers listed in the NTPD<br />
configuration file<br />
"AUTO-n" Query the first "n" NTP servers listed in the NTPD<br />
configuration file<br />
"" Don't perform the initial query at all<br />
"address1 address2 ..." Query the specified NTP servers<br />
The NTPD configuration file is /etc/ntp.conf.<br />
The default is "AUTO-2".<br />
XNTPD_OPTIONS=""<br />
Parameters for the /sbin/ntpd NTP daemon. The following<br />
parameters<br />
are supported:<br />
/sbin/ntpd [ -abdgmnqx ] [ -c config_file ] [ -e e_delay ]<br />
[ -f freq_file ] [ -k key_file ] [ -l log_file ]<br />
[ -p pid_file ] [ -r broad_delay ] [ -s statdir ]<br />
[ -t trust_key ] [ -v sys_var ] [ -V<br />
default_sysvar ]<br />
[ -P fixed_process_priority ]<br />
The default is no parameters.<br />
Appendix C. The ionode.README file 437
to<br />
by<br />
Normally, you should not have to change these options. If you wish<br />
change them, you need to do this in your site customization script<br />
(S10sitefs - refer to "Typical Site Customization" in this README)<br />
adding lines similar to the following:<br />
rm /etc/sysconfig/xntp<br />
echo "XNTPD_INITIAL_NTPDATE=\"AUTO-2\"" >> /etc/sysconfig/xntp<br />
echo "XNTPD_OPTIONS=\"-c /etc/ntp.conf\"" >> /etc/sysconfig/xntp<br />
3. Set up configuration information for the NTP daemon. This is stored<br />
in<br />
the /etc/ntp.conf file. The default file is created in the ramdisk<br />
by<br />
the S15xntpd script and contains the following:<br />
restrict default nomodify<br />
server $BGL_SNIP<br />
The default NTP server is the service node. Normally, you should<br />
not<br />
need to change these options. If you wish to supply your own file,<br />
you<br />
need to create it in your site customiztion script (S10sitefs -<br />
refer to<br />
"Typical Site Customization" in this README) by adding lines similar<br />
to<br />
the following:<br />
echo "restrict default nomodify" >> /etc/ntp.conf<br />
echo "server $BGL_SNIP" >> /etc/ntp.conf<br />
For more information on NTP, refer to http://www.ntp.org/<br />
If you do not want NTP started on the I/O nodes, you can remove the<br />
/etc/sysconfig/xntp file from the ramdisk in your site customization<br />
script<br />
(S10sitefs - refer to "Typical Site Customization" in this README) by<br />
adding the following line to that script:<br />
rm /etc/sysconfig/xntp<br />
Support for Syslog<br />
------------------<br />
During I/O node startup, the $BGL_DISTDIR/etc/rc.d/rc3.d/S45syslog<br />
script<br />
runs to start up the syslogd and klogd daemons. There are three things<br />
you<br />
438 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
can do to configure this support:<br />
1. Set up options used by the S45syslog script. These options are<br />
located<br />
in file /etc/sysconfig/syslog. The default file contains the<br />
following<br />
options:<br />
KERNEL_LOGLEVEL=1<br />
Specifies the default logging level for kernel messages (0-7).<br />
Kernel<br />
messages having a priority level equal to or more severe than<br />
(less<br />
than) this value are logged by syslog. The logging levels are as<br />
follows:<br />
KERN_EMERG 0 System is unusable<br />
KERN_ALERT 1 Action must be taken immediately<br />
KERN_CRIT 2 Critical conditions<br />
KERN_ERR 3 Error conditions<br />
KERN_WARNING 4 Warning conditions<br />
KERN_NOTICE 5 Normal but significant condition<br />
KERN_INFO 6 Informational<br />
KERN_DEBUG 7 Debug-level messages<br />
The default is "1".<br />
SYSLOGD_PARAMS=""<br />
Parameters for the /sbin/syslogd daemon. The following parameters<br />
are<br />
supported:<br />
/sbin/syslog [ -a socket ] [ -d ] [ -h ]<br />
[ -p socket ] [ -f config file ] [ -n ]<br />
[ -l hostlist ] [ -m interval ] [ -r ]<br />
[ -t ] [ -s domainlist ] [ -v ]<br />
The default is no parameters. However, the S45syslog script<br />
always<br />
specifies "-n" because it is required when starting syslogd from<br />
the<br />
init process.<br />
KLOGD_PARAMS="-2"<br />
Parameters for the /sbin/klogd daemon. The following parameters<br />
are<br />
supported:<br />
/sbin/klogd [ -c n ] [ -d ] [ -f fname ] [ -iI ] [ -n ] [ -o<br />
]<br />
Appendix C. The ionode.README file 439
]<br />
to<br />
by<br />
440 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
[ -k fname ] [ -v ] [ -x ] [ -2 ] [ -s ] [ -p<br />
The default is "-2" and "-c KERNEL_LOGLEVEL".<br />
SYSLOG_DAEMON="syslogd"<br />
The name of the syslog daemon. The default is "syslogd".<br />
Normally, you should not have to change these options. If you wish<br />
change them, you need to do this in your site customization script<br />
(S10sitefs - refer to "Typical Site Customization" in this README)<br />
adding lines similar to the following:<br />
rm /etc/sysconfig/syslog<br />
echo "KERNEL_LOGLEVEL=1" >> /etc/sysconfig/syslog<br />
echo "SYSLOGD_PARAMS=\"\"" >> /etc/sysconfig/syslog<br />
echo "KLOGD_PARAMS=\"-2\"" >> /etc/sysconfig/syslog<br />
echo "SYSLOG_DAEMON=\"syslogd\"" >> /etc/sysconfig/syslog<br />
2. Set up configuration information for the syslog daemon. This is<br />
stored<br />
in the /etc/syslog.conf file. The default file is created in the<br />
ramdisk by the S45syslog script and contains the following:<br />
authpriv.none;*.emerg;*.alert;*.crit;*.err; /dev/console<br />
This default file logs messages having priority level 0, 1, 2, or 3<br />
to<br />
the I/O node console (the I/O node log file). If you wish to supply<br />
your own file, you need to create it in your site customization<br />
script<br />
(S10sitefs - refer to "Typical Site Customization" in this README)<br />
by<br />
adding lines similar to the following:<br />
echo "authpriv.none;*.emerg;*.alert;*.crit;*.err; /dev/console" \<br />
>> /etc/syslog.conf<br />
3. If you want to send I/O node messages to a remote syslog server<br />
(using<br />
@server-name in the /etc/syslog.conf file), consider the following:<br />
a. When starting syslogd on the remote server, specify "-r -m 0" to<br />
enable it.<br />
b. If the remote server is logging to files, be aware that the files<br />
can get large. Consider using a utility to manage these files,<br />
such<br />
as logrotate.
For more information, refer to the man pages for syslog, syslogd,<br />
klogd,<br />
and logrotate.<br />
If you do not want syslog started on the I/O nodes, you can remove the<br />
/etc/sysconfig/syslog file from the ramdisk in your site customization<br />
script (S10sitefs - refer to "Typical Site Customization" in this<br />
README)<br />
by adding the following line to that script:<br />
rm /etc/sysconfig/syslog<br />
Support for GPFS<br />
----------------<br />
During I/O node startup, the GPFS client may optionally be started.<br />
Assuming the appropriate GPFS setup has been done, the following<br />
describes<br />
the steps for configuring the I/O node scripts to start GPFS and<br />
explains<br />
the flow of these scripts.<br />
In your site customization script (S10sitefs - refer to "Typical Site<br />
Customization" in this README), add the following line:<br />
echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />
This creates the /etc/sysconfig/gpfs file in the I/O node ramdisk and<br />
specifies that you want the GPFS client to be started.<br />
You may specify the following additional options in the same manner<br />
(the<br />
default values are shown here, so you only need to specify these if you<br />
want different values):<br />
echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />
echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >> /etc/sysconfig/gpfs<br />
GPFS_VAR_DIR specifies the pathname to the "var" directory to be used<br />
by<br />
the GPFS client. A per-I/O node directory is created under<br />
GPFS_VAR_DIR<br />
for storing log files and configuration information. Note that if the<br />
GPFS_VAR_DIR resides in an NFS exported file system, that export should<br />
specify "no_root_squash" so that the I/O node root user can write to<br />
the<br />
Appendix C. The ionode.README file 441
GPFS_VAR_DIR directory.<br />
GPFS_CONFIG_SERVER specifies the host name or IP address of the primary<br />
GPFS cluster configuration server node. The GPFS configuration file<br />
(/var/mmfs/gen/mmsdrfs) for an I/O node is retrieved from this node, if<br />
necessary.<br />
When you specify to start the GPFS client, as described above, the<br />
following occurs during I/O node startup:<br />
1) The SSH daemon is started by the $BGL_DISTDIR/etc/rc.d/rc3.d/S16sshd<br />
script. SSH is used by GPFS to communicate among the I/O nodes and<br />
the<br />
service node. This script will also use the<br />
$BGL_SITEDISTDIR/etc/hosts<br />
file to set the hostname of this I/O node.<br />
2) The GPFS client is started by the<br />
$BGL_DISTDIR/etc/rc.d/rc3.d/S40gpfs<br />
script.<br />
Typical Site Customization<br />
--------------------------<br />
A Blue Gene site will normally have a script that customizes the<br />
startup<br />
and shutdown. This script could perform the following functions:<br />
1. Mount (during startup) and unmount (during shutdown) high<br />
performance<br />
file servers for application use. To ensure the file servers are<br />
available<br />
to applications,<br />
- the mount must occur after portmap is started by the S05nfs script<br />
and<br />
before ciod is started by the S50ciod script.<br />
- if using S45syslog to start syslog that logs to files in a mounted<br />
file<br />
system, the mount must occur before syslog is started.<br />
- the unmount must occur after ciod is ended by the K50ciod script and<br />
before portmap is ended by the K95nfs script.<br />
Recommendations for NFS mounts:<br />
a. The mount should be retried several times to accomodate a busy file<br />
server.<br />
442 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
. Mount options:<br />
tcp - This provides automatic flow control when it detects that<br />
packets<br />
are being dropped due to network congestion or server<br />
overload.<br />
rsize,wsize - These specify read and write NFS buffer sizes. 8192<br />
is<br />
the minimum recommended size, and 32768 is the maximum<br />
recommended size. If your server is not built with a<br />
kernel compiled with a 32768 size, it will negotiate<br />
down<br />
to what it can support. In general, the larger the<br />
size,<br />
the better the performance. However, depending on the<br />
capacity of your network and server, 32768 may be too<br />
large and cause excessive slowdowns or hangs during<br />
times<br />
of heavy I/O. This is something that each site needs<br />
to<br />
tune.<br />
async - Specifying this option may improve write performance,<br />
although<br />
there is greater risk for losing data if the file server<br />
crashes<br />
and is unable to get the data written to disk.<br />
2. Point /tmp to a larger file system. The default /tmp is in a small<br />
ramdisk. Your applications may require a larger /tmp.<br />
3. Set ciod environment variables. Refer to "Environment Variables<br />
used<br />
by ciod" in this README for details.<br />
4. Set NTP parameters. Refer to "Support for Network Time Protocol<br />
(NTP)"<br />
in this README for details.<br />
5. Set syslog parameters. Refer to "Support for Syslog" in this README<br />
for<br />
details.<br />
6. Set GPFS parameters. Refer to "Support for GPFS" in thie README for<br />
details.<br />
Your site customization script should be in the<br />
Appendix C. The ionode.README file 443
$BGL_SITEDISTDIR/etc/rc.d/rc3.d directory. For example, call it<br />
"sitefs".<br />
Then, in order to properly place it in the startup and shutdown<br />
sequence,<br />
symbolic links should be created to this script as follows:<br />
ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />
$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S10sitefs<br />
ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \<br />
$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K90sitefs<br />
The S10sitefs and K90sitefs links are sequenced such that they meet the<br />
mount requirements described above.<br />
During startup, S10sitefs will be called with the "start" parameter.<br />
During shutdown, K90sitefs will be called with the "stop" parameter.<br />
The following is an example of such a site customization script:<br />
#!/bin/sh<br />
#<br />
# Sample sitefs script.<br />
#<br />
# It mounts a filesystem on /scratch, mounts /home for user files<br />
# (applications), creates a symlink for /tmp to point into some<br />
directory<br />
# in /scratch using the IP address of the I/O node as part of the<br />
directory<br />
# name to make it unique to this I/O node, and sets up environment<br />
# variables for ciod.<br />
#<br />
. /proc/personality.sh<br />
. /etc/rc.status<br />
#-------------------------------------------------------------------<br />
# Function: mountSiteFs()<br />
#<br />
# Mount a site file system<br />
# Attempt the mount up to 5 times.<br />
# If all attempts fail, send a fatal RAS event so the block fails<br />
# to boot.<br />
#<br />
# Parameter 1: File server IP address<br />
# Parameter 2: Exported directory name<br />
444 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
# Parameter 3: Directory to be mounted over<br />
# Parameter 4: Mount options<br />
#------------------------------------------------------------------mountSiteFs()<br />
{<br />
# Make the directory to be mounted over<br />
[ -d $3 ] || mkdir $3<br />
# Set to attempt the mount 5 times<br />
ATTEMPT_LIMIT=5<br />
ATTEMPT=1<br />
# Attempt the mount up to 5 times.<br />
# Echo a message to the I/O node log for each failed attempt.<br />
until test $ATTEMPT -gt $ATTEMPT_LIMIT || mount $1:$2 $3 -o $4; do<br />
echo "Attempt number $ATTEMPT to mount $1:$2 on $3 failed"<br />
sleep $ATTEMPT<br />
let ATTEMPT=$ATTEMPT+1<br />
done<br />
# If all attempts failed, send a fatal RAS event so the block fails<br />
to<br />
# boot. If the mount worked, echo a message to the I/O node log.<br />
if test $ATTEMPT -gt $ATTEMPT_LIMIT; then<br />
echo "Mounting $1:$2 on $3 failed" > /proc/rasevent/kernel_fatal<br />
exit<br />
else<br />
echo "Attempt number $ATTEMPT to mount $1:$2 on $3 worked"<br />
fi<br />
}<br />
#-------------------------------------------------------------------<br />
# Script: sitefs()<br />
#<br />
# Perform site-specific functions during startup and shutdown<br />
#<br />
# Parameter 1: "start" - perform startup functions<br />
# "stop" - perform shutdown functions<br />
#-------------------------------------------------------------------<br />
# Set to ip address of site fileserver<br />
SITEFS=172.32.1.1<br />
# First reset status of this service<br />
rc_reset<br />
Appendix C. The ionode.README file 445
# Handle startup (start) and shutdown (stop)<br />
case "$1" in<br />
start)<br />
echo Mounting site filesystems<br />
# Mount a scratch file system...<br />
mountSiteFs $SITEFS /scratch /scratch<br />
tcp,rsize=32768,wsize=32768,async<br />
# Mount a home file system...<br />
mountSiteFs $SITEFS /home /home tcp,rsize=8192,wsize=8192<br />
# Arrange something for tmp...<br />
# - make a unique directory in the scratch file system.<br />
# - rename the original /tmp to /tmp.original.<br />
# - point /tmp to the unique directory.<br />
tmpdir=/scratch/ionodetmp/$BGL_IP<br />
[ -d $tmpdir ] || mkdir -p $tmpdir<br />
mv /tmp /tmp.original<br />
ln -s $tmpdir /tmp<br />
# Setup environment variables for ciod<br />
echo "export CIOD_RDWR_BUFFER_SIZE=262144" >><br />
/etc/sysconfig/ciod<br />
echo "export DEBUG_SOCKET_STARTUP=ALL" >><br />
/etc/sysconfig/ciod<br />
klogd<br />
for<br />
# Uncomment the following line to not start the NTP daemon<br />
# rm /etc/sysconfig/xntp<br />
# Uncomment the following line to not start the syslogd and<br />
# daemons<br />
# rm /etc/sysconfig/syslog<br />
# Uncomment the first line to start GPFS.<br />
# Optionally uncomment the other lines to change the defaults<br />
# GPFS.<br />
# echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs<br />
# echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs<br />
# echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >><br />
/etc/sysconfig/gpfs<br />
446 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
c_status -v<br />
;;<br />
stop)<br />
echo Unmounting site filesystems<br />
# Put /tmp back<br />
rm /tmp<br />
mv /tmp.original /tmp<br />
# Unmount the scratch and home file systems<br />
umount -f /scratch<br />
umount -f /home<br />
;;<br />
restart)<br />
;;<br />
status)<br />
;;<br />
esac<br />
rc_exit<br />
#------End of script---------<br />
A more advanced script could select a fileserver based on the I/O node<br />
location string (e.g. R01-M0-NE-I:J18-U01). Note that BGL_LOCATION<br />
comes<br />
from sourcing /proc/personality.sh.<br />
case "$BGL_LOCATION" in<br />
R00-*) SITEFS=172.32.1.1;; # rack 0 NFS server<br />
R01-*) SITEFS=172.32.1.2;; # rack 1 NFS server<br />
R02-*) SITEFS=172.32.1.3;; # rack 2 NFS server<br />
R03-*) SITEFS=172.32.1.4;; # rack 3 NFS server<br />
R04-*) SITEFS=172.32.1.5;; # rack 4 NFS server<br />
R05-*) SITEFS=172.32.1.6;; # rack 5 NFS server<br />
esac<br />
The script could use the $BGL_IP (ip addr of the I/O node) to compute<br />
the<br />
fileserver(s). It also can use $BGL_BLOCKID so that different blocks<br />
can<br />
mount special fileservers.<br />
Appendix C. The ionode.README file 447
Support for Subnets<br />
-------------------<br />
Each midplane or group of midplanes can belong to a different subnet.<br />
To<br />
take advantage of this, there are two things that need to be done:<br />
1. Specify IP addresses for specific I/O nodes. This is done in the<br />
BglIpPool database table.<br />
The I/O nodes in the midplane(s) belonging to a particular subnet<br />
must<br />
have IP addresses belonging to that subnet. When the BglIpPool<br />
table is<br />
populated, the I/O node locations must be specified along with the<br />
IP<br />
addresses as in the following example:<br />
db2 "INSERT INTO BglIpPool<br />
(location,machineSerialNumber,ipAddress)<br />
VALUES('R01-M0-N4-I:J18-U01','BGL', '172.30.100.244') "<br />
If the BglIpPool table already exists, and subnet support is being<br />
added, the following steps must be done:<br />
a. The BglIpPool table must be deleted and recreated with the<br />
correct<br />
location-ipAddress pairings, as described above.<br />
b. The IpAddress field within the BglNode table must be set to NULL:<br />
db2 "update bglnode set IpAddress = NULL"<br />
c. PostDiscovery must be re-run.<br />
2. Specify subnet information for each midplane in the<br />
BglMidplaneSubnet<br />
table, as in the following example:<br />
db2 "INSERT INTO \<br />
BglMidplaneSubnet<br />
(posInMachine,ipAddress,broadcast,nfsIpAddress) \<br />
VALUES('R000','172.27.138.254','172.27.138.255','172.27.96.117') "<br />
FIELD DESCRIPTION IONODE<br />
SCRIPT<br />
VARIABLE<br />
448 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
posInMachine The midplane specification<br />
ipAddress The IP address of the gateway for the subnet<br />
BGL_GATEWAY<br />
broadcast The broadcast address for the subnet<br />
BGL_BROADCAST<br />
nfsIpAddress The IP address of the /bgl file server for BGL_FS<br />
this midplane<br />
The ipAddress and broadcast fields are related to subnet support.<br />
The<br />
nfsIpAddress field enables multiple file servers to host /bgl, for<br />
improved performance.<br />
Extracting the ramdisk from ramdisk.elf<br />
---------------------------------------<br />
For those who may wish to examine the contents of the I/O node's<br />
ramdisk,<br />
here are instructions for extracting a ramdisk.img.gz from the<br />
ramdisk.elf<br />
that is shipped with the Blue Gene driver, and mounting the ramdisk.<br />
The ramdisk.img.gz is stored in the .data section of a standard ELF<br />
image. It is stored as raw data with an 8-byte header that is needed<br />
at<br />
runtime, which needs to be stripped off with the dd command.<br />
cd <br />
objcopy --only-section .data --output-target binary \<br />
/bgl/BlueLight//ppc/bglsys/bin/ramdisk.elf image.tmp<br />
dd if=image.tmp of=ramdisk.img.gz bs=8 skip=1<br />
rm -f image.tmp<br />
The ramdisk image has been extracted from ramdisk.elf.<br />
Uncompress it.<br />
gunzip ramdisk.img.gz<br />
If gunzip works, skip this step and go to the "loopback mount" step.<br />
If gunzip fails with "unexpected end of file", truncate 1 byte from<br />
ramdisk.img.gz using the following commands, where xxxxxx is the size<br />
of ramdisk.img.gz minus 1 (as calculated from "ls -l ramdisk.img.gz"<br />
output):<br />
dd if=ramdisk.img.gz of=ramdisk.img2.gz count=1 bs=xxxxxx<br />
Appendix C. The ionode.README file 449
mv ramdisk.img2.gz ramdisk.img.gz<br />
Loopback mount ramdisk.img on directory "r".<br />
mkdir r<br />
mount -o loop ramdisk.img r<br />
Run commands to examine the ramdisk under directory "r".<br />
For example:<br />
find r -ls<br />
bglsn:/bgl/BlueLight/ppcfloor/docs #<br />
450 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Abbreviations and acronyms<br />
BPM Bulk Power Module<br />
AIX Advanced Interactive Executive<br />
API Application Programming Interface<br />
ARP Address Resolution Protocol<br />
ASIC Application Specific Integrated<br />
Circuit<br />
BGL Blue Gene Light<br />
BIST Built-In Self Test<br />
BLRTS Blue Gene Run Time System<br />
BPE Bulk Power Enclosure<br />
CDMF Commercial Data masking Facility<br />
CLI Command Line Interface<br />
CN Compute Node<br />
CNK Compute Node Kernel<br />
CPU Central Processing Unit<br />
CSM Cluster <strong>Systems</strong> Management<br />
CWD Current Working Directory<br />
DES Data Encryption System<br />
DNS Domain Name System<br />
EOF End Of File<br />
FEN front-end Node<br />
FIFO First-In-First-Out<br />
FQDN Fully Qualified Domain Name<br />
GPFS General Parallel File System<br />
GPL GNU Public License<br />
GSA Global Storage Architecture<br />
GUI Graphical User Interface<br />
HPC High Performance computing<br />
HTML Hyper Text Markup Language<br />
IBM International Business Machines<br />
Corporation<br />
IDEA International Data Encryption<br />
Algorithm<br />
IEEE Institute of Electrical and Electronic<br />
Engineering<br />
IO Input Output<br />
IP Internet Protocol<br />
ICMP Internet Control Message Protocol<br />
ITSO International Technical Support<br />
Organization<br />
JTAG Joint Technical Advisory <strong>Group</strong><br />
LED Light Emitting Diode<br />
LPAR Logical Partition<br />
LUN Logical Unit Number<br />
MCP Mini-Control Program<br />
MMCS Midplane Management Control<br />
System<br />
MPI Message Passing Interface<br />
MSB Most Significant Bit<br />
MTU Maximum Transmission Unit<br />
NFS Network File System<br />
NSD Network Shared Disk<br />
NTP Network Time Protocol<br />
NUMA Non-Uniform Memory Access<br />
PGP Pretty Good Privacy<br />
PKI Public Key Infrastructure<br />
PPC Power PC®<br />
RAM Random Access Memory<br />
RAS Reliability Availability Serviceability<br />
RPC Remote Procedure Call<br />
RPM RedHat Package Manager<br />
RSA Rivest Shamir and Adelman<br />
RSCT Reliable Scalable Clustering<br />
Technology<br />
© Copyright IBM Corp. 2006. All rights reserved. 451
RSH Remote Shell<br />
SHA Secure Hash Algorithm<br />
SLES SUSE Linux Enterprise Server<br />
SMP Symmetric Multi-Processing<br />
SN Service Node<br />
SP System Parallel<br />
SQL Structured Query Language<br />
SRAM Static Random Access Memory<br />
SSH Secure Shell<br />
SSL Secure Socket Layer<br />
TCP Transmission Control Protocol<br />
TCP/IP Transmission Control Protocol /<br />
Internet Protocol<br />
TWS Tivoli Workload Scheduler<br />
UDP Universal Datagram Protocol<br />
UID User ID<br />
URL Universal Resource Locator<br />
VLSI Very Large Scale Integration<br />
XML Extended Markup Language<br />
452 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Related publications<br />
IBM Redbooks<br />
The publications that we list in this section are considered particularly suitable for<br />
a more detailed discussion of the topics that we cover in this redbook.<br />
For information about ordering these publications, see “How to get IBM<br />
Redbooks” on page 454. Note that some of the documents referenced here<br />
might be available in softcopy only.<br />
► Unfolding the IBM eServer Blue Gene Solution, SG24-6686<br />
► Blue Gene/L: System Administration, SG24-7178<br />
Other publications<br />
These publications are also relevant as further information sources:<br />
► IBM General Parallel File System Concepts, Planning, and Installation <strong>Guide</strong>,<br />
GA22-7968-02<br />
► IBM General Parallel File System Administration and Programming<br />
Reference, SA22-7967-02<br />
► IBM General Parallel File System <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>,<br />
GA22-7969-02<br />
► IBM LoadLeveler Using and Administering <strong>Guide</strong>, SA22-7881<br />
► CSM for AIX 5L and Linux V1.5 Planning and Installation <strong>Guide</strong>,<br />
SA23-1344-01<br />
► CSM for AIX 5L and Linux V1.5 Administration <strong>Guide</strong>, SA23-1343-01<br />
► CSM for AIX 5L and Linux V1.5 Command and Technical Reference,<br />
SA23-1345-01<br />
► RSCT Administration <strong>Guide</strong>, SA22-7889-10<br />
► RSCT for Linux Technical Reference, SA22-7893-10<br />
© Copyright IBM Corp. 2006. All rights reserved. 453
Online resources<br />
These Web sites and URLs are also relevant as further information sources:<br />
► GPFS Concepts, Planning, and Installation <strong>Guide</strong>, GA22-7968-02<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23/bl1ins10/bl1ins10.html<br />
► GPFS Administration and Programming Reference, SA22-7967-02<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23/bl1adm10/bl1adm10.html<br />
► GPFS <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>, GA22-7969-02<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23/bl1pdg10/bl1pdg10.html<br />
► GPFS Documentation updates<br />
http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus<br />
ter.gpfs.doc/gpfs23_doc_updates/docerrata.html<br />
► Cluster <strong>Systems</strong> Management documentation and updates<br />
http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html<br />
How to get IBM Redbooks<br />
Help from IBM<br />
You can search for, view, or download Redbooks, Redpapers, Hints and Tips,<br />
draft publications and Additional materials, as well as order hardcopy Redbooks<br />
or CD-ROMs, at this Web site:<br />
ibm.com/redbooks<br />
IBM Support and downloads<br />
ibm.com/support<br />
IBM Global Services<br />
ibm.com/services<br />
454 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
Index<br />
Symbols<br />
/O node 218<br />
~/.rhosts 153–154<br />
A<br />
accountability 396<br />
allocate_block 162<br />
alternate Central Manager 171, 203<br />
Apache 120<br />
authentication 396, 398<br />
authorization 396<br />
authorized_keys 303, 402<br />
B<br />
back-end mpirun 150, 184<br />
Barrier 189<br />
Barrier and Global interrupt 21<br />
BG_ALLOW_LL_JOBS_ONLY 173, 183<br />
BG_CACHE_PARTIONS 173<br />
BG_CACHE_PARTITIONS 173<br />
BG_ENABLED 173<br />
BG_MIN_PARTITION_SIZE 173<br />
bgIO cluster 244, 251, 263<br />
BGL_DISTDIR 213<br />
BGL_EXPORTDIR 213<br />
BGL_OSDIR 213<br />
BGL_SITEDISTDIR 213<br />
BGLBASEPARTITION 67<br />
BGLBLOCK 36, 92<br />
BGLEVENTLOG 36<br />
BGLJOB 41<br />
bglmaster 31, 43, 62, 89, 115, 281, 283<br />
BGLMIDPLANE 68<br />
BGLNODECARDCOUNT 77<br />
BGLPROCESSORCARDCOUNT 77<br />
BGLSERVICEACTION 49<br />
BGLSERVICECARDCOUNT 70<br />
bglsysdb 287<br />
bgmksensor 390, 394<br />
BGWEB 57–59, 68, 71, 77, 80, 86, 108<br />
blc_powermodules 132<br />
blc_temperatures 132<br />
blc_voltages 132<br />
bll_lbist 133<br />
bll_lbist_linkreset 133<br />
bll_lbist_pgood 133<br />
bll_powermodules 134<br />
bll_temperatures 134<br />
bll_voltages 134<br />
block 93, 158, 162–163, 218<br />
Block Information 162<br />
Block information 123–124<br />
Block inititialization 106<br />
Block monitoring 100<br />
blrts 7, 145<br />
blrts tool chain 145, 148, 322<br />
Blue Gene Light Runtime System 7<br />
Blue Gene/L 15, 199<br />
Blue Gene/L driver 339, 342<br />
Blue Gene/L libraries 173<br />
Blue Gene/L processes 56<br />
Blue Gene/L, core 56, 82–83<br />
Boot process 36<br />
booting a block 138<br />
BPE 15<br />
BPM 15, 451<br />
bridge API 170, 179–180<br />
bridge.config 176<br />
BRIDGE_CONFIG_FILE 155, 167, 175, 349, 351<br />
bs_trash_0 132<br />
bs_trash_1 132<br />
Bulk power enclosure 15<br />
Bulk power module 15<br />
Bulk power modules 114<br />
BusyBox 33<br />
C<br />
CableDiscovery 44, 46<br />
CDMF 397<br />
Central Manager 168–170, 178, 203, 375–376<br />
chkconfig 152<br />
ciod 34, 39, 162, 212, 274<br />
ciodb 31, 34, 41, 89, 136, 274, 280<br />
cipherList 259<br />
clock card 15, 68<br />
© Copyright IBM Corp. 2006. All rights reserved. 455
clock signal 11<br />
cluster 168<br />
Cluster <strong>Systems</strong> Management. See CSM.<br />
Cluster Wide File System 164<br />
CNK 7, 35, 38, 145, 212<br />
coaxial port 15<br />
collective communication 142–143, 189<br />
collective communication and computation 142<br />
collective network 21, 26, 32, 41<br />
compilers 141, 144<br />
compilers, IBM 145<br />
compute card 2, 7–8, 80, 127, 268<br />
Compute node kernel 7, 34, 212<br />
COMPUTENODES 81<br />
condition 386, 388<br />
control system server logs 99<br />
co-processor 7<br />
cryptographic techniques 395<br />
CSM 384<br />
CSM client 385<br />
CSM cluster 385<br />
D<br />
data encryption 396<br />
data signing 396<br />
database 29, 60<br />
database browser 119, 130<br />
database view 70<br />
db.properties 176, 287<br />
DB_PROPERTY 156, 167, 175, 349, 352<br />
DB2 29, 58, 87<br />
DB2 commands 161<br />
DB2 database 69, 119, 384<br />
DB2 statements 111<br />
DB2 stored procedure 391<br />
DB2 trigger 391<br />
db2cshrc 330<br />
db2iauto 88<br />
db2profile 156, 167, 330, 349<br />
db2set 88<br />
DES 397<br />
dgemm160 132<br />
dgemm160e 132<br />
dgemm3200 132<br />
dgemm3200e 132<br />
diagnostic data 31<br />
diagnostic tests 31<br />
diagnostics 106, 113, 119, 128, 131–132, 135<br />
456 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
digital signature 398<br />
discovery 12, 30, 43, 46, 50, 67<br />
discovery process 104<br />
dr_bitfail 132<br />
driver update 320<br />
dumpconv 235<br />
E<br />
emac_dg 132<br />
EndServiceAction 46, 49, 53, 122, 294<br />
environment data 30<br />
environmental information 114, 118, 126<br />
ethtool 84, 290<br />
event 386<br />
F<br />
fan card 13<br />
fan unit 14<br />
fans 114<br />
FEN (Front-End Node) 24, 32, 61, 103, 144<br />
File Server 217<br />
file systems 32, 64<br />
free_block 267<br />
front-end mpirun 150, 183<br />
Functional Ethernet 21<br />
functional network 24, 29, 56, 60, 267, 269<br />
G<br />
gcc compiler 148<br />
General Parallel File System. See GPFS.<br />
gensmallblock 95<br />
gi_single_chip 132<br />
gidcm 132<br />
global barrier and interrupt network 27–28<br />
GPFS 24, 56, 65, 211–212, 226, 230, 234, 244,<br />
266, 294, 303, 329, 346<br />
GPFS access patterns 226<br />
GPFS cross-cluster authentication 246–247<br />
GPFS Portability Layer 230, 321<br />
GPFS_STARTUP 229<br />
H<br />
hardware browser 121<br />
hardware monitor 113–114<br />
hardware problems 106<br />
Hash function 396<br />
hosts.equiv 153–154
I<br />
I/O card 77, 135<br />
I/O node 7–8, 24, 27, 33, 39, 62, 77, 95, 145, 149,<br />
168, 212, 237–238, 318<br />
I/O node boot sequence 212–213<br />
I²C 21–22, 24<br />
IBM compilers 145<br />
IBM XL compilers 145<br />
ibmcmp 214<br />
IDEA 398<br />
IDo 9, 22, 28, 36<br />
IDo chip 28, 30–31, 43<br />
IDo link 270<br />
idoproxy 36, 43, 89, 280–281, 283, 289<br />
idoproxydb 31, 136<br />
ifconfig 290<br />
Initializing the partition 182<br />
installation dist 214<br />
intake plenum 13<br />
Inter-Integrated Circuit 24<br />
IRecv 142<br />
J<br />
job command file 317<br />
job cycle 176<br />
job ID 193, 197–198<br />
job information 123, 125, 162<br />
job monitoring 100<br />
job runtime information 165<br />
job setup 157<br />
job state 363<br />
job states 163<br />
job status 96, 161–162<br />
job submission 36, 66, 141, 149, 158, 178, 349<br />
job tracking 158<br />
job_type 199<br />
JOBNAME 198<br />
jobstepid 193<br />
journaling file system 226<br />
JTAG 21–22<br />
K<br />
known_hosts 303, 402<br />
L<br />
ldconfig 173<br />
ldd 173<br />
link card 17, 22, 114<br />
link card chips 71<br />
linpack 133<br />
linuximg 33<br />
list_jobs 139<br />
llcancel 193<br />
llhold 366<br />
llinit 204<br />
llq 178, 193, 197, 203, 365<br />
llstatus 61, 172, 186, 188–189, 192, 203, 371<br />
llsubmit 36, 66, 178, 193, 198–199<br />
LoadL_admin 171, 203, 206<br />
LoadL_config 172, 178, 204, 207<br />
LoadLeveler 56, 66, 109, 124, 141, 162, 167–168,<br />
170, 205, 295, 358<br />
daemons 202<br />
job classes 205<br />
job command file 198<br />
job state 177<br />
job submission 178<br />
job sumission 179<br />
queue 182<br />
run queue 193<br />
LOCAL_CONFIG 205<br />
location 47<br />
location codes 3<br />
LocationString 46<br />
lscondition 386<br />
lsresponse 387<br />
lssensor 389<br />
lxtrace 235<br />
M<br />
massive parallel system 15<br />
Master 375<br />
MCP 7, 33, 35, 38, 212–213<br />
mem_l2_coherency 133<br />
Message Passing Interface 141–142<br />
methodology 2, 56, 106<br />
microloader 35, 38<br />
midplane 6, 8, 17, 22, 68–70, 75, 158, 189<br />
midplane information 123<br />
Midplane Management Control System 30, 149<br />
mini-control program 212<br />
mkcondition 390<br />
mkcondresp 386<br />
mkresponse 390<br />
mksensor 390<br />
Index 457
mmauth 247–248, 261<br />
mmchconfig 247, 308<br />
mmchdisk 263<br />
MMCS 30, 141, 162<br />
MMCS console 113, 136<br />
mmcs_console 66, 96<br />
mmcs_db_console 91, 96, 136, 163, 220, 242, 252,<br />
267, 282–283, 295<br />
mmcs_db_server 31, 36, 39, 136, 282<br />
mmcs_server 89, 283<br />
MMCS_SERVER_IP 155, 166, 175, 349<br />
mmdsh 308<br />
mmfslinux 235<br />
mmgetstate 255, 298, 347–348<br />
mmlscluster 259<br />
mmremotefs 249–250, 257<br />
mmstartup 245<br />
monitoring tools 106<br />
MPI 56, 141, 147<br />
MPI library 142, 146<br />
MPI_Barrier 142–143<br />
MPI_Comm_rank 144<br />
MPI_Comm_size 144<br />
MPI_COMM_WORLD 142, 144<br />
MPI_Finalize 144<br />
MPI_Init 144<br />
MPI_IRecv 142<br />
MPI_Isend 142<br />
MPI_Recv 142<br />
MPI_Reduce 142<br />
MPI_Scatter 142<br />
MPI_Send 142<br />
mpicc 147<br />
mpicxx 147<br />
mpif77 147<br />
mpirun 36, 41, 66, 141, 149–150, 155, 158, 160,<br />
162–165, 170–171, 177, 183, 187, 200–201, 266,<br />
343, 347, 350<br />
only_test_protocol 356<br />
-verbose 370<br />
mpirun checklist 166<br />
ms_gen_short 133<br />
N<br />
naming convention 19<br />
Negotiator 168, 170, 189, 375<br />
NegotiatorLog 208<br />
netstat 208<br />
458 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
network switches 103<br />
network tuning parameters 223<br />
NFS 65, 90, 211–212, 215, 225, 266, 294, 296, 329<br />
NFS servers 215<br />
nfsd 219<br />
node card 8, 30, 48, 76, 104, 114<br />
node definition file 251<br />
node startup 212<br />
no-pty 155<br />
NSD 231<br />
NUMNODECARDS 77<br />
NUMSERVICECARDS 70<br />
O<br />
operational logs 113<br />
P<br />
pagepool 226, 244, 259, 261, 294, 297<br />
parallel 199<br />
parallel programming environment 141–142<br />
partition 180, 358<br />
partitions 158, 180, 197<br />
point-to-point 142<br />
point-to-point communication 142<br />
port mapper 294<br />
portmap 219<br />
PostDiscovery 43, 46, 54<br />
postdiscovery 50<br />
power_module_stress 133<br />
PowerPC 144<br />
PowerPC 440 7, 148<br />
PPC64 144<br />
ppcfloor 335<br />
predefined partitions 160<br />
PrepareForService 46–47, 50, 122, 289, 292, 294<br />
printf () 144<br />
problem definition 56<br />
problem determination methodology 55, 104, 254<br />
problem monitor 123<br />
processor card 80, 82<br />
Public Key Infrastructure (PKI) 396, 398<br />
Q<br />
quorum 244<br />
R<br />
rack 3, 104, 131, 158, 267
acks 66<br />
ramdisk 38<br />
RAS 30<br />
RAS data 30<br />
RAS event 36, 38, 108, 115, 118–119, 126–127,<br />
268<br />
Redbooks Web site 454<br />
Contact us xv<br />
Reliable Scalable Cluster Technology 385<br />
remote command execution 266, 399<br />
remote execution environment 151<br />
remote file copy 399<br />
remote shell 100, 151<br />
response 386, 388<br />
rpc.mountd 219<br />
RPM 147<br />
RSCT 385<br />
rsh 100<br />
rshd 153<br />
rundiag 135<br />
runtime 123<br />
S<br />
schedd 171, 185, 187<br />
scheduler 168<br />
scp 401<br />
secure shell 102, 151, 154, 383, 399<br />
secure socket layer 400<br />
sensor 388<br />
serial 199<br />
server logs 61<br />
service action 46, 121–122<br />
service card 11, 69, 114<br />
Service Network 267, 269–270, 273<br />
Service network 21<br />
Service Netwotk 29<br />
Service Node 22, 29, 32, 34, 44, 56, 59, 83, 103,<br />
135, 144, 155, 232, 237–238, 267, 273<br />
Service Node database 58<br />
serviceactionid 53<br />
sftp 401<br />
showmount 91, 218–219, 296<br />
site configuration 158<br />
site dist 214<br />
sitefs 217, 221, 259<br />
slogin 401<br />
SRAM 38<br />
ssh 401<br />
sshd 103, 155, 214, 334<br />
SSL 400<br />
Start daemon 168<br />
startcondresp 386, 389, 394<br />
startd 171–172, 182<br />
starter 172<br />
Starter process 168<br />
StarterLog 368<br />
status LED 12<br />
stderr 25<br />
stdin 25<br />
stdout 25<br />
submit_job 36, 96, 141, 149, 267<br />
sumbit_job 66<br />
Symmetric key 396<br />
sysctl.conf 223<br />
system processes 267<br />
systemcontroller 43, 46, 50<br />
T<br />
TBGLIPPOOL 60<br />
TBGLJOB 111<br />
TBGLJOB_HISTORY 111<br />
TBLSERVICECARD 69<br />
ti_abist 133<br />
ti_edramabist 133<br />
ti_lbist 133<br />
Torus 17, 21, 26, 189<br />
tr_connectivity 133<br />
tr_loopback 133<br />
tr_multinode 133<br />
tracedev 235<br />
Triple DES 397<br />
ts_connectivity 133<br />
ts_loopback_0 133<br />
ts_loopback_1 133<br />
ts_multinode 133<br />
V<br />
virtual node 7<br />
W<br />
Web interface 113<br />
write_con 137, 221, 242<br />
X<br />
xinetd 152<br />
Index 459
XL compilers 148<br />
XLC/XLF 148<br />
XLC/XLF compilers 148<br />
xlmass 149<br />
xlsmp 149<br />
460 IBM System Blue Gene Solution: <strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
(1.0” spine)<br />
0.875”1.498”<br />
460 788 pages<br />
IBM System Blue Gene Solution:<br />
<strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong>
IBM System Blue Gene<br />
Solution<br />
<strong>Problem</strong> <strong>Determination</strong> <strong>Guide</strong><br />
Learn detailed<br />
procedures through<br />
illustrations<br />
Use sample<br />
scenarios as helpful<br />
resources<br />
Discover GPFS<br />
installation hints and<br />
tips<br />
SG24-7211-00 ISBN 0738496766<br />
Back cover<br />
This IBM Redbook is intended as a problem determination<br />
guide for system administrators in a High Performance<br />
Computing environment. It can help you find a solution to<br />
issues that you encounter on your IBM eServer Blue Gene<br />
system.<br />
This redbook includes a problem determination methodology<br />
along with the problem determination tools that are available<br />
with the basic IBM eServer Blue Gene Solution. It also<br />
discusses additional software components that are required<br />
for integrating your Blue Gene system in a complex computing<br />
environment.<br />
This redbook also describes a GPFS installation procedure<br />
that we used in our test environment and several scenarios<br />
that describe possible issues and their resolution that we<br />
developed following the proposed problem determination<br />
methodology.<br />
Finally, this redbook includes a short introduction about to<br />
how to integrate your Blue Gene system in a High<br />
Performance Computing environment managed by IBM<br />
Cluster <strong>Systems</strong> Management as well as an introduction to<br />
how you can use secure shell in such an environment.<br />
INTERNATIONAL<br />
TECHNICAL<br />
SUPPORT<br />
ORGANIZATION<br />
®<br />
BUILDING TECHNICAL<br />
INFORMATION BASED ON<br />
PRACTICAL EXPERIENCE<br />
IBM Redbooks are developed by<br />
the IBM International Technical<br />
Support Organization. Experts<br />
from IBM, Customers and<br />
Partners from around the world<br />
create timely technical<br />
information based on realistic<br />
scenarios. Specific<br />
recommendations are provided<br />
to help you implement IT<br />
solutions more effectively in<br />
your environment.<br />
For more information:<br />
ibm.com/redbooks