- Page 1: IBM System Blue Gene Solution Probl
- Page 4 and 5: Note: Before using this information
- Page 6 and 7: 1.7.5 Microloader . . . . . . . . .
- Page 8 and 9: 4.4.2 Principles of operation in a
- Page 10 and 11: Installing LoadLeveler on SN and FE
- Page 12 and 13: Trademarks The following terms are
- Page 14 and 15: He holds a Master's Degree in Elect
- Page 16 and 17: Frank Ingram Randal Massot Mike Nel
- Page 18 and 19: xvi IBM System Blue Gene Solution:
- Page 20 and 21: 1.1 Blue Gene/L system overview In
- Page 22 and 23: Figure 1-2 Card positions in the fr
- Page 24 and 25: 1.2.2 Midplane As the name suggests
- Page 26 and 27: 1.2.5 Node card As an application r
- Page 28 and 29: Figure 1-8 Node card center LED pan
- Page 30 and 31: Status LEDs are located on the righ
- Page 32 and 33: There are 20 fan units (clusters) p
- Page 34 and 35: The BPMs also have status LEDs, as
- Page 38 and 39: Fan assemblies: Rxx-Mx-Ax Fan assem
- Page 40 and 41: 1.3.1 Service network The service n
- Page 42 and 43: I²C The Inter-Integrated Circuit (
- Page 44 and 45: 1.3.3 Three dimensional torus (3D t
- Page 46 and 47: programmed as a global logical “O
- Page 48 and 49: ► Configuration data: Includes sy
- Page 50 and 51: 1.5 Front-End Node A Front-End Node
- Page 52 and 53: BusyBox is open source software. Yo
- Page 54 and 55: 1.8 Boot process, job submission, a
- Page 56 and 57: After the chips are initialized, th
- Page 58 and 59: 56 ALL CIOD Complete Front-End Node
- Page 60 and 61: Frontend Node Job Job Scheduler mpi
- Page 62 and 63: 1.9.5 CableDiscovery CableDiscovery
- Page 64 and 65: 1.10.1 Discovery logs SystemControl
- Page 66 and 67: Mar 31 11:25:43.750 EST: Service Ac
- Page 68 and 69: glsn:/bgl/BlueLight/ppcfloor/bglsys
- Page 70 and 71: EndServiceAction logs its invocatio
- Page 72 and 73: Repeat the discovery command for ea
- Page 74 and 75: 2.1 Introduction Whenever you have
- Page 76 and 77: The Configuration section displays
- Page 78 and 79: 1 record(s) selected. Then check th
- Page 80 and 81: The default location of control sys
- Page 82 and 83: Table 2-6 Control system log for pe
- Page 84 and 85: 2.2.8 Job submission 2.2.9 Racks No
- Page 86 and 87:
2.2.10 Midplanes 2.2.11 Clock cards
- Page 88 and 89:
2.2.13 Link cards ► Using a DB2 s
- Page 90 and 91:
Figure 2-6 BGWEB table showing the
- Page 92 and 93:
2.2.15 Link summary You can view th
- Page 94 and 95:
2.2.16 Node cards Example 2-14 Disp
- Page 96 and 97:
Figure 2-10 Top of the page for the
- Page 98 and 99:
2.2.18 Compute or processor cards C
- Page 100 and 101:
To check the number of processor ca
- Page 102 and 103:
etc/shadow:bglsysdb:$1$SwI1..4e$iGN
- Page 104 and 105:
4. Use the /bin/ping command to che
- Page 106 and 107:
If you discover issues during the c
- Page 108 and 109:
2. If you need to restart BGLMaster
- Page 110 and 111:
2. After you are connected to the c
- Page 112 and 113:
Apr 05 17:20:57 (I) [1090516192] te
- Page 114 and 115:
2.3.8 Check that a simple job can r
- Page 116 and 117:
Figure 2-16 BGWEB showing job Infor
- Page 118 and 119:
To check the control system server
- Page 120 and 121:
If there is an issue with rsh, then
- Page 122 and 123:
2.3.13 Check the physical Blue Gene
- Page 124 and 125:
This methodology also allows someon
- Page 126 and 127:
- Hardware 3.2, “Hardware monitor
- Page 128 and 129:
BRIDGE (Debug): rm_add_job() - Comp
- Page 130 and 131:
112 IBM System Blue Gene Solution:
- Page 132 and 133:
3.1 Introduction This chapter expla
- Page 134 and 135:
Using the GUI If you have a graphic
- Page 136 and 137:
Figure 3-2 Querying the environment
- Page 138 and 139:
3.3.1 Starting the tool Figure 3-4
- Page 140 and 141:
Figure 3-5 Missing hardware in a re
- Page 142 and 143:
Block information The block informa
- Page 144 and 145:
In addition, the Show RAS events fo
- Page 146 and 147:
Figure 3-11 RAS drill down showing
- Page 148 and 149:
Figure 3-14 Testcase highlighting t
- Page 150 and 151:
3.4.1 Test cases The diagnostics su
- Page 152 and 153:
3.4.2 Starting the tool Test case D
- Page 154 and 155:
While the Web interface provides a
- Page 156 and 157:
Apr 06 18:59:41 (I) [1079031008] {1
- Page 158 and 159:
140 IBM System Blue Gene Solution:
- Page 160 and 161:
4.1 Parallel programming environmen
- Page 162 and 163:
4.2 Compilers The following steps e
- Page 164 and 165:
Because these RPMs can be downloade
- Page 166 and 167:
4.2.2 The IBM XLC/XLF compilers Thi
- Page 168 and 169:
of providing this access for the ad
- Page 170 and 171:
The system administrator can set up
- Page 172 and 173:
Note: A similar configuration is pe
- Page 174 and 175:
► DB_PROPERTY This setting is def
- Page 176 and 177:
BRIDGE (Info) : The machine serial
- Page 178 and 179:
In Figure 4-3, clicking Runtime dis
- Page 180 and 181:
The Web page shown in Figure 4-4 al
- Page 182 and 183:
Figure 4-5 shows a diagram of the j
- Page 184 and 185:
Users and system administrators can
- Page 186 and 187:
A node part of a LoadLeveler cluste
- Page 188 and 189:
4.4.2 Principles of operation in a
- Page 190 and 191:
Start daemon (startd) runs on each
- Page 192 and 193:
The binaries are located in /bgl/Bl
- Page 194 and 195:
Because these variable are user spe
- Page 196 and 197:
4.4.8 LoadLeveler job submission pr
- Page 198 and 199:
Figure 4-14 shows the process of Lo
- Page 200 and 201:
Steps 5 and 6: Initializing the par
- Page 202 and 203:
Steps 8 and 9: Starting the paralle
- Page 204 and 205:
Service Node Central Manager Figure
- Page 206 and 207:
4. One of the start daemons is down
- Page 208 and 209:
Z = 0 ===== +------------+ | R000|
- Page 210 and 211:
Z = 0 ===== +----------------------
- Page 212 and 213:
You can use the llq -l command to
- Page 214 and 215:
Submitting Host: bglfen1.itso.ibm.c
- Page 216 and 217:
Note: The JOBID from Blue Gene/L da
- Page 218 and 219:
► #@ arguments This keyword conta
- Page 220 and 221:
LoadLeveler processes, logs, and pe
- Page 222 and 223:
You can configure the LoadLeveler d
- Page 224 and 225:
► STARTD_RUNS_HERE This keyword s
- Page 226 and 227:
For example, knowing how sockets wo
- Page 228 and 229:
210 IBM System Blue Gene Solution:
- Page 230 and 231:
5.1 NFS and GPFS In a basic configu
- Page 232 and 233:
sshd (optional) Starts the secure s
- Page 234 and 235:
1. Create NFS Server with storage a
- Page 236 and 237:
5.2.4 NFS checklists the simplest p
- Page 238 and 239:
Step 2 - Check if the NFS-FS can be
- Page 240 and 241:
# It mounts a filesystem on /scratc
- Page 242 and 243:
5.3 GPFS Example 5-6 Recommended ne
- Page 244 and 245:
However, if you have the following
- Page 246 and 247:
nodes and these have no boot/OS dis
- Page 248 and 249:
Important: The GPFS code that is in
- Page 250 and 251:
5.3.6 Creating the GPFS file system
- Page 252 and 253:
Installing the GPFS code for SN Fig
- Page 254 and 255:
You should see a list similar to th
- Page 256 and 257:
172.30.1.1 bglsn_fn OpenPower Servi
- Page 258 and 259:
Having created the /tmp/authorized_
- Page 260 and 261:
When the "GPFS_STARTUP=1" line is i
- Page 262 and 263:
Creating the bgIO cluster After you
- Page 264 and 265:
Mon Mar 20 14:14:34 2006: mmfsd rea
- Page 266 and 267:
Example 5-21 Telling clusters to au
- Page 268 and 269:
To mount the GPFS file system on th
- Page 270 and 271:
Example 5-25 Node definition file f
- Page 272 and 273:
5.3.9 GPFS problem determination me
- Page 274 and 275:
Checking the GPFS log files for pro
- Page 276 and 277:
Node number Node name GPFS state --
- Page 278 and 279:
Example 5-35 shows the output of th
- Page 280 and 281:
We also ran the mmlscluster command
- Page 282 and 283:
5.3.11 References For more informat
- Page 284 and 285:
6.1 Introduction In some scenarios,
- Page 286 and 287:
6.2.1 Hardware error: Compute card
- Page 288 and 289:
Example 6-2 mmcs message: service c
- Page 290 and 291:
Problem determination When we boote
- Page 292 and 293:
Looking in the system logs we see t
- Page 294 and 295:
Lessons learned The /tmp directory
- Page 296 and 297:
Mar 14 12:27:04 (I) [1079567584] Ab
- Page 298 and 299:
Lessons learned We learned the foll
- Page 300 and 301:
glsn:/bgl/BlueLight/ppcfloor/bglsys
- Page 302 and 303:
6.2.12 DB2 not started on the SN In
- Page 304 and 305:
Now we start DB2 on the SN; after D
- Page 306 and 307:
After changing the db.properties wi
- Page 308 and 309:
Mar 15 17:04:33 (I) [1086321888] ro
- Page 310 and 311:
Current message level: 0x00000007 (
- Page 312 and 313:
As EndServiceAction did not complet
- Page 314 and 315:
Mar 13 16:02:17 (I) [1084843232] ro
- Page 316 and 317:
Problem determination In this scena
- Page 318 and 319:
** error ** ** error ** ** error **
- Page 320 and 321:
Mar 28 15:56:48 (I) [1088451808] ro
- Page 322 and 323:
-rw-r--r-- 1 root root 220 Mar 19 1
- Page 324 and 325:
Summary: api = MPIIO (version=2, su
- Page 326 and 327:
Are you sure you want to continue c
- Page 328 and 329:
Now that we know that the GPFS conf
- Page 330 and 331:
Lessons learned From this scenario,
- Page 332 and 333:
We can see that the partition is st
- Page 334 and 335:
Ckpt Execute Dir: Restart From Ckpt
- Page 336 and 337:
We realize that we cannot find eith
- Page 338 and 339:
Note: Even though we can manually c
- Page 340 and 341:
-rw-r--r-- 1 root root 58882 Mar 31
- Page 342 and 343:
Step 1 Now we install the new Blue
- Page 344 and 345:
We used the updated script to retri
- Page 346 and 347:
Example 6-66 shows the files that w
- Page 348 and 349:
The BACKFILL scheduler with Blue Ge
- Page 350 and 351:
Log will be written to /bgl/BlueLig
- Page 352 and 353:
However, we were able to ssh to the
- Page 354 and 355:
8. We try to run a job. We start by
- Page 356 and 357:
Summary: api = MPIIO (version=2, su
- Page 358 and 359:
11.We test that we can boot a block
- Page 360 and 361:
delaying 1 seconds . . . write 6.61
- Page 362 and 363:
clients = 128 (16 per node) repetit
- Page 364 and 365:
Problem determination When we ran t
- Page 366 and 367:
7 ionode6 active 8 ionode7 active 9
- Page 368 and 369:
6.4.2 The mpirun command: environme
- Page 370 and 371:
BE_MPI (Info) : == BE completed ==
- Page 372 and 373:
Problem determination By default, t
- Page 374 and 375:
# bglfen1 test1 bglfen2 test1 bglsn
- Page 376 and 377:
6.4.4 LoadLeveler: scenarios descri
- Page 378 and 379:
FE_MPI (Info) : Scheduler interface
- Page 380 and 381:
The log messages indicate the job c
- Page 382 and 383:
Problem description A job submitted
- Page 384 and 385:
MAX_JOB_REJECT = 0 # # When ACTION_
- Page 386 and 387:
From the StartLog, we can see that
- Page 388 and 389:
Detailed checking We start by check
- Page 390 and 391:
Example 6-121 LoadLeveler basic che
- Page 392 and 393:
We first check the LoadLeveler Nego
- Page 394 and 395:
The error messages shown in Example
- Page 396 and 397:
04/05 15:30:28 TI-6 LoadLeveler: Ba
- Page 398 and 399:
To detect the left-over socket file
- Page 400 and 401:
382 IBM System Blue Gene Solution:
- Page 402 and 403:
7.1 Cluster Systems Management This
- Page 404 and 405:
7.1.2 Monitoring the Blue Gene/L da
- Page 406 and 407:
To simplify the earlier monitoring
- Page 408 and 409:
Description = This sensor is update
- Page 410 and 411:
To verify that all is working prope
- Page 412 and 413:
7.1.5 Miscellaneous related informa
- Page 414 and 415:
The cryptographic technique are emp
- Page 416 and 417:
► RC2/RC4 algorithms RC2: block c
- Page 418 and 419:
Secure shell server The secure shel
- Page 420 and 421:
Secure shell client programs also d
- Page 422 and 423:
► If this file exists, it checks
- Page 424 and 425:
► We restarted the ssh daemons on
- Page 426 and 427:
408 IBM System Blue Gene Solution:
- Page 428 and 429:
Installing LoadLeveler on SN and FE
- Page 430 and 431:
Create the home directory for user
- Page 432 and 433:
Example A-3 presents a sample LoadL
- Page 434 and 435:
Example A-4 presents a sample LoadL
- Page 436 and 437:
CKPT_CLEANUP_INTERVAL = 86400 # sam
- Page 438 and 439:
# STARTER = $(BIN)/LoadL_starter ST
- Page 440 and 441:
# MAX_JOB_REJECT = 0 # # When ACTIO
- Page 442 and 443:
The /bgl/dist/etc/rc.d/init.d/sitef
- Page 444 and 445:
# ---------------------------------
- Page 446 and 447:
# Parameter 1: "start" - perform st
- Page 448 and 449:
430 IBM System Blue Gene Solution:
- Page 450 and 451:
gl/BlueLight/ppcfloor/docs/ionode.R
- Page 452 and 453:
------------------------------- The
- Page 454 and 455:
BGL_BLOCKID Name of block BGL_VERBO
- Page 456 and 457:
to by Normally, you should not have
- Page 458 and 459:
] to by 440 IBM System Blue Gene So
- Page 460 and 461:
GPFS_VAR_DIR directory. GPFS_CONFIG
- Page 462 and 463:
$BGL_SITEDISTDIR/etc/rc.d/rc3.d dir
- Page 464 and 465:
# Handle startup (start) and shutdo
- Page 466 and 467:
Support for Subnets ---------------
- Page 468 and 469:
mv ramdisk.img2.gz ramdisk.img.gz L
- Page 470 and 471:
RSH Remote Shell SHA Secure Hash Al
- Page 472 and 473:
Online resources These Web sites an
- Page 474 and 475:
clock signal 11 cluster 168 Cluster
- Page 476 and 477:
mmauth 247-248, 261 mmchconfig 247,
- Page 478 and 479:
XL compilers 148 XLC/XLF 148 XLC/XL
- Page 482:
IBM System Blue Gene Solution Probl