Administering Platform LSF - SAS

More documents

Recommendations

Info

$Intel(R) Math Kernel Library for Linux* OS User's Guide$

Checkpointing Jobs Checkpointing Jobs Fault tolerance Checkpointing a job involves capturing the state of an executing job, the data necessary to restart the job, and not wasting the work done to get to the current stage. The job state information is saved in a checkpoint file. There are many reasons why you would want to checkpoint a job. To provide job fault tolerance, checkpoints are taken at regular intervals (periodically) during the job’s execution. If the job is killed or migrated, or if the job fails for a reason other than host failure, the job can be restarted from its last checkpoint and not waste the efforts to get it to its current stage. Migration Checkpointing enables a migrating job to make progress rather than restarting the job from the beginning. Jobs can be migrated when a host fails or when a host becomes unavailable due to load. Load balancing Checkpointing a job and restarting it (migrating) on another host provides load balancing by moving load (jobs) from a heavily loaded host to a lightly loaded host. In this section ◆ “Approaches to Checkpointing” on page 309 ◆ “Checkpointing a Job” on page 313 308 Administering Platform LSF
Approaches to Checkpointing Kernel-level checkpointing User-level checkpointing Chapter 23 Job Checkpoint, Restart, and Migration LSF provides support for most checkpoint and restart implementations through uniform interfaces, echkpnt and erestart. All interaction between LSF and the checkpoint implementations are handled by these commands. See the echkpnt(8) and erestart(8) man pages for more information. Checkpoint and restart implementations are categorized based on the facility that performs the checkpoint and the amount of knowledge an executable has of the checkpoint. Commonly, checkpoint and restart implementations are grouped as kernel-level, user-level, and application-level. Kernel-level checkpointing is provided by the operating system and can be applied to arbitrary jobs running on the system. This approach is transparent to the application, there are no source code changes and no need to re-link your application with checkpoint libraries. To support kernel-level checkpoint and restart, LSF provides an echkpnt and erestart executable that invokes OS specific system calls. Kernel-level checkpointing is currently supported on: ◆ Cray UNICOS ◆ IRIX 6.4 and later ◆ NEC SX-4 and SX-5 See the chkpnt(1) man page on Cray systems and the cpr(1) man page on IRIX systems for the limitations of their checkpoint implementations. LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re-linked with a set of libraries provided by LSF in LSF_LIBDIR. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred. Application-level checkpointing The application-level approach applies to those applications which are specially written to accommodate the checkpoint and restart. The application writer must also provide an echkpnt and erestart to interface with LSF. For more details see the echkpnt(8) and erestart(8) man pages. The application checkpoints itself either periodically or in response to signals sent by other processes. When restarted, the application itself must look for the checkpoint files and restore its state. Administering Platform LSF 309
Page 1 and 2:
Administering Platform LSF® Versio
Page 3 and 4:
Contents Welcome . . . . . . . . .
Page 5 and 6:
Contents Part II: Working with Reso
Page 7 and 8:
Contents 17 Reserving Resources . .
Page 9 and 10:
Contents 30 External Job Submission
Page 11 and 12:
Contents 40 Non-Shared File Systems
Page 13 and 14:
Welcome Contents ◆ “About This
Page 15 and 16:
Welcome Command notation Notation M
Page 17 and 18:
Welcome Platform LSF License Schedu
Page 19 and 20:
Welcome Resource requirement specif
Page 21 and 22:
Welcome Run-time enhancements Threa
Page 23 and 24:
Welcome Environment variables ◆
Page 25 and 26:
Welcome New files added to installa
Page 27 and 28:
Technical Support Contact Platform
Page 29 and 30:
C H A P T E R 1 About Platform LSF
Page 31 and 32:
Chapter 1 About Platform LSF Job sl
Page 33 and 34:
Chapter 1 About Platform LSF Master
Page 35 and 36:
Chapter 1 About Platform LSF Master
Page 37 and 38:
Chapter 1 About Platform LSF Host t
Page 39 and 40:
Chapter 1 About Platform LSF Extern
Page 41 and 42:
Chapter 1 About Platform LSF Job Li
Page 43 and 44:
C H A P T E R 2 How the System Work
Page 45 and 46:
Chapter 2 How the System Works Queu
Page 47 and 48:
Chapter 2 How the System Works Disp
Page 49 and 50:
Job Execution Environment Shared us
Page 51 and 52:
Fault Tolerance Dynamic master host
Page 53:
P A R T I Managing Your Cluster Con
Page 56 and 57:
Viewing Cluster Information Viewing
Page 58 and 59:
Default Directory Structures Defaul
Page 60 and 61:
Default Directory Structures Window
Page 62 and 63:
Controlling Daemons Controlling Dae
Page 64 and 65:
Controlling mbatchd Controlling mba
Page 66 and 67:
Reconfiguring Your Cluster 3 Run ba
Page 68 and 69:
Host States Host States Host states
Page 70 and 71:
Viewing Host Information Viewing Ho
Page 72 and 73:
Viewing Host Information type Viewi
Page 74 and 75:
Controlling Hosts Controlling Hosts
Page 76 and 77:
Adding a Host Adding a Host Use lsf
Page 78 and 79:
Removing a Host Removing a Host CAU
Page 80 and 81:
Adding and Removing Hosts Dynamical
Page 82 and 83:
Adding and Removing Hosts Dynamical
Page 84 and 85:
Adding Host Types and Host Models t
Page 86 and 87:
Registering Service Ports NIS or NI
Page 88 and 89:
Host Naming Host Naming Network add
Page 90 and 91:
Hosts with Multiple Addresses Examp
Page 92 and 93:
Host Groups Host Groups Where to us
Page 94 and 95:
Tuning CPU Factors Tuning CPU Facto
Page 96 and 97:
Handling Host-level Job Exceptions
Page 98 and 99:
Handling Host-level Job Exceptions
Page 100 and 101:
Queue States Queue States Queue sta
Page 102 and 103:
Viewing Queue Information SCHEDULIN
Page 104 and 105:
Controlling Queues Controlling Queu
Page 106 and 107:
Controlling Queues Dispatch Windows
Page 108 and 109:
Managing Queues Managing Queues Res
Page 110 and 111:
Handling Job Exceptions Configuring
Page 112 and 113:
Job States Job States The bjobs com
Page 114 and 115:
Job States Viewing wait status and
Page 116 and 117:
Viewing Job Information Viewing exc
Page 118 and 119:
Switching Jobs from One Queue to An
Page 120 and 121:
Suspending and Resuming Jobs Suspen
Page 122 and 123:
Sending a Signal to a Job Sending a
Page 124 and 125:
Using Job Groups Creating a job gro
Page 126 and 127:
Using Job Groups % bhist -l 105 You
Page 128 and 129:
Viewing User and User Group Informa
Page 130 and 131:
About User Groups About User Groups
Page 132 and 133:
LSF User Groups LSF User Groups You
Page 134 and 135:
LSF User Groups 134 Administering P
Page 137 and 138:
C H A P T E R 8 Understanding Resou
Page 139 and 140:
Chapter 8 Understanding Resources T
Page 141 and 142:
Chapter 8 Understanding Resources S
Page 143 and 144:
How LSF Uses Resources Viewing job
Page 145 and 146:
Chapter 8 Understanding Resources S
Page 147 and 148:
Chapter 8 Understanding Resources V
Page 149 and 150:
Chapter 8 Understanding Resources A
Page 151 and 152:
C H A P T E R 9 Adding Resources Co
Page 153 and 154:
Adding New Resources to Your Cluste
Page 155 and 156:
Chapter 9 Adding Resources Configur
Page 157 and 158:
Static Shared Resource Reservation
Page 159 and 160:
Chapter 9 Adding Resources Configur
Page 161 and 162:
Chapter 9 Adding Resources ELIM res
Page 163 and 164:
Modifying a Built-In Load Index Con
Page 165:
P A R T III Scheduling Policies Con
Page 168 and 169:
Specifying Time Values Specifying T
Page 170 and 171:
Specifying Time Expressions Specify
Page 172 and 173:
Automatic Time-based Configuration
Page 174 and 175:
Deadline Constraint Scheduling Dead
Page 176 and 177:
Exclusive Scheduling 176 Administer
Page 178 and 179:
About Preemptive Scheduling About P
Page 180 and 181:
How Preemptive Scheduling Works Job
Page 182 and 183:
Configuring Preemptive Scheduling Q
Page 184 and 185:
Configuring Preemptive Scheduling 1
Page 186 and 187:
About Resource Requirements About R
Page 188 and 189:
Queue-Level Resource Requirements V
Page 190 and 191:
About Resource Requirement Strings
Page 192 and 193:
Selection String Selection String S
Page 194 and 195:
Order String Order String Syntax De
Page 196 and 197:
Usage String Example rusage[mem=50:
Page 198 and 199:
Span String Span String Syntax A sp
Page 200 and 201:
Same String 200 Administering Platf
Page 202 and 203:
About Fairshare Scheduling About Fa
Page 204 and 205:
User Share Assignments Examples ◆
Page 206 and 207:
Dynamic User Priority Default dynam
Page 208 and 209:
Host Partition Fairshare Host Parti
Page 210 and 211:
Queue-Level User-based Fairshare Qu
Page 212 and 213:
Cross-queue Fairshare bqueues -l al
Page 214 and 215:
Cross-queue Fairshare 3 In all the
Page 216 and 217:
Hierarchical Fairshare Viewing hier
Page 218 and 219:
Queue-based Fairshare Queue-based F
Page 220 and 221:
Configuring Slot Allocation per Que
Page 222 and 223:
Viewing Queue-based Fairshare Alloc
Page 224 and 225:
Typical Slot Allocation Scenarios T
Page 226 and 227:
Typical Slot Allocation Scenarios W
Page 228 and 229:
Typical Slot Allocation Scenarios T
Page 230 and 231:
Using Historical and Committed Run
Page 232 and 233:
Using Historical and Committed Run
Page 234 and 235:
Users Affected by Multiple Fairshar
Page 236 and 237:
Ways to Configure Fairshare Ways to
Page 238 and 239:
Ways to Configure Fairshare Example
Page 240 and 241:
Using Goal-Oriented SLA Scheduling
Page 242 and 243:
Using Goal-Oriented SLA Scheduling
Page 244 and 245:
Configuring Service Classes for SLA
Page 246 and 247:
Viewing Information about SLAs and
Page 248 and 249:
Viewing Information about SLAs and
Page 250 and 251:
Understanding Service Class Behavio
Page 252 and 253:
Page 254 and 255:
Page 257 and 258: C H A P T E R 16 Resource Allocatio
Page 259 and 260: Chapter 16 Resource Allocation Limi
Page 269 and 270: C H A P T E R 17 Reserving Resource
Page 271 and 272: Chapter 17 Reserving Resources Usin
Page 273 and 274: Chapter 17 Reserving Resources Usin
Page 275 and 276: Chapter 17 Reserving Resources View
Page 277 and 278: C H A P T E R 18 Managing Software
Page 279 and 280: Host Locked Licenses Chapter 18 Man
Page 281 and 282: Network Floating Licenses Chapter 1
Page 283 and 284: Chapter 18 Managing Software Licens
Page 285 and 286: C H A P T E R 19 Dispatch and Run W
Page 287 and 288: Run Windows Configuring run windows
Page 289 and 290: C H A P T E R 20 Job Dependencies C
Page 291 and 292: Chapter 20 Job Dependencies ◆ In
Page 293 and 294: Chapter 20 Job Dependencies externa
Page 295 and 296: C H A P T E R 21 Job Priorities Con
Page 297 and 298: Chapter 21 Job Priorities Specifyin
Page 299 and 300: C H A P T E R 22 Job Requeue and Jo
Page 301 and 302: Chapter 22 Job Requeue and Job Reru
Page 307: C H A P T E R 23 Job Checkpoint, Re
Page 311 and 312: Chapter 23 Job Checkpoint, Restart,
Page 313 and 314: Checkpointing a Job Prerequisites C
Page 315 and 316: Making Jobs Checkpointable Manually
Page 317 and 318: Enabling Periodic Checkpointing At
Page 319 and 320: Restarting Checkpointed Jobs Requir
Page 321 and 322: Chapter 23 Job Checkpoint, Restart,
Page 323 and 324: C H A P T E R 24 Chunk Job Dispatch
Page 325 and 326: Chapter 24 Chunk Job Dispatch Confi
Page 327 and 328: Submitting and Controlling Chunk Jo
Page 329 and 330: Chapter 24 Chunk Job Dispatch Fairs
Page 331 and 332: C H A P T E R Ë 25 Job Arrays LSF
Page 333 and 334: Chapter 25 Job Arrays Maximum size
Page 335 and 336: Redirecting Standard Input and Outp
Page 337 and 338: Chapter 25 Job Arrays Job Array Dep
Page 339 and 340: Chapter 25 Job Arrays Specific job
Page 341 and 342: Requeuing a Job Array Chapter 25 Jo
Page 343: P A R T V Controlling Job Execution
Page 346 and 347: About Resource Usage Limits About R
Page 348 and 349: Specifying Resource Usage Limits Sp
Page 350 and 351: Specifying Resource Usage Limits If
Page 352 and 353: Supported Resource Usage Limits and
Page 358 and 359:
CPU Time and Run Time Normalization
Page 360 and 361:
Automatic Job Suspension Automatic
Page 362 and 363:
Suspending Conditions Suspending Co
Page 364 and 365:
Suspending Conditions Viewing suspe
Page 366 and 367:
About Pre-Execution and Post-Execut
Page 368 and 369:
Configuring Pre- and Post-Execution
Page 370 and 371:
Configuring Pre- and Post-Execution
Page 372 and 373:
About Job Starters About Job Starte
Page 374 and 375:
Command-Level Job Starters Command-
Page 376 and 377:
Queue-Level Job Starters Queue-Leve
Page 378 and 379:
Controlling Execution Environment U
Page 380 and 381:
Understanding External Executables
Page 382 and 383:
Using esub Option LSB_SUB_EXCEPTION
Page 384 and 385:
Using esub General esub logic Rejec
Page 386 and 387:
Using esub # Deny userC the ability
Page 388 and 389:
Working with eexec Working with eex
Page 390 and 391:
Default Job Control Actions Default
Page 392 and 393:
Configuring Job Control Actions Con
Page 394 and 395:
Configuring Job Control Actions Exa
Page 396 and 397:
Customizing Cross-Platform Signal C
Page 399 and 400:
C H A P T E R 32 Interactive Jobs w
Page 401 and 402:
Chapter 32 Interactive Jobs with bs
Page 403 and 404:
Page 405 and 406:
Page 407 and 408:
Interactive Batch Job Messaging Lim
Page 409 and 410:
Running X Applications with bsub Ch
Page 411 and 412:
Page 413 and 414:
Page 415 and 416:
C H A P T E R 33 Running Interactiv
Page 417 and 418:
Chapter 33 Running Interactive and
Page 419 and 420:
Interactive Tasks Chapter 33 Runnin
Page 421 and 422:
Page 423 and 424:
Page 425 and 426:
Page 427:
P A R T VII Running Parallel Jobs C
Page 430 and 431:
How LSF Runs Parallel Jobs How LSF
Page 432 and 433:
Submitting Parallel Jobs Submitting
Page 434 and 435:
Submitting MPI Jobs Submitting MPI
Page 436 and 437:
Starting Parallel Tasks with LSF Ut
Page 438 and 439:
Specifying a Minimum and Maximum Nu
Page 440 and 441:
Specifying a Mandatory First Execut
Page 442 and 443:
Controlling Processor Allocation Ac
Page 444 and 445:
Running Parallel Processes on Homog
Page 446 and 447:
Using LSF Make to Run Parallel Jobs
Page 448 and 449:
Limiting the Number of Processors A
Page 450 and 451:
Reserving Processors Reserving Proc
Page 452 and 453:
Reserving Memory for Pending Parall
Page 454 and 455:
Allowing Jobs to Use Reserved Job S
Page 456 and 457:
Allowing Jobs to Use Reserved Job S
Page 458 and 459:
Parallel Fairshare Parallel Fairsha
Page 460 and 461:
Optimized Preemption of Parallel Jo
Page 462 and 463:
About Advance Reservation About Adv
Page 464 and 465:
Configuring Advance Reservation ◆
Page 466 and 467:
Using Advance Reservation Adding a
Page 468 and 469:
Using Advance Reservation Removing
Page 470 and 471:
Using Advance Reservation 6:10 0 10
Page 472 and 473:
Using Advance Reservation Modifying
Page 474 and 475:
Using Advance Reservation 474 Admin
Page 477 and 478:
C H A P T E R 36 Event Generation C
Page 479 and 480:
Chapter 36 Event Generation Events
Page 481 and 482:
C H A P T E R 37 Tuning the Cluster
Page 483 and 484:
Adjusting LIM Parameters RUNWINDOW
Page 485 and 486:
Chapter 37 Tuning the Cluster Compa
Page 487 and 488:
Changing Default LIM Behavior to Im
Page 489 and 490:
Chapter 37 Tuning the Cluster Whene
Page 491 and 492:
Tuning mbatchd on UNIX Operating sy
Page 493 and 494:
C H A P T E R 38 Authentication Con
Page 495 and 496:
Chapter 38 Authentication eauth -c
Page 497 and 498:
Chapter 38 Authentication How LSF d
Page 499 and 500:
Chapter 38 Authentication Correctin
Page 501 and 502:
Chapter 38 Authentication About Dae
Page 503 and 504:
User Account Mapping Chapter 38 Aut
Page 505 and 506:
C H A P T E R 39 Job Email, and Job
Page 507 and 508:
Chapter 39 Job Email, and Job File
Page 509 and 510:
Page 511 and 512:
Page 513 and 514:
C H A P T E R 40 Non-Shared File Sy
Page 515 and 516:
Chapter 40 Non-Shared File Systems
Page 517 and 518:
Chapter 40 Non-Shared File Systems
Page 519 and 520:
C H A P T E R 41 Error and Event Lo
Page 521 and 522:
Chapter 41 Error and Event Logging
Page 523 and 524:
System Event Log CAUTION Chapter 41
Page 525 and 526:
Chapter 41 Error and Event Logging
Page 527 and 528:
C H A P T E R 42 Troubleshooting an
Page 529 and 530:
Common LSF Problems LIM dies quietl
Page 531 and 532:
Chapter 42 Troubleshooting and Erro
Page 533 and 534:
Page 535 and 536:
Page 537 and 538:
Page 539 and 540:
Page 541 and 542:
Page 543 and 544:
Setting Daemon Timing Levels Chapte
Page 545:
P A R T IX LSF Utilities Contents
Page 548 and 549:
About lstcsh About lstcsh The lstcs
Page 550 and 551:
Local and Remote Modes Local and Re
Page 552 and 553:
Differences from Other Shells Diffe
Page 554 and 555:
Starting lstcsh Starting lstcsh Sta
Page 556 and 557:
Host Redirection Host Redirection E
Page 558 and 559:
Built-in Commands Built-in Commands
Page 560 and 561:
Writing Shell Scripts in lstcsh Wri
Page 562 and 563:
Index job migration 321 job requeue
Page 564 and 565:
Index D daemons authd 496 authentic
Page 566 and 567:
Index file spooling. See command fi
Page 568 and 569:
Index Internet addresses, matching
Page 570 and 571:
Index host locked 279 interactive j
Page 572 and 573:
Index pre-42 UNIX directory structu
Page 574 and 575:
Index disabling 317 job-level 317 q
Page 576 and 577:
Index associating with hosts 155 Bo
Page 578 and 579:
Index status closed in bhosts 70 jo
Page 580:
Index 580 Administering Platform LS
show all

Administering Platform LSF - SAS

Create successful ePaper yourself

Delete template?

Save as template?