HIGH-PERFORMANCE COMPUTINGmaintaining job priority in accordance with the site policy that theadministrator has established for the amount and timing of resourcesused to execute jobs. Based on that information, the scheduler decideswhich job will execute on which compute node and when.Understanding job scheduling in clustersWhen a job is submitted to a resource manager, the job waits ina queue until it is scheduled and executed. The time spent in thequeue, or wait time,depends on several factors including job priority,load on the system, and availability of requested resources.Turnaround time represents the elapsed time between when thejob is submitted and when the job is completed; turnaround timeincludes the wait time as well as the job’s actual execution time.Response time represents how fast a user receives a response fromthe system after the job is submitted.Resource utilization during the lifetime of the job represents theactual useful work that has been performed. System throughputtisdefined as the number of jobs completed per unit of time. Meanresponse time is an important performance metric for users, whoexpect minimal response time. In contrast, system administratorsare concerned with overall resource utilization because they wantto maximize system throughput and return on investment (ROI),especially in high-throughput computing clusters.In a typical production environment, many different jobs aresubmitted to clusters. These jobs can be characterized by factorssuch as the number of processors requested (also known as job size,or job width), estimated runtime, priority level, parallel or distributedexecution, and specific I/O requirements. During execution, largejobs can occupy significant portions of a cluster’s processing andmemory resources.System administrators can create several types of queues, eachwith a different priority level and quality of service (QoS). To makeintelligent schedules, however, schedulers need information regardingjob size, priority, expected execution time (indicated by the user),resource access permission (established by the administrator), andresource availability (automatically obtained by the scheduler).In high-performance computing clusters, the scheduling of paralleljobs requires special attention because parallel jobs compriseseveral subtasks. Each subtask is assigned to a unique computenode during execution and nodes constantly communicate amongUser 1User 2User 3User nUserssubmit jobsFigure 1. Typical resource management systemResource managerJob queueExternal jobschedulerInformation is sentto the resource managerInternal jobschedulerJobs areassigned tocompute nodesCompute node 1Compute node 2Compute node 3Compute node 4Compute node nthemselves during execution. The manner in which the subtasksare assigned to processors is called mapping. Because mappingaffects execution time, the scheduler must map subtasks carefully.The scheduler needs to ensure that nodes scheduled to executeparallel jobs are connected by fast interconnects to minimize theassociated communication overhead. For parallel jobs, the jobefficiency also affects resource utilization. To achieve high resourceutilization for parallel jobs, both job efficiency and advancedscheduling are required. Efficient job processing depends on effectiveapplication design.Under heavy load conditions, the capability to provide a fairportion of the cluster’s resources to each user is important. This capabilitycan be provided by using the fair-share strategy, in which thescheduler collects historical data from previously executed jobs anduses the historical data to dynamically adjust the priority of the jobsin the queue. The capability to dynamically make priority changeshelps ensure that resources are fairly distributed among users.Most job schedulers have several parameters that can beadjusted to control job queues and scheduling algorithms, thusproviding different response times and utilization percentages. Usually,high system utilization also means high average response timefor jobs—and as system utilization climbs, the average responsetime tends to increase sharply beyond a certain threshold. Thisthreshold depends on the job-processing algorithms and job profiles.In most cases, improving resource utilization and decreasing jobturnaround time are conflicting considerations. The challenge for ITorganizations is to maximize resource utilization while maintainingacceptable average response times for users.Figure 2 summarizes the desirable features of job schedulers.These features can serve as guidelines for system administrators asthey select job schedulers.Using job scheduling algorithmsThe parallel and distributed computing community has put substantialresearch effort into developing and understanding job schedulingalgorithms. Today, several of these algorithms have been implementedin both commercial and open source job schedulers. Scheduling algorithmscan be broadly divided into two classes: time-sharing andspace-sharing. Time-sharing algorithms divide time on a processorinto several discrete intervals, or slots. These slots are then assignedto unique jobs. Hence, several jobs at any given time can share thesame compute resource. Conversely, space-sharing algorithms givethe requested resources to a single job until the job completes execution.Most cluster schedulers operate in space-sharing mode.Common, simple space-sharing algorithms are first come, firstserved (FCFS); first in, first out (FIFO); round robin (RR); shortest jobfirst (SJF); and longest job first (LJF). As the names suggest, FCFSand FIFO execute jobs in the order in which they enter the queue.This is a very simple strategy to implement, and works acceptablywell with a low job load. RR assigns jobs to nodes as they arrive in134POWER SOLUTIONS Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. February 2005
HIGH-PERFORMANCE COMPUTINGFeatureCommentsa. Jobs waitingin the queueBroad scopeThe nature of jobs submitted to a cluster can vary, so thescheduler must support batch, parallel, sequential, distributed,interactive, and noninteractive jobs with similar efficiency.Support for algorithmsThe scheduler should support numerous job-processingalgorithms—including FCFS, FIFO, SJF, LJF, advancereservation, and backfill. In addition, the scheduler should beable to switch between algorithms and apply differentalgorithms at different times—or apply different algorithmsto different queues, or both.b. The queue aftereach priority groupis sorted accordingto execution timeSorted high-priority jobsSorted low-priority jobsCapability to integratewith standard resourcemanagersSensitivity to computenode and interconnectarchitectureScalabilityThe scheduler should be able to interface with the resourcemanager in use, including common resource managers suchas Platform LSF, Sun Grid Engine, and OpenPBS (the original,open source version of Portable Batch System).The scheduler should match the appropriate compute nodearchitecture to the job profile—for example, by using computenodes that have more than one processor to provide optimalperformance for applications that can use the secondprocessor effectively.The scheduler should be capable of scaling to thousands ofnodes and processing thousands of jobs simultaneously.c. Longest job firstscheduled. Longest job firstand backfillscheduleComputenodesComputenodesTimeFair-share capabilityThe scheduler should distribute resources fairly under heavyconditions and at different times.TimeEfficiencyThe overhead associated with scheduling should be minimaland within acceptable limits. Advanced scheduling algorithmscan take time to run. To be efficient, the scheduling algorithmitself must spend less time running than the expected savingin application execution time from improved scheduling.e. Shortest job firstscheduleComputenodesDynamic capabilitySupport for preemptionThe scheduler should be able to add or remove computeresources to a job on the fly—assuming that the job canadjust and utilize the extra compute capacity.Preemption can occur at various levels; for example, jobs maybe suspended while running. Checkpointing—that is, thecapability to stop a running job, save the intermediate results,and restart the job later—can help ensure that results are notlost for very long jobs.f. Shortest job firstand backfillscheduleComputenodesTimeTimeFigure 2. Features of job schedulersFigure 3. Job scheduling algorithmsthe queue in a cyclical, round-robin manner. SJF periodically sortsthe incoming jobs and executes the shortest job first, allowing shortjobs to get a good turnaround time. However, this strategy may causedelays for the execution of long (large) jobs. In contrast, LJF commitsresources to longest jobs first. The LJF approach tends to maximizesystem utilization at the cost of turnaround time.Basic scheduling algorithms such as these can be enhancedby combining them with the use of advance reservation andbackfill techniques. Advance reservation uses execution timepredictions provided by the users to reserve resources (such asCPUs and memory) and to generate a schedule. The backfill techniqueimproves space-sharing scheduling. Given a schedule withadvance-reserved, high-priority jobs and a list of low-priority jobs,a backfill algorithm tries to fit the small jobs into schedulinggaps. This allocation does not alter the sequence of jobs previouslyscheduled, but improves system utilization by running lowpriorityjobs in between high-priority jobs. To use backfill, thescheduler requires a runtime estimate of the small jobs, which issupplied by the user when jobs are submitted.Figure 3 illustrates the use of the basic algorithms and theenhancements discussed in this article. Figure 3a shows a queuewith 11 jobs waiting; the queue has both high-priority and lowpriorityjobs. Figure 3b shows these jobs sorted according to theirestimated execution time.The example in Figure 3 assumes an eight-processor cluster andconsiders only two parameters: the number of processors and theestimated execution time. This figure shows the effects of generatingschedules using the LJF and SJF algorithms with and without backfilltechniques. Sections c through f of Figure 3 indicate that backfill canimprove schedules generated by LJF and SJF, either by increasing utilization,decreasing response time, or both. To generate the schedulesshown, the low- and high-priority jobs are sorted separately.Examining a commercial resource managerand an external job schedulerThis section introduces scheduling features of a commercial resourcemanager, Load Sharing Facility (LSF) from Platform Computing, andan open source job scheduler, Maui.www.dell.com/powersolutions Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. POWER SOLUTIONS 135