21.07.2015 Views

Chapter 2: Theory review Part one - staffweb

Chapter 2: Theory review Part one - staffweb

Chapter 2: Theory review Part one - staffweb

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Nichols’ butler [NIC87] takes the simple index concept to its extreme. The CPU run-queuelength metric is used, however a node is considered busy when only a single task isrunning. In this sense the load index is reduced to a binary switch.In Remote Unix [LIT87] a processing node is considered available to host remote tasks if ithas no local tasks executing. This is a simple deviation from the CPU run-queue lengthmetric.Monash [LS95] uses the CPU run-queue length as its load index. The load level at aspecific time is predicted based on the load statistics for the same time during the pastweek.In Sprite, a node is considered idle if there has been an average of less than 1 runnableprocess over the last minute and no user interaction in the last 30 seconds [DOU91].In Condor [SKS92], workstations are considered idle if there has been no user activityover a short period of time and there are no background tasks waiting to execute.In DAWGS [CM92], workstations are considered idle if no users are logged in at thespecific workstation and no computationally intensive processes are running.Another popular CPU-load index is the CPU utilisation index. Schemes that base loadsharing decisions on this index al<strong>one</strong> include V [SKS92] and LYDIA [UH97]. The CPUutilisation index is used in combination with memory availability in Stealth [WKL93],and in combination with disk-space availability in LOADIST [SSK95]. Related to theCPU utilisation metric, [SS95] proposes a load index ‘free-time’ which is a predictionof the amount of CPU idle-time in a future time-window. This is used as an indicationof nodes’ abilities to satisfy the deadline of an arriving task.In addition to the implementations discussed above, the simple index approach is alsoadvocated in a number of other works which include:• [BAN93] Considers multiple-resource nodes. It is found that the benefits of loadsharing diminish as node complexity increases. The weakness in their approach isthat they consider that all resources must be idle before a task transfer can occur.Richard John Anthony 8D.Phil. thesis <strong>Chapter</strong> 2


• [DAN95] Finds the CPU run-queue length to be the best load metric. Their findingsare however based on simulations in which all tasks are identical.• [LD97] Uses the CPU run-queue length metric in simulations in which all tasks areidentical.TransitionThere is a considerable body of research that supports the argument that simple loadindexes are insufficient because they do not contain enough information on which tobase load sharing decisions. [GC93] And [KL88] state that the CPU queue al<strong>one</strong> isinsufficient as a load index. [MW91] Identifies that the traditional approach of using only aCPU load metric is appropriate only for scheduling compute-intense tasks. [LO86] And[ML87] stress the importance of workload analysis in determining load indices.The authors of Gammon [BW89] recognise the limits of the single metric approach and thelikely-hood that additional metrics to represent the load on other resources would bebeneficial.[FZ87] Identifies that each resource at a processing node can be differently loaded andshows that significant improvement over the Unix load index (CPU run-queue lengthaveraged over 1 minute) can be obtained if more representative load metrics are used.Their approach is to make a linear combination of a number of resource queues. Thishowever still results in a single metric and the comp<strong>one</strong>nt factors cannot be used separatelyby a load sharing policy.History [SVE90] uses knowledge of the previous execution times of tasks when makingtask transfer decisions. The scheme incorporates a load sharing filter that generates runtimestatistics for each task that is executed. Tasks that are expected to be short-lived arealways executed locally. The aim is to avoid inefficient task transfers, as the benefits ofexecuting a short-lived process remotely are often outweighed by the costs of migration.Other than execution-time, no other task characteristics are considered. The load indexused is the CPU run-queue length.Complex indexesThere is a trend that will be referred to as the rich-information approach. This school ofthought favours the use of load indexes which consist of a number of metrics. Each metricRichard John Anthony 9D.Phil. thesis <strong>Chapter</strong> 2


will usually be concerned with a single resource. Typically the CPU run-queue lengthmetric is used with other metrics to make up the load index. In addition, the characteristicsof tasks may be taken into account to some extent. [CHO90] States that the performanceof load sharing improves as the amount of information used is increased, up until someoptimal level.It should be noted that some ‘load’ metrics actually refer to the availability of resourcerather than the load on the resource. Such metrics are more suitable for use with resourcessuch as memory, because the amount of memory available is more significant indetermining the behaviour of a process than is the amount of memory in use.The load index used by Freedman's PMS scheme [NUT94] is the CPU run-queue lengthand a condition that no interactive tasks are present, similar to the indexes used in Spriteand DAWGS. The scheme uses information concerning specific applications, but thisinformation must be supplied by the user [TS93].The load index used in MOSIX [BL85] comprises the CPU run-queue length and theamount of free memory. Task characteristics considered are the elapsed runtime, andwhether a task is using shared memory. A residency rule ensures that processes executeat the current location for a certain minimal amount of time prior to migration, thusreducing the likelihood of wasteful transfers involving short-lived processes [BGW93].Tasks using shared memory do not migrate.San Luis [AEF97] uses a load index based on the performance-weighted CPU run-queuelength, the amount of free memory, disk traffic level and network traffic level. Thememory requirements and the desired response-time of tasks are taken into account inscheduling decisions.Utopia [ZZW93] uses a load index which incorporates: CPU run-queue length, availablememory, disk transfer rate, the amount of swap disk-space available and the number ofconcurrent users.LSF [LSF97] uses the same metrics as Utopia, with three additions: CPU utilisation,paging rate, and the amount of idle time at processing nodes.Richard John Anthony 10D.Phil. thesis <strong>Chapter</strong> 2


R-shell [ZF87] uses a load index consisting of the CPU, paging system and I/O queues,summed and averaged over 4 seconds. Task characteristics are considered to the extentthat if local resources are directly accessed the task is tagged ‘immobile’ and preventedfrom migrating.2.3.2.2 Dissemination techniques.Load information must be disseminated between nodes so that task transfer decisions canbe made. The are three approaches that can be taken to achieve this.1. Demand driven. Each node collects information when it needs it to make a load sharingdecision. The main advantage of this technique is that load information is exchangedonly when it is required. The main disadvantage is that delay is introduced into thescheduling decision as the polling node must wait for at least <strong>one</strong> reply.2. Periodic. Information is disseminated or collected at regular intervals. This is simple toimplement. However, it is important to determine the most appropriate disseminationperiod as overheads due to periodic communication increase system load and reducescalability. In [TL89] it is suggested that the highest frequency that should be used is 10seconds. This value is chosen as a compromise because state updates occur infrequentlyat idle nodes and short-term fluctuations at busy nodes need to be smoothed out byusing time-averaged load values.3. State-change driven. Nodes issue information about their load-state only when itchanges by a certain amount. Examples include the State-Change Broadcast algorithm(STB) modelled in [LM82], REM [SCT88] and V [SKS92]. Determining the thresholdvalue is problematic, the policy must be sensitive to significant changes but not tominor fluctuations. State-change policies generally have lower communication ratesthan periodic policies. However, if the state at a particular node does not change for along period of time the information held about that node will become stale. Aged loadstateinformation is unreliable since there is no way of telling if the node has crashed orhas just not sent a message due to a steady state. A newly joining node will not receiveinformation concerning steady-state nodes, even if those nodes are suitable transferpartners. One way to improve the basic state-change policy is to introduce additionaldissemination messages which are sent if the load-state does not change for a longperiod of time. [KK92] Suggests a combination of event-driven and slow periodicupdate dissemination.Richard John Anthony 11D.Phil. thesis <strong>Chapter</strong> 2


[ELZ86A,FIN90,DAN95]. If no suitable partner has been found when this limit isreached the task is executed locally.The probability of a successful poll (the hit ratio) is dependent on the load level in thesystem, no number of polls guarantees a hit. In [FIN90] a poll limit of 7 is found tooutperform a poll limit of 3. However in [MTS90] it is found that there is little or nobenefit achieved by increasing the poll limit beyond 3 or 4. [ELZ86A] finds that smallprobe limits, such as 3, are appropriate as they return most of the benefits of largervalues, at lower cost.Polling works best when there is a high probability of finding a transfer partner withfew polls. In sender-initiated schemes this occurs at low system-wide load levelswhereas receiver-initiated approaches work best at high system-wide load levels.2. Broadcast polling. A request for information is sent to all nodes simultaneously. Thistechnique can be used in both centralised and distributed demand-driven policies. Adistributed example “poll when idle” (PID) is found in [LM82].3. Multicast polling. Nodes are grouped into clusters. A request for information is sent toall nodes in a given cluster, simultaneously. This technique can be used in bothcentralised policies e.g. Process Server, and distributed demand-driven policies e.g.NEST [AE87].4. Broadcast. Load-state information is sent to all nodes simultaneously. This technique isused in periodic policies e.g. Zhou’s DISTED algorithm [ZHO88] and state-changepolicies e.g. ASLB [HLK94], V [DOK91] and [KOW97]. [LM82] Introduced a variant“broadcast when idle” (BID).Broadcast dissemination techniques are particularly efficient in schemes where allnodes participate in load sharing. Broadcast is less efficient where only a subset ofnodes are involved as the other nodes are interrupted unnecessarily. A disadvantage ofbroadcast dissemination semantics is that instability can occur, caused by several busynodes simultaneously transferring load to a node that advertises itself as low loaded.This can lead to swamping (flooding) of the node and can cause it to initiate transfersof its own.Richard John Anthony 13D.Phil. thesis <strong>Chapter</strong> 2


5. Multicast. Nodes are grouped into clusters. Load-state information is sent to all nodeswithin a specific cluster simultaneously. This technique is suitable for use in periodicand state-change distributed policies.Schemes that use centralised dissemination include Utoptia, LOADIST and Stealth.Condor [LLM88] uses a co-ordinator that polls each workstation every two minutes.Local schedulers find exchange partners by contacting the co-ordinator. Sprite [DO91],<strong>one</strong> of the few receiver-initiated schemes, uses a co-ordinator. Workstations thatbecome underloaded inform the co-ordinator. In Process Server [HAG86], servers areorganised in clusters and each cluster has a controller. Clients contact the controllerwhen a server is needed. Zhou’s GLOBAL algorithm [ZHO88] uses a co-ordinator thatcollects load-state values from all nodes periodically. The information is assembled intoa load vector and broadcast to all nodes. Nichols’ butler [NIC87] implements a globalregistry containing details of the nodes that are currently free.De-centralised dissemination is used in a number of schemes:[KOW97] Uses state-change broadcasts to send out load information from an overloadednode. The sender then waits for replies from all nodes that are suitable bidders, i.e. capableof accepting some of the load. However it is difficult to know how long to wait for repliessince it is not known how many nodes are suitable bidders at any time. For simplicity inthe simulation, it is assumed that all nodes are bidders and the sender thus waits for allreplies. Such an approach introduces delay into the scheduling of tasks and this delay isrelated to the number of nodes present, thus limiting the extensibility of the approach. V[DOK91] uses a decentralised state-change-driven information policy in which each nodebroadcasts its state after a significant change.In ASLB [HLK94] and DAWGS [CM92] idle nodes use broadcast messages to advertisetheir availability to accept tasks. In Zhou’s well-cited DISTED algorithm, each nodeperiodically broadcasts its load-state vector to all nodes in the system. NEST [AE87] usesboth server advertisement, in which idle nodes broadcast their availability, and multicastpolling when client nodes request the availability of idle nodes. One further distributedapproach is found in LYDIA [UH97], in which the shared file system is used. Each nodekeeps its load-state vector in local storage that is mapped into the global file space.Richard John Anthony 14D.Phil. thesis <strong>Chapter</strong> 2


2.3.3 Selection policyThe role of this policy is to select tasks for transfer. In sender-initiated schemes busynodes choose tasks to transfer to another node, whereas in receiver-initiated schemeslightly loaded nodes inform potential senders of the types of task they are willing toaccept. In addition the policy determines how much load, or how many tasks, to transfer. Ifthe load transfer mechanism is non preemptive the selection policy is restricted to choosinga task that has not yet started to execute. This usually only applies to newly arriving tasks.In this case the decision is limited to whether the task should be transferred or not. The twofactors involved in this decision are: 1 the suitability of the task for transfer, for examplePMS and Freedman-Sharp do not select interactive tasks, and 2 the load level of the localnode relative to other nodes in the system. Where preemptive transfers are used, theselection policy can select any task at any time.Certain task-types may not be suitable for migration such as those that perform localservices or use local resources that are themselves immobile. Centralised services areusually placed at well-known locations and are therefor poor candidates for re-location.The fact that some tasks cannot be re-located, and some are more suitable candidates forre-location than others implies that the selection policy should be able to distinguishbetween different types of process in the system and especially between operatingsystemcomp<strong>one</strong>nts and non operating-system comp<strong>one</strong>nts.Several works have shown that most of the benefits of load sharing can be obtained bymoving a small proportion of the load [ZF87,KKM93]. However, the load that is movedmust be carefully selected. Schemes that follow this approach include History whichfilters out short-lived tasks. Results from the use of History [SVE90] have shown thatsystem-wide performance is better when more tasks are kept local. Where History isconcerned, the selection is based only on the previous execution-time of the task, toprevent short-lived tasks from migrating. There is scope to extend the filtering approachby using additional information.Selection policies differ in the extent to which they are automated and the amount ofinformation taken into account:The San Luis [AEF97] and V [DOK91] schemes support initial placement. Only newlyarriving tasks are considered.Richard John Anthony 15D.Phil. thesis <strong>Chapter</strong> 2


In LOADIST [SSK95] tasks are user-selected for transfer. Selections are subject to asimple security policy which maintains a list of valid users and a list of valid applications.Both Condor [LLM88] and Sprite [OCD88] have selection policies that are manual when auser selects a task for remote execution, but automatic when remotely executing tasks areselected for eviction when necessary.Stealth [SKS92] has a fully automated selection policy which selects tasks for transferbased on CPU and memory availability, and the previous success rates of transferringsimilar tasks under similar resource availability conditions. The system is initiallytrained off-line for each task-type, a number of executions are required so that thescheme can learn whether the task is a good candidate for remote execution. As such itis probably the first load sharing scheme to have some learning ability. However, offlinelearning under carefully controlled conditions, as used in Stealth, requires userintervention and wastes valuable processing time.In MOSIX [BGW93] a process has to execute for a certain minimum amount of timebefore it can be selected for migration, reducing the likelihood of wasteful transfersinvolving short-lived processes.GAMMON [BW89] selects the highest priority task at the most loaded node for transfer.Apart from instantaneous priority, no other task specific information is considered. Thisapproach can work well if all tasks are equally suitable for transfer. However, 1 The reasonfor transferring a task from the most busy node is to even-out the load imbalance andhence improve performance. This implies that selection should be primarily based on theamount of load each task is exerting on the node rather than its priority. 2 If any task istransferred away from a busy node, all of the tasks at the node benefit (not just the <strong>one</strong> thatis transferred away). In fact, the task that is transferred could benefit least if preemptivetransfers are used (as in GAMMON) since it will experience migration delay. This is agood argument for transferring a lower priority task.2.3.4 Transfer policyThis policy determines when load transfers should occur and the mechanism by whichtransfers take place. It uses the load-state information of nodes to determine whether theRichard John Anthony 16D.Phil. thesis <strong>Chapter</strong> 2


local node should send some of its workload to other nodes, receive some extra work fromother nodes, or keep its current workload.Most transfer policies are threshold based. Typically a threshold will be used to determineif the load index value signifies that the node is idle or busy, as in DAWGS [CM92]. Moresophisticated policies incorporate a pair of thresholds so that a middle z<strong>one</strong> exists betweenthe idle and busy states [GC93,GS93]. This prevents most of the instability that can occurdue to load oscillation when only a single threshold is used.In sender-initiated policies busy nodes are responsible for initiating a task transfer,whereas in receiver-initiated policies the transfer is initiated by the low-loaded node thatis to receive the task. If only newly-arriving tasks are considered for transfer, senderinitiatedpolicies are the simplest to implement. If the new task will increase the workloadbeyond some threshold then the node is considered to be a sender and starts to search for atransfer partner. The main drawback with this approach is that the burden of finding atransfer partner is placed on the busiest nodes, reducing their performance even further.This can lead to instability. Receiver-initiated algorithms tend to be stable because assystem load increases the likelihood of a receiver finding a sender with just a few pollsincreases. For this reason, receiver-initiated policies have been found to outperformsender-initiated policies under heavy system-load conditions. At low system loads therewill tend to be more receivers than senders so receivers may waste cycles searchingunsuccessfully for exchange partners. However, since receiver nodes must be low-loadedby definition, the loss of some cycles at these nodes is of low significance. The ease ofimplementation has most likely been a determining factor in making sender-initiatedpolicies significantly more popular. A detailed comparison of sender-initiated and receiverinitiatedpolicies can be found in [ELZ86B].Schemes that are sender-initiated include: History and LOADIST. Condor’s centralisedcontroller implements the transfer policy, deciding which workstations are receivers basedon the periodic information it collects.Schemes that are receiver-initiated include ASLB and Sprite. V implements a distributed‘relative transfer’ policy in which the set of lightest loaded nodes identify themselves asreceivers.Richard John Anthony 17D.Phil. thesis <strong>Chapter</strong> 2


Symmetrically-initiated transfer policies have also been proposed. These support loadtransfers initiated by both busy and low-loaded nodes [LR93]. Symmetrically-initiatedalgorithms are more complex but allow the advantages of both sender-initiated andreceiver-initiated algorithms to be exploited. Symmetrically-initiated schemes arepotentially unstable, there must be a z<strong>one</strong> between the activation thresholds for thesender and receiver parts of the algorithm so that a node cannot rapidly move betweensender and receiver states. [MTS89] Found that symmetric initiated policies outperformsender-initiated and receiver-initiated policies in the presence of small task-transferdelays. However, as the task transfer delays were increased the policies were found toperform almost identically.2.3.4.1 Transfer mechanismThere are two distinct categories of task transfer mechanism. Initial placementmechanisms, in which the task is transferred before execution begins, and preemptivemechanisms in which executing processes can be relocated.Initial placement mechanismsThese mechanisms are also referred to as remote execution mechanisms and nonpreemptive transfer mechanisms.Initial placement mechanisms are far less complex than preemptive mechanisms. Howeverthey are also less flexible since tasks are placed at their final execution site and cannot befurther re-located. Thus resources can not be released during execution except bytermination of the process that is using them. Complexity and overhead are reduced sincethere is no state information to transfer. Reduced complexity means that non preemptiveschemes are considerably simpler to implement than preemptive schemes. Because of this,non preemptive mechanisms are advocated in [KOW97].[EAG88] Finds that a considerable gain in performance over the no-load-sharing situation,can be achieved through the use of load sharing schemes based on non preemptive processmigration.Initial placement is used in numerous schemes including: Process Server, Coshell, Utopia,History, LOADIST, DAWGS, NEST and the San Luis scheme.Richard John Anthony 18D.Phil. thesis <strong>Chapter</strong> 2


A distributed file system is usually employed to provide location-transparent execution oftasks and thus remove the need to transfer the executable file in the task-transfer message.[SSK95] Uses initial placement achieved by direct transfer of the executable file prior toexecution. The approach generates the same amount of network traffic as the distributedfile system approach, but generates twice as much disk traffic as the file is first written todisk and then read into memory from disk.Preemptive mechanismsPreemptive process migration allows tasks to be suspended, moved, and restarted at a newlocation in an ideally transparent fashion. Usually a task can be moved at any stageduring its processing. It is usually possible for a task to move more than once.Preemptive process migration is a considerably more complex and costly procedure thannon preemptive migration. This is due to the work involved with the saving and moving ofprocess contexts.The procedure for preemptive process migration is basically: 1 stop (freeze) the process sothat its state is consistent whilst it is collected; 2 collect the process state (consisting ofvirtual memory, registers, open file handles etc.) and transfer it to the new location; 3restart (unfreeze) the process at the new location. The main problem with this approach isthat a process is suspended for the entire time that its state is transferred, which may beseveral seconds, or even minutes.The delay caused by freezing can violate the rules of execution transparency. Forexample users may notice the delay, messages or signals sent to the process may be lost,or communication partners may time-out and report a failure. To overcome the problemof suspending the process for long periods, several schemes have modified the methodof process transfer.The V system [DOK91] uses pre-copying, in which the virtual memory pages aretransferred while the process is still executing, the process is then frozen and any pagesthat have been updated since the start of the migration procedure are re-copied to the newlocation, the process is then re-started at the new location. The benefit of this approach isthat the suspension time is greatly reduced. One drawback is that the approach is lessefficient as some pages are transferred twice.Richard John Anthony 19D.Phil. thesis <strong>Chapter</strong> 2


In Sprite [DO91], dirty pages are written to a file, whose name is then passed to thedestination node. Since a transparent network file system is provided by Sprite, the file canbe accessed as required. The benefit of this approach is that freezing time is reduced.The Accent distributed operating system [OCD88] employs a lazy copying approach, inwhich a process is transferred to a new location and immediately re-started. Pages aretransferred on demand when page faults occur. This reduces the delay and communicationoverhead caused by migration since some pages are never referenced and the work oftransferring the <strong>one</strong>s that are is spread over time after the initial migration. This approachhas the advantage that freezing time is minimal, but has the disadvantage that an ongoingresidual dependency is created on the originator node. Such dependencies may be tolerablebetween a pair of nodes if justified by a performance gain and/or greater executiontransparency, but further migrations of the same process will compound the dependenciesand increase the task’s susceptibility to node failure.Other schemes that use a preemptive process migration mechanism include: MOSIX,PMS, Eindhoven Multiprocessor System (EMPS) [DG92] and ASLB.There is debate as to whether the additional gains to be made by the use of preemptiveprocess migration justify its additional cost. It is quite clear that preemptive processmigration does offer greater flexibility. In [KL88] Krueger and Livny conclude that inmany cases preemptive process migration can achieve considerable additionalimprovement in performance over that achieved by process placement at initiation only,and that the additional costs associated with preemptive process migration can be justified.[HD96] Finds that preemptive migration achieves a performance improvement of between35% and 50% over initial placement. In contrast, several works have stated that theadvantage of preemptive migration over non preemptive migration can be small[EAG88,WKL93,BDM93]. Many factors including: the architecture of the system and itscommunications sub-system, the predominant granularity of tasks, task arrival anddeparture rates, the design of the load sharing scheme itself, and the level of performanceheterogeneity between nodes will all affect the relative performance of task transfermechanisms. With so many variables it is not surprising that different opinions exist.Richard John Anthony 20D.Phil. thesis <strong>Chapter</strong> 2


Besides performance, preemptive migration offers greater flexibility. Dynamicreconfiguration of a system is possible, tasks can be relocated for systems maintenance orother reasons.Preemptive evictionA number of schemes incorporate special rules to protect the resource requirements oflocal workstation users. When local (interactive) use is detected, remotely executingtasks are suspended or removed to continue execution at another node.In Remote Unix [LIT87], local-user activity causes remotely executing tasks to becheckpointed and immediately evicted, to be subsequently restarted elsewhere.In Condor [LLM88], local-user activity causes remotely executing tasks to besuspended. If the user activity is detected for more than five minutes the task is evicted,otherwise execution is resumed.In Sprite [DOU91], local-user activity causes remotely executing tasks to be returned totheir originator node, from where they can be re-migrated.In DAWGS [CM92], local-user activity causes remotely executing tasks to besuspended initially. If the user activity is sporadic, execution is resumed after a shortperiod. Otherwise the task is evicted back to the source node, from which it may be remigratedto another node.Stealth [SKS92] uses initial placement but also has a preemptive capability. This is used toevict remotely executing tasks when they are being starved of resources by higher prioritylocal tasks.2.3.5 Location policyThis policy is responsible for locating a transfer partner. Location policies can bedistributed, each node selecting a transfer location based on locally held information.Examples include DAWGS, LYDIA and MOSIX. Alternatively policies can be devisedhaving a central resource manager.Richard John Anthony 21D.Phil. thesis <strong>Chapter</strong> 2


Remote Unix [LIT87] is based on a single central resource manager and localschedulers at each node. The central resource manager represents a central point offailure and no recovery mechanism has been implemented. Condor [LLM88], thesuccessor of Remote Unix also used a centralised location policy, this time recovery isprovided by starting a new co-ordinator at a different node. Resource allocation iscentralised in Process Server, servers are organised in clusters and each cluster has acontroller [HAG86]. The same approach is used in Utopia [ZZW93], with the addition ofan election algorithm to ensure that a controller always exists in each cluster.Busy nodes attempt to locate transfer partners that have low load levels in senderinitiatedschemes. In receiver-initiated schemes low-loaded nodes attempt to locate abusy node from which to transfer work.Implementations of location policies include:• Random. A transfer partner is selected randomly, its load-state is ignored. This canresult in useless task transfers when an already busy node receives extra work, but hasbeen shown to provide performance improvements over no-load-distribution[ELZ86A,KK92]. The performance improvements stem from the fact that only busynodes transmit load, whilst all nodes are potential receivers. Random location policieswork best when there are few heavily loaded nodes and many relatively idle nodes.• Threshold. The decision is based simply on whether the node’s load level is above orbelow a given threshold [MB91]. The first suitable node found is used[ELZ86A,CHO90]. The strength of this type of policy is that the number of pollsrequired is usually low, and therefore decisions are made quickly. However, it makes noattempt to find the best partner. This approach can lead to marginal-gain transfers.• Shortest. A group of nodes are polled in turn to determine whether each is a suitabletransfer partner. In sender-initiated schemes, the node with the lowest load, below athreshold, is chosen. In receiver-initiated schemes, the node with the highest load ischosen. This approach considers both the load at the group of polled nodes and the loadat the local node and thus works on the basis of a global optimum. This implementationincurs greater overhead and [ELZ86A] finds that it is only marginally better than athreshold policy. This conclusion is re-affirmed by independent study in [FIN90].Richard John Anthony 22D.Phil. thesis <strong>Chapter</strong> 2


• Greedy. A group of nodes are polled in turn to determine whether each is a suitabletransfer partner. No threshold is used, any node that has lower load than the local nodecan be selected [CHO90]. This policy works on the basis of a local rather than globaloptimum [MB91]. This can result in marginal-gain transfers. However, Chowdhuryfinds the greedy policy better than the threshold policy under most circumstances.• Global plans. This policy aims to maintain the load at each of the nodes in the systemwithin a band, delta. The smaller this band, the more balanced the system. Scheduling isdistributed but the policy works towards a global, rather than local optimum, eachscheduler taking account of the load sharing actions of other schedulers [KAR95].• Preferred list. Each node maintains a list of other nodes arranged in order of preference(according to expected load-state). Nodes are polled in this order. The first acceptabletransfer partner found is used [SC89].• Opposite domain. In the Flexible Load Sharing algorithm (FLS) [KKM93], each nodemaintains a domain of randomly chosen nodes that are currently in the opposite state,i.e. an overloaded node will maintain a domain of under-loaded nodes. Exchangepartners are selected from this domain when required.2.3.6 Design goalsA wide variety of design goals are used, these fall into a number of categories:1. Maximise the efficiency of resource utilisation. Condor is designed to maximise theutilisation of workstations. Freedman-Sharp attempts to keep nodes optimally loaded,and attempts to avoid resource depletion by withholding some tasks until they can beprocessed efficiently. The emphasis of Sprite is to encourage resource sharing. PMSattempts to balance the CPU load average across the system. Monash has severaldesign goals which include facilitation of efficient sharing of resources and systemworkload; and to reduce the development time for distributed applications. GAMMONaims to minimise the occurrence of the wait-while-idle scenario. The performance goalstated in [BG93] is to maximise processor utilisation and the throughput of the system.2. Minimise mean task response-time. The San Luis scheme aims to support a numberof different goals, depending on a particular system’s objectives. Potential goalsinclude task response-time minimisation and system throughput maximisation. [EG89]Richard John Anthony 23D.Phil. thesis <strong>Chapter</strong> 2


Defines the goal of load sharing as “to reduce the mean response-time by redistributingthe workload among the set of processes in the network”. They suggest that underheavy loads a “keeping idle processors busy” approach (a form of load sharing)outperforms load balancing. The differences between load sharing and load balancinghave been discussed in chapter 1.3. Permit access to remote resources. The primary design goal of Process Server is tofacilitate access to a greater computing resource than that available at a singleworkstation.2.3.7 Support for heterogeneityAs discussed in chapter 1, heterogeneity can occur in several forms in loosely-coupleddistributed systems.Architectural heterogeneity implies that different executable code is required at differentnodes. This form of heterogeneity can be accommodated in a load sharing scheme inseveral ways which include:• Re-compilation from source code after migration. Monash [LS97] supports nonpreemptive migration of programs and data in architecturally heterogeneous systems.To aid development across heterogeneous systems, the tool is implemented as an RPCbasedclient-server system. Support is restricted to applications for which source codeor object code is available. The compilation introduces processing overheads and delayto the task’s execution.• The use of interpreted languages, which are not suitable for all applications and aregenerally less efficient than compiled languages.• The use of hidden directories, as implemented in the LOCUS distributed operatingsystem [BP85]. For each task a hidden directory contains executables for eacharchitecture present in the system. Prior to task execution, the associated hiddendirectory is searched for the appropriate executable to suit the execution node’sarchitecture.Most load sharing schemes do not cater for architectural heterogeneity due to thecomplexity involved.Performance heterogeneity means that nodes have the same architecture but differentlevels of resource, which leads to different performance. For example, nodes may haveRichard John Anthony 24D.Phil. thesis <strong>Chapter</strong> 2


different processor speeds, amounts of primary memory and disk space. As discussed inchapter 1, performance heterogeneity is common in loosely-coupled systems.Load sharing schemes which do not provide specific support for this form ofheterogeneity include: Condor, Sprite, Gammon and ASLB.There are a number of schemes which consider CPU-speed heterogeneity only, theseinclude: MOSIX [BS85], Process Server [HAG86] and Stealth. The most commonapproach is to represent the CPU performance of each node as a scalar value which isused to normalise the CPU load values. In Oklahoma [TN95] CPU-speed heterogeneityis supported by calibrating nodes against the speed of the fastest node. This approach isproblematic if a new, faster node is subsequently added to the system. In Coshell[FOW93] and San Luis [AEF97] the MIPS rating of nodes is used to adjust for CPUperformance heterogeneity.Utopia [ZZW93] and LSF [LSF97] support disk performance heterogeneity in terms ofthe speed and size of disks. Memory-size heterogeneity is implicitly supported bymeasuring the amount of free memory at each node.The Freedman-Sharp scheme depends on user-supplied information to cater forperformance heterogeneity [TS93].The SPICE butler [DH85] represents the load-state of each processing node as theamount of resource that is available. In this way the scheme implicitly accommodatesperformance heterogeneity.2.3.8 ImplementationImplementation approaches can be divided into three categories:1. Kernel-space implementations in which the load sharing scheme is fully integratedwith the operating system. This has a number of advantages which include: i) loadsharing becomes a fundamental feature of the system, ii) all kernel-maintainedinformation is readily available, iii) transparency is relatively easy to achieve sincethe process’ interface with the operating system need not be altered. Disadvantagesinclude: development time and cost, and reduced portability. Examples includeMOSIX, Sprite, V and EMPS.Richard John Anthony 25D.Phil. thesis <strong>Chapter</strong> 2


2. User-space implementations in which the load sharing scheme is completelytransparent to the operating system. No modifications are required to the operatingsystem. Existing utilities such as Unix’s rup and rsh may be employed to simplifydevelopment. This approach has the advantage of portability and ease of installation.However, there can be difficulties obtaining values from kernel-held data structures,which may partially account for the fact that many schemes use very little load-stateinformation. This approach may lead to duplication in user space modules, ofexisting kernel functionality. This may in turn lead to complex, inefficient schemesconsisting of many interconnected modules. User transparency is harder to achieveand functionality is often lower than kernel-space implementations. Examplesinclude REM, Condor, PMS, Utopia, LSF, Freedman-Sharp, Calypso, LOADISTand LYDIA.3. User-space implementations that require some kernel modifications. These schemesuse a combination of user space utilities and kernel hooks. The kernel hooks areusually required to achieve efficient access to kernel-held information concerningthe load state of the node, and can often be implemented as additional system calls.An example is DAWGS.2.3.9 Adaptive behaviourAdaptive algorithms dynamically adjust their behaviour to suit the state of the system.Note that the ability to learn the resource requirements of tasks is not consideredadaptive behaviour in this work.There are a number of ways in which adaptive behaviour could be used in a load sharingscheme. These include:• Modification of poll limit. Several works, including [ELZ86B] have concluded that asmall number of polls is usually sufficient to locate a transfer partner. Whilst a lowpoll limit is efficient in general, under particular conditions it may be beneficial toincrease the limit. For example, in a receiver-initiated scheme under low system-wideload conditions, it will be difficult to locate a sender node, increasing the poll limitwill improve the chances of success.The accuracy of polled information is related to message delay. As delay increasesthere comes a point where the information is so old that decisions based on it are noRichard John Anthony 26D.Phil. thesis <strong>Chapter</strong> 2


etter than random decisions [MTS89]. Under these circumstances polling activitiesshould be suspended.• Modification of load sampling frequency. The local load state is periodically sampledin most schemes. It may be beneficial to reduce the frequency of sampling when thenode is very heavily loaded to maximise efficiency.• Modification of load dissemination rate. A number of schemes use periodic loaddissemination. In sender-initiated schemes, under heavy system-wide load conditionsit may be beneficial to reduce the frequency of transmission to reduce overheads,since it is unlikely that an idle node will be found.• Alternation between sender-initiated and receiver-initiated operation to suit theambient system workload level. Sender-initiated operation is known to be moreefficient at low system-wide loads since it is easier to find a suitable receiver.Receiver-initiated operation works better at high system-wide loads since it is easier tofind a sender.[PD97] Evaluated five load sharing algorithms, <strong>one</strong> of which was adaptive. Thisparticular algorithm adjusts the granularity of sub-tasks sent to a particular nodeaccording to the response-time of the node which changes according to its load level.The adaptive algorithm was not the best performer and they comment that adaptivealgorithms may not work well for load sharing in some types of system. However, theinvestigation was limited to the performance of a single task-type (matrixmultiplication).[ELZ86B] Compares adaptive sender-initiated and receiver-initiated location policies inwhich threshold values (the maximum number of tasks at a receiver node) areautomatically switched between 1 and 2 to suit conditions. Each threshold value isadvantageous under certain conditions. The adaptive policies are found to outperformfixed-threshold policies.SMALL [MW91] uses an ANN to learn load balancing policies. The response-times oftasks are analysed under various load conditions and using different policies. Theobjective function is to place each task so as to minimise its response-time.It is reasonable to expect that adaptive schemes should be able to outperform nonadaptiveschemes in general. However, care must be taken since the additionalRichard John Anthony 27D.Phil. thesis <strong>Chapter</strong> 2


complexity required to make a scheme adaptive could negate some of the potentialbenefit.2.3.10 PerformanceThis is the most important criterion. The bottom line when comparing load sharingschemes is how well they perform. However, the design goals can differ significantlybetween schemes and it is hard to meaningfully compare the performance of dissimilarschemes.Popular performance metrics include:1. Task response-time, e.g. R-shell [ZF87]. Kremien and Kramer in [KK92] propose astandard measure of performance for load sharing algorithms, in terms of apercentage increase achieved over the no-load-sharing (baseline) performance:distance = 100 *response-time(baseline) - response-time(load sharing)response-time(baseline)2. System throughput. The aim is to complete as many tasks as possible in a giventime-frame. Thus this goal is closely related to the minimisation of tasks’ responsetime.3. The extent to which load is balanced. In [KAR95] the measure used is the size of theband, delta, of load values at nodes. Balance improves as delta is reduced.4. [EG89] Defines the success rate as simply the ratio of the number of successfultransfers (a performance gain achieved) to the total number of transfers. This is auseful definition but fails to take into account the relative size of gains and losses.5. [LLM88] Introduces the concept of leverage as a performance measure. Leverage isthe ratio of remote processing capacity consumed by a transferred task toprocessing capacity consumed at the originator node to support the remoteexecution. Compute intense tasks generally achieve higher leverage than IO intensetasks.Comparison of performance results is problematic: Results presented are often highlyspecific, involving <strong>one</strong>, or at most a few different task-types. Test conditions are oftenRichard John Anthony 28D.Phil. thesis <strong>Chapter</strong> 2


estricted and there are few results reported that relate to testing over a large number oftasks under general operating conditions. Additionally the format in which results arepresented and the aspect of the scheme which is being highlighted by the experimentcan differ widely between works. To clarify the discussion, performance results arediscussed in four categories:Richard John Anthony 29D.Phil. thesis <strong>Chapter</strong> 2


Performance results expressed as application speedup values[UH97] Reports that LYDIA achieved a factor of 30.5% performance improvement fora Cholesky Factorisation task in a loaded system. The performance improvement falls tozero for an unloaded system.Results from Gammon [BW89] show an average performance improvement of 9.2% in asmall system consisting of only 3 nodes.Coshell [FOW93] has been shown to provide improvement factors of up to 5 in theperformance of a parallel make utility.[BGW93] Presents the results of a few very specific experiments using MOSIX, inwhich quite impressive speedups were achieved. A large-grained parallel task runningover six nodes achieved a speedup factor of 5.96 (best case). A distributed travellingsalesmansolution achieved a speedup of 7.64 (best case) when executed in a system of8 nodes.Utopia [ZZW93] has achieved a speedup factor of 4.7 for a parallel make in a 5 nodesystem, whilst a speedup of 15.8 was achieved in a 25 node system.REM has achieved near-linear speedups for a parallel bubble-sort [SHO91].Experimental results for Sprite [DO91] show speedups of up to a factor of 5 for largegrained,predominantly CPU- intensive parallel tasks in systems of up to 12 nodes.LSF [LSF97] has achieved response-time improvements of up to 35%. In an extreme casean unspecified program was found to benefit from a speedup factor of 20.Performance results presented as general system-performance improvementsResults show that the Condor system was able to reclaim 4771 hours out of 12438 hours ofunused processing time in a system of 23 workstations during an observation period ofunstated duration [LLM88]. The workload consisted of very long-lived tasks such that atask which used 1 hour of CPU-time was considered relatively short.Richard John Anthony 30D.Phil. thesis <strong>Chapter</strong> 2


The results presented in [WKL93] are concerned with the ability of Stealth to learn thesuitability of tasks for remote execution. It is shown to be able to adapt, over about 50executions, to "sudden" changes in task behaviour.Monash [LS97]. Results are presented for the execution of three specific, but unidentifiedprograms across an architecturally heterogeneous system of 2 nodes. The results show thatthe performance of these tasks was enhanced. However, no detailed informationconcerning the experimental conditions, such as the actual function of the programs, or theresource levels at the processing nodes is given.Claims rather than resultsAlthough there are no results available, sales literature for the Freedman-Sharp scheme[TS93] makes some impressive claims which include: “no matter how many jobs aresubmitted, computers never become overloaded”, and “Load Balancer eliminates resourcecontention … super-linear speed-ups are not uncommon”.Performance-related findingsResults using History [SVE90] indicate that performance generally increases as moretasks are forced to stay local, the best results being achieved when 80% of tasks wereexecuted at the arrival site. These results conflict with the results from R-shell [ZF87],in which the performance was found to increase as fewer tasks were forced to stay local.However this conflict could be due to differences in the experimental conditions used,especially in the workloads chosen.Summary comment for performance resultsThe results presented, and the experimental methods used to generate them, aregenerally subjective. Such results are unsuitable for making safe comparisons betweenschemes.In order to make valid comparisons, a universal objective testing method is required.However, due to the wide range of abilities and limitations of schemes, it would bedifficult to devise a universal test.OverheadRichard John Anthony 31D.Phil. thesis <strong>Chapter</strong> 2


Directly related to performance is the extent of the overheads exerted by a scheme.Overhead arises in the form of processing and communication costs [KAR95]. Thecosts can be attributed to load measurement, load information dissemination, makingscheduling decisions, and task transfers. In general, load sharing schemes should bedesigned so as to minimise overhead. [EG89] States that a good load sharing policyshould minimise the use of control messages and avoid unfruitful task transfers.[TL89] Provides a definition of acceptable overheads of load sharing. The scheme shoulduse less than 1% of the CPU cycles of any node, use less than 1% of the networkbandwidth, and take less than 100ms to select a transfer partner. The last condition is nowquite dated due to improvements in processor technology and should be tightened upaccordingly.Utopia’s load index module (LIM) was found to account for less than 1% total CPUactivity in moderately sized configurations, and reached a maximum of 1.5% total CPUactivity in a simulation of a configuration consisting of 1000 hosts grouped into 20clusters each of 50 nodes.The overhead of R-shell is stated vaguely as being “between <strong>one</strong> and a few percent ofCPU-time” [ZF87].In [GK94] the processing time overhead for the gradient algorithm is consideredacceptable at “less than 5%”. This is considerably higher than other schemes. It could beargued that the absolute extent of overhead is unimportant so long as it is outweighed byproportionally greater performance benefits.[CHO90] Finds that load sharing is beneficial where the processing time overhead of atask transfer is less than 30% of the task’s execution time. Where the processing timeoverheads exceed 40% of the task’s execution time, load sharing performance wasfound to be worse than that of an unshared system.2.3.11 TransparencyTransparency is the concept of hiding implementation detail. Distributed systems arecomplex, having many facets. This complexity leads to a number of differentRichard John Anthony 32D.Phil. thesis <strong>Chapter</strong> 2


equirements of transparency. These are described as forms of transparency, eachrelating to a specific way in which the concept must be applied.Access transparency: The ability to access local and remote objects with the sameoperations. This is a fundamental requirement for any load sharing scheme. Accesstransparency is the most basic form of transparency, being required to implement almostall of the other forms.Applications that access a particular object can not be modified at run-time to access thatobject differently because either process or object has moved. It is possible to install alayer between the application and kernel that can deal with access resolutions, sendinglocal requests to the local kernel and remote requests to the corresponding layer at someother node where the required object is located. This approach is common in distributedservices, examples include CORBA and Sun’s NFS.Load sharing schemes that use the additional layer approach include MOSIX which isbased on a three-layer kernel. The middle layer provides transparent communicationbetween the microkernel layer at <strong>one</strong> node and the Unix emulation layer, maybe at adifferent node.Sprite [OCD88] provides a transparent network file system, which allows local and remoteresources to be accessed with the same operations.Location transparency: The ability to access objects without knowledge of theirlocation. This usually this implies Naming transparency: names do not contain anylocation information. Complications arise if distributed applications have built-in locationinformation. As with access transparency, location transparency is a prerequisite for theprovision of most other forms of transparency.Location transparency is supported in the Galaxy distributed operating system whichincorporates a single global name space in which object’s external names are constant andlocation independent [SMS91]. The LOCUS distributed operating system [WPE83]employs fully transparent names that contain no location information and therefore do notchange when the associated object is re-located.Richard John Anthony 33D.Phil. thesis <strong>Chapter</strong> 2


The implementation of process migration is greatly simplified in location transparentsystems since names are not updated as a result of migration. The EindhovenMultiprocessor System (EMPS) uses the concept of mailboxes to provide locationtransparent IPC [DG92]. When a process is migrated, all IPC is re-directed simply byupdating only that process’s mailbox connections, the mailbox itself does not move.This approach does however incur additional communication overhead. REM [SCT88]uses 'virtual' processes to act as communication agents for remotely executingprocesses.In the Kerberos authentication system the location of a process has to be known at theapplication level because tickets contain internet addresses to prevent masquerading.Kerberos is thus incompatible with process migration, as the movement of a process wouldinvalidate all its privileges.Execution transparency: The requirement that the behaviour of a process should beindependent of its execution location. The main implication is that the results produced bya process must be location independent. Execution transparency is essential to all loadsharing schemes, it is the primary facilitator for the logical distribution of the processingpower of the system.Execution transparency implies that access to resources such as files should be locationand access transparent. PMS [FRE91] relies on the presence of NFS and the NetworkInformation Service (NIS) to ensure that files and user authentication are availablethroughout the system.Condor evicts remotely executing tasks when user activity is detected [TL95,LLM88]. Aremote task is suspended when user activity is first detected and incurs a five minute delaywhich can violate execution transparency, since IPC messages and signals cannot bereceived while the task is suspended and can be lost (performance and migrationtransparency could also be violated). Tasks can also be suspended in the DAWGS scheme.Processes that will not produce the correct results remotely because they need access tosome fixed local resource must be prevented from migrating.Richard John Anthony 34D.Phil. thesis <strong>Chapter</strong> 2


Migration transparency: The ability to move objects without affecting the operation ofapplications that use those objects. The definition can be extended in the case of processesby the requirement that the process is unaware of the migration. This form of transparencyis a fundamental requirement for load sharing schemes that preemptively relocateprocesses, or that move files or data structures. However, it does not apply to nonpreemptive transfers.Sprite [DO91] provides migration transparency since post-migration processes continue tooperate as though still on their home machines. The node at which a process enters thesystem is designated the home machine for that process and once the process is migrated,location dependent calls to system resources are passed back to the home machine in amanner that is transparent to the executing process.MOSIX [BS85] provides a preemptive process migration mechanism. A virtual machineenvironment isolates a process and its environment from the processor in which itexecutes, and processes are responsible for handling their own affairs including migrationdecisions. MOSIX provides partial migration transparency in the sense that althoughmigration is visible to the process, the process will still produce the correct results aftermigration.The Process Migration Subsystem (PMS) [FRE91] provides partially transparentpreemptive process migration. There are limits in the type of process that can be supportedsince although the memory image is preserved during migration, there is no support for thepreservation of IPC, file or device handles.Some schemes simply terminate remotely executing tasks when resources are reclaimedby local processes. Examples of this are Nichols’ butler [NIC87], and the SPICE butler[DH85]. This violates migration transparency and is wasteful of resources as thecomputation must be restarted elsewhere.Replication transparency: Multiple copies of objects can be created, without any effectof the replication seen by applications that use the object involved. Replication can onlyrealistically be used if it is transparent. Replication increases availability but is complicatedby the need for consistency control where data objects are concerned. NamingRichard John Anthony 35D.Phil. thesis <strong>Chapter</strong> 2


transparency is required since replica’s names must hide details of their location andinstance.Replication is often used to implement failure transparency by having several copies of aprocess or data at different nodes. This technique is employed by several load sharingschemes including REM [SHO91] which keeps copies of each child process on multipleworkstations. The replication is transparent, the first result received from a set of replicatedprocesses is used, and others discarded when they arrive.Failure transparency: Faults are concealed such that applications can continue executionwithout knowledge that a fault has occurred. Many load sharing schemes provide a degreeof failure transparency by replicating process at several nodes [AHM94,DTJ89], and REM[SHO88], or by checkpointing executing process so that, in the event of node failure, theycan be re-started on another node, as in Condor [LLM88].Calypso [BDK94] provides support for parallel applications which are executed as a set ofworker processes. Failure transparency can be provided by allowing the number ofworkers specified to exceed the number needed, in which case the results of computationsat the fastest workers will be used and other results discarded. If <strong>one</strong> worker should fail therest will complete the program.Remote Unix is based on a single central resource manager and local schedulers at eachnode. The central resource manager represented a central point of failure for which norecovery mechanism was implemented.The successor of Remote Unix, Condor [LLM88], deals with failure of its centralisedcoordinator node by starting a new coordinator at another node. This is a largelytransparent solution although there can be a period during which there is no coordinator,which will cause delay to processes awaiting scheduling.Richard John Anthony 36D.Phil. thesis <strong>Chapter</strong> 2


Syntactic transparency: If a load sharing environment is to be transparent to users, thereshould be no special syntax required to initiate, or communicate with, a remote process.Syntactic transparency is <strong>one</strong> of the primary forms that constitute user transparency, andis a fundamental requirement for full automation.Table 2.3.11-1 shows the special syntax required to initiate load sharing in a number ofschemes. The use of such syntax requires that the user can decide which tasks aresuitable for remote execution and can, to some extent, identify the conditions that makea transfer worthwhile.Load sharing schemeDAWGSFreedman-SharpLOADISTNichols’ butlerSpriteCommand-line syntaxdawgs lbrun loadist rem mig Table 2.3.11-1 Special syntax required to initiate load sharingPerformance transparency: The ability of a system to maintain acceptableperformance despite fluctuations in load level, task-arrival rate and prevalent task-type.A fundamental goal of any load sharing scheme should be to maximise performanceaccording to some criterion. Load sharing should help smooth out peaks and troughs inperformance.This form of transparency also implies that the effect of <strong>one</strong> process on the performance ofanother should be minimal. Schemes including Condor and DAWGS, that suspendremotely executing tasks when a higher priority local task is started, introduce significantdelay to the remote tasks. In V, remotely executing tasks are given lower priority thanlocally initiated tasks. Remotely executing processes can be temporarily starved ofresources. Stealth provides preemptive migration to resolve such situations and in this waystrives to maintain performance transparency.The Freedman-Sharp scheme holds back some tasks when all workstations are busy. Thesetasks are not allowed to start execution until sufficient resource becomes available. Thisapproach is effective in maintaining the overall performance of the system, but can have asignificant impact on the response-time of the delayed tasks.[ZF87] Identifies that the reduction in the standard deviation of task response-times isgreater than the reduction in the mean response-time of tasks when load sharing is applied.Richard John Anthony 37D.Phil. thesis <strong>Chapter</strong> 2


This indicates that load sharing increases consistency in tasks’ response-times, and thuscontributes to the performance transparency of a system.Scaling transparency: The behaviour of a system should not change when the system,or some aspect of it is expanded.As distributed applications are scaled up the communication requirements generallyincrease. The bandwidth limits of the communication subsystem are eventually reached,potentially causing serious performance degradation.Efficient design of load sharing algorithms in terms of the communication aspects will aidextensibility. <strong>Part</strong>itioning nodes into groups or clusters has been proposed [EB94] in orderto reduce communication and management overhead, thereby increasing extensibility.Some schemes assume a pre-determined maximum scale of a system and incorporate suchinformation into their design. For example: Process Server [HAG86] assumes a limit of 50nodes and its algorithms take this figure into account.The use of centralised comp<strong>one</strong>nts can cause scaling problems. For example, centralisedlocation policies keep tables of information about the load state at each node, so thattransfer partners can be assigned. Such tables will grow as the number of nodes increases,as will the number of requests for service. The time to process each request will grow dueto the increased time needed to search the larger table.Stability is another factor that affects scalability. Stable load sharing algorithms are moreefficient since wasteful transfers caused for example by oscillation, are avoided.User transparency: In the context of load sharing, user transparency is provided if theuser is unaware that load sharing is present. Total user transparency implies that usersplay no part in the load distribution function. There would be no need to registerprograms for migration, manually select tasks for transfer, recompile applications withadditional libraries, or supply resource-requirement information when submitting tasks.Unfortunately this design goal is very hard to achieve and as a consequence most loadsharing schemes lack this highly desirable quality to some extent.Richard John Anthony 38D.Phil. thesis <strong>Chapter</strong> 2


There are two distinct types of user to consider: the programmer, who may need specialfunctional support when developing applications for a distributed environment; and theend-user who submits tasks to the system and requires prompt and correct results. Fromthe programmer’s viewpoint, syntactic transparency within programming constructs is anissue. There are several consequences of a lack of transparency to end-users, whichinclude:• load sharing may only be accessible to expert users who understand the schemeitself and the underlying principles of load sharing;• users may be prompted for information which they cannot supply;• users may make poor judgements concerning task behaviour or resourcerequirements;• it is inefficient with respect to users’ time.If users are aware that load sharing is present; for example, In DAWGS [CM92] usersare sent email when their remote task completes; they certainly should not be expectedto understand the way in which load sharing works. The SPICE butler [DH85] forexample, requires that users provide detailed information concerning the resourcerequirements of tasks.Some load sharing schemes have a non-transparent user interface, but the mechanismmay still be transparent. For example, the user may have to specify that a task should beexecuted remotely but may not be aware of the remote execution location or how it wasselected.The PMS scheme [FRE91] is <strong>one</strong> of a number of schemes in which processes must registerin advance of load sharing. This is achieved by re-linking programs with a load sharinglibrary prior to execution, which provides functionality for the registration and otherrelated activities. Other schemes that require some or all programs be re-linked in thismanner include Utopia and LYDIA. The need for programs to be re-linked implies thatsuch schemes can only be used with applications for which the source or object code isavailable. This can be quite limiting, and translates into additional work for programmers,systems administrators or users.Richard John Anthony 39D.Phil. thesis <strong>Chapter</strong> 2


The Freedman-Sharp scheme [TS93] maintains a database of programs registered for loadsharing, with their user-supplied configuration data which includes: path, memoryrequirements and priority.Process Server [HAG86] requires that programmers must explicitly invoke load sharingactivity.Condor [LLM88] is <strong>one</strong> of several schemes in which workstations are considered to haveowners. This approach introduces a new aspect of user transparency, in that the owner of aworkstation should not notice that their machine is being shared by others. Condor dealswith this by giving the workstation owner’s processes higher priority than remotelyexecuting tasks, and by evicting remotely executing tasks when the owner reclaims theirworkstation (i.e. initiates a task). This aspect of user transparency is also raised in [KC91]in which it is suggested that ensuring workstation owners receive satisfactory service,regardless of how much delay is experienced by some processes, is more important thanscheduling for global optimisation. The Stealth scheduler allocates whatever resources arerequired to a workstation owner’s processes, and remotely executing processes getwhatever is left.2.3.12 General DiscussionThis section deals with a number of important issues that are not addressed elsewhere inthe evaluation.FairnessOne of the goals of many load sharing schemes is to ensure fairness. This is interpreted andimplemented differently in various schemes.In DAWGS [CM92], fair is interpreted simply as: local processes have priority. There isno consideration of fairness between remotely executing tasks. A single node could be thesource of many remote tasks, to the extent that another node that attempts to remotelyexecute a task fails to find a suitable receiver.In Sprite, fairness is implemented in two ways. Firstly, Sprite gives workstation ownerspriority over their local resource. Secondly, Sprite implements a fairness policy betweenusers submitting tasks for remote execution [DO91]. This policy is based on the numberRichard John Anthony 40D.Phil. thesis <strong>Chapter</strong> 2


of workstations allocated to a given user for remote execution. If the central coordinatorcannot find an idle node on which to execute a remote request, and there is auser that has absorbed a disproportionate number of remote nodes, <strong>one</strong> of that user’sremotely executing tasks will be evicted back to its originator node to make room forthe new task.In Condor, fairness is also implemented at two levels. As with Sprite, workstationowners are given priority over their local resource. Additionally, Condor employs theup/down algorithm to ensure fairness between workstations submitting remote requests[SKS92]. This algorithm keeps a tally of the number of remote tasks executed fromeach workstation, incrementing the tally when a task is submitted and decrementing thetally when a task completes or is evicted. When remote resources are depleted and anew remote request is attempted, Condor will evict a task belonging to the workstationwith the highest tally back to its originator node.In Stealth, fairness is assured by prioritising resource allocation to local tasks so thatworkstation owners do not loose responsiveness while lending their machines for pool use[KC91]. This is similar to the approarch taken in V [DOK91] in which remotelyexecuting tasks are run at a lower priority than locally initiated tasks. This is in contrastto the approach taken in Condor and Sprite, where remotely executed tasks are given thesame priority as locally initiated tasks but are migrated away when owner activity isdetected at the workstation. The Sprite and Condor approaches to fairness could beinterpreted as an unnecessary source of complexity which can cause wasteful migrations.Load oscillationLoad oscillation occurs when tasks are repeatedly transferred from node to node[ELZ86A]. This causes instability and can cause a significant increase in systemworkload and consume a considerable amount of network bandwidth. The tasksinvolved can be starved of processing time. A special case of this problem occurs whena task ping-pongs between a pair of nodes, as identified in [STA85]. Initial placementschemes only allow a task to be transferred once, such schemes naturally avoid thisparticular form of oscillation.MOSIX applies a residency rule to reduce the occurrence of oscillation, only allowing atask to be transferred after it has executed locally for some minimum amount of time. InRichard John Anthony 41D.Phil. thesis <strong>Chapter</strong> 2


[JM93] time-stamps have been proposed to prevent a single process ping-ponging betweennodes. This has the same effect as the MOSIX solution, the timestamps ensure a minimumamount of time passes between movements of any process.A related issue is flooding, in which several nodes transfer tasks to the same initially lowloadednode, which becomes overloaded as a result. This in turn can cause oscillation asthe now overloaded node may initiate transfers of its own.Load oscillation is treated in detail in section 7.4.1.Commercial successFew schemes have become commercially available. These include: Condor, Freedman-Sharp and LSF. There are no figures available to indicate the number of systems orusers that have benefited from a load sharing implementation.Load sharing has not been widely accepted to date, despite the significant potential benefitsit offers. Limitations in schemes, especially in terms of transparency and flexibility, mayhave contributed to this slow take-up.Richard John Anthony 42D.Phil. thesis <strong>Chapter</strong> 2


2.4 Conclusion for part 12.4.1 Summary of load sharing schemesTable 2.4.1-1 summarises the key features of 28 load sharing schemes in chronologicalorder.CriterionLoadIndexDate SchedulingDisseminationTransparencyPerformancehetrogeneitysupportFlexibility TaskPlacementImplementationof taskConsiderationcharacteristicsAbility tolearn taskbehaviourAdaptiveschemebehaviourPredecessorschemeSchemeRemote Unix 1985 C ? S C B IE N! U N N N NSPICE butler 1985 D D C E BCMP IE T! U U N N NMOSIX 1985 D P QS C CIMP P C K N N N MOSProcess Server 1986 CH D S! V CMP I! C! U! N N N NNichols’ butler 1987 G F! Q V CI I N U N N N NR-shell 1987 C/D B/P C ? ? I N U N N N NREM 1988 D S Q P BCP I N U N N N NCondor 1988 C P QS S BC IE N U N N N RemoteUnixSprite 1988 C S QS S BCMP PE N K N N N NGammon 1989 D B Q ? ? ? N U N N N NHistory 1990 D ? Q ? L I N U P R N NV 1991 D S U ? ? P N! K N N N NPMS 1991 D! P! Q C CM P N U N N N NSmall 1991 D ? C ? ? I ? U! N N A NStealth 1992 D P! US FO A! IE C U! P LO N NEMPS 1992 D! ? S! ? I? P N K N N N NDAWGS 1992 D S QS S BCI IE N M N N N NCoshell 1993 D ? QS V BP I C U N N N NUtopia 1993 CH B/P C V A! I P U N N N R-shellLSF 1994 CH! P! C! V BCP I! P U U N N Utopia !ASLB 1994 D B Q ? ? P N U! N N N NFreedman- 1994 D! P! QS V BCM I! U U U N N PMSSharpCalypso 1994 D! ? Q! V CP I! N! U U N N NLOADIST 1995 C P US S BCM I P U N N N NOklahoma 1995 D P Q ? ? I! C U! N N N NMonash 1995 D! P! Q C BCM I P U! N N N NLYDIA 1997 D F U C A I N U C R N NSan Luis 1997 C ? C E CR I! C U! P/U L N NTable 2.4.1-1Load sharing schemes feature summary tableSpecial character key: ? = unclear or information unavailable, ! inferred,/ various used separatelyKey to columns:Scheduling: C = centralised, D = distributed, H = hierarchical, G = distributed, using a global free-node registryDissemination: (load information): B = broadcast, P = polling, D = demand driven, S = state-change driven, F = uses shared filesystemLoad index: Q = CPU run-queue, U = CPU utilisation, S = other simple index, C = complex index (multiple metrics)Transparency: FO = full (except that off-line training is required), P = user transparent but not programmer transparent, S = usersmust select tasks for transfer, but transparent to programmer, E = user assumed to be an ‘expert’ and must supply complexinformation, V = visible to programmers and users, C = requires re-compilation or re-linking with special header files or libraryfiles.Flexibility: (task types supported): A = all (and nearly all), B = batch (no I/O), C = CPU intense, I = IPC, L = specifically longlivedtasks, M = memory intensive, P = parallel, R = real time tasksTask placement: I = initial placement, P = preemptive transfers, E = preemptive evictionPerformance heterogeneity support: N = N<strong>one</strong>, C = CPU, T = Total, P = partial, U = user supplied / configuredImplementation: K = kernel level, U = user level implementation, no modifications to kernel, M = user level implementation,kernel modifications neededTask characteristics considered: N = not considered, F = Full (automatic), P = <strong>Part</strong>ial (automatic), C = CPU-time requirementonly, U = user providedAbility to learn task behaviour: L = learns task resource requirements, N = non learning, R = learns response-time only, O =learning at least partially off-lineAdaptive behaviour: A = Adaptive (learns load balancing strategies), N = non adaptivePredecessor scheme: = name of predecessor scheme, N = n<strong>one</strong> or unknownRichard John Anthony 43D.Phil. thesis <strong>Chapter</strong> 2


2.4.2 ConclusionBased on the detailed evaluation of load sharing schemes that has been undertaken, thissection identifies the state-of-the-art in load sharing schemes.The rich-information approach is an emerging trend. Modern workstations consist of anumber of resources. To accurately represent the load at such a processing node,multiple metrics are required.The Utopia, LSF and San Luis schemes use complex load indexes which take accountof the load on a number of resources. Simple load indexes, which only represent theload on a single resource, are still popular. Recent schemes which use simple loadindexes include ASLB, Oklahoma, Monash and LYDIA.Two schemes, Stealth and San Luis, exhibit significant learning behaviour. Theseschemes are able to learn (partially) the resource requirements of tasks. The History andLYDIA schemes learn task response-times only.[TL89] States “it is well known that schedulers make better decisions if they can takeadvantage of information about the nature of the jobs they are scheduling”. However,many schemes do not consider the resource requirements of tasks. Such schemesinclude Condor, Sprite, Gammon, V, DAWGS, Utopia, LOADIST, Oklahoma andMonash. Where information concerning tasks’ behaviour is used it is often limited tothe response-time or CPU-time requirement. However, with the exception of thelearning schemes discussed above, resource-requirement information is user-provided,as in the SPICE butler, LSF, Freedman-Sharp and Calypso schemes.A number of schemes make provision for performance heterogeneity. The SPICE butlerconsidered the provision of all major resource types. Other schemes with significantprovision for this form of heterogeneity are Utopia, LSF, LOADIST and Monash. Manyschemes make no provision for performance heterogeneity, including Condor, Sprite,History, PMS, EMPS, DAWGS, ASLB and LYDIA. Other schemes, including therecent Oklahoma and San Luis schemes cater only for differences in CPU-speed. TheFreedman-Sharp scheme requires that resource provision information is user-provided.Performance heterogeneity is very common in loosely-coupled systems and thus a loadRichard John Anthony 44D.Phil. thesis <strong>Chapter</strong> 2


sharing scheme should at least provide support for CPU speed heterogeneity andmemory-size heterogeneity.The only published scheme that is significantly adaptive is SMALL [MW91]. Thisscheme uses an ANN to learn load balancing strategies. However it does not considerthe behaviour or resource needs of individual tasks. Strategies are based only on loadstateinformation.There are no fully transparent schemes. All load sharing schemes to date require at least<strong>one</strong> form of user or programmer interaction. These include: re-compilation of task’ssource code, inclusion of special programming constructs, user-supplied informationconcerning the behavior of specific tasks, user selection of tasks for transfer, and offlinetraining. A lack of transparency in load sharing is a barrier to widespreadacceptance and use. A scheme which is designed with the assumption that users knowhow the load sharing scheme works, or even what load sharing is, can realistically onlybe used in restricted environments such as computing departments at universities. TheStealth scheme is the closest to a fully transparent scheme to date. However off-linetraining is needed, requiring a significant number of execution instances for each newtask-type. A fully transparent scheme should have few acceptance problems since usersneed not even be aware of its presence.A number of schemes including Remote Unix, Sprite, Condor, V, DAWGS and Stealthemploy eviction to protect local workstation users resources. In terms of the local user’sexperience this may be good. However in terms of load sharing efficiency andeffectiveness it is bad. Eviction causes delay to the tasks concerned, frustration to theowners of those tasks, and increases the number of transfers that take place. Nichols’butler [NIC87] takes the extreme measure of killing off remote tasks when a local userreclaims their workstation. This is wasteful of processing time and renders remoteexecution undesirable since the longer a remote task executes for, the greater thechances of it being killed off. This implies that a user is better off keeping large taskslocal, which is in conflict with the purpose of many schemes in which long-lived, batchtasks are specifically selected for remote execution.Dynamic load sharing is achieved by measuring the load at each node in the system andselecting the location for a specific task based on some criterion. In general thisRichard John Anthony 45D.Phil. thesis <strong>Chapter</strong> 2


involves placing a new task at the least loaded node. There are some exceptions to thisgeneral approach. For example, [KA98] proposes that some processors are leftpurposefully idle to ensure the correct scheduling of a real-time task-graph in thepresence of task execution-time variations.The majority of schemes achieve task transfers by initial placement. The popularity ofinitial placement stems from its ease of implementation and low overhead costs. Theonly recent scheme to incorporate preemptive process migration is ASLB.The majority of schemes are implemented in user-space. This approach leads toportable, easy to develop schemes that are however generally less functional andtransparent than kernel-space implementations. The most recent kernel-spacedevelopment was EMPS, reported in 1992. The DAWGS scheme (also 1992) is amainly user-space implementation which requires some kernel modifications.Centralised scheduling is popular because it is simple to implement but the schedulercan become a performance bottleneck and introduces a single point of failure. Themajority of the schemes <strong>review</strong>ed use distributed scheduling. This can lead to scalingproblems if the communications aspects are not efficient. A few schemes, includingProcess Server, Utopia and LSF use hierarchical scheduling, in which nodes aregrouped in clusters, to promote scalability.State-change load information dissemination, if appropriately configured, uses fewermessages than periodic dissemination and introduces less delay than demand-drivendissemination. A steady-state at a node can cause a long period between updatemessages. This can be resolved by sending additional update messages after a certainmaximum silent period.<strong>Chapter</strong> 2 <strong>Part</strong> two2.5 Classification of load sharing modelsA number of models have been used to investigate load sharing issues. Some of thesemodels have been used to theoretically evaluate load sharing policies and some haveformed the basis for load sharing implementations.Richard John Anthony 46D.Phil. thesis <strong>Chapter</strong> 2


Models of tightly-coupled systemsA number of load sharing models are oriented specifically towards tightly-coupledsystems. The characteristics of such systems that are of particular relevance to modelsof load sharing include:• Nodes are homogenous.• Resources such as memory may be shared or centrally provided and thus singleresourcenodes (i.e. just a CPU) can be a reality.• Low-cost high-reliability communication links are used.As this work is concerned with loosely-coupled systems, models of tightly-coupledsystems are not discussed further.Models of loosely-coupled systemsModels that are oriented specifically towards loosely-coupled systems need to take intoaccount the characteristics of such systems. Characteristics of loosely-coupled systemsthat are of relevance to models of load sharing include:• Nodes are interconnected by some form of network technology thereforecommunication speeds are considerably inferior to processing speeds.• Nodes are autonomous, each comprising a number of resources which include CPU,memory, secondary storage and network interface.• Performance heterogeneity is common.• Secondary storage may be localised or shared using a distributed file system such asNFS.A simple taxonomy of load sharing models is drawn up, based on the primary focalpoint of each model. These focal points include: the process, the processing node, andthe system (of processing nodes). The taxonomy is shown in figure 2.5-1.The differentiation between models of static and dynamic load sharing is significant,and as such it forms the first branch in the simple taxonomy. It is generally accepted inthe literature that static load sharing is appropriate in only a limited number ofscenarios. For example, when details of system load including the task-arrival rate andthe resource requirements of tasks are known in advance.Richard John Anthony 47D.Phil. thesis <strong>Chapter</strong> 2


Load Sharing modelsStatic scheduling modelsDynamic scheduling modelsModels that focuson the processModels that focus onthe processing nodeModels that focuson the systemNodes modelled asa single resourceNodes modelled as acollection of resourcesHomogenoussystemsHeterogeneoussystemsFigure 2.5-1A simple taxonomy of load sharing modelsStatic scheduling models are generally probabilistic in nature and seek to balance loadover time. The execution site for a task is chosen based on a probability function thatweights the suitability of each node according to some predefined algorithm, perhapstaking into account the processor speed at each node.Such an approach does not have the ability to react to the fluctuations in load levels thatoccur in almost all distributed systems. Static scheduling is therefor only suitable invery stable systems, and is generally accepted as being unsatisfactory for dynamicsystems.The majority of loosely-coupled systems exhibit significant dynamic behaviour, havingload types and levels which vary with time and cannot be predicted in advance. Forthese systems dynamic scheduling, in which policy decisions are based on the load-stateof nodes, is required.Several works including [DAN95] refer to policies that base load sharing decisions onload-state information as adaptive. However, for the purpose of this work these arereferred to as dynamic. The term adaptive is reserved for policies that adjust theirbehaviour to suit the system-state. For example, by modifying rules or by changingthreshold values. Adaptive policies are much less common than dynamic policies.2.6 Evaluation of models of load sharingThis section critically compares models of load sharing.Evaluation criteriaThis section identifies and justifies the criteria by which models of load sharing will beevaluated:Richard John Anthony 48D.Phil. thesis <strong>Chapter</strong> 2


• Assumptions. The assumptions that underpin a model play a significant role indetermining the reality of the model. Simplifications are sometimes required inorder to study a specific aspect of a system in detail, however simplifyingassumptions may only hold in specific cases. If a model relies on unrealisticassumptions then the model itself is likely to be unrealistic.• Focus of the model. The majority of load sharing models are focused on theprocessing node. There are also models that focus on the system as a whole, andmodels that focus on the process. Focus forms the second branch of the simpletaxonomy.• Representation of tasks. The extent to which individual tasks’ characteristics aretaken into account affects the reality of the model.• Support for performance heterogeneity. Performance heterogeneity is common inloosely-coupled systems. Models should cater for this heterogeneity.• Representation of processing nodes. The autonomous processing nodes thatconstitute loosely-coupled systems consist of a collection of resources. The load onresources is unlikely to be uniform across all resource-types at a node.Consideration of the multi-resource nature of processing nodes increases the realityof a model.• Global scheduling technique. Whether a model uses centralised or distributedscheduling is a significant differentiating factor. Each form of scheduling has itsuses, but it is important to ensure that the form used in a particular model isappropriate for the target system.• Local scheduling discipline. Modern workstations that commonly comprise looselycoupledsystems almost invariably employ an operating system which is based onsome form of multiprocessing. This implies that the local scheduling is preemptive.However, first-come-first-served (FCFS) scheduling is easier to incorporate intoqueuing theory models and for this reason FCFS scheduling is the most popularform of local scheduling used in models of load sharing.• Queuing model used. Queuing theory is a particularly popular technique for buildingmodels of computer systems. However, there are several different ways in which agiven real system can be represented in a queuing model, depending on thesimplifying assumptions that are made.• Performance metrics, findings and simulation results. A key comparison criterion.However, models are used in a wide variety of ways to explore many different loadRichard John Anthony 49D.Phil. thesis <strong>Chapter</strong> 2


sharing related issues. As such it is difficult to compare the models by the resultsobtained.2.6.1 AssumptionsThere are a number of common assumptions made when developing models of loadsharing. Some of these assumptions are identified and the impact of each is discussed:• Infinite waiting-queue capacity at nodes [BFS89,LLS98A,CE94]. Strictly thisassumption does not hold for any real system since the details of all tasks at a nodemust be held in memory which is always finite. It could be argued that the size ofmemory typically available in modern systems permits sufficiently large queues tobe held that a queue overflow scenario will never realistically occur.However, multi-processing operating systems keep track of all processes bymaintaining a process-table that contains all the information needed. These processtables are usually implemented as a linked list, which has no intrinsic size limits, buta strict limit is usually imposed. The process-table limit is set to some value deemedsuitable for the given system. Limits in the order of 250 processes are common,Some versions of Linux have a default limit of 512.In real systems there are at least two factors which reduce the likely-hood of theprocess-table limits being reached: 1 Systems administrators may tune the value tosuit the use environment of the system. 2 Even the most powerful workstations willsuffer significant performance degradation well before the typical process-tablelimits are reached. This degradation often acts as a throttle, deterring users fromsubmitting additional tasks. However, factors such as these are very difficult toaccount for in a model.In the light of these issues, careful evaluation is needed to determine if it is safe tomodel the queuing capacity of a specific system as infinite. Models which assume afinite process table limit include [SN93].• Scheduling decisions are achieved instantly and at zero cost [KL94]. In reality thereis always a processing cost associated with a scheduling decision. It is possible thatthis processing overhead is very low (less than 1% processing capacity is desirable[TL89]), and in the scheme of things can be considered negligible.However, the delay incurred while information concerning the load at remote nodesis gathered must be considered. If this is achieved through polling there are twoRichard John Anthony 50D.Phil. thesis <strong>Chapter</strong> 2


messages required per node polled. This communication delay is likely to be moresignificant than the processing overhead. [BAN93] Assumes that the instantaneoussystem-wide state information is readily available at no cost. Several models haveconsidered the costs associated with polling and as a result find that low poll limitsare generally more effective [MTS90,ELZ86A].• The communication cost of task relocation is negligible [ELZ86A]. In [BAN93] it isassumed that there is no communication cost associated with task transfer. Thevalidity of this assumption depends on the type of task transfer that is supported.Initial placement transfers can be achieved by a single message containing the taskname and its arguments and an acknowledgement message. This approach assumesa global file name space is provided by NFS or similar. In this situation thecommunication cost of relocation is at least two messages. If coarse-grained tasksare transferred then the communication cost can usually be considered to be low.However, if initial placement is achieved by transferring the executable file, as inLOADIST, a greater number of messages will be required.Preemptive task migration always involves at least the transfer of process state andthe creation of the process at the recipient node. In these cases the communicationcost is likely to be significant.• Tasks are infinitely divisible [BD96,LD97,LLS98A,LLS98B,GS94]. This isparticularly useful when devising models in which tasks are decomposed andexecuted as parallel fragments, but is unrealistic in general. The model used in[GL91] assumes that the workload may be assigned to the nodes in the system in“all desired proportions” and thus perfect load balancing can be achieved. [HTC88]Assumes that the load at each node is “measured by real numbers and is arbitrarilydivisible”.There is a contradiction in the use of this assumption in models of loosely-coupledsystems in that it implies the distribution of fine-grained load, known to beunsuitable for such systems due to the communications costs involved.Most simplifying assumptions are optimistic. Results achieved from models based onthem are thus also optimistic. Such models are often only applicable to specific(sometimes unlikely) scenarios.Richard John Anthony 51D.Phil. thesis <strong>Chapter</strong> 2


2.6.2 Focus of the modelProcess-level modelsA process is defined as an executing instance of a program. Process behaviour can beextremely complex. All but the simplest programs are configured with input parametersand data. In addition, where preemptive scheduling such as Round Robin (RR) is used,processes must compete with <strong>one</strong>-another for resources. Thus it is very unlikely that anytwo processes will behave in exactly the same way, even if they are instances of thesame program.Due to this complexity there are few models that focus on the specific behaviour ofprocesses. Rather, a common simplifying assumption is that the workload ishomogenous with identical, fixed service-times for each task. A more realistic approachis to assume that all tasks exhibit similar type of behaviour but with exp<strong>one</strong>ntiallydistributed service times. The Poisson distribution is particularly popular for modellingtask-arrival patterns.[AV98] Proposes the interval-order augmentation algorithm for scheduling general taskgraphs. The scheduling problem is solved in two stages: 1, Additional precedencerelations are added to the original task graph to ensure that it is an interval order graph(a tractable case of the general scheduling problem). 2, Existing optimal interval-ordersolutions are then applied. The effects of communication are considered, although thereis an assumption that the communication costs are the same as the execution cost forany given task. This implies fine-grained sub-tasking. Randomly generated task graphswith different numbers of nodes and directed edges are used to test the algorithm.[GI90] Assumes homogenous processing nodes with infinite memory. RR scheduling isused. A state-transition model of the high-density CPU and I/O resource use activity ofeach program is developed, and updated after each execution. The model is used topredict the resource use during the next execution of the program, based on the resourceuse pattern of the most-recent execution and the state-transition model.Processing-node modelsA processing node can be defined as an entity that can execute a process. A processingnode must by definition have a Central Processing Unit (CPU). Other resources such asprimary and secondary storage are usually associated with the processing node but theRichard John Anthony 52D.Phil. thesis <strong>Chapter</strong> 2


physical mapping of nodes to resources can vary between architectures. For example, amultiprocessor machine can be constructed from a number of nodes that each have theirown memory, a number of nodes that can each access a shared memory, or a hybrid ofthe two approaches. Loosely-coupled systems consist of autonomous processing nodes.These consist of a number of resources which include CPU, memory and networkinterface. They may have additional resources, most notably secondary storage in theform of a disk. It is important that nodes which consist of a number of resources aremodeled as such.A common simplification is to consider the resources at a node collectively and tomodel each node as a single queue. Examples include: [LM82,ELZ86A,ELZ86B,BFS89,BW89,MTS89,CHO90,SN93,CE94,FMP94]. This approach fails to capture theresource diversity of modern workstations or the load on each resource in the node,rather it assumes that all resources at a node are equally contended for. There may beenvironments where such a simplification is valid, but in general it represents asignificant source of error.A more comprehensive model which represents the node as a collection of resources isfound in [BAN93]. This model is more representative of modern workstations thanthose models which represent all resources by a single collective index.System-level modelsThese generally use system-wide load levels to define current system state.Performance homogenous systems can be easily represented as a centralised queuingmodel. If there are k nodes, the model can be stated as a M/M/k queuing system.However, this does make certain assumptions about task-arrival rates and service times.Performance homogenous system-level models include [LR93,KL94].Performance heterogeneous systems are much harder to model, even if each node isassumed to have only a CPU resource. Additional resources greatly increase thecomplexity of models. Performance heterogeneous system-level models include[SW89,MB91,MGR93]. These models only deal with differences in CPU performance.Richard John Anthony 53D.Phil. thesis <strong>Chapter</strong> 2


2.6.3 Representation of tasksThere are two aspects to the representation of tasks in models, the behaviour of the tasksand the task-arrival pattern.Task behaviourThere are a number of methods which have been used to represent the behaviour orresource-requirements of tasks. All of these are simplifications which reduceinformation content:1. All tasks are identical. [GAB82,LD97] Assume homogenous tasks.[ELZ86A,ELZ86B] State that tasks are all of the same type, although the servicetimescan vary.2. Task service-times are often represented as an exp<strong>one</strong>ntial distribution. This is acommon approach, used in many models including [FYN88,SW89,CHO90,MTS90,MB91,SN93,LR93,FMP94]. In [LEE94] all tasks are assumed to have the samebasic resource requirements but their service times are modeled as an exp<strong>one</strong>ntialdistribution. Models that use the Poisson distribution to represent service timesinclude [LM82]. In [LO86] a detailed workload analysis is carried out. Theyconclude that an exp<strong>one</strong>ntial distribution (or any distribution with an exp<strong>one</strong>ntialtail) is not an accurate representation of tasks’ processing requirements.3. Random service times are modeled in [LLS98A], tasks are not differentiated inother respects.4. [KL94] Assumes that tasks service times are known or can be accurately predicted.Each task-type has specific behaviour, using different amounts of resource. It is notreasonable to assume that the behaviour of <strong>one</strong> task-type can be predicted by observinga different task-type. Models which are based on a simple representation of tasks arethus unrepresentative of the majority of real systems.The response-time of a task is dependent in part on the load on resources. This in turn isdependent on the behaviour of tasks present. Thus it is important to accurately representworkloads in terms of composition. However, this is not d<strong>one</strong> in existing models.Task-arrival patternThe representation of the arrival pattern of tasks to the system is of key significance tothe accuracy of the model. The performance of many scheduling policies are sensitive toRichard John Anthony 54D.Phil. thesis <strong>Chapter</strong> 2


the system load level. The task-arrival rate is commonly modeled as an exp<strong>one</strong>ntialdistribution [ELZ86B,BAN93,FMP94]. Models which use the Poisson distributioninclude [LM82,BFS89,MTS89,MB91,LR93,KL94,LEE94,LLS98A]. Another approachis to use a simple mean arrival-rate to describe the arrival process [ELZ86A,MGR93].In real systems the task-arrival rate can differ from node-to-node and from time-to-time.It is important that policies do not depend on constant task-arrival rates or identicalarrival rates across the system for efficient performance. In [ELZ86A] the same averagearrival rate at each node is assumed. This is used to signify that in the long term theexternal load imposed on each node is the same.2.6.4 Representation of processing nodesBy far the most common representation of processing nodes is as a single queue. In thisrepresentation the node is considered a single resource which can not be subdividedfurther. The only characteristic that can be used to differentiate processing nodes in suchrepresentations is their processing speed. This approach is convenient for thedevelopment of simple queuing models, but prevents evaluation of the actual internalresource structure of nodes and of the effects of load on the individual resources.Models in which processing nodes are represented as single-resource nodes include[LM82,ELZ86A,BFS89,CHO90,GL91,LR93,LEE94,CE94,AV98]. In such models theCPU run-queue length is equivalent to the number of tasks at the node and is used as theload index.If the load on individual resources is to be studied, a more detailed model is required.[BAN93] Represents a processing node as a group of resources, each with its ownqueue. This model uses the central server model of a processing node, see figure 2.6.4-1. The model in [BAN93] assumes that nodes are homogenous.Richard John Anthony 55D.Phil. thesis <strong>Chapter</strong> 2


unblockHigh-levelSchedulerlow-level preemptionblockQueue CentralProcessingUnitResourcequeuesResourcesexit (task completion or failure)Figure 2.6.4-1The central server model of a processing node2.6.5 Support for performance heterogeneityDespite the prevalence of performance heterogeneity in loosely-coupled systems, manymodels of load sharing for these systems represent them as being performancehomogenous. Such models include[LM82,ELZ86B,FYN88,GL91,BAN93,KL94,BD96, AV98].Poor performance can result when algorithms that are designed for homogenoussystems are applied to heterogeneous systems [SW89]. For example, simple round-robinallocation of tasks to nodes might work satisfactorily under certain conditions in ahomogenous system. In a system where nodes have different processing capacities, it isdesirable to send proportionally more work to the faster nodes.No load sharing models that provide complex treatment of performance heterogeneityhave been found. However, a number of models consider differences in processingspeed as a special case. For the queuing models the optimization problem is thus todetermine the amount of load to send to each queue, given the different processing ratesof nodes. Such models include [MGR93,GS94,LEE94,HSU96]. [BFS89] Builds on theJoin-the-Biased-Queue rule introduced by Yum [YUM81] by making it adaptive. Thequeues are biased to account for processor speed heterogeneity. However the bias is notconstant, rather it is continually updated by the average queue length information foreach node. Several models take a simplified approach by assuming a number of classesof nodes. [SW89,MTS90] Assume there are only two classes of node, fast and slow.The two-speed simplification is a useful way of making the general performanceheterogeneitycase tractable, but relies on a simple representation of processing nodesand a simple binary divide is imposed on the performance of the nodes. [SN93]Richard John Anthony 56D.Phil. thesis <strong>Chapter</strong> 2


Supports both the two-speed case and a realistic case where each node has its ownspecific processing speed. [MB91] Supports this latter case.2.6.6 Global scheduling techniqueBoth centralised scheduling and distributed scheduling models of load sharing are used.Centralised scheduling is stable, simple to model, and can be based on simplealgorithms. However centralised scheduling leaves the system vulnerable as there is asingle point of failure, and the scheduler can become a bottleneck both in terms ofcommunication and the processing time required to perform the scheduling as scaleincreases. Centralised scheduling models are found in [SW89,MB91,SN93,KL94,LLS98A].A hierarchical centralised model, which overcomes some of the limits of purecentralisation, is found in [CE94]. Nodes are connected in clusters and load sharing isconsidered at both the inter-cluster and intra-cluster levels.Distributed scheduling is highly resilient to the failure of any single comp<strong>one</strong>nt and canbe designed to scale well. However distributed scheduling requires more complexalgorithms and instability can arise if nodes base decisions on insufficient information,or if communication delay causes stale information to be used. Distributed schedulingmodels include [ELZ86A,BFS89, MTS89,LR93,FMP94].2.6.7 Local scheduling disciplineOften, to simplify a model, First Come First Served (FCFS) scheduling at nodes isassumed. Examples include [LM82,ELZ86A,BW89,MTS89,CHO90,MTS90,MGR93,KL94,LLS98A]. In such models a processing node has an associated queue.The task at the head of the queue is processed to completion while tasks in the queuewait, see figure 2.6.7-1.High-levelschedulerQueueProcessingnodeexitFigure 2.6.7-1Processing node model based on FCFS schedulingRichard John Anthony 57D.Phil. thesis <strong>Chapter</strong> 2


However, modern workstations are usually equipped with multi-processing operatingsystems whose preemptive scheduling is based on timesharing, variants of round robin(RR) scheduling are typically used. FCFS models are not accurate representations ofsuch workstations.Often simple computation-only tasks are assumed. Clearly FCFS will outperform RR inthe case where only compute-intense tasks are concerned. This is because the need forcontext switching is removed (computation-only tasks can be assumed not to block).This is <strong>one</strong> reason why FCFS should not be used in models of systems that usepreemptive schedulers.[DAN95] Evaluates the relative impact on load sharing performance of three localscheduling strategies: FCFS, shortest job first (SJF), and RR. It is shown that thepreemptive policy (RR) has a very different effect to that of the non preemptive policies(SJF and FCFS) on the sensitivity of load sharing policies to variance in task interarrivaltimes and service times. This is further evidence that FCFS models are notrepresentative of systems that use RR.There are further differences between FCFS scheduling and preemptive timesharescheduling: 1 Waiting-time for tasks is different, 2 FCFS scheduling does not supportthe concept of blocking and therefor permits a processor to become idle while there areready tasks, 3 short-term throughput is higher in FCFS than RR. Consider the simpleexample shown in table 2.6.7-1 in which three identical tasks queue for service in FCFSand timesharing respectively. In the latter case assume that the time-slice size is smallwith respect to the task’s processing-time requirement. Each task requires <strong>one</strong> unit ofprocessing time. Throughput is measured at the end of each time unit, and reflects thenumber of tasks that have completed in that time unit.System Task details FCFS scheduling Time-slice (RR) schedulingtimeunitsTasknumberProcessingtimerequirementTotalwaittimeTime in system(response-time)Throughput(in currenttime unit)TotalwaittimeTime in system(response-time)Throughput(in currenttime unit)1 1 1 0 1 1 2 3 02 2 1 1 2 1 2 3 03 3 1 2 3 1 2 3 3Table 2.6.7-1 Differences between FCFS and RR schedulingRichard John Anthony 58D.Phil. thesis <strong>Chapter</strong> 2


As can be seen from this example, the extent to which the use of FCFS scheduling isacceptable in models of loosely-coupled systems depends on the performance goal. Ifthe goal is to maximise throughput then the FCFS and RR approaches can be equivalentover time. If the performance goal is to minimise task response-time or to minimise taskwaiting-time then the approaches are not equivalent.An improved processing node model that supports task preemption is shown in figure2.6.7-2.High-levelschedulerQueueProcessingnodeexitFigure 2.6.7-2Processing node model that supports task preemption2.6.8 Queuing model usedThe queuing-theoretic approach is popular for the construction of models of computersystems. The technique allows simple models to be devised which lend themselves wellto mathematical analysis. However, in many cases the models are subject to numeroussimplifying assumptions, some of which diminish the quality of their representation ofreality. Due the different assumptions that can be made, there are several different waysin which a given real system can be represented.1. A system of processing nodes can be represented as a system of independent M/M/1queuing systems, as shown in figure 2.6.8-1.Arrive λ 1 1Complete µ 1Arrive λ 22Complete µ 2Arrive λ 3Queues3ProcessingnodesComplete µ 3Figure 2.6.8-1 Parallel independent M/M/1 queuing systems[MTS89] Describes the M/M/1 model as the no-load-balancing model. This isgenerally true for physical systems in which nodes operate independently and eachRichard John Anthony 59D.Phil. thesis <strong>Chapter</strong> 2


have their own arrival task stream. [KL94] Uses this model to describe randomscheduling for a 2 node system (which they call flip-coin scheduling). Each node ismodelled as an independent M/M/1 system. They also use this model to describealternate scheduling for a 2 node system. Each node is modelled as an independentErlangian-2 arrival-rate system, stated E 2 /M/1.2. A system of processing nodes can be represented as a system of centrally scheduledM/M/1 queuing systems, as shown in figure 2.6.8-2.1Complete µ 1Arrive λScheduler2Complete µ 2Queues3ProcessingnodesComplete µ 3Figure 2.6.8-2 Parallel, centrally scheduled M/M/1queuesThe parallel, centrally scheduled queues model is found in [LLS98A,BFS89,SW89]. A clustered version of this model is described in [CE94]. In [LEE94] thismodel is used to determine both the processing capacity assignment forheterogeneous processing nodes, and the load distribution over these nodessimultaneously. In [BFS89] nodes are assumed to have different processing speedsso the queue lengths are biased to account for this. The bias values are continuouslyadjusted based on the moving average queue length at each node. This model is usedin [KL94] to describe a join-the-shorter-waiting-time-queue scheduling policy.3. A system of processing nodes can be represented as a single-queue multiple-server(SQMS) system, as shown in figure 2.6.8-3.1Complete µ 1Arrive λQueueScheduler23Complete µ 2Complete µ 3ProcessingnodesFigure 2.6.8-3 Single-queue multiple server (SQMS) systemRichard John Anthony 60D.Phil. thesis <strong>Chapter</strong> 2


A single queue with service rate nµ is known to be more effective in theory than nhomogenous queues each with service rate µ. [LM82] introduces the wait-whileidleproblem and shows that when the probability of waiting-while-idle is zero, nM/M/1 queues are as effective as a single M/M/n queue. In all other cases, M/M/noutperforms the multiple queue model. It follows, theoretically, that forperformance heterogeneous systems a single queue with service rate kµ is moreeffective than m heterogeneous queues whose total service rate is kµ.Many load sharing models are based on the single queue approach. The SQMSmodel is found in [MB91,SW89]. In [KL94] an M/M/2 system of this design isdescribed as perfect scheduling. There are two assumptions, that nodes arehomogenous and that FCFS scheduling is used at nodes. When these assumptionshold, it can be seen that the scheduler always places the task that is at the head of thequeue on the optimum free processor. The more general M/M/k case is described asperfect scheduling in [ELZ86A] although it is noted that M/M/k analysis isoptimistic for any load sharing scheme since task transfer costs must be considered.[MTS89] Describes the M/M/k model as perfect load sharing with no costs.4. A system of processing nodes can be represented as a fully distributed queuingsystem, as shown in figure 2.6.8-4.External arrivals λ 1 11Complete µ 1External arrivals λ 222Complete µ 2External arrivals λ 33ProcessingnodesInternalschedulerqueues3ProcessingnodesComplete µ 3Figure 2.6.8-4 Fully distributed queuing systemThe fully distributed model works as follows: Tasks arrive independently at eachnode. Based on locally held information the node makes a scheduling decision, thetask can be transferred to another node or executed locally. The fully distributedmodel is found in [MGR93]. In [CE94] the fully distributed model is used at thecluster level. Each cluster has its own arrival and scheduling queues, and consists ofa number of processing nodes.Richard John Anthony 61D.Phil. thesis <strong>Chapter</strong> 2


5. A system of n processing nodes can be represented as a closed central server systemArrive λComplete µ 2CentralserverComplete µ 3Central nodeQueues ProcessingservernodesnodeQueues ProcessingFigure 2.6.8-6 Open central server model in which the central nodes receives external tasksFigure 2.6.8-5 Closed central server modelwith n-1 servers, as shown in figure 2.6.8-5.Complete µ 1A version of the closed central server model is found in [SN93] in which the centralserver is dedicated and does not process any of the load itself. It is possible toarrange for the central server node to retain some portion of load for localprocessing. Centralised scheduling is potentially very accurate because the centralserver can have full system knowledge. Consistent system-wide state knowledge ishard to achieve in distributed scheduling. A centralised approach permits the use ofsimpler scheduling algorithms. However the centralised scheduler approach iscommunication intensive, can cause scaling problems due to the performance limitsof the single scheduler node relative to that of the system, and represents a singlepoint of failure.6. A system of n processing nodes can be represented as an open central server systemwith n-1 servers, as shown in figures 2.6.8-6 and 2.6.8-7.There are a number of variations of the open central server model that are possible.These are concerned with: 1 which nodes receive external tasks, two possibilities areshown in figures 2.6.8-6 and 2.6.8-7, and 2 whether the central node retains some ofthe tasks for local processing, or is dedicated to scheduling over the remainingRichard John Anthony 62D.Phil. thesis <strong>Chapter</strong> 2


nodes. In [EG89] two variations of the open central server model are found. In thesevariations the controller does not receive external tasks, but does retain a proportionof the workload. Each node has a service queue for local execution, and atransmission queue for tasks to be executed remotely.2.6.9 Performance metrics, simulation results and findingsA popular performance metric, used in [BFS89] and [LEE94] for example, is taskresponse-time (individual task characteristics are not considered).[LLS98A] Derives optimality equations (when performance is averaged over infinitetime) for four well known optimality criteria for load balance in performanceheterogeneous systems (loosely-coupled by implication). These criteria are: all nodeshave the same mean number of tasks; all nodes have the same mean waiting queuelength; all nodes have the same mean response time; and all nodes have the same meanwaiting time. Note that only <strong>one</strong> of these criteria can be guaranteed to be met at anytime. The calculations depend directly on the assumption that load is infinitely divisible,which is unrealistic. Results are provided separately for each of these performancecriteria. Summarily, for each of the criteria as the workload increases the balance ofload is less well maintained. At workload intensities above half of system capacity theresults are quite poor.The results obtained in [ELZ86A] indicate that simple load sharing schemes can providesignificant improvement in average task response-times compared to the no-loadsharingcase. It is also indicated that simple policies can perform nearly as well ascomplex policies which use more information. The simplifying assumptions (FCFSscheduling and homogenous workload described simply by an average service-time andan average task inter-arrival time) must be borne in mind when evaluating this secondresult.The model presented in [MTS89] is used to compare sender, receiver, andsymmetrically-initiated load sharing policies. It is concluded that load sharing can bebeneficial in the presence of quite severe delays, even exceeding the mean task servicetime.At delays of greater than 10 seconds, random load sharing was found to performas well as any probing policy. With any of the policies, as delay increases there is aRichard John Anthony 63D.Phil. thesis <strong>Chapter</strong> 2


cutoff point at which the load sharing performance is no better than the no-load-sharingcase.In simulations in [ELZ86B], average transfer costs of up to 50% of average taskservice-time are used. Performance for both sender and receiver policies degrade rapidlyas transfer costs exceed 25% of processing cost.The conclusions are that neither sender-initiated or receiver-initiated policies are bestunder all circumstances. Sender policies are best under light to moderate system loadand where low-cost non preemptive transfers are available. Receiver policies are bestwhere higher cost preemptive transfers are used, and where system load is high.Results in [CHO90] show that load sharing can be beneficial when the time-cost of atask transfer is as much as 30% of the service time of the task concerned. This result isencouraging since it implies that load sharing can be useful for a large proportion oftasks present in general-purpose systems.[BFS89] Found an adaptive algorithm to dramatically outperform both the non-adaptivebiased queue algorithm and also RR scheduling, as the number of simulated users, andhence the workload, was increased. The work also finds that the instantaneous CPU runqueuelength is not the most efficient load index for use in complex systems(heterogeneous processing speeds), a simple asymptotic average of a number of periodicsamples performing better.Results in [CE94] show: 1 The interval between load information exchanges directlyinfluences the results of load sharing. More frequent exchanges improve performancebut increase the communication overhead. 2 For a system of eight nodes, grouping intofour clusters caused greater communications cost, a greater percentage of remotelyexecuted tasks, but lower load imbalances within clusters, than when grouped into twoclusters.Three known, centralised scheduling heuristics are evaluated in [SW89]: 1 shortestexpected delay (SED), in which tasks are placed to minimise their expected delay. Thequeue lengths and node speeds are taken into account. 2 never queue (NQ), in whichtasks are sent to the fastest available sever. This policy does not take into account theremaining work at faster nodes and may thus increase the delay experienced by a taskRichard John Anthony 64D.Phil. thesis <strong>Chapter</strong> 2


y sending it to an idle slow server rather than placing it in a (possibly short) queue fora fast sever. And 3 greedy throughput (GT) which attempts to maximise the number oftasks that complete before the next task arrives. A given task is scheduled such that it ismost likely to complete before the next task arrival.It is found that when there are relatively few fast nodes, maintaining a queue for theseensures they are efficiently used. SED tends to over-queue on the faster nodes while NQunder-queues. The GT policy is found to outperform SED and NQ because it takes intoaccount the task-arrival rate.In [MB91] it is observed that in the case of a single queue system of heterogeneousnodes, it may be beneficial to wait for a fast server to become free rather than placing atask on the first server to become free which may be a slow server. The issue is how todetermine the amount of time to wait. The penalty for using a slow server is the increasein response-time caused. The penalty for waiting for a fast server is the increased delayexperienced by all tasks in the queue, while a slow server is idle.A range of policies are examined: 1 greedy, in which all tasks are placed at the fastestserver until its performance advantage is eroded to unity with the next slowest server, 2never-queue (NQ), in which a task is placed on the fastest idle server, 3 a new heuristicwhich is based on the greedy policy but takes into account the CPU utilisation ofservers.The greedy policy is found to send too much work to the fastest servers, beyond thepoint where their performance has fallen below that of slower nodes. This is caused bythe use of a discrete threshold and is more pronounced at low system loads. NQ utilisesthe slower servers more than the greedy policy. NQ is found to perform quite well atvery low system utilisation, poorly at intermediate utilisation, and close to optimal atutilisation levels above 0.8 of system capacity. The new heuristic is found to be veryclose to optimal at all system utilisations, continuously outperforming the greedy andNQ policies.These results show that relatively slow nodes can be effectively employed to improvesystem responsiveness. The policies evaluated all use faster nodes with preference butcan send some proportional fraction of load to the slower nodes.The work in [MGR93] is mainly concerned with the efficient use of probing. Oneconclusion is that as the heterogeneity in the system increases, the marginal benefits of ahigher probe limit increase. This is because the chances of finding a faster nodeRichard John Anthony 65D.Phil. thesis <strong>Chapter</strong> 2


increases with the number of polls. Another conclusion is that probing becomes lessefficient at higher system loads, since more probes are needed to find a suitable receivernode. At these load levels, periodic broadcast is found to have lower overhead. Theypropose switching between probing and periodic broadcast automatically as systemwideload increases.[FIN90] Measures the proportion of tasks that complete within given deadlines usingeach of: random, threshold and shortest policies, as introduced in [ELZ86A]. It is foundthat threshold performed almost as well as shortest over a range of experiments. Thiswould seem to favour simpler policies. However, although the threshold policy issimpler in an algorithmic sense, both threshold and shortest are equally simple in aninformation-use sense, each making decisions based only on the queue length at nodes.Specific task characteristics are not considered.Richard John Anthony 66D.Phil. thesis <strong>Chapter</strong> 2


2.7 Conclusion for part 22.7.1 Summary of load sharing modelsTable 2.7.1-1 provides a comparison of 22 models of load sharing in chronologicalorder. Comparison criteria are selected by their relevance to the hypothesis.CriterionFocusofmodelPerformanceheterogeneityRepresentationof processingnodesScheduling(centralised/distributed)Scheduling(static/dynamic)LocalschedulingdisciplineTaskarrivalrateModel DatereferenceLM82 1982 N N S D D FCFS E(P) E(P)ELZ86A 1986 N N S D D FCFS A AELZ86B 1986 N N S D D ? E ABFS89 1989 N C S D D FCFS E(P) NBW89 1989 N N S D D FCFS ? NMTS89 1989 N N S D D FCFS E(P) ESW89 1989 S C S C D FCFS E(P) EFIN90 1990 N C S D D FCFS E ECHO90 1990 N N S D D FCFS E(P) EMTS90 1990 N C S D D FCFS E(P) EGL91 1991 N N S D S ? F NMB91 1991 S C S C D FCFS E(P) EBAN93 1993 N N M D! D RR E MSN93 1993 N C S C(closed) D FCFS closed ELR93 1993 S N S D D ? E(P) EMGR93 1993 S C S D D FCFS A NKL94 1994 S N S C D FCFS E(P) KFMP94 1994 N N S D D FCFS E ECE94 1994 N N S CH D RR E EGLEE94 1994 N/S C S D S ? E(P) ELLS98A 1998 S C S C S FCFS E(P) RAV98 1998 P N S C! D ? G GTable 2.7.1-1 Load sharing models feature summary tableSpecial character key: ? = unclear or information unavailable, ! inferredTask behaviourand/or resourcerequirementsKey to columns:Focus of model: P = process-level, N = processing node level, S = system-levelPerformance heterogeneity: N = n<strong>one</strong> (homogenous), C = CPU-speedRepresentation of processing node: S = single resource, M = multiple resources consideredScheduling (centralised/distributed): C = centralised, D = distributed, H = hierarchicalScheduling (static/dynamic): S = static, D = dynamicLocal scheduling discipline: FCFS = first come first served, RR = round robinTask-arrival rate: A = average (mean) arrival-rate used to describe the arrival process, E = exp<strong>one</strong>ntial service-time((P) = Poisson distribution), F = pre-determined fixed number of tasks, G = task arrival is dependent on position anddependencies in the task-graphTask behaviour: N = not considered, A = average (mean) service-time used to describe tasks, G = task-graphdependencies, E = exp<strong>one</strong>ntial service-time ((P) = Poisson distribution), K = service times known or can beaccurately predicted, R = random service-time, M = task behaviour described by a probability matrix which describesits routing between resources at a node.The most advanced of the models <strong>review</strong>ed; in terms of representation of processingnodes, local scheduling and task behaviour, are described in [BAN93]. These modelsare briefly outlined:Richard John Anthony 67D.Phil. thesis <strong>Chapter</strong> 2


The three models assume processing nodes consist of a CPU and <strong>one</strong> or more disks.They use a probability matrix to describe the routing of a task between resources.The first two models consider serial and parallel routing between resourcesrespectively. Serial routing is quite consistent with the traditional state transitionmodel of a multiprocessing system in which a task is using (or waiting for) a singleresource at a time. Memory can be an exception here, but these models are notconcerned with memory. Parallel routing is described as allowing a task to visitonly a single resource before it exits the node, and they give an example based on anode that has several processing elements. This type of parallel model could not beapplied to a system that has resources other than processors, since every task needsat least access to a CPU.The third model uses the central server approach to represent a processing node, inwhich the CPU is the central resource. Other resources are modeled as queues thatcan be entered on leaving the CPU. On leaving <strong>one</strong> of these resources the task mustreturn to the CPU queue. This is the most realistic of the three models, most closelyrepresenting the traditional ready-run-blocked RR scheduling.The load index used is the number of tasks at nodes. The goal is to maintain allresources at a node equally utilised to avoid any bottleneck. The main problem withthis goal is that the task mix in the system may not permit this. For example, theremay be <strong>one</strong> disk intensive task and three CPU intensive tasks to spread over threenodes. Various schedules could be devised, all of them poor in terms of this goal.The performance metric is the average response-time of tasks. The results indicatethat load sharing performance is dependent on the number of resources per node,diminishing rapidly as the number of resources increases. However, this is due tothe probability of a busy node being able to find a transfer partner, which in turn isdependent on the definition of an idle node and the condition imposed that only idlenodes can accept transfers. The models are based around the precept that a node isidle only when all of its resources are idle.There is consideration of unequal task-arrival rates at nodes, in this case theperformance is found to degrade at a slower rate as the number of resourcesincreases.One conclusion is that the number of tasks at a node is as good a load metric as anyother. However no specific task resource requirements are considered, all taskdependentparameters are generated probabilistically.Richard John Anthony 68D.Phil. thesis <strong>Chapter</strong> 2


2.7.2 ConclusionExisting models of load sharing fail to capture the complexity of real systems in anumber of aspects:• A number of models take into account CPU-speed heterogeneity. However, thereare no models which consider differences in the provision of other resources. Manymodels simply assume performance homogeneity.• Only <strong>one</strong> model has been found which represents a processing node as a collectionof resources. The other models represent each processing node as a single resource.• The majority of models are based on dynamic scheduling. However, some recentmodels have been based on static scheduling despite its known limitations.• The majority of models assume FCFS local scheduling. Only two models based onRR local scheduling have been found.• Task-arrival patterns and service-times are commonly modeled as exp<strong>one</strong>ntialdistributions. Few models attempt to cater for task characteristics other than theirtask-graph dependencies or service times. One model [KL94] assumes thatindividual-task service times are known or can be accurately predicted. Anothermodel [BAN93] uses probability matrices to describe each task’s routing betweenresources.Generally, the existing queuing models are suitable for representing load levels at nodesand for balancing those loads in simple situations such as when nodes and/or tasks arehomogenous. These models do not closely resemble the reality of loosely-coupledsystems consisting of heterogeneous multi-resource processing nodes and typicallyhaving a task mix exhibiting wide variance in behaviour.A number of desirable features of load sharing models for loosely-coupled systems canbe identified:• The model should support both performance-homogeneous and performanceheterogeneous systems.• The model should represent each of the major resources that constitute a node asindividual entities.• The model should allow reasoning based on realistic tasks and workloads.• The model should be based on dynamic global scheduling.• The model should assume RR local scheduling.Richard John Anthony 69D.Phil. thesis <strong>Chapter</strong> 2


The models described in [BAN93] fully meet 3 of these requirements and partially meeta further 1, see section 2.7.1. Apart from this, no model that fully or partially supportsmore than 2 of these features has been found.Richard John Anthony 70D.Phil. thesis <strong>Chapter</strong> 2


Richard John Anthony 71D.Phil. thesis <strong>Chapter</strong> 2


Richard John Anthony 72D.Phil. thesis <strong>Chapter</strong> 2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!