10.07.2015 Views

An Architecture for Dynamic Allocation of Compute Cluster Bandwidth

An Architecture for Dynamic Allocation of Compute Cluster Bandwidth

An Architecture for Dynamic Allocation of Compute Cluster Bandwidth

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3When DNS-Round-Robin is in place, several IP addresses aremapped to the same domain name. The IP address that theDNS chooses to respond with is selected in a round robinfashion. While effectively distributing the physical machines towhich a client connects, it does not distribute load. A clientmay lookup a server and not use it <strong>for</strong> a transfer, in which casesome servers will be used <strong>for</strong> transfers more <strong>of</strong>ten than others.In our approach we only seek out a transfer node when wehave a transfer request, there<strong>for</strong>e we have a good idea <strong>of</strong> theload it will incur. More importantly, the DNS approach doesnot take into account sharing <strong>of</strong> computational and transferresources in a dynamically changing environment.Our work overlaps with concepts prevalent in queuingtheory. We are allocating resources to support a user loadwhich is the fundamental problem that queues are designed tosolve. Our work is not researching into the field <strong>of</strong> queuingtheory. We are rather acting as a user <strong>of</strong> it from a layer above.However, it is important <strong>for</strong> us to understand how we areaffected by some <strong>of</strong> its fundamental concepts. Additionally,ongoing work in this field will allow some importantextensions to the architecture we propose here.The backfill algorithms described in “Core Algorithms <strong>of</strong>the MAUI Scheduler” [15] and "Utilization, Predictability,Workloads, and User Runtime Estimates in Scheduling theIBM SP2 with Backfilling" [16] play a role in node acquisition<strong>for</strong> our system. The proposed architecture will per<strong>for</strong>m bestwhen nodes can be obtained quickly. Further, most transferjobs run <strong>for</strong> a relatively short amount <strong>of</strong> time. These twocharacteristics dovetail well into backfill. A queue maintains alist <strong>of</strong> jobs that it orders according to priority. To preventstarvation, the order in which the jobs are requested plays astrong role in this priority. The scheduler then matches jobswith nodes. Higher priority jobs receiver earlier start times.Since requests vary in wallclock time and number <strong>of</strong> nodesrequired, the process inevitably leaves time gaps between jobs.Backfilling searches the remaining lower priority job to findthose that can fit into these time gaps. This can result in lowerpriority jobs running be<strong>for</strong>e higher priority jobs, but it does notdelay any job from its originally scheduled time, it simply useswhat would otherwise be idle time. Naturally jobs whichrequest a small number <strong>of</strong> nodes and a short wallclock time areable to fit into the most gaps and thus have a higher likelihood<strong>of</strong> running at an earlier time. Our architecture takes advantage<strong>of</strong> this by always requesting the smallest number <strong>of</strong> nodes, anda relatively short wallclock time.There has been much research on predicting resource loadrequirements [17] [18] [19], including network loads [20] [21][22]. Our system can benefit from this work. If we were able topredict network loads quickly and accurately enough, we couldset resources aside ahead <strong>of</strong> time <strong>for</strong> network traffic spikes. Toallow <strong>for</strong> this possibility, we have designed into ourarchitecture a way <strong>for</strong> prediction algorithms to be implementedin a natural and modular fashion. We discuss this topic whenwe introduce the controller program.Fast access to nodes is one challenge that our architecturefaces. We cannot instantly obtain a node because a user’s jobmay be running on it. There<strong>for</strong>e we must wait <strong>for</strong> a job tocomplete be<strong>for</strong>e the scheduler will even consider grantingaccess to our system. Ideally we could suspend any runningjob, use its node <strong>for</strong> a transfer, and then restart that user’s jobwhen we are finished, all without disrupting the job or denyingit wallclock time. Research into virtual machines [23] can beused to achieve this. By running each job in a virtual machinethe entire system state can be suspended and even migrated toanother host. As this research advances we expect it to providesolutions <strong>for</strong> immediate node access.III. ARCHITECTURERecall that our goal is to define an architecture which allowswell connected computational clusters to service bandwidthgreedy applications without incurring the cost <strong>of</strong> dedicatedresources. We propose a modular architecture that uses asmany existing components as possible. Monitoring s<strong>of</strong>tware isused to observe the state <strong>of</strong> the stock components and provideenough in<strong>for</strong>mation so that policy decisions can be made in aseparate and noninvasive process. Our architecture defines thefollowing components:Nodes: Nodes are full computer systems complete with theiroperating system, CPU, memory, disk and NIC. A typicalexample is a PC running Linux. Nodes are used <strong>for</strong> servicinguser requests. They must be appropriate <strong>for</strong> both computationand transfer with fast NICs (e.g., gigabit Ethernet cards) andfast modern CPUs. Nodes should have access to a shared filesystem. This is a common description <strong>for</strong> most computationalclusters, as on the TeraGrid.LRM: <strong>An</strong> admission control queue that has its own means<strong>for</strong> setting policy <strong>for</strong> granting access to nodes. The ability toset high priority in order to preempt jobs that have waitedlonger in the queue, and the ability to suspend running jobs areimportant features but not mandatory. Common examples <strong>of</strong>such LRMs are PBS [24], LSF [25], MAUI, and LoadLeveler[26]. A GRAM [27] interface to the LRM is recommended.Transfer Service: This is the component responsible <strong>for</strong>transferring data in and out <strong>of</strong> the cluster. The transfer servicecomprises a frontend service and a set <strong>of</strong> backend services.The frontend service acts as a contact point <strong>for</strong> clients, and asa load balancing proxy to the backend services which move thedata. Backend services run on nodes acquired from the LRM.The frontend must be instrumented with monitoring s<strong>of</strong>twareso that an external process can observe its state to makedecisions based upon the load.Controller: This component is the brain <strong>of</strong> the system. Itobserves the frontend transfer service to determine when clientload is high or when the backend transfer node pool is idle, atwhich point it interacts with the LRM to either acquire a newnode <strong>for</strong> use in data transfer or release an idle node from thetransfer pool back into the computational queue. Thecontroller’s decisions are based upon a policy set by the siteadministrator.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!