07.01.2013 Views

Open Science Grid TeraGrid

Open Science Grid TeraGrid

Open Science Grid TeraGrid

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Condor & <strong>Grid</strong><br />

HCC Informal Condor Tutorial


Today’s Activities…<br />

• We will be providing a short tutorial on using Condor on<br />

both HCC and remote resources.<br />

– We will cover basic job submits (like a batch system!).<br />

– As well as some advanced techniques that makes<br />

Condor more useful for managing many, many jobs.<br />

– Hands on! I want to make sure everyone here gets a<br />

chance to run their own jobs if they want.


Condor & <strong>Grid</strong><br />

• Condor – Resource Scavenger<br />

– Think PBS/SGE<br />

– Has <strong>Grid</strong> Extensions (Condor-G)<br />

• <strong>Grid</strong> – <strong>Open</strong> <strong>Science</strong> <strong>Grid</strong> (OSG)<br />

– High Throughput <strong>Grid</strong><br />

– Serial Jobs


<strong>Open</strong> <strong>Science</strong> <strong>Grid</strong>


<strong>Open</strong> <strong>Science</strong> <strong>Grid</strong> VS. Tera<strong>Grid</strong><br />

<strong>Open</strong> <strong>Science</strong> <strong>Grid</strong> Tera<strong>Grid</strong><br />

High Throughput (lots of jobs) High Performance (Big Jobs)<br />

Serial MPI / <strong>Open</strong>MP<br />

Free Signup Restricted Signup<br />

Opportunistic Resource Allocation


OSG Ideal Workflow<br />

• Lots of independent jobs (no mpi/openmp)<br />

• < 24 hours<br />

• Portable program


<strong>Grid</strong> Workflow<br />

1. runAutodock Autodock -p protein -l Ligand<br />

Autodock<br />

Autodock<br />

Autodock<br />

2. Create Jobs<br />

Submit Host<br />

SRM Storage<br />

4. Save Output<br />

3. Send Jobs for Execution<br />

Firefly<br />

Wisconsin<br />

Red<br />

Output<br />

Output<br />

Output<br />

Output


Condor & <strong>Grid</strong>: Step 1<br />

• Submit file:<br />

universe = vanilla<br />

executable = /bin/hostname<br />

output = host.out<br />

error = host.err<br />

log = host.log<br />

queue


Condor & <strong>Grid</strong>: Step 2<br />

• Submit Job:<br />

• Check job:<br />

condor_submit host.condor<br />

[dweitzel@hcc-grid condortest]$ condor_q<br />

-- Submitter: hcc-grid.unl.edu : : hcc-grid.unl.edu<br />

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD<br />

45288.0 dweitzel 12/9 18:17 0+00:00:00 I 0 0.0 hostname


The Most Important Commands<br />

• Submit a job:<br />

– condor_submit <br />

• Check your job’s status:<br />

– condor_q<br />

• Remove your jobs:<br />

– condor_rm


Using The <strong>Grid</strong><br />

• Condor is run as a batch system on Prairiefire.<br />

– But it doesn’t require running on a cluster.<br />

– It doesn’t even require we run on nodes Nebraska<br />

owns!<br />

• We use a system called “glideinWMS” which submits jobs<br />

to remote clusters.<br />

– Each job submitted is actually a condor worker node!<br />

– The job starts on the remote node, launches Condor,<br />

and joins our cluster.<br />

– This way, you learn barely any grid-specific things.<br />

Just use Condor!


Using The <strong>Grid</strong><br />

• By using glideinWMS, we can capture a huge number of<br />

slots! The plot below shows the last 24 hrs of activity.


Condor & <strong>Grid</strong><br />

• More complicated submit<br />

universe = vanilla<br />

executable = /usr/bin/wc<br />

args = hosts<br />

output = wordcount.out<br />

error = wordcount.err<br />

log = wordcount.log<br />

should_transfer_files = YES<br />

when_to_transfer_output = ON_EXIT<br />

transfer_input_files = /etc/hosts<br />

queue


Condor & <strong>Grid</strong>: Step 3<br />

• Prepare for grid submission<br />

– Certificate - https://pki1.doegrids.org/ca/<br />

• (We’ll help you out with this afterward)<br />

– Initialize proxy:<br />

voms-proxy-init --voms hcc:/hcc<br />

– This will expire in 12 hours. If you don’t have a<br />

current proxy, all commands will fail!


Condor & <strong>Grid</strong>: Step 4<br />

• Modify for <strong>Grid</strong> Submission.<br />

– No changes to the condor submit file.<br />

– Must submit from glidein.unl.edu<br />

• You can use your HCC account to log in here, but<br />

you must inform us first; we’ll then create a new<br />

home directory here.<br />

– Modify your job to not depend on the shared file<br />

system.<br />

• I.e., if you need a software package, you can’t install<br />

it yourself – you need our help.


Condor & <strong>Grid</strong>: Step 5<br />

• Condor will automatically transfer files back and forth.<br />

– This works pretty well for up to 100-200 MB per job.<br />

– No more than ~10 input and output files. Zip small<br />

files together!<br />

should_transfer_files = YES<br />

when_to_transfer_output = ON_EXIT<br />

transfer_input_files = file1,file2<br />

transfer_output_files = file3, file4


Running many jobs<br />

• Need to run the same thing 100 times? No problem!<br />

– Or maybe change it slightly each time?<br />

universe = vanilla<br />

executable = /usr/bin/wc<br />

args = hosts<br />

output = wordcount.out.$(Process)<br />

error = wordcount.err.$(Process)<br />

log = wordcount.log<br />

should_transfer_files = YES<br />

when_to_transfer_output = ON_EXIT<br />

transfer_input_files = /etc/hosts<br />

queue 10


Connecting Jobs with DAG scripts<br />

Job A<br />

Job B<br />

Job C


Defining the DAG<br />

• You need to write a *.dag file, which will reference<br />

Condor submit files. This file describes the jobs to run and<br />

their dependencies.<br />

• The file below describes the linear-shape dependency<br />

graph on the previous slide.<br />

Job A a.sub<br />

Job B b.sub<br />

Job C c.sub<br />

Parent A Child B<br />

Parent B Child C


Submit & Run<br />

• Instead of using the “condor_submit” command, you want<br />

to use “condor_submit_dag”.<br />

• DAGMan can be used for job dependency graphs of up to<br />

100,000 jobs.<br />

– But let me recommend something smaller to start<br />

with…


Saving large outputs<br />

• Sometimes the output is too large to save with Condor.<br />

• In this case, you want to use a special protocol called SRM<br />

for file transfers.<br />

• The commands you will use start with lcg-*<br />

– “lcg-cp –b –D srmv2 ”<br />

• Does not work recursively. One file at a time.<br />

– “lcg-ls URL”<br />

– The URL you will use is:<br />

• srm://redsrm1.unl.edu:8443/srm/v2/server?SFN=/mnt/hadoo<br />

p/user///filename

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!