You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Condor & <strong>Grid</strong><br />
HCC Informal Condor Tutorial
Today’s Activities…<br />
• We will be providing a short tutorial on using Condor on<br />
both HCC and remote resources.<br />
– We will cover basic job submits (like a batch system!).<br />
– As well as some advanced techniques that makes<br />
Condor more useful for managing many, many jobs.<br />
– Hands on! I want to make sure everyone here gets a<br />
chance to run their own jobs if they want.
Condor & <strong>Grid</strong><br />
• Condor – Resource Scavenger<br />
– Think PBS/SGE<br />
– Has <strong>Grid</strong> Extensions (Condor-G)<br />
• <strong>Grid</strong> – <strong>Open</strong> <strong>Science</strong> <strong>Grid</strong> (OSG)<br />
– High Throughput <strong>Grid</strong><br />
– Serial Jobs
<strong>Open</strong> <strong>Science</strong> <strong>Grid</strong>
<strong>Open</strong> <strong>Science</strong> <strong>Grid</strong> VS. Tera<strong>Grid</strong><br />
<strong>Open</strong> <strong>Science</strong> <strong>Grid</strong> Tera<strong>Grid</strong><br />
High Throughput (lots of jobs) High Performance (Big Jobs)<br />
Serial MPI / <strong>Open</strong>MP<br />
Free Signup Restricted Signup<br />
Opportunistic Resource Allocation
OSG Ideal Workflow<br />
• Lots of independent jobs (no mpi/openmp)<br />
• < 24 hours<br />
• Portable program
<strong>Grid</strong> Workflow<br />
1. runAutodock Autodock -p protein -l Ligand<br />
Autodock<br />
Autodock<br />
Autodock<br />
2. Create Jobs<br />
Submit Host<br />
SRM Storage<br />
4. Save Output<br />
3. Send Jobs for Execution<br />
Firefly<br />
Wisconsin<br />
Red<br />
Output<br />
Output<br />
Output<br />
Output
Condor & <strong>Grid</strong>: Step 1<br />
• Submit file:<br />
universe = vanilla<br />
executable = /bin/hostname<br />
output = host.out<br />
error = host.err<br />
log = host.log<br />
queue
Condor & <strong>Grid</strong>: Step 2<br />
• Submit Job:<br />
• Check job:<br />
condor_submit host.condor<br />
[dweitzel@hcc-grid condortest]$ condor_q<br />
-- Submitter: hcc-grid.unl.edu : : hcc-grid.unl.edu<br />
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD<br />
45288.0 dweitzel 12/9 18:17 0+00:00:00 I 0 0.0 hostname
The Most Important Commands<br />
• Submit a job:<br />
– condor_submit <br />
• Check your job’s status:<br />
– condor_q<br />
• Remove your jobs:<br />
– condor_rm
Using The <strong>Grid</strong><br />
• Condor is run as a batch system on Prairiefire.<br />
– But it doesn’t require running on a cluster.<br />
– It doesn’t even require we run on nodes Nebraska<br />
owns!<br />
• We use a system called “glideinWMS” which submits jobs<br />
to remote clusters.<br />
– Each job submitted is actually a condor worker node!<br />
– The job starts on the remote node, launches Condor,<br />
and joins our cluster.<br />
– This way, you learn barely any grid-specific things.<br />
Just use Condor!
Using The <strong>Grid</strong><br />
• By using glideinWMS, we can capture a huge number of<br />
slots! The plot below shows the last 24 hrs of activity.
Condor & <strong>Grid</strong><br />
• More complicated submit<br />
universe = vanilla<br />
executable = /usr/bin/wc<br />
args = hosts<br />
output = wordcount.out<br />
error = wordcount.err<br />
log = wordcount.log<br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
transfer_input_files = /etc/hosts<br />
queue
Condor & <strong>Grid</strong>: Step 3<br />
• Prepare for grid submission<br />
– Certificate - https://pki1.doegrids.org/ca/<br />
• (We’ll help you out with this afterward)<br />
– Initialize proxy:<br />
voms-proxy-init --voms hcc:/hcc<br />
– This will expire in 12 hours. If you don’t have a<br />
current proxy, all commands will fail!
Condor & <strong>Grid</strong>: Step 4<br />
• Modify for <strong>Grid</strong> Submission.<br />
– No changes to the condor submit file.<br />
– Must submit from glidein.unl.edu<br />
• You can use your HCC account to log in here, but<br />
you must inform us first; we’ll then create a new<br />
home directory here.<br />
– Modify your job to not depend on the shared file<br />
system.<br />
• I.e., if you need a software package, you can’t install<br />
it yourself – you need our help.
Condor & <strong>Grid</strong>: Step 5<br />
• Condor will automatically transfer files back and forth.<br />
– This works pretty well for up to 100-200 MB per job.<br />
– No more than ~10 input and output files. Zip small<br />
files together!<br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
transfer_input_files = file1,file2<br />
transfer_output_files = file3, file4
Running many jobs<br />
• Need to run the same thing 100 times? No problem!<br />
– Or maybe change it slightly each time?<br />
universe = vanilla<br />
executable = /usr/bin/wc<br />
args = hosts<br />
output = wordcount.out.$(Process)<br />
error = wordcount.err.$(Process)<br />
log = wordcount.log<br />
should_transfer_files = YES<br />
when_to_transfer_output = ON_EXIT<br />
transfer_input_files = /etc/hosts<br />
queue 10
Connecting Jobs with DAG scripts<br />
Job A<br />
Job B<br />
Job C
Defining the DAG<br />
• You need to write a *.dag file, which will reference<br />
Condor submit files. This file describes the jobs to run and<br />
their dependencies.<br />
• The file below describes the linear-shape dependency<br />
graph on the previous slide.<br />
Job A a.sub<br />
Job B b.sub<br />
Job C c.sub<br />
Parent A Child B<br />
Parent B Child C
Submit & Run<br />
• Instead of using the “condor_submit” command, you want<br />
to use “condor_submit_dag”.<br />
• DAGMan can be used for job dependency graphs of up to<br />
100,000 jobs.<br />
– But let me recommend something smaller to start<br />
with…
Saving large outputs<br />
• Sometimes the output is too large to save with Condor.<br />
• In this case, you want to use a special protocol called SRM<br />
for file transfers.<br />
• The commands you will use start with lcg-*<br />
– “lcg-cp –b –D srmv2 ”<br />
• Does not work recursively. One file at a time.<br />
– “lcg-ls URL”<br />
– The URL you will use is:<br />
• srm://redsrm1.unl.edu:8443/srm/v2/server?SFN=/mnt/hadoo<br />
p/user///filename