XClose

Introduction to High Performance and High Throughput Computing

Home
Menu

Legion Environment and Batch processing

A Simple Serial Job

calculate_pi - Calculates π by numerically integrating a curve.

\[\int_{0}^{1}\frac{4}{1+x^2}\text{d}x=\pi\]

Make a Copy

cd ~/Scratch
cp -r /shared/ucl/apps/examples/calculate_pi_dir ./
cd calculate_pi_dir
make
./calculate_pi

Job Script

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -cwd
./calculate_pi

Job Script Defaults

#$ -l h_rt=0:15:00
#$ -l memory=1M
#$ -l tmpfs=10G

Queue Commands

|:---:|:---| | qsub | submit job | | qstat | view queue status and job info | | qdel | stop & delete a job | | qrsh | start an interactive session |

Submitting Jobs to the Queue

$ qsub submit.sh
Your job 3521045 ("submit.sh") has been submitted

$ qsub -terse submit.sh
3521045

Submitting Jobs to the Queue

Special comments are options for qsub

Check man qsub for full lists

Every cluster is a little different

Viewing Queue

$ qstat

job-ID  prior   name       user         state submit/start at    
-----------------------------------------------------------------
3521045 0.00000 submit.sh  ccaaxxx      qw    01/14/2014 14:51:54

Job States

Letter Status
q queued
w waiting
r running
E error
t transferring
h held

More Detailed Info

qstat -j 3521045

(gives a lot of output)

(Demo)

Job States - Errors

Most common problems:

  • Working directory does not exist
  • Working directory is not in ~/Scratch (and thus not writable)

Note that qstat -j cuts off the end of the error message - try e.g. qexplain 53893 to see full error message.

Removing Jobs

$ qdel 3521045
ccaaxxx has deleted job 3521045
  • removes a queued job
  • stops and removes a running job

Environment Within a Job

Exercise: Run the simple calculate_pi program as a job.

Exercise: To see what environment variables are set by the scheduler, try making a job script that runs env and puts the output in a file.

Exercise: sort the file, and compare it to your current environment to see what has changed.

Multithreaded Jobs

OpenMP

Make a Copy

  • Go into your Scratch directory
  • Make a copy of the /shared/ucl/apps/examples/openmp_pi_dir directory
  • Build the program using make, and try running it

Something Like

cd ~/Scratch
cp -r /shared/ucl/apps/examples/openmp_pi_dir ./
cd openmp_pi_dir
make
./openmp_pi

Requesting Threads

#$ -pe smp 4
  • tells scheduler to find you 4 cores and allocate them to your job
  • sets OMP_NUM_THREADS=4 to tell OpenMP you're only using 4 instead of all

Requesting Threads

#$ -pe smp 4

Exercise: Try modifying the script from before to run the new program.

Exercise: Run versions with 1, 2, 3, and 4 cores, and compare the timings.

Job Script

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -pe smp 4
#$ -cwd
./openmp_pi

Multi-node Jobs

  • Need some method to communicate over the network
  • Most common is MPI

MPI

Make a Copy

cd ~/Scratch
cp -r /shared/ucl/apps/examples/mpi_pi_dir ./
cd mpi_pi_dir
make
./mpi_pi
# This won't always work on clusters

Requesting Multinode Jobs

#$ -pe mpi 36
  • makes space for multi-node job
  • creates variables and machines file

Note that each requested core gets the amount of memory requested.

Job Script

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -pe mpi 4
#$ -cwd

gerun ./mpi_pi

Exercise: Try modifying the script from before to run the new program, using 4, 8, 12, and 24 cores and the mpi parallel environment.

Requesting an Array Job

#$ -t 3     <- (only runs one job)
#$ -t 1-3
#$ -t 1-7:2

This queues an array of jobs which only differ in how the $SGE_TASK_ID variable is set.

Exercise: Try modifying the serial job script (calculate_pi) to run 4 jobs as an array.

Exercise: calculate_pi can take an argument to tell it how many steps to use. Try using this with $SGE_TASK_ID to run using 300, 500, and 700 steps.

Job Script

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -t 1-4
#$ -cwd

./calculate_pi ${SGE_TASK_ID}0

Job Arrays and File Performance

The Lustre parallel filesystem performs worst when creating and writing to lots of little files.

Arrays of jobs often create files like this.

To help performance, run this type of job using the local storage on the node, and copy the files over when the job is complete.

Local Storage: $TMPDIR

Job Script

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -t 1-40000
#$ -cwd

cd $TMPDIR
$HOME/my_programs/make_lots_of_files \
  --some-option=$SGE_TASK_ID

Job Script

Then either:

cp * $SGE_WORK_DIR

or

cp -r $TMPDIR $SGE_WORK_DIR

Job Script

Or, better for lots of files:

cd $SGE_O_WORK_DIR

tar -czf $JOB_ID.$SGE_TASK_ID.tar.gz $TMPDIR
zip -f   $JOB_ID.$SGE_TASK_ID.zip    $TMPDIR

Existing Applications

Modules system helps to set up environment for applications.

Check module avail to see what modules exist.

$ module avail
--- /shared/ucl/apps/modulefiles/core ----
gerun             rcps-core/1.0.0
mrxvt/0.5.4       screen/4.2.1
[...]

Existing Applications

(Demo)

Using Modules

Most modules add one or more programs to your $PATH.

$ htop
bash: htop: command not found
$ module load htop
$ htop

You will see a colourful interactive process viewer.

$ module unload htop
$ htop
bash: htop: command not found

Module Contents

$ module show htop

(Demo)

Prerequisites and Conflicts

Some modules depend on or conflict with other modules

Module 'a' depends on one of the module(s) 'b'
Module 'a' conflicts with the 
currently loaded module(s) 'b'

(Demo)

Recommended Bundles

e.g. r/recommended loads a collection of other modules and then the R module.

Job Script

#!/bin/bash -l
#$ -l h_rt=0:10:00
#$ -cwd
module unload compilers mpi
module load r/recommended
R --no-save --slave <<EOF >r.output.$JOB_ID
runif(50,0,1)
EOF

(generates a bunch of random numbers)

Other Schedulers

Other systems (e.g. Emerald) may use a slightly different scheduler system, so the scripts can be slightly different -- consult the relevant documentation.

#$ -pe mpi 24
#PBS -l nodes=2:ppn=12

#$ -pe smp 12
#PBS -l nodes=1:ppn=12

#$ -l h_rt=1:00:00
#PBS -l walltime=1:00:00

#$ -l memory=4G
#PBS -l mem=4gb

Quick Reference Sheet

Legion: https://wiki.rc.ucl.ac.uk/mediawiki-1.23.9/images/a/ad/Legionrefsheet.pdf