TTIC cluster intro

TTIC compute cluster (a.k.a. gouda) guidelines

July, 2014

This document describes current cluster policies and provides instructions for using our scheduler, Sun Grid Engine (SGE). This is not intended as a complete tutorial. It assumes that the reader is already familiar with the main SGE commands (qsub, qlogin, qstat, qmon, qalter, qdel, etc.). For more information about any of the options described below, please refer to the man pages or SGE user's manual , or ask Adam, Greg, or Karen

General policy:

It is not allowed to ssh to individual compute nodes. If you would like to run a single interactive process on a node, please use qlogin to properly reserve a slot.

Users are limited to a maximum of 1000 submitted jobs at a time. Any extra jobs beyond 1000 may be killed without notice. Note: An array job counts as a single job, so if you need to run >1000 jobs at one time, simply "package" some of them into an array job. See below for a quick way to do this.

Users should not run anything CPU- or memory-intensive on the head node; use qlogin instead . The head node is intended for running only what is needed to start scheduler jobs. We have many users sharing the head node, and it only takes one or two users running intensive processes on it to make life difficult for everyone.

Users should not run any job that will generate significant I/O without using $TMPDIR to alleviate the stress on the storage systems

A user violating any of these policies, or abusing the cluster in any other way (e.g., through unreasonably extensive use of the high priority option) will have their account immediately suspended, and reinstated at the discretion of the system administrator. Second violation will lead to suspension of account, with reinstatement at the discretion of the faculty.

Job priority determination:

The scheduler does not operate in a first-in-first-out mode but rather allocates resources according to priority. This enables fair use of the resources, and avoids users having to "negotiate" cluster division amongst themselves. The priority of a job is determined by a weighted combination of:

the job's waiting time,
a fair share criterion, determined by the recent history of cluster use by the user,
a user-specified priority (low/default/high),
a resource-dependent component: parallel environment jobs have somewhat higher priority, proportionally to the number of slots they require.

The relative priorities of all of the jobs in the queue are recalculated at each scheduling interval (currently set to 15 seconds). Some details on the priority components and job submission:

Fairness: Each user has equal target share of the cluster, and the usage (slots x time) is tracked over the last few days, with a certain "half-life" (the default for which is 7 days). I.e., if I just used 10 slots for an hour, after 7 days this only counts as 5 slots x 1 hour.
User-set priority: A user may increase or decrease their job's priority, from the default neutral setting, using the options '-l high' or '-l low' (either on the command line or in the SGE script). Please use '-l high' only if a job is particularly urgent, e.g. due to an impending deadline. This high priority option will be disabled if it is over-used.
Since the scheduler is no longer first-in-first-out, jobs using parallel environments (e.g. MPI) must use the option '-R y' to allow slots to be "reserved" for them. Failing to specify this option may lead to a job being "starved" by the scheduler.

Multiple queues:

There are 4 queues for different lengths of jobs. The following table gives the specs of the queues. In most cases, there should be no need to keep track of the nodes; they are included here for completeness. Note that the queues are setup in a cascading system so that any node that is in the very-long queue is also in the long queue, medium queue, and short queue.

	Queue name very-long.q long.q medium.q short.q	CPU time limit 1 week 36 hours 4 hours 30 minutes	Wall clock time limit 1.5 weeks 108 hours 12 hours 90 minutes	Nodes in queue 6, 7-11, 15-16, 29, 30 0, 1, 4, 5, 12, 14, 27, 28 2, 3, 13, 25, 26 23, 24	Total # cores in queue 86 164 228 244

Any job must be submitted to one of the queues, by specifying the corresponding length attribute of general form -l <length>:
-l short
-l medium
-l long
-l very-long

Any job not specifying a queue will be submitted to the "very-long" queue. Any job exceeding either the CPU or wall clock time limit in the queue in which it's running will be automatically killed. Note that you can no longer use options like '-q all.q@compute-0-N' because such queues no longer exist. If for some reason you need to submit to a specific node which is included in a certain queue, say the medium queue, use '-q medium.q@compute-0-N -l medium'.

General guidelines and hints:

Currently, there is no way to reserve a certain amount of memory for a job. This means that it is possible for jobs to use up all of the RAM on a given node and cause it to start swapping. This can significantly slow down all jobs on the node. However there is a workaround:
- Determine how much memory you will need per task and then use more than one slot for the job; e.g., if the job will eventually need 6G (twice the amount of memory per core on our oldest machines), use 2 slots. To use multiple slots for a non-parallel job, use '-pe serial N', where N is the number of slots needed.
- Use the chart below to determine the set of nodes your job should use and construct your qsub command.
- - Nodes: 23-30 - 24G (3G per slot): '-l mem_total=20G'
  - Nodes: 0-8 - 64G (8G per slot): '-l mem_total=60G'
  - Nodes: 9-11 - 128G (10.6G per slot): '-l mem_total=120G'
- Example 1: Your job will use 8G per task:
  'qsub -pe serial 3 -mem_total=20G jobname.sge'
  'qsub -pe serial 1 -mem_total=60G jobname.sge'
- Example 2: Your job will use 16G per task:
  'qsub -pe serial 2 -l mem_total=60G jobname.sge'
- Example 3: Your job will use 32G per task:
  'qsub -pe serial 4 -l mem_total=60G jobname.sge'
- Example 4: Your job will use 64G per task:
  'qsub -pe serial 7 -l mem_total=120G jobname.sge'
- Example 5: Your job will use 128G per task:
  'qsub -pe serial 12 -l mem_total=120G jobname.sge'

Whenever possible, submit a large number of jobs as a single array job. This will not affect your jobs' priority and will behave in all practical ways as if each of the jobs was submitted independently. However, the scheduler (and therefore the head node) is less taxed by array jobs, and it makes monitoring jobs via qstat or qmon much easier for everyone. distribute.pl is a script that will take as input a list of command lines and qsub options and generate an array job with each line as a task. See the "Examples" section below for an example.

It can be difficult to estimate how long a given job will take to run. For this reason, it's a good idea to estimate very conservatively. For example, if you think your job will take more than half the time limit of the medium queue, you may well be better off submitting to the long queue.

You probably want to use the option '-r y' with all of your jobs. This will enable a job to be rescheduled if the node to which it is assigned dies.

To get detailed load information on cluster nodes, see http://gouda.uchicago.edu

When running an MPI job, make sure you generate a machine file from the hosts provided by the scheduler (see mpi_job.sge).

If you have an I/O intensive job (i.e. many reads/writes from the fileserver), this can significantly slow down the network. In extreme cases, the only solution is for the jobs to be killed or put on hold by the sysadmin. This seems to be rare, so most users should not need to worry about this most of the time. If you do have an I/O-intensive job, you can avoid this problem by either: (1) using the local /tmp directory on each node for your I/O, or (2) modifying your code to reduce the I/O. Here's how to do option (1): For each job, SGE creates a temporary directory under /tmp (e.g. /tmp/<jobnum>.<queue-length>.q), and exports the path as $TMPDIR. At the beginning of your job, copy any needed files (or at least any that will require significant I/O) into $TMPDIR. Do all of your I/O from/to files in $TMPDIR. Before the end of your job, copy any needed files from $TMPDIR into a permanent location (e.g. your home directory). When the job exits, SGE deletes $TMPDIR. See simple_job_using_TMPDIR.sge for an example SGE script. It is possible for files stored in $TMPDIR to be lost if a job dies; if you are concerned about this, you may want to copy intermediate outputs, logs, etc. to a permanent location from time to time during the job.

Examples:

Submit a single batch job, with options specified in the job script: qsub job.sge

Submit a single batch job, with command-line options: qsub -l short -r y simple_job.sge

Submit a single batch job, with command-line options, using $TMPDIR for I/O: qsub -l short -r y simple_job_using_TMPDIR.sge

Submit a parallel-environment job, with command-line options: qsub -l short -r y -R y -pe mpi 20 mpi_job.sge

Submit an array job using distribute.pl: ./distribute.pl distribute_example '-l short -r y'

For more information:

See the man pages for individual scheduler commands, or

See the SGE user guide

See real-time node usage info at http://gouda.uchicago.edu.

Using Theano

If you plan to use Theano on gouda, please add the following two lines to your script to avoid multithreading.

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

If you intend to use multithreading, you can adjust the variables accordingly. For example, if you want to use 4 threads, add the following two lines to your script,

export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4

and then submit your job with qsub -pe serial 4.