PBS queuing system for a Linux cluster

( a word of advice from Brian). PBS is the queuing system used to setup jobs and distribute them to nodes and then return the output of the jobs. Installation of PBS is not trivial - the manual should be read for full details. Here is a short run-down of important files and tips for keeping PBS running OK.

/var/spool/PBS contains all PBS information for a node. On all nodes except the master node, this only contains information for the pbs_mom (the client daemon that listens for jobs and tells the server when jobs are finished e.t.c.). Logs for the pbs_mom are found in /var/spool/PBS/mom_logs. When jobs are running, the standard output and error are found in /var/spool/PBS/spool. Any of these files that might not be delivered at the end of the job for whatever reason are left in /var/spool/PBS/undelivered. The pbs_mom is started by /etc/rc.d/init.d/pbs and it's derivatives on boot.

On the master node, two other daemons run: pbs_server and pbs_scheduler. These are also started up by /etc/rc.d/init.d/pbs and it's derivatives on boot.

The setup of the server is controlled by the qmgr command. This starts a shell from which you can query and make changes to the setup. For example, to print the setup use the command "print server" (which can be abreiviated to "p s"). To add a new node use the command "create node galaxy65". Please see the manual and man pages for more information. Only one queue is setup. Jobs are targetted to certain nodes using properties (see below). The status of nodes can be queried using pbsnodes -a. Since there are two CPUs per node, upto two jobs can be run per node.

To send a job (which should be a script of some sort - shell or ICM, so long as it has a #! directive on the first line), use the qsub command. Useful flags to specify (can also be added to the script on the second line with a #PBS directive):

code description
-j oe make the stdout and stderr files become merged on exit of the job.
-l nodes=256 for nodes with 256Mb (except the master node, galaxy1)
-l nodes=512 for nodes with 512Mb (except the master node)
-l nodes=dual for all nodes except the master node

Note that the 256,512 and dual properties are defined using the "set node galaxy?? properties+=????" command in qmgr - this approach can be used to partition the cluster in anyway that is needed (e.g. - 20 nodes for VLS, the rest for BLAST jobs). This is easier in a small-user environment than creating new queues (again see the manual and man pages). Note that it is good practice not to run jobs on the master node since it will be dealing with network traffic and user interaction (and hence has different properties to the other nodes).

Jobs can be checked using the qstat command. The -n option allows one to see which node a jobs is running on. Hints for dealing with PBS woes:

Do not allow users to send jobs that will last shorter than 2-3 minutes. This is a waste of resources and just stresses PBS.
Do not allow users to send jobs more often than every 10-15 seconds. Sometimes even this is too fast.
If a node crashes whilst running a job, qstat will hang for upto 5 minutes. The fix is to do /etc/rc.d/init.d/pbsstop on the master nodem and then go into /var/spool/PBS/server_priv/jobs and delete the two (or four) files that correspond to the lost jobs. These can be found by grep'ing for the node names. The crashed node can be found out by running /usr/sbin/do uptime on the master node. Then run /etc/rc.d/init.d/pbs start to restart the PBS server and all should be fine.
Sometimes a reboot of the cluster will lead to pbsnodes -a showing all ndoes to be down. The fix here is to do /etc/rc.d/init.d/pbs stop on the master node, then run /usr/sbin/do "/etc/rc.d/init.d/pbs restart" to restart the PBS mom's on the slave nodes. Then restart the PBS server with /etc/rc.d/init.d/pbs start

For the details about PBS take a look to the file http://www-itg.lbl.gov/Grid/public/pbs/pbs.v2.3_admin.pdf