( a word of advice from Brian). PBS is the queuing system used to setup jobs and distribute them to nodes and then return the output of the jobs. Installation of PBS is not trivial - the manual should be read for full details. Here is a short run-down of important files and tips for keeping PBS running OK.
/var/spool/PBS contains all PBS information for a node.
On all nodes except the master node, this only contains information for the pbs_mom
(the client daemon that listens for jobs and tells the server when jobs are finished e.t.c.).
Logs for the pbs_mom are found in /var/spool/PBS/mom_logs.
When jobs are running, the standard output and error are found in /var/spool/PBS/spool.
Any of these files that might not be delivered at the end of the job for whatever reason
are left in /var/spool/PBS/undelivered.
The pbs_mom is started by /etc/rc.d/init.d/pbs and it's derivatives on boot.
On the master node, two other daemons run: pbs_server and pbs_scheduler.
These are also started up by /etc/rc.d/init.d/pbs and it's derivatives on boot.
The setup of the server is controlled by the qmgr command.
This starts a shell from which you can query and make changes to the setup.
For example, to print the setup use the command "print server" (which can be abreiviated to "p s").
To add a new node use the command "create node galaxy65".
Please see the manual and man pages for more information. Only one queue is setup.
Jobs are targetted to certain nodes using properties (see below).
The status of nodes can be queried using pbsnodes -a.
Since there are two CPUs per node, upto two jobs can be run per node.
To send a job (which should be a script of some sort - shell or ICM, so long as it has a #! directive on the first line), use the qsub command.
Useful flags to specify (can also be added to the script on the second line with a #PBS directive):
| code description |
|---|
| -j oe make the stdout and stderr files become merged on exit of the job. |
| -l nodes=256 for nodes with 256Mb (except the master node, galaxy1) |
| -l nodes=512 for nodes with 512Mb (except the master node) |
| -l nodes=dual for all nodes except the master node |
Note that the 256,512 and dual properties are defined using the "set node galaxy?? properties+=????" command in
qmgr - this approach can be used to partition the cluster in anyway that is needed (e.g. - 20 nodes for VLS, the rest for BLAST
jobs). This is easier in a small-user environment than creating new queues (again see the manual and man pages). Note that it is good
practice not to run jobs on the master node since it will be dealing with network traffic and user interaction (and hence has different
properties to the other nodes).
Jobs can be checked using the qstat command.
The -n option allows one to see which node a jobs is running on.
Hints for dealing with PBS woes:
/etc/rc.d/init.d/pbsstop on the master nodem and then go into /var/spool/PBS/server_priv/jobs and delete the two (or four) files
that correspond to the lost jobs. These can be found by grep'ing for the node names. The crashed node can be found out by
running /usr/sbin/do uptime on the master node. Then run /etc/rc.d/init.d/pbs start to restart the PBS
server and all should be fine.
pbsnodes -a showing all ndoes to be down. The fix here is to do
/etc/rc.d/init.d/pbs stop on the master node,
then run /usr/sbin/do "/etc/rc.d/init.d/pbs restart" to restart the PBS mom's on the slave nodes.
Then restart the PBS server with /etc/rc.d/init.d/pbs start
For the details about PBS take a look to the file http://www-itg.lbl.gov/Grid/public/pbs/pbs.v2.3_admin.pdf