Torque is a tool that controls job distribution in several computers (cluster).
Here are some tips to install Torque in 2 or more computers.
Installation (server and nodes)
First, follow this thread. It has all that is necessary steps to get started.
./configure make sudo make install ./torque.setup root
Do not forget to setup the correct libraries or you will get errors like this:
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: MaxConnectTimeout
sudo echo /usr/local/lib >> /etc/ld.so.conf ldconfig
You have to configure /var/spool/torque/server_priv/nodes adding, for example:
my_server np=4 node01 np=4
Configure /var/spool/torque/mom_priv/config to allow the server to also be a node
Configuring the server with these commands is crucial:
# block 1
qmgr -c "set server scheduling=true" qmgr -c "create queue batch queue_type=execution" qmgr -c "set queue batch started=true" qmgr -c "set queue batch enabled=true" qmgr -c "set queue batch resources_default.nodes=4" qmgr -c "set queue batch resources_default.walltime=3600" qmgr -c "set server default_queue=batch" qmgr -c "set server operators = email@example.com" qmgr -c "set server keep_completed = 0"
# block 2: before I did this, my jobs would get stuck in the queue. Adjust to your needs. I could not find much information on what each option does exactly
qmgr -c "set queue batch max_running = 8" qmgr -c "set queue batch resources_max.ncpus = 8" qmgr -c "set queue batch resources_min.ncpus = 1" qmgr -c "set queue batch resources_max.nodes = 2" qmgr -c "set queue batch resources_default.ncpus = 1" qmgr -c "set queue batch resources_default.neednodes = 1:ppn=1" qmgr -c "set queue batch resources_default.nodect = 1" qmgr -c "set queue batch resources_default.nodes = 1"
Kill all pbs processes and restart:
killall pbs_server pbs_mom pbs_sched pbs_server; pbs_mom; pbs_sched
If you have a firewall
- Open UDP ports 15001 and 15004 and 1023.
- I also had to add special scp instructions or open port 22 or my jobs would stay in E state forever.
To instruct the node (mom) to use the correct ssh port and take advantage of NFS:
$pbsserver my_server $usecp *:/mnt/shared_disk /mnt/shared_disk $rcpcmd /usr/bin/scp -P 2232
- Install torque as above.
- Do not forget to fix the library.
- Configure nodes with the contents of the box above suited to your needs.
Quick summary of errors I encountered
- Everything was setup properly, but jobs would stay in Q (queued) status or go direct to C (completed) state without ever running. Torque would send emails to root saying that the “Job Deleted because it would never run” and “Not enough of the right type of nodes available”. The solution is to use block 2 of the “set queue” commands above.
- Jobs would run but stay in E (exit) state forever, clogging the queue. The solution is to open the correct ports in the firewall. Ports 15001, 15004 and 1023 in the server must be open for the nodes.
- Jobs would still stay in E (exit) state forever, even after ports were opened. The solution is to either open port 22 or configure the node to use scp in the open port.
- pbs_mom was up in node, but appeared as “down” under pbsnodes. This was a firewall problem. For some reason, a new installation of torque required opening port 15096. Check /var/spool/torque/mom_logs/ and see what is the error message. Torque was nice enough to tell which port it was trying to connect in the server.