#StackBounty: #linux #cluster-computing #torque Torque cannot communicate with host

Bounty: 50

I have been attempting to setup the torque scheduler for a small cluster. I followed the steps to setup the scheduler from http://docs.adaptivecomputing.com/torque/archive/3-0-2/1.2configuring_torque_on_server.php

However when i attempt

qterm -t quick

I get the following error

$ sudo qterm -t quick
Unable to communicate with Terra(192.168.1.25)
Cannot connect to specified server host 'Terra'.
qterm: could not connect to server '' (111) Connection refused 

but the server starts just fine. However when I attempt to run a command that runs on multiple nodes such as

qsub -l nodes=2:ppn=4 /home/user/scripts/someScript

it prints out somethign like

7.Terra

where Terra is the name of the head node, but is also a node in the cluster. This isn’t the problem. The problem is that it does not run. nor does it have any output anywhere :/

The torque server log: https://ptpb.pw/EaKo

The terra node log: https://ptpb.pw/9w5M

and the Marte log: https://ptpb.pw/o4PT

I can get it to run with a pbs script but only with one node….

#!/bin/bash
#PBS -l pmem=1gb,nodes=1:ppn=4
#PBS -m abe
cd Documents/
wc -l largeTest.csv

Here is the ouput of qstat after submitting a job

Job ID                    Name             User            Time Use S 
Queue
------------------------- ---------------- --------------- -------- - -----
16.Terra                   testPerformance  justin                 0 R batch      

the output of pbsnodes -a

Terra
 state = free
 power_state = Running
 np = 4
 properties = Tower
 ntype = cluster
 status = opsys=linux,uname=Linux Terra 4.17.14-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 9 11:56:50 UTC 2018 x86_64,sessions=11525 22029,nsessions=2,nusers=1,idletime=57964,totmem=8111556kb,availmem=7539284kb,physmem=8111556kb,ncpus=4,loadave=0.00,gres=,netload=30570521372,state=free,varattr= ,cpuclock=Fixed,macaddr=e0:3f:49:44:72:20,version=6.1.1.1,rectime=1534937388,jobs=
 mom_service_port = 15002
 mom_manager_port = 15003
 gpus = 1

Marte
 state = free
 power_state = Running
 np = 4
 properties = NFSServer
 ntype = cluster
 status = opsys=linux,uname=Linux Marte 4.18.1-arch1-1-ARCH #1 SMP PREEMPT Wed Aug 15 21:11:55 UTC 2018 x86_64,sessions=366 556 563,nsessions=3,nusers=2,idletime=58140,totmem=7043404kb,availmem=6703808kb,physmem=7043404kb,ncpus=4,loadave=0.02,gres=,netload=36500663511,state=free,varattr= ,cpuclock=Fixed,macaddr=c8:5b:76:4a:65:91,version=6.1.1.1,rectime=1534937359,jobs=
 mom_service_port = 15002
 mom_manager_port = 15003

and the /var/spool/torque/server_priv/nodes

Terra np=4 gpus=1 Tower
Marte np=4 NFSServer

Edit: Here are the most recent logs as well

Mom Log for Node: https://ptpb.pw/DhKi

Mom Log for head node: https://ptpb.pw/MTlD

and the server log: https://ptpb.pw/HPkE


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.