#StackBounty: #cluster #hpc #beegfs BeegFS cluster – One node dead, need to move on without it

Bounty: 50

As the title suggests, a cluster consisting of four machines has one dead machine in the rack.

The cluster is set up with a buddy mirror system for redundancy, so the data should still be intact. The dead machine is the secondary in its mirror group, so how does one start the cluster and ignore any warnings and errors that would stem from the machine that is unreachable?

As it currently stands, All beegfs-meta servers are running, All remaining beegfs-storage services are running, but No beegfs-clients want to start:

Jul31 17:42:31 *mount(44691) [Remoting (stat storage targets)] >> Error target (storage): 401; Msg: Communication error
Jul31 17:42:31 *mount(44691) [Mount sanity check] >> Retrieval of storage server free space info failed. Are the storage servers running and registered at the management daemon? Did you remove a storage target directory on a server? (Error: Communication error)
Jul31 17:42:31 *mount(44691) [App (stop components)] >> Stopping components...

Optimally, if possible, I would like to not remove the dead node, but somehow disable it, as it will eventually come after a hardware fix.


Get this bounty!!!