I have two Debian servers connected to a shared NFS server.
nfs.internal.com:/volume1/SHARE on /share type nfs (rw,nosuid,relatime,sync,vers=3,rsize=131072,wsize=131072,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,nordirplus,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.0.100,mountvers=3,mountport=892,mountproto=tcp,fsc,lookupcache=none,local_lock=none,addr=172.16.0.100)
Sometimes, the load on the server will increase. There seems to be a breakpoint where the load causes NFS operations to slow down enough that a runaway load condition occurs and we eventually reach a limit on processes configured for the service run on this server (1500).
When the load increases on one server, the other is not affected, so I don’t believe networking or the NFS server to be the culprit.
Under normal conditions, load ~10
ops/s rpc bklog 20637.000 0.000 getattr: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 10453.000 2334.176 0.223 0 (0.0%) 0.339 0.467 access: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 4893.000 1139.078 0.233 0 (0.0%) 0.338 0.469
Under "medium" load of ~100:
ops/s rpc bklog 13374.200 0.000 read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 10.600 1202.169 113.412 0 (0.0%) 0.623 26.868 write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 7.600 5.462 0.719 0 (0.0%) 0.553 7.395
I don’t have it saved, but at the highest load levels of 1500, the avg exe time will reach 4-5 seconds.
From what I understand, an exe value that is much higher than RTT means that the requests are queuing on the local server rather than waiting on the NFS server.
It sounds like the number of simultaneous requests to the NFS server is specified by
sunrpc.tcp_slot_table_entries. In recent kernel versions this value is dynamically increased up to
sunrpc.tcp_max_slot_table_entries, however I have never seen it go above the default of two.
Could this be the cause of the performance issues, and what would prevent this value from scaling like it should be?