Recently I encounter a very puzzuling problem. We are doing AI traingings based on some type of NPU device. The training usually read/write data from/to some NFS directory, and during the training we found the network device became down status.
We tried executing ip link set dev xxxx up, only to find that it doesn’t take any effect in the real status of this device. It keeps down status until we reboots the machine.
Via the system log (ubuntu 18.04) and the NIC driver log, we know something is wrong either with the card or the driver upon the card. The driver keeps complaining something like "update mac stats fail, get mac pkt stats fail" etc.
The most interesting and difficult thing is to figure out what application behavior triggers this issue and how that makes it happens.
How can a user-space application cause the NIC to go down? Is it possible to figure out what happens there without having the source code of the NIC driver?
Please help me to explain this, or give me some suggestions about how to figure it out, or list some documentations on it.