#StackBounty: #ubuntu #physical-environment Debugging Unexpected System Shutdown

Bounty: 50

I’m attempting to debug a recurring unexpected system shutdown, which sometimes occurs when the machine is under load, but I cannot get it to happen reliably. My current hypothesis are:

  • drawing too much power from the wall
  • thermal issues
  • undiscovered hardware issue

The Hardware + Software + Workload

A list of hardware on my machine can be found here: https://pcpartpicker.com/user/erotemic/saved/#view=WKpmD3

The relevant bits are:

  • CPU: Intel i9-11900K with a Noctua NH-d15 Air Cooler
  • GPU0: RTX 3090 (attached to monitors)
  • GPU1: GTX 1080ti
  • PSU: EVGA T2 1600 W 80+ Titanium

I’m running stock Ubuntu 21.04

There are a few different workloads I will stress the machine with.

  • ethermine – which used both GPUs.
  • BOINC – with climateprediction.net and World Community Grid (set to use 90% of the CPU as long as the machine isn’t in use)
  • Custom Machine Learning workflows with PyTorch.

I haven’t been using ethermine recently, I’ve been running my ML workloads.

The Wattage Hypothesis

I’ve measured the wattage of system, and it consumes rougly 700-800 Watts as measured by a Kill-O-Watt P3 (this includes the monitors and everything else plugged into the surge protector). I’m living in an old American building that was converted into appartments. So, I’m not 100% sure about the capacity of the circuit, but assuming everything is up to code (which I’m not convinced it is) the circuit should be able to accomidate 1800 watts. Other electronics in the room are a 10 Watt lamp and a 989 Watt AC. So this brushes right up against the 1800 Watt limit. At first I was convinced this must be the culprit, but one night when it was cooler, I started a workload and unplugged the AC, and in the morning the power off, so this hypothesis no longer explains the symptoms well.

Additionally, I thought my cheap "Quirky Pivot Power" surge protector might be an issue, so I’ve ordred a Tripp Lite ISOBAR6Ultra, which hopefully will be higher quality, but it has not arrived yet, and I don’t think that is the problem.

The Thermal Hypothesis

I’m currently leaning more towards the thermals being the problem, but when I search the logs I don’t see any information related to thermal related shutdowns.

I’ve been using psensor to monitor temperatures and dump logs to disk every 300 seconds (so the logged temps might not include the high temperature that caused the shutdown).

I’ve graphed the temperatures that were logged around the most recent shutdown, which occured around 2021-08-18 at 3:00am:

enter image description here

Note, that I was intentionally not using the RTX 3090 here in an effort to prevent such issues, but it seems even the 1080ti runs will trip whatever condition is causing this shtudown.

The CPU logs a max temperature here of 93C, but I’ve seen the temperature record maximum temperatures of close to 99C, and the critical temperature as reproted by "sensors" is 100C. So, given that the CPU temperatures are increasing right before the shutdown occured, and the logging interval is every 5 minutes, it could very well be the case that the system hit a critical temperature and shutdown before the next log occurred.

But I’m still unsatisfied with this. First running journalctl -g 'temperature|critical' -b -2 as suggested in https://unix.stackexchange.com/questions/502226/how-do-you-find-out-if-a-linux-machine-overheated-before-the-previous-boot-and-w gives no indication that the system logged a temperature issue.

The result of journalctl -b -1

Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 71
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 72
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 19
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 74
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 23
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 67 to 77
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 27 to 23
Aug 18 02:47:00 toothbrush boinc[3170]: 18-Aug-2021 02:47:00 [---] Suspending computation - CPU is busy
Aug 18 02:47:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r344465819 t8087795, 64bit:1), syncing.
Aug 18 02:47:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r398433847 t7744494, 64bit:1), syncing.
Aug 18 02:47:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r349747452 t8371229, 64bit:1), syncing.
Aug 18 02:48:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r390229100 t7980049, 64bit:1), syncing.
Aug 18 02:48:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r352409333 t7226854, 64bit:1), syncing.
Aug 18 02:48:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r508920330 t10538384, 64bit:1), syncing.
Aug 18 02:48:50 toothbrush boinc[3170]: 18-Aug-2021 02:48:50 [---] Resuming computation
Aug 18 02:49:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r261199946 t4894398, 64bit:1), syncing.
Aug 18 02:49:01 toothbrush boinc[3170]: 18-Aug-2021 02:49:01 [---] Suspending computation - CPU is busy
Aug 18 02:49:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r251680223 t6509690, 64bit:1), syncing.
Aug 18 02:49:21 toothbrush boinc[3170]: 18-Aug-2021 02:49:21 [---] Resuming computation
Aug 18 02:49:31 toothbrush boinc[3170]: 18-Aug-2021 02:49:31 [---] Suspending computation - CPU is busy
Aug 18 02:49:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r346528983 t5840449, 64bit:1), syncing.
Aug 18 02:50:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r560923145 t12173867, 64bit:1), syncing.
Aug 18 02:50:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r567474866 t11497897, 64bit:1), syncing.
Aug 18 02:50:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r519892497 t10585216, 64bit:1), syncing.
Aug 18 02:51:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r553040012 t11503711, 64bit:1), syncing.
Aug 18 02:51:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r489967052 t11999909, 64bit:1), syncing.
Aug 18 02:51:31 toothbrush boinc[3170]: 18-Aug-2021 02:51:31 [---] Resuming computation
Aug 18 02:51:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r279491189 t4690385, 64bit:1), syncing.
Aug 18 02:51:41 toothbrush boinc[3170]: 18-Aug-2021 02:51:41 [---] Suspending computation - CPU is busy
Aug 18 02:52:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r233899151 t4847426, 64bit:1), syncing.
Aug 18 02:52:01 toothbrush boinc[3170]: 18-Aug-2021 02:52:01 [---] Resuming computation
Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session closed for user root
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation

The result of cat /var/log/syslog near the shutdown is:

Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'lp'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'ppdev'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'parport_pc'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'msr'
Aug 18 09:25:52 toothbrush kernel: [    0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush lvm[461]:   2 logical volume(s) in volume group "vgubuntu" monitored
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [    0.000000] KERNEL supported cpus:
Aug 18 09:25:52 toothbrush systemd[1]: Starting Flush Journal to Persistent Storage...
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Intel GenuineIntel
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   AMD AuthenticAMD
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Hygon HygonGenuine
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Centaur CentaurHauls
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   zhaoxin   Shanghai  
Aug 18 09:25:52 toothbrush systemd[1]: Finished Load Kernel Modules.

Something of interest here is that the last log before the shutdown was Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation, indicating the BOINC was about to start running a CPU intensive process.

Running cat /var/log/kern.log and looking in the nearby time gives less info:

Aug 17 23:47:21 toothbrush kernel: [100858.782842] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
Aug 17 23:47:21 toothbrush kernel: [100858.782850] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 17 23:47:21 toothbrush kernel: [100858.782851] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00000001/00002000
Aug 17 23:47:21 toothbrush kernel: [100858.782852] pcieport 0000:00:01.0:    [ 0] RxErr                  (First)
Aug 18 00:00:01 toothbrush kernel: [101618.605604] audit: type=1400 audit(1629259201.304:83): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=3302495 comm="cupsd" capability=12  capname="net_admin"
Aug 18 00:00:05 toothbrush kernel: [101622.407042] audit: type=1400 audit(1629259205.104:84): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=3302502 comm="cups-browsed" capability=23  capname="sys_nice"
Aug 18 09:25:52 toothbrush kernel: [    0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [    0.000000] KERNEL supported cpus:

Running: last -x | head | tac

joncrall :0           :0               Mon Aug 16 19:46 - crash (1+13:38)
runlevel (to lvl 5)   5.11.0-25-generi Mon Aug 16 19:47 - 09:26 (1+13:39)
joncrall pts/3        tmux(11727).%0   Mon Aug 16 20:41 - 21:49  (01:07)
joncrall pts/23       tmux(3215922).%0 Tue Aug 17 23:46 - crash  (09:39)
reboot   system boot  5.11.0-25-generi Wed Aug 18 09:25   still running
joncrall :0           :0               Wed Aug 18 09:25   still logged in
runlevel (to lvl 5)   5.11.0-25-generi Wed Aug 18 09:26   still running

I forget exactly what the "crash" vs "still running" column means in last reboot, so I’m not sure how to interpret this, or if there is any diagnostic info here.

So, if it is thermals, I don’t think the system is logging it.

My Questions + Summary:

So, my machine is turning off, and I’m not sure if its thermals, power, or something else. In an effort to mitigate thermal issues, I installed an additional 4 fans in my case, which how has 2 intake fans in the lower front, 1 intake at the bottom front, 2 outakes towards the top back, and 1 outake in the back. The NH-d15 has both fans installed on it (I double checked the orientations).

  • Are there other logs I can check to debug thermal issues?
  • Was I a fool to use air cooling, could this just be temperature fluctuations that would be mitigated by a AOI CPU water cooler?
  • Is there any other hypothesis that I’m not considering?

My next steps:

I’m not completely out of debugging options on my own, but I’ve been dealing with this problem for awhile now, and I wanted to (1) summarize the list of steps I’ve taken and (2) see if anyone has any insight that I’m not considering. I’m getting close to the end of my leads, and I’ve bought a bunch of additional hardware that might not even be addressing the issue, so I want to get input before I try the AOI (which is the next real debugging step on my list)

  • I need to ask my landlord about the eletricity setup in the appartment. Not exactly sure what I’ll be able to get from this, but I’d like to verify that the wall circuit should be able to handle the load I’m putting on it.
  • I’ll attempt another ML run tonight with BOINC completely disabled. Of course that won’t really narrow anything down, because it will lower both power and thermal usage, but if the machine does or does not shut down multiple nights in a row while BOINC is disabled and I’m running ML workflows, then I suppose that is some information.
  • I’m waiting for my new surge protector + power strip to arrive. I doubt that will fix anything, but my cheapo strip is probably in need of an upgrade anyway.

Update 2021-08-23

I’ve run an experiment several times where I run the ML workload + BOINC while having the case side panel opened. This results in lower thermals, but the system still experiences the unexpected shutdowns:

Here is a graph of thermals from my last run (where I record a temperature every 30 seconds instead of every 5 minutes as in the previous graph):

enter image description here

This graph is showing that the thermals were nowhere near critical ranges when the shutdown happened. This is the 3rd time I’ve done the experiment, so I’m starting to think the thermal hypothesis might be wrong.

For reference, this plot shows the temperature readings from the aformentioned experiments. I’ve drawn black lines where a shutdown occurred:

enter image description here

Additionally, I’ve noted the shutdown happening when nothing else on the circuit was plugged in, so my machine should have had all 1800 Watts to itself, and it can only realisticly use roughly half of that, so the wattage hypothesis is out.

This leaves me stumped.

Update 2021-09-19

I’m still getting the issue, but only when I’m running my machine-learning scripts. The shutdown does not occur if I run ethermine and BOINC all night, but if I shut everything else down and try to train a neural network, it will eventually hard shutdown.

I watched my KillOWatt wall meter as this was happening and the wattage was around 690W when it powered off. The temperatures were nowhere near critical.

If I target my ML workload at my 1080ti instead of my RTX3090 it does not occur.

I’m now thinking either:

  • The MSI 3090 I is forcing the machine to shutdown somehow
  • The EVGA 1600W PSU is faulty and thinks it is overdrawing (even though measurements show it is nowhere even close to doing so) and causing the shutdown
  • There is a MOBO issue
  • A myriad of any other hardware issue

Still trying to figure out next steps: maybe MSI or EVGA support? The GPU does draw a lot of power, but it is well within the PSU limit, so idk what’s causing the issue.

Update 2021-09-25

The issue is still happening. I’ve recorded 3 videos that display exactly what happens:

https://www.youtube.com/watch?v=Ue4XHcusqto

In this video the draw was 191W and the GPU temp was 65C. The CPU was at 76% load. The higest core temp was 87C.

https://www.youtube.com/watch?v=LPwaI1SRlXk

In this video, I underclocked the GPU with GWE (green with envy) to no avail. I downclocked the memory -200MHz on the GPU and -100Mhz on the Mem, and I’ve limited power to 200W. The draw from the GPU was: 138.19W and the temp was 41C at shutdown. The CPU was at 79% load, the highest core temp was 87C.

https://www.youtube.com/watch?v=yQ7i-8Kp6xg

In the last video the underclock is disabled. the CPU was at 78% and the higest core temp was 91C. The GPU was at 225W and 52C.

I don’t know how the wattage could be the issue because I can run ethermine for days without a problem (and it draws 340W and runs at 68C), and there is no issue with that. I also ran stress -c 16 to max out my CPU usage and that didn’t cause an issue either. I don’t know what it is about my pytorch workflow that is causing my system to hard-shutdown. But I’m desperate for answers.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.