#StackBounty: #google-cloud-platform #kubernetes #google-kubernetes-engine Google Cloud Kuberbetes run-away systemd 100% CPU usage

Bounty: 50

Last week, after upgrading our GKE cluster to Kubernetes 1.13.6-gke.13, some of our nodes started to fail due to high CPU usage. Pods on these nodes would be CPU starved, work poorly and get killed due to failing liveness checks.

This is what top shows when we SSH into a problem node:

top - 10:11:27 up 5 days, 22 min,  1 user,  load average: 23.71, 21.90, 20.32
Tasks: 858 total,   3 running, 854 sleeping,   0 stopped,   1 zombie
%Cpu(s): 33.5 us, 30.9 sy,  0.0 ni, 35.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  30157.0 total,  14125.6 free,   4771.2 used,  11260.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  24762.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0  114636  22564   4928 R  65.2   0.1   5254:53 systemd
    356 message+  20   0  110104  98784   2916 S  33.8   0.3   2705:49 dbus-daemon
   1236 root      20   0 3174876 147328  57464 S  22.0   0.5   1541:29 kubelet
    366 root      20   0   12020   3864   3396 S  21.6   0.0   1705:21 systemd-logind
   1161 root      20   0 5371428 586832  26756 S  18.7   1.9   1848:18 dockerd
    432 root      20   0 5585144  49040  13420 S  11.1   0.2 848:54.06 containerd
  23797 root      20   0  994620   8784   6088 S   3.0   0.0  96:58.79 containerd-shim
  45359 root      20   0  994620   8968   5600 S   3.0   0.0 100:28.46 containerd-shim
  35913 root      20   0 1068352   8192   5600 S   2.3   0.0 104:54.12 containerd-shim
  10806 root      20   0  994620   8908   5596 S   2.0   0.0 102:57.45 containerd-shim
  15378 root      20   0  994620   9084   5600 S   2.0   0.0 102:24.08 containerd-shim
  33141 root      20   0  994620   8856   5848 S   2.0   0.0  95:26.89 containerd-shim
  34299 root      20   0  994620   8824   5848 S   2.0   0.0  90:55.28 containerd-shim
  48411 root      20   0  994620   9488   6216 S   2.0   0.0  95:38.56 containerd-shim
1824641 root      20   0 1068352   6836   5716 S   2.0   0.0  65:45.81 containerd-shim
  10257 root      20   0  994620   9404   5596 S   1.6   0.0 101:10.45 containerd-shim
  15400 root      20   0 1068352   8916   6216 S   1.6   0.0  97:47.99 containerd-shim
  22895 root      20   0 1068352   8408   5472 S   1.6   0.0 102:55.97 containerd-shim
  29969 root      20   0  994620   9124   5600 S   1.6   0.0  98:32.32 containerd-shim
  34720 root      20   0  994620   8444   5472 S   1.6   0.0  97:23.98 containerd-shim
  10073 root      20   0 1068352   9796   6152 S   1.3   0.0 100:54.30 containerd-shim

To attempt to resolve the issue we recreated all the nodes. We created a new pool with equivalent resources and migrated all pods over by scaling down the old pool to 0 nodes. (This was difficult because at least 2 of our previous nodes failed to shut down, even after a long time. In the end we had to shut down the underlying VMs to kill those nodes.) At first this seemed to help, node CPU usage and load averages were low, but then the problem returned.

Next we created yet another pool, this time with twice as much CPU. It didn’t help. Some nodes still had extremely high CPU usage for systemd, dbus-daemon, kubelet etc.

We kept that setup though. Now we are running with tons of extra CPU per node and, although expensive, that masks the problem (there’s enough CPU to also run actual pods in addition to the problematic system services).

How do we find out what’s actually wrong here?


Snippet from journalctl -u kubelet:

Jul 04 05:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:49:34.849808    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 05:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:54:34.850598    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 05:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 05:59:34.851797    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:04:34.858344    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:09:34.859626    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:14:34.861142    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:19:34.862185    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:24:34.863160    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:29:34.864156    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:34:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:34:34.865041    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:39:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:39:34.866044    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:44:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:44:34.866969    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:49:34.867979    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:54:34.869429    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 06:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 06:59:34.870359    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:04:34.871190    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:09:34.872063    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:14:34.873240    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:19:34.874123    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:24:34.875010    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:29:34.876612    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:34:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:34:34.877420    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:39:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:39:34.878368    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:44:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:44:34.879221    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:49:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:49:34.880239    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:54:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:54:34.881172    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 07:59:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 07:59:34.881868    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:04:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:04:34.882653    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:09:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:09:34.883432    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:14:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:14:34.884175    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:19:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:19:34.885043    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:24:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:24:34.885845    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Jul 04 08:29:34 gke-cluster0-pool-1-9c29a343-vqtq kubelet[1236]: I0704 08:29:34.886690    1236 container_manager_linux.go:434] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service

Output from strace -c -p `pidof systemd`:

strace: Process 1 attached
strace: Process 1 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 24.20    0.109101           5     22827           lstat
 17.45    0.078654           9      8610           sendmsg
 15.51    0.069914           7     10549           read
 10.96    0.049406           3     17310           getpid
  6.45    0.029075           8      3806           openat
  6.08    0.027385           7      3783           readlinkat
  5.76    0.025945           3      7579           fstat
  4.47    0.020167           4      5700           getrandom
  3.62    0.016301           4      3806           close
  3.14    0.014133           7      1892           access
  2.28    0.010278           5      1924        11 stat
  0.03    0.000145           4        33           epoll_wait
  0.02    0.000089           4        22           readlink
  0.01    0.000029           3        11           prctl
  0.01    0.000029           3        11           clock_gettime
  0.01    0.000027           2        11           getgid
  0.01    0.000026           2        11           geteuid
  0.00    0.000020           2        11           getuid
  0.00    0.000020           2        11           getegid
------ ----------- ----------- --------- --------- ----------------
100.00    0.450744                 87907        11 total

Since dbus is very active, I took a look at that too: dbus-monitor --system --profile | head -n 20

dbus-monitor: unable to enable new-style monitoring: org.freedesktop.DBus.Error.AccessDenied: "Rejected send message, 1 matched rules; type="method_call", sender=":1.165" (uid=5004 pid=769854 comm="dbus-monitor --system --profile ") interface="org.freedesktop.DBus.Monitoring" member="BecomeMonitor" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (bus)". Falling back to eavesdropping.
#type    timestamp    serial    sender    destination    path    interface    member
#                    in_reply_to
sig    1562263380.765023    2    org.freedesktop.DBus    :1.165    /org/freedesktop/DBus    org.freedesktop.DBus    NameAcquired
sig    1562263380.953812    132870362    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d5d1be6e08bfe7552d6f9ee50a943eca88dd0dd749ec248594aa0be91879a2cdb_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957890    132870363    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d5d1be6e08bfe7552d6f9ee50a943eca88dd0dd749ec248594aa0be91879a2cdb_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957912    132870364    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d539cd8b8367913182e438aa1b3b05714c8f3f131e20bcadabdeb850c375a8125_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957918    132870365    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d539cd8b8367913182e438aa1b3b05714c8f3f131e20bcadabdeb850c375a8125_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.957923    132870366    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2dcontainers_2d4c04ee8d1bf693ff2c9300b198b2b47bbf2c240265af5b9edc1f07b6cbd0c1ce_2dmounts_2dshm_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958014    132870367    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2dcontainers_2d4c04ee8d1bf693ff2c9300b198b2b47bbf2c240265af5b9edc1f07b6cbd0c1ce_2dmounts_2dshm_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958020    132870368    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d872dd2c0a63a9f3b5c9c5e4972e06fcf89d28b4c7f59aea7ea4d38f5a6bf0db6_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958025    132870369    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d872dd2c0a63a9f3b5c9c5e4972e06fcf89d28b4c7f59aea7ea4d38f5a6bf0db6_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958029    132870370    :1.157    <none>    /org/freedesktop/systemd1/unit/home_2dkubernetes_2dcontainerized_5fmounter_2drootfs_2dvar_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958111    132870371    :1.157    <none>    /org/freedesktop/systemd1/unit/home_2dkubernetes_2dcontainerized_5fmounter_2drootfs_2dvar_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958117    132870372    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958121    132870373    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2dkubelet_2dpods_2d2f4e6eae_5cx2d9e51_5cx2d11e9_5cx2db4ee_5cx2d42010a80000f_2dvolumes_2dkubernetes_2eio_5cx7esecret_2dfluentd_5cx2dgcp_5cx2dtoken_5cx2dzfrkb_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958126    132870374    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d667d713fe45ea0af609c85fbfd3aafbca9494574fe10509bda8cd3d13d1e6b66_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958223    132870375    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d667d713fe45ea0af609c85fbfd3aafbca9494574fe10509bda8cd3d13d1e6b66_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958229    132870376    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d0a9f899f23c0965b43cae81dd04f46a78e4b1b063fa5679323146b06932474a9_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958233    132870377    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d0a9f899f23c0965b43cae81dd04f46a78e4b1b063fa5679323146b06932474a9_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged
sig    1562263380.958238    132870378    :1.157    <none>    /org/freedesktop/systemd1/unit/var_2dlib_2ddocker_2doverlay2_2d875734c124694cd54a7c26516f740cedcb853a18ff5805a89669747c1188b65d_2dmerged_2emount    org.freedesktop.DBus.Properties    PropertiesChanged


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.