I have a server running Ubuntu 18.04 which has been experiencing huge CPU spike that nearly brings Apache to a halt once or twice a day. The server runs a couple of websites – all php & mysql driven applications. Here’s a bit more detail on things I’ve looked into:
MySQL: Slow query log is enabled and set to log queries taking longer than 1 second. Reviewing this log after a spike reveals nothing in particular. No long running queries to speak of.
CRON: I’ve reviewed all user cron jobs running on the server and there’s nothing that happens during the times when these spikes occur. There are only a couple of CPU intensive jobs and they run around 3am and take approx 5 minutes to complete.
max_execution_time are set to 60 seconds and
memory_limit is 64M (this is a 16GB server which typically doesn’t come close to maxing out memory usage).
APACHE: Our host (Linode) has a tool called Longview which shows various diagnostics related to Apache. Despite the huge spike in resources consumed, requests seem to be happening at a normal rate. Manually inspecting the access logs confirms this. Here’s a screenshot of the Apache tab in Longview showing a spike in Workers, CPU and RAM this morning – as well as a relatively normal rate of Requests:
I’ve also added flags in the Apache access logs to show time and I/O data about each request. The end of the LogFormat is
time:%T input:%I output:%O. None of the request or response sizes are unusually large (1MB might be the largest response I saw and that was for an image). The only thing standing out is the "time taken to serve the request" which is the
%T flag. At a certain point in the morning many seemingly normal requests take 5 – 10 minutes to complete for no apparent reason.
I’m completely stumped at this point. Where can i go from here to diagnose the event that’s triggering this?