We’re trying to diagnose a VM apparently running slowly, from inside the VM.

The situation:

  • We have an IIS-hosted application on Windows Server 2008R2 running on a 6-core, 12GB virtual machine.
  • We have a database running on an SQL Server 2008R2 SP3 cluster (16+ cores, >16GB RAM)
  • The application is processing a queue of messages against this database. Processing consists mostly of query/fetch and update, maybe a dozen or so roundtrips per message. A limited number of threads (3) are assigned to this workload, which is synchronous, so the threads do block on database responses.
  • The database is apparently lightly loaded: only a few percent of maximum CPU load.
  • To our knowledge, both the database and the VM host are in the same datacentre.

The database reports considerable time spent waiting on async network IO, ie. waiting for the application to consume data.
The application VM is also apparently lightly loaded: ~20% CPU.
The infrastructure is not owned by us and our only access is via RDP to the virtual machine, and SQL Management Studio to the database cluster. We do not have sufficient privileges to run a profiler, though we do record performance counters for both database and VM.

A few weeks ago, the message processing rate abruptly dropped by 70-80%. As far as we are aware, nothing had changed: the application had not been modified or reconfigured, and the performance counters don’t indicate any change in load characteristics. The owners of the infrastructure have stated that nothing has changed at their end.

As part of a restart process, the application was made to reload its message queue. This involves one simple SELECT of a few hundred thousand rows which are then read into in-memory structures. The database serviced the SELECT in a few seconds but then waited for ~10 minutes on the application reading the resultset. This is a single-threaded operation involving very simple deserialisation, and should not take more than a couple of minutes on this hardware.

My current theory is that network latency has increased in some way, but ping only reports ‘<1ms’ and we don’t have a baseline in any case. hrPing reports times between 0.5 and 2ms from the application server to the database.

Another possibility is that the VM’s real CPU capacity has in some way decreased, but I’d expect that to manifest as increased ‘apparent’ load.

Are there any other avenues of investigation available to us?

