#StackBounty: #windows-10 #cpu #performance #hyper-threading How does the Windows 10 scheduler deal with Hyper Threading since Core Par…

Bounty: 50

I’m running Windows 10 (1607) on an Intel Xeon E3-1231v3 CPU (Haswell, 4 physical cores, 8 logical cores).

When I first had Windows 7 installed on this machine, I could observe that four out of eight logical cores were parked until an application needed more than 4 threads. One can check with Windows resource monitor whether cores are parked or not (example).
As far as I understand, this is an important technique to keep the threads balanced across physical cores, as explained on the Microsoft website: “the Core Parking algorithm and infrastructure is also used to balance processor performance between logical processors on Windows 7 client systems with processors that include Intel Hyper-Threading Technology.

However after upgrading to Windows 10, I noticed that there is no core parking. All logical cores are active all the time and when you run an application using less than four threads you can see how the scheduler equally distributes them across all logical cpu cores. Microsoft employees have confirmed that Core Parking is disabled in Windows 10.

But I wonder why? What was the reason for this? Is there a replacement and if yes, how does it look like? Has Microsoft implemented a new scheduler strategy that made core parking obsolete?


Appendix:

Here is an example on how core parking introduced in Windows 7 can benefit performance (in comparison to Vista which didn’t have core parking feature yet). What you can see is that on Vista, HT (Hyper Threading) harms performance while on Windows 7 it doesn’t:

enter image description here

enter image description here

(source)

I tried to enable Core Parking as mentioned here, but what I observed was that the Core Parking algorithm isn’t Hyper Threading aware anymore. It parked cores 4,5,6,7, while it should have parked core 1,3,5,7 to avoid that threads are assigned to the same physical core. Windows enumerates cores in such a way that two successive indices belong to the same physical core. Very strange. It seems Microsoft has messed this up fundamentally. And no one noticed…

Furthermore, I did some CPU benchmarks using exactly 4 threads.

CPU affinity set to all cores (Windows defualt):

Average running time: 17.094498, standard deviation: 2.472625

CPU affinity set to every other core (so that it runs on different physical cores, best possible scheduling):

Average running time: 15.014045, standard deviation: 1.302473

CPU affinity set to the worst possible scheduling (four logical cores on two physical cores):

Average running time: 20.811493, standard deviation: 1.405621

So there is a performance difference. And you can see that the Windows defualt scheduling ranks between the best and worst possible scheduling, as we would expect it to happen with a non-hyperthreading aware scheduler. However, as pointed out in the comments, there may be other causes responsible for this, like fewer context switches, inference by monitoring applications, etc. So we still don’t have a definitive answer here.

Source code for my benchmark:

#include <stdlib.h>
#include <Windows.h>
#include <math.h>

double runBenchmark(int num_cores) {
  int size = 1000;
  double** source = new double*[size];
  for (int x = 0; x < size; x++) {
    source[x] = new double[size];
  }
  double** target = new double*[size * 2];
  for (int x = 0; x < size * 2; x++) {
    target[x] = new double[size * 2];
  }
  #pragma omp parallel for num_threads(num_cores)
  for (int x = 0; x < size; x++) {
    for (int y = 0; y < size; y++) {
      source[y][x] = rand();
    }
  }
  #pragma omp parallel for num_threads(num_cores)
  for (int x = 0; x < size-1; x++) {
    for (int y = 0; y < size-1; y++) {
      target[x * 2][y * 2] = 0.25 * (source[x][y] + source[x + 1][y] + source[x][y + 1] + source[x + 1][y + 1]);
    }
  }
  double result = target[rand() % size][rand() % size];
  for (int x = 0; x < size * 2; x++) delete[] target[x];
  for (int x = 0; x < size; x++) delete[] source[x];
  delete[] target;
  delete[] source;
  return result;
}

int main(int argc, char** argv)
{
  int num_cores = 4;
  system("pause");  // So we can set cpu affinity before the benchmark starts 
  const int iters = 1000;
  double avgElapsedTime = 0.0;
  double elapsedTimes[iters];
  for (int i = 0; i < iters; i++) {
    LARGE_INTEGER frequency;
    LARGE_INTEGER t1, t2;
    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&t1);
    runBenchmark(num_cores);
    QueryPerformanceCounter(&t2);
    elapsedTimes[i] = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
    avgElapsedTime += elapsedTimes[i];
  }
  avgElapsedTime = avgElapsedTime / iters;
  double variance = 0;
  for (int i = 0; i < iters; i++) {
    variance += (elapsedTimes[i] - avgElapsedTime) * (elapsedTimes[i] - avgElapsedTime);
  }
  variance = sqrt(variance / iters);
  printf("Average running time: %f, standard deviation: %f", avgElapsedTime, variance);
  return 0;
}


Get this bounty!!!