I am trying to learn Bayesian methods, and to that end, I picked up an application of interest to me to develop the concepts in practice.
Suppose I wrote an initial version of a performance-sensitive piece of software, and want to optimize its execution time. I may have a baseline version and an "improved" version (or at least, I suspect it may be an improvement — I need to measure).
I’m looking to quantify how likely it is that this new version is actually be an improvement (as opposed to being equivalent or possibly even worse than the baseline), as well as how much — is it 20% faster? 100% faster? 10% slower? Also I’d like to give credible intervals rather than only point estimates of the speedup.
To that end, I time a number of runs of the two version of the software, trying to keep all other factors the same (input data, hardware, OS, etc.) I also try to kill every running app and service, and even turn off networking, to make sure that, to the extent possible by modern feature-heavy code, these apps have the CPU all to themselves. I also disable Turbo Boost on my CPU to prevent CPU clock rate changes over time and temperature, and run my fans at maximum to minimize the change of CPU thermal throttling (and in practice my computer’s thermal solution is good enough that I’ve never seen this happen). I’ve tried to restrict the portion of the code being measured to the computational part only, so no I/O to add variability.
Despite my best efforts, this is not an embedded system with a single-core processor running on bare metal, so there is some variability, possibly due to OS processes that remain and take up a little bit of CPU, CPU affinity of processes, as well as microarchitectural sources of variability such as caches, out of order execution and hyperthreading.
Current model and code
Currently I’m using the BEST model, implemented by the following code in Python using PyMC3 (heavily inspired by the linked document), in case it is of interest. The arguments are timings of the baseline version (
baseline) and improved version (
def statistical_analysis(baseline, opt): # Inspired by https://docs.pymc.io/notebooks/BEST.html y = pd.DataFrame( dict( value=np.r_[baseline, opt], group=np.r_[['baseline']*len(baseline), ['opt']*len(opt)] ) ) μ_m = y.value.mean() μ_s = y.value.std() σ_low = µ_s/1000 σ_high = µ_s*1000 with pm.Model() as model: baseline_mean = pm.Normal('baseline_mean', mu=μ_m, sd=1000*μ_s) opt_mean = pm.Normal('opt_mean', mu=μ_m, sd=1000*μ_s) baseline_std = pm.Uniform('baseline_std', lower=µ_s/1000, upper=1000*µ_s) opt_std = pm.Uniform('opt_std', lower=µ_s/1000, upper=1000*µ_s) ν = pm.Exponential('ν_minus_one', 1/29.) + 1 λ_baseline = baseline_std**-2 λ_opt = opt_std**-2 dist_baseline = pm.StudentT('baseline', nu=ν, mu=baseline_mean, lam=λ_baseline, observed=baseline) dist_opt = pm.StudentT('opt', nu=ν, mu=opt_mean, lam=λ_opt, observed=opt) diff_of_means = pm.Deterministic('difference of means', baseline_mean - opt_mean) ratio_of_means = pm.Deterministic('ratio of means', baseline_mean/opt_mean) trace = pm.sample(draws=3000,tune=2000) baseline_hdi = az.hdi(trace['baseline_mean']) baseline_out = (baseline_hdi, trace['baseline_mean'].mean(), baseline_hdi) opt_hdi = az.hdi(trace['opt_mean']) opt_out = (opt_hdi, trace['opt_mean'].mean(), opt_hdi) speedup_hdi = az.hdi(trace['ratio of means']) speedup = (speedup_hdi, trace['ratio of means'].mean(), speedup_hdi) dif = trace['difference of means'] > 0 prob = (dif > 0).sum()/len(dif) return (baseline_out, opt_out, speedup, prob)
prob variable indicates how likely it is that a difference exists, and
speedup includes the mean as well as 95% HDI for the ratio of execution time of the baseline version to the improved version. The remaining variables are the mean as well as 95% HDI of the execution time of baseline and improved versions.
Issues with the model
The BEST model assumes a Student t-distribution for the values of the execution time, but I have a hunch this is not an adequate modeling assumption.
Given a certain piece of code, one could in principle tally up every single instruction executed, and figure out exactly how fast an "undisturbed" CPU could run it, given the amount of execution resources like ALUs and load/store units, the latency of each instruction, etc. Therefore, there exists a minimum value, bounded by the CPU hardware capabilities, such that the code will never run faster than this. We cannot measure this minimum, though, because the measurements are contaminated by the sources of noise previously mentioned.
Thus, I’d like to think that my model should be the sum of a constant value (the minimum) and some distribution with positive values only, and probably a heavy tailed one, seeing as some outlier event may happen during the execution of the code (the system decides to update an app, or run a backup, or whatever).
Edit: some data
To give an idea of the kind of distribution that may be found in practice, I measured 5000 executions of the serial and a parallel version of the same code, for the same input data, and generated histograms for both, with 250 bins each. I’m not claiming this is necessarily representative, but it shows how inadequate the Student t-distribution is for this problem.
First, the serial version:
And now for the parallel version:
This leads me to the question:
What are some distributions that might be a good fit to this model?