#StackBounty: #estimation #inference #standard-error Standard error of estimated covariance

Bounty: 50

Let $X_1,…,X_n$ and $Y_1,…,Y_n$ be two independent random samples from $mathcal{N}(mu, sigma^2)$ where both $mu$ and $sigma$ are unknown parameters.

I estimate their covariance using:
$$hat{operatorname{cov}}(X, Y) = operatorname{E}{big[(X_i – operatorname{E}[X])(Y_i – operatorname{E}[Y])big]} $$

with replacing $operatorname{E}[X]$ and $operatorname{E}[Y]$ by the according sample mean.

How do i calculate the standard error of $hat{operatorname{cov}}(X, Y)$?


Get this bounty!!!

#StackBounty: #distributions #bayesian #modeling #inference Bayesian modeling of software execution time

Bounty: 100

Introduction

I am trying to learn Bayesian methods, and to that end, I picked up an application of interest to me to develop the concepts in practice.

Context

Suppose I wrote an initial version of a performance-sensitive piece of software, and want to optimize its execution time. I may have a baseline version and an "improved" version (or at least, I suspect it may be an improvement — I need to measure).

I’m looking to quantify how likely it is that this new version is actually be an improvement (as opposed to being equivalent or possibly even worse than the baseline), as well as how much — is it 20% faster? 100% faster? 10% slower? Also I’d like to give credible intervals rather than only point estimates of the speedup.

To that end, I time a number of runs of the two version of the software, trying to keep all other factors the same (input data, hardware, OS, etc.) I also try to kill every running app and service, and even turn off networking, to make sure that, to the extent possible by modern feature-heavy code, these apps have the CPU all to themselves. I also disable Turbo Boost on my CPU to prevent CPU clock rate changes over time and temperature, and run my fans at maximum to minimize the change of CPU thermal throttling (and in practice my computer’s thermal solution is good enough that I’ve never seen this happen). I’ve tried to restrict the portion of the code being measured to the computational part only, so no I/O to add variability.

Despite my best efforts, this is not an embedded system with a single-core processor running on bare metal, so there is some variability, possibly due to OS processes that remain and take up a little bit of CPU, CPU affinity of processes, as well as microarchitectural sources of variability such as caches, out of order execution and hyperthreading.

Current model and code

Currently I’m using the BEST model, implemented by the following code in Python using PyMC3 (heavily inspired by the linked document), in case it is of interest. The arguments are timings of the baseline version (baseline) and improved version (opt):

def statistical_analysis(baseline, opt):
    # Inspired by https://docs.pymc.io/notebooks/BEST.html
    y = pd.DataFrame(
        dict(
            value=np.r_[baseline, opt],
            group=np.r_[['baseline']*len(baseline), ['opt']*len(opt)]
        )
    )

    μ_m = y.value.mean()
    μ_s = y.value.std()
    σ_low = µ_s/1000
    σ_high = µ_s*1000

    with pm.Model() as model:
        baseline_mean = pm.Normal('baseline_mean', mu=μ_m, sd=1000*μ_s)
        opt_mean = pm.Normal('opt_mean', mu=μ_m, sd=1000*μ_s)
        baseline_std = pm.Uniform('baseline_std', lower=µ_s/1000,
                                                  upper=1000*µ_s)
        opt_std = pm.Uniform('opt_std', lower=µ_s/1000, upper=1000*µ_s)
        ν = pm.Exponential('ν_minus_one', 1/29.) + 1
        λ_baseline = baseline_std**-2
        λ_opt = opt_std**-2

        dist_baseline = pm.StudentT('baseline', nu=ν, mu=baseline_mean,
                                    lam=λ_baseline, observed=baseline)
        dist_opt = pm.StudentT('opt', nu=ν, mu=opt_mean,
                               lam=λ_opt, observed=opt)

        diff_of_means = pm.Deterministic('difference of means',
                                         baseline_mean - opt_mean)
        ratio_of_means = pm.Deterministic('ratio of means',
                                          baseline_mean/opt_mean)

        trace = pm.sample(draws=3000,tune=2000)

        baseline_hdi = az.hdi(trace['baseline_mean'])
        baseline_out = (baseline_hdi[0],
                        trace['baseline_mean'].mean(),
                        baseline_hdi[1])

        opt_hdi = az.hdi(trace['opt_mean'])
        opt_out = (opt_hdi[0], trace['opt_mean'].mean(), opt_hdi[1])

        speedup_hdi = az.hdi(trace['ratio of means'])
        speedup = (speedup_hdi[0],
                   trace['ratio of means'].mean(),
                   speedup_hdi[1])

        dif = trace['difference of means'] > 0
        prob = (dif > 0).sum()/len(dif)

    return (baseline_out, opt_out, speedup, prob)

The prob variable indicates how likely it is that a difference exists, and speedup includes the mean as well as 95% HDI for the ratio of execution time of the baseline version to the improved version. The remaining variables are the mean as well as 95% HDI of the execution time of baseline and improved versions.

Issues with the model

The BEST model assumes a Student t-distribution for the values of the execution time, but I have a hunch this is not an adequate modeling assumption.

Given a certain piece of code, one could in principle tally up every single instruction executed, and figure out exactly how fast an "undisturbed" CPU could run it, given the amount of execution resources like ALUs and load/store units, the latency of each instruction, etc. Therefore, there exists a minimum value, bounded by the CPU hardware capabilities, such that the code will never run faster than this. We cannot measure this minimum, though, because the measurements are contaminated by the sources of noise previously mentioned.

Thus, I’d like to think that my model should be the sum of a constant value (the minimum) and some distribution with positive values only, and probably a heavy tailed one, seeing as some outlier event may happen during the execution of the code (the system decides to update an app, or run a backup, or whatever).

Edit: some data

To give an idea of the kind of distribution that may be found in practice, I measured 5000 executions of the serial and a parallel version of the same code, for the same input data, and generated histograms for both, with 250 bins each. I’m not claiming this is necessarily representative, but it shows how inadequate the Student t-distribution is for this problem.

First, the serial version:

Histogram of serial version

And now for the parallel version:

enter image description here

The question

This leads me to the question:

What are some distributions that might be a good fit to this model?


Get this bounty!!!

#StackBounty: #distributions #p-value #inference #data-transformation #controlling-for-a-variable Generate null distribution from pvalues

Bounty: 50

I have a set of experiments on which I apply the Fisher’s exact method to statistically infer changes in cellular populations.
Some of the data are dummy experiments that model our control experiments which describe the null model.

However, due to some experimental variation most of the controlled experiments reject the null hypothesis at a $ p_{val} <0.05$. Some of the null hypotheses of the actual experimental conditions are also rejected at a $ p_{val} <0.05 $. However, these pvalues, are magnitudes low than those of my control conditions. This indicates a stronger effect of these experimental conditions. However, I am not aware of a proper method to quantify these changes and statistically infer them.

An example of what the data looks like:

ID      Pval            Condition
B0_W1   2.890032e-16    CTRL 
B0_W10  7.969311e-38    CTRL
B0_W11  8.078795e-25    CTRL    
B0_W12  2.430554e-80    TEST1 
B0_W2   3.149525e-30    TEST2
B1_W1   3.767914e-287   TEST3
B1_W10  3.489684e-56    TEST4
B1_W10  3.489684e-56    TEST5

One idea I had:

  1. selecting the ctrl conditions and let $ X = -ln(p_{val}) $ which will distribute the transformed data as an expontential distribution.
  2. Use MLE to find the $lambda$ parameter of the expontential distribution. This will be my null distribution.
  3. Apply the same transformation to the rest of the $p_{val}$ that correspond to the test conditions
  4. Use the cdf of the null distribution to get the new "adjusted pvalues".

This essentially will give a new $alpha$ threshold for the original pvalues and transform the results accordingly using the null’s distribution cdf. Are these steps correct? Is using MLE to find the rate correct or it violates some of the assumptions to achieve my end goal? Any other approaches I could try?


Get this bounty!!!

#StackBounty: #regression #inference #linear-model #matlab #standard-error How to best find standard error *across* linear regression f…

Bounty: 50

So I have a scenario where I n = 8 subjects, each of which have a different source of noise in their respective observations $y$. For example, consider the following:

num_datasets = 8;

x = [1:20]';

%define matrix for the response for 8 different datasets
Y = repmat(x,1,8) * nan;

for i = 1:size(X,2)
    Y(:,i) = 2*x + unifrnd(3,8)*randn(size(x));
end

So clearly each observation/subject has the same true slope but different amounts/sources of noise. Now, I know that the standard error for the linear regression fit has the form:

$sigmasqrt{frac{1}{n}+ frac{(x^*-bar x)^2}{sum_{i=1}^n (x_i-bar{x})^2} }$

where $sigma$ represents the standard deviation of the residuals of the fit, $n$ represents the number of samples in the observation (in my example above this would be 20, not 8), $(x^* – bar x)$ represents the distance of each $x_i$ sample from the mean (which is why the standard error increases hyperbolically as you deviate from the mean), and then ${sum_{i=1}^n (x_i-bar{x})^2}$ is simply the variance in $x$.

However, if I interpret this equation correctly, I think this gives the standard error across the dimension of $x$, and doesn’t directly tell me the standard error across subjects. In other words, I suspect it wouldn’t be a good idea to use this formula for each subject and then take the mean standard error (please correct me if I am wrong). So I have 2 questions:

  1. What would be the best way to calculate the standard error across subjects? Would it simply be to perform the fit for each subject, and then take the standard deviation of the fits?
  2. What would the shape of the standard error of the fit look like, and what is the intuition behind that? Would it still be hyperbolic? I don’t think it would, but actually really not sure.


Get this bounty!!!

#StackBounty: #machine-learning #classification #inference #accuracy What is a better way to inference on chained machine learning mode…

Bounty: 50

I’m trying to figure out a better way to make inference for my audio gender classification models.

I have created 2 models: (1) model1 predicts adult/child, (2) model2 predicts male/female.

My current approach is:

  1. predict using model1, if “child”, done, else continue
  2. predict using model2, get “male” or “female”

If my model1 has an accuracy (on validation) of 85% and my model2 has an accuracy of 93%, then this is the expected accuracyfor my approach:

  • child:
    $$P(model1_correct) = 0.85 = 85%$$

  • male/female
    $$P(model2_correct | model1_correct) = frac{0.93times0.85}{0.85} = 0.93$$
    since two models are independent though chained
    $$P(model2_correct) = P(model2_correct | model1_correct)P(model1_correct)=0.93 * 0.85 = 0.79$$

Is the calculation above reasonable? However, in this way, my male/female confidence is much lower after adding child. Is there a better way I can use those 2 models for the 3-class classification?


Get this bounty!!!

#StackBounty: #multivariate-analysis #inference #multilevel-analysis #proportion #analysis Comparing changes in a response variable ove…

Bounty: 50

Say you have a variable, like average house cost. You have values of this variable at multiple sites in a city, and want to see how it fluctuates with respect to surrounding business type proportions (multiple classes). For example, at one site, the average home cost is 10 and the surrounding proportions of business types is 30% restaurants, 60% government, and 10% agriculture. What is/are the/any recommended ways to analyze how the housing cost changes with respect to different proportions (over sites)? I am looking for statistical procedures, and have so far come across Redundancy Analysis. However, I’m having trouble figuring out if this is exactly what I’m looking for or if it is optimal. I’ve also looked into canonical analysis and am having the same confusion.

Further, I have different proportions over different spatial scales. For example, around each site I have drawn a circle and calculated the business proportions within the circle. Then I can make the circle a little bigger, and follow the same approach. Finally, I also have data over many years, thus am hoping to add a temporal analysis aspect to this work. Obviously I need to start with a single year and circle (proportion) to keep things simple first, but am looking for guidance in sifting through these layers in the most efficient/ methodological way.


Get this bounty!!!

#StackBounty: #regression #hypothesis-testing #multiple-regression #inference #linear-model Working with Covid-19 Dataset in R, Checkin…

Bounty: 100

Im working on a dataset of individuals with the covid-19 infection. The dataset consist of 2010 individuals and four columns:

  • deceased: if the person died of corona (1:yes, 0:no)
  • sex: male/female
  • age: age of the person (ranging from 2 to 99 years old)
  • country: which country the person is from (France, Japan, Korea, Indonesia)

The dataset can be loaded with the following code:

id = "1CA1RPRYqU9oTIaHfSroitnWrI6WpUeBw"
d.corona = read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download",id),header = T)

Im trying to answer the following questions:

  • Is age a greater risk factor for males than for females?
  • Is age a greater risk factor for the French population than for the Korean population?
  • It seems like the French population is at much higher risk of dying from Covid-19 than the other countries (just from inspecting data). Does this seem like a valid result, or could it be influenced by the way the data were collected?

I was thinking it would be a good idea to fit some appropriate models to answer the questions. I already fit a logistic regression model with glm but I don’t see how I could use that to answer these specific questions. For the very last question I believe the mean age in France is significantly higher than in the other countries, which would explain why it seems like the probability of dying from the virus is higher in France based on the data. But Im sure there could be a better explanation.


Get this bounty!!!

#StackBounty: #estimation #inference #fisher-information #efficiency Deriving C-R inequality from H-C-R bound

Bounty: 50

As mentioned in the title, I want to derive the Cramer-Rao Lower bound from the Hammersly-Chapman-Robbins lower bound for the variance of a statistic $T$.
The statement for the H-C-R lower bound is the following,

Let $mathbf{X} sim f_{theta}(.)$ where $theta in Theta subseteq mathbb{R}^k.$ Suppose $T(mathbf{X})$ is an unbiased estimator of $tau(theta)$ where $tau colon Theta to mathbb{R}$. Then we have,
begin{equation}
text{Var}{theta}(T) ge displaystyle sup{Delta in mathcal{H}{theta}}, displaystyle frac{[tau(theta + Delta)]^2}{mathbb{E}{theta}left(frac{f_{theta + Delta}}{f_{theta}} – 1right)^2}
end{equation
}

where $mathcal{H}_{theta} = {alpha in Theta colon text{ support of } f text{ at } theta + alpha subseteq text{ support of } f text{ at } theta}$

Now when $k = 1$ and the regularity conditions hold, taking $Delta to 0$ gives the following inequality,
begin{equation}
text{Var}{theta}(T) ge displaystyle frac{[tau'(theta)]^2}{mathbb{E}{theta} left( frac{partial }{partial theta} log f_{theta}(mathbf{X}) right)^2}
end{equation
}

which is exactly the C-R inequality for univariate case.

However, I want to derive the general form of C-R inequality from the H-C-R bound, i.e. when $k > 1$. But, I have not been able to do it. Though, I was able to figure out that we would have to use $mathbf{0} in mathbb{R}^k$ instead of $0$ and $|Delta|$ to obtain the derivatives, which was obvious anyways, I couldn’t get to any expression remotely similar to the C-R inequality. One of the difficulty arises while dealing with the squares. Since for the univariate case, we were able to take the limit inside and as a result got the square of the derivative. While, for the latter case, we cannot take the limit inside, because the derviate in this case would be a vector and we will have the expression containg the square of a vector which is absurd.

I want to know how to derive the C-R inequality in the latter case?


Get this bounty!!!

#StackBounty: #anova #generalized-linear-model #inference #multiple-comparisons #ancova ANCOVA: Measured weight of 5 participants 5 tim…

Bounty: 50

I measured weight of 5 participants, 5 times each, on 3 different scales accurate to 5 places after decimals.

In total, I took: [5 participants] x [5 times each] x [3 different scales] = 75 measurements.

EDIT 1: My null hypothesis is that the weight measured by the three weighing scales is the same. I’d like to be able to say with the help of a statistical test: 1) Whether or not any of the weighing scale is different, 2) If any of the three scales is indeed different, which one?

Since weight is a continuous response variable, if I wanted to use ANCOVA to test for differences between the scales, should I use participant ID (e.g. 1, 2, 3, 4, 5) as a covariate and run the ANCOVA?

Any pointers to the rights statistical test for such an analysis and the R functions I should look at would be greatly appreciated!

EDIT 2: For some additional context: the data has already been acquired as before and we cannot acquire new data. Looking at this data in retrospect, I first thought of ANOVA and then wondered if participant ID should be included as a covariate since measurements were repeated 5 times for each participant on all the scales.

Thanks to @whuber for teasing out the details and enriching the question.


Get this bounty!!!

#StackBounty: #bayesian #estimation #inference #prevalence Optimization of pool size and number of tests for prevalence estimation via …

Bounty: 100

I’m trying to devise a protocol for pooling lab tests from a cohort in order to get prevalence estimates using as few reagents as possible.

Assuming perfect sensitivity and specificity (if you want to include them in the answer is a plus), if I group testing material in pools of size $s$ and given an underneath (I don’t like term “real”) mean probability $p$ of the disease, the probability of the pool being positive is:

$$p_w = 1 – (1 – p)^s$$

if I run $w$ such pools the probability of having $k$ positive wells given a certain prevalence is:

$$p(k | w, p) = binom{N}{k} (1 – (1 – p)^k)^s(1 – p)^{s(w-k)}$$

that is $k sim Binom(w, 1 – (1 – p)^s)$.

To get $p$ I just need to maximize the likelihood $p(k | w, p)$ or use the formula $1 – sqrt[s]{1 – k/w}$ (not really sure about this second one…).

My question is, how do I optimize $s$ (maximize) and $w$ (minimize) according to a prior $p$ in order have the most precise estimates, below a certain level of error?


Get this bounty!!!