## #StackBounty: #machine-learning #confidence-interval #p-value #conditional-probability Finding a confidence factor for a calculation?

### Bounty: 50

I have a population in which some have an event A and some other don’t. Event A is actually my target class. I also have a set of variables/features for my population which I can use in a modeling (supervised learning) setting. Let’s say one of the features/variables is age. What I’d like to find is the impact of age on event A in a very intuitive way. Assume my population size is 2000 and 100 of them have event A and the rest don’t. I somehow came up with a cutting point for the age, e.g. less that 40 years old and greater than 40 years old. Here is the distribution of the population:

``````                  Have event A       don't have event A
less that 40              20                   100
greater than 40           80                   1800
``````

To show the impact of age on event, I do the following : p(have event A| age less than 40) / p(have event A/ age greater than 40)
= (20/120) / (80/1880)

However, I’d like to find something like a p-value for this calculation. Howe can I do that?

Get this bounty!!!

## #StackBounty: #regression #correlation #p-value #assumptions Difference between the assumptions underlying a correlation and a regressi…

### Bounty: 50

My question grew out of a discussion with @whuber in the comments of a different question.

Specifically, @whuber ‘s comment was as follows:

One reason it might surprise you is that the assumptions underlying a correlation test and a regression slope test are different–so even when we understand that the correlation and slope are really measuring the same thing, why should their p-values be the same? That shows how these issues go deeper than simply whether \$r\$ and \$beta\$ should be numerically equal.

This got my thinking about it and I came across a variety of interesting answers. For example, I found this question “Assumptions of correlation coefficient” but can’t see how this would clarify the comment above.

I found more interesting answers about the relationship of Pearson’s \$r\$ and the slope \$beta\$ in a simple linear regression (see here and here for example) but none of them seem to answer what @whuber was referring to in his comment (at least not apparent to me).

Question 1: What are the assumptions underlying a correlation test and a regression slope test?

For my 2nd question consider the following outputs in `R`:

``````model <- lm(Employed ~ Population, data = longley)
summary(model)

Call:
lm(formula = Employed ~ Population, data = longley)

Residuals:
Min      1Q  Median      3Q     Max
-1.4362 -0.9740  0.2021  0.5531  1.9048

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   8.3807     4.4224   1.895   0.0789 .
Population    0.4849     0.0376  12.896 3.69e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.013 on 14 degrees of freedom
Multiple R-squared:  0.9224,    Adjusted R-squared:  0.9168
F-statistic: 166.3 on 1 and 14 DF,  p-value: 3.693e-09
``````

And the output of the `cor.test()` function:

``````with(longley, cor.test(Population, Employed))

Pearson's product-moment correlation

data:  Population and Employed
t = 12.8956, df = 14, p-value = 3.693e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8869236 0.9864676
sample estimates:
cor
0.9603906
``````

As can be seen by the `lm()` and `cov.test()` output, the Pearson’s correlation coefficient \$r\$ and the slope estimate (\$beta_1\$) are largely different, 0.96 vs. 0.485, respectively, but the t-value and the p-values are the same.

Then I also tried to see if I am able to calculate the t-value for \$r\$ and \$beta_1\$, which are the same despite \$r\$ and \$beta_1\$ being different. And that’s where I get stuck, at least for \$r\$:

Calculate the the slope (\$beta_1\$) in a simple linear regression using the total sums of squares of \$x\$ and \$y\$:

``````x <- longley\$Population; y <- longley\$Employed
xbar <- mean(x); ybar <- mean(y)
ss.x <- sum((x-xbar)^2)
ss.y <- sum((y-ybar)^2)
ss.xy <- sum((x-xbar)*(y-ybar))
``````

Calculate the least-squares estimate of the regression slope, \$beta_{1}\$ (there is a proof of this in Crawley’s R Book 1st edition, page 393):

``````b1 <- ss.xy/ss.x
b1
# [1] 0.4848781
``````

Calculate the standard error for \$beta_1\$:

``````ss.residual <- sum((y-model\$fitted)^2)
n <- length(x) # SAMPLE SIZE
k <- length(model\$coef) # NUMBER OF MODEL PARAMETER (i.e. b0 and b1)
df.residual <- n-k
ms.residual <- ss.residual/df.residual # RESIDUAL MEAN SQUARE
se.b1 <- sqrt(ms.residual/ss.x)
se.b1
# [1] 0.03760029
``````

And the t-value and p-value for \$beta_1\$:

``````t.b1 <- b1/se.b1
p.b1 <- 2*pt(-abs(t.b1), df=n-2)
t.b1
# [1] 12.89559
p.b1
# [1] 3.693245e-09
``````

What I don’t know at this point, and this is Question 2, is, how to calculate the same t-value using \$r\$ instead of \$beta_1\$ (perhaps in baby-steps)?

I assume that since `cor.test()`‘s alternative hypothesis is whether the true correlation is not equal to 0 (see `cor.test()` output above), I would expect something like the Pearson correlation coefficient \$r\$ divided by the “standard error of the Pearson correlation coefficient” (similar to the `b1/se.b1` above)?! But what would that standard error be and why?

Maybe this has something to do with the aforementioned assumptions underlying a correlation test and a regression slope test?!

Get this bounty!!!

## #StackBounty: #hypothesis-testing #p-value #multiple-comparisons #power #false-discovery-rate How to calculate FDR and Power?

### Bounty: 50

would anybody give details numerical example of FDR and power calculation? for example, there are p-values given, using these p-values calculate threshold using BH procedure. then calculate FDR and Power. I need complete example of all this. you may use this p-values = 0.010, 0.013, 0.014, 0.190, 0.350, 0.500, 0.630, 0.670, 0.750, 0.810

thanks

Get this bounty!!!

## #StackBounty: #regression #confidence-interval #p-value #bootstrap #nonlinear-regression Efficient nonparametric estimation of confiden…

### Bounty: 50

I’m estimating parameters for a complex, “implicit” nonlinear model \$f(mathbf{x}, boldsymbol{theta})\$. It’s “implicit” in the sense that I don’t have an explicit formula for \$f\$: its value is the output of a complex fluid dynamics code (CFD). After NLS regression, I had a look at residuals, and they don’t look very normal at all. Also, I’m having a lot of issues with estimating their variance-covariance matrix: methods available in `nlstools` fail with an error.

I’m suspecting the assumption of normally distributed parameter estimators is not valid: thus I would like to use some nonparametric method to estimate confidence intervals, \$p\$-values and confidence regions for the three parameters of my model. I thought of bootstrap, but other approaches are welcome, so long as they don’t rely on normality of parameter estimators. Would this work:

1. given data set \$D={P_i=(mathbf{x}_i,f_i)}_{i=1}^N\$, generate datasets \$D_1,dots,D_m\$ by sampling with replacement from \$D\$
2. For each \$D_i\$, use NLS (Nonlinear Least Squares) to estimate model parameters \$boldsymbol{theta}^*_i=(theta^*_{1i},theta^*_{2i},theta^*_{3i})\$
3. I now have empirical distributions for the NLS parameters estimator. The sample mean of this distribution would be the bootstrap estimate for my parameters; 2.5% and 97.5% quantiles would give me confidence intervals. I could also make scatterplots matrices of each parameter against each other, and get an idea of the correlation among them. This is the part I like the most, because I believe that one parameter is weakly correlated with the others, while the remaining are extremely strongly correlated among themselves.

Is this correct? Then how do I compute the \$p-\$values – what is the null for nonlinear regression models? For example, for parameter \$theta_{3}\$, is it that \$theta_{3}=0\$, and the other two are not? How would I compute the \$p-\$value for such an hypothesis from my bootstrap sample \$boldsymbol{theta}^_1,dots,boldsymbol{theta}^_m\$? I don’t see the connection with the null…

Also, each NLS fit takes me quite some time (let’s say a few hours) because I need to run my fluid dynamics code \$ptimes N\$ times, where \$N\$ is the size of \$D\$ and \$p\$ is about 40 in my case. The total CPU time for bootstrap is then \$40times N times m\$ the time of a single CFD run, which is a lot. I would need a faster way. What can I do? I thought of building a metamodel for my CFD code (for example, a Gaussian Process model) and use that for bootstrapping, instead than CFD. What do you think? Would that work?

Get this bounty!!!

## #StackBounty: #hypothesis-testing #t-test #p-value Estimating "population p-value" \$Pi\$ using an observed p-value

### Bounty: 100

I asked a similar question last month, but from the responses, I see how the question can be asked more precisely.

Let’s suppose a population of the form

\$\$X sim mathcal{N}(100 + t_{n-1} times sigma / sqrt{n}, sigma)\$\$

in which \$t_{n-1}\$ is the student \$t\$ quantile based on a specific value of a parameter \$Pi\$ (\$0<Pi<1)\$. For the sake of the illustration, we could suppose that \$Pi\$ is 0.025.

When performing a one-sided \$t\$ test of the null hypothesis \$H_0: mu = 100\$ on a sample taken from that population, the expected \$p\$ value is \$Pi\$, irrespective of sample size (as long as simple randomized sampling is used).

I have 4 questions:

1. Is the \$p\$ value a maximum likelihood estimator (MLE) of \$Pi\$? (Conjecture: yes, because it is based on a \$t\$ statistic which is based on a likelihood ratio test);

2. Is the \$p\$ value a biased estimator of \$Pi\$? (Conjecture: yes because (i) MLE tend to be biased, and (2) based on simulations, I noted that the median value of many \$p\$s is close to \$Pi\$ but the mean value of many \$p\$s is much larger);

3. Is the \$p\$ value a minimum variance estimate of \$Pi\$? (Conjecture: yes in the asymptotic case but no guarantee for a given sample size)

4. Can we get a confidence interval around a given \$p\$ value by using the confidence interval of the observed \$t\$ value (this is done using the non-central student \$t\$ distribution with degree of freedom \$n-1\$ and non-centrality parameter \$t\$) and computing the \$p\$ values of the lower and upper bound \$t\$ values? (Conjecture: yes because both the non-central student \$t\$ quantiles and the \$p\$ values of a one-sided test are continuous increasing functions)

Get this bounty!!!

## #StackBounty: #hypothesis-testing #t-test #p-value Estimating "population p-value" \$Pi\$ using an observed p-value

### Bounty: 100

I asked a similar question last month, but from the responses, I see how the question can be asked more precisely.

Let’s suppose a population of the form

\$\$X sim mathcal{N}(100 + t_{n-1} times sigma / sqrt{n}, sigma)\$\$

in which \$t_{n-1}\$ is the student \$t\$ quantile based on a specific value of a parameter \$Pi\$ (\$0<Pi<1)\$. For the sake of the illustration, we could suppose that \$Pi\$ is 0.025.

When performing a one-sided \$t\$ test of the null hypothesis \$H_0: mu = 100\$ on a sample taken from that population, the expected \$p\$ value is \$Pi\$, irrespective of sample size (as long as simple randomized sampling is used).

I have 4 questions:

1. Is the \$p\$ value a maximum likelihood estimator (MLE) of \$Pi\$? (Conjecture: yes, because it is based on a \$t\$ statistic which is based on a likelihood ratio test);

2. Is the \$p\$ value a biased estimator of \$Pi\$? (Conjecture: yes because (i) MLE tend to be biased, and (2) based on simulations, I noted that the median value of many \$p\$s is close to \$Pi\$ but the mean value of many \$p\$s is much larger);

3. Is the \$p\$ value a minimum variance estimate of \$Pi\$? (Conjecture: yes in the asymptotic case but no guarantee for a given sample size)

4. Can we get a confidence interval around a given \$p\$ value by using the confidence interval of the observed \$t\$ value (this is done using the non-central student \$t\$ distribution with degree of freedom \$n-1\$ and non-centrality parameter \$t\$) and computing the \$p\$ values of the lower and upper bound \$t\$ values? (Conjecture: yes because both the non-central student \$t\$ quantiles and the \$p\$ values of a one-sided test are continuous increasing functions)

Get this bounty!!!

## #StackBounty: #hypothesis-testing #t-test #p-value Estimating "population p-value" \$Pi\$ using an observed p-value

### Bounty: 100

I asked a similar question last month, but from the responses, I see how the question can be asked more precisely.

Let’s suppose a population of the form

\$\$X sim mathcal{N}(100 + t_{n-1} times sigma / sqrt{n}, sigma)\$\$

in which \$t_{n-1}\$ is the student \$t\$ quantile based on a specific value of a parameter \$Pi\$ (\$0<Pi<1)\$. For the sake of the illustration, we could suppose that \$Pi\$ is 0.025.

When performing a one-sided \$t\$ test of the null hypothesis \$H_0: mu = 100\$ on a sample taken from that population, the expected \$p\$ value is \$Pi\$, irrespective of sample size (as long as simple randomized sampling is used).

I have 4 questions:

1. Is the \$p\$ value a maximum likelihood estimator (MLE) of \$Pi\$? (Conjecture: yes, because it is based on a \$t\$ statistic which is based on a likelihood ratio test);

2. Is the \$p\$ value a biased estimator of \$Pi\$? (Conjecture: yes because (i) MLE tend to be biased, and (2) based on simulations, I noted that the median value of many \$p\$s is close to \$Pi\$ but the mean value of many \$p\$s is much larger);

3. Is the \$p\$ value a minimum variance estimate of \$Pi\$? (Conjecture: yes in the asymptotic case but no guarantee for a given sample size)

4. Can we get a confidence interval around a given \$p\$ value by using the confidence interval of the observed \$t\$ value (this is done using the non-central student \$t\$ distribution with degree of freedom \$n-1\$ and non-centrality parameter \$t\$) and computing the \$p\$ values of the lower and upper bound \$t\$ values? (Conjecture: yes because both the non-central student \$t\$ quantiles and the \$p\$ values of a one-sided test are continuous increasing functions)

Get this bounty!!!

## #StackBounty: #hypothesis-testing #t-test #p-value Estimating "population p-value" \$Pi\$ using an observed p-value

### Bounty: 100

I asked a similar question last month, but from the responses, I see how the question can be asked more precisely.

Let’s suppose a population of the form

\$\$X sim mathcal{N}(100 + t_{n-1} times sigma / sqrt{n}, sigma)\$\$

in which \$t_{n-1}\$ is the student \$t\$ quantile based on a specific value of a parameter \$Pi\$ (\$0<Pi<1)\$. For the sake of the illustration, we could suppose that \$Pi\$ is 0.025.

When performing a one-sided \$t\$ test of the null hypothesis \$H_0: mu = 100\$ on a sample taken from that population, the expected \$p\$ value is \$Pi\$, irrespective of sample size (as long as simple randomized sampling is used).

I have 4 questions:

1. Is the \$p\$ value a maximum likelihood estimator (MLE) of \$Pi\$? (Conjecture: yes, because it is based on a \$t\$ statistic which is based on a likelihood ratio test);

2. Is the \$p\$ value a biased estimator of \$Pi\$? (Conjecture: yes because (i) MLE tend to be biased, and (2) based on simulations, I noted that the median value of many \$p\$s is close to \$Pi\$ but the mean value of many \$p\$s is much larger);

3. Is the \$p\$ value a minimum variance estimate of \$Pi\$? (Conjecture: yes in the asymptotic case but no guarantee for a given sample size)

4. Can we get a confidence interval around a given \$p\$ value by using the confidence interval of the observed \$t\$ value (this is done using the non-central student \$t\$ distribution with degree of freedom \$n-1\$ and non-centrality parameter \$t\$) and computing the \$p\$ values of the lower and upper bound \$t\$ values? (Conjecture: yes because both the non-central student \$t\$ quantiles and the \$p\$ values of a one-sided test are continuous increasing functions)

Get this bounty!!!

## #StackBounty: #hypothesis-testing #t-test #p-value Estimating "population p-value" \$Pi\$ using an observed p-value

### Bounty: 100

I asked a similar question last month, but from the responses, I see how the question can be asked more precisely.

Let’s suppose a population of the form

\$\$X sim mathcal{N}(100 + t_{n-1} times sigma / sqrt{n}, sigma)\$\$

in which \$t_{n-1}\$ is the student \$t\$ quantile based on a specific value of a parameter \$Pi\$ (\$0<Pi<1)\$. For the sake of the illustration, we could suppose that \$Pi\$ is 0.025.

When performing a one-sided \$t\$ test of the null hypothesis \$H_0: mu = 100\$ on a sample taken from that population, the expected \$p\$ value is \$Pi\$, irrespective of sample size (as long as simple randomized sampling is used).

I have 4 questions:

1. Is the \$p\$ value a maximum likelihood estimator (MLE) of \$Pi\$? (Conjecture: yes, because it is based on a \$t\$ statistic which is based on a likelihood ratio test);

2. Is the \$p\$ value a biased estimator of \$Pi\$? (Conjecture: yes because (i) MLE tend to be biased, and (2) based on simulations, I noted that the median value of many \$p\$s is close to \$Pi\$ but the mean value of many \$p\$s is much larger);

3. Is the \$p\$ value a minimum variance estimate of \$Pi\$? (Conjecture: yes in the asymptotic case but no guarantee for a given sample size)

4. Can we get a confidence interval around a given \$p\$ value by using the confidence interval of the observed \$t\$ value (this is done using the non-central student \$t\$ distribution with degree of freedom \$n-1\$ and non-centrality parameter \$t\$) and computing the \$p\$ values of the lower and upper bound \$t\$ values? (Conjecture: yes because both the non-central student \$t\$ quantiles and the \$p\$ values of a one-sided test are continuous increasing functions)

Get this bounty!!!

## #StackBounty: #hypothesis-testing #t-test #p-value Estimating "population p-value" \$Pi\$ using an observed p-value

### Bounty: 100

I asked a similar question last month, but from the responses, I see how the question can be asked more precisely.

Let’s suppose a population of the form

\$\$X sim mathcal{N}(100 + t_{n-1} times sigma / sqrt{n}, sigma)\$\$

in which \$t_{n-1}\$ is the student \$t\$ quantile based on a specific value of a parameter \$Pi\$ (\$0<Pi<1)\$. For the sake of the illustration, we could suppose that \$Pi\$ is 0.025.

When performing a one-sided \$t\$ test of the null hypothesis \$H_0: mu = 100\$ on a sample taken from that population, the expected \$p\$ value is \$Pi\$, irrespective of sample size (as long as simple randomized sampling is used).

I have 4 questions:

1. Is the \$p\$ value a maximum likelihood estimator (MLE) of \$Pi\$? (Conjecture: yes, because it is based on a \$t\$ statistic which is based on a likelihood ratio test);

2. Is the \$p\$ value a biased estimator of \$Pi\$? (Conjecture: yes because (i) MLE tend to be biased, and (2) based on simulations, I noted that the median value of many \$p\$s is close to \$Pi\$ but the mean value of many \$p\$s is much larger);

3. Is the \$p\$ value a minimum variance estimate of \$Pi\$? (Conjecture: yes in the asymptotic case but no guarantee for a given sample size)

4. Can we get a confidence interval around a given \$p\$ value by using the confidence interval of the observed \$t\$ value (this is done using the non-central student \$t\$ distribution with degree of freedom \$n-1\$ and non-centrality parameter \$t\$) and computing the \$p\$ values of the lower and upper bound \$t\$ values? (Conjecture: yes because both the non-central student \$t\$ quantiles and the \$p\$ values of a one-sided test are continuous increasing functions)

Get this bounty!!!