## #StackBounty: #distributions #variance #interpretation #standard-deviation #reporting Variance of a derived magnitude

### Bounty: 50

I’m wondering about how to present results on a report (and how to interpret it).

Let $$Y = f(mathbf{X})$$ be a random variable. Of course, if we derive it’s PDF $$f_{Y}(y)$$, we could present it on the report and the reader would have all the information for that variable. But, suppose we compute it’s variance approximately (without passing through their PDF) by using the formula
$$begin{equation*} text{Var}left(Yright) = sum_{i=1}^{n} left( left. frac{partial y}{partial x_i} rightvert _{mu_i} sigma_{X_{i}} right)^2 end{equation*}$$
When we present the result as $$(mu_{Y} – sigma_{Y}, mu_{Y} + sigma_{Y})$$, we aren’t really giving any information about how it’s distributed.

When the standard deviation of a variable is given without their PDF, we are supposed to interpret it with some inequality (Chebyshev’s, for example) to give us a bound for a confidence interval?

I’m asking this becuase I took two laboratory courses reporting magnitudes as mentioned and now, doing a probability course, I’ve learned that a function of gaussian distributed variables doesn’t follows, in general, a gaussian. So I want to know what’s the point on reporting standard deviation for an unknown distribution: using an upper bound for the confidence interval (Chebyshev, the only one that works for any distribution) or there are other reasons.

The question implicitly asks if what I’m saying is correct. If it’s not clear what I mean, please leave a comment so I can make an attempt to clarify.

Get this bounty!!!

## #StackBounty: #confidence-interval #standard-deviation #experiment-design #coefficient-of-variation #harmonic-mean Sample of rates: mea…

### Bounty: 50

A performance test of a software application measures the maximum rate (operations per second), which this application can handle. The test is repeated multiple times each iteration yielding the determined maximum rate. So, we have a sample of rates to be analyzed statistically. It should be analyzed if the result is useful (maximum dispersion as per CoV, minimum confidence as per the size of the bootstrapped confidence interval) and the results should be summarized in a single number. The size of the sample is rather small (<= 10).

1. AFAIK the harmonic mean should be used to summarize such rates instead of the arithmetic mean or am I wrong?
2. Do I also have to use variants of other statistical measures considering the fact, that the sample consists of rates, i.e.: Are specialized formulas for the standard deviation and the coefficient of variation different from the commonly used formulas needed? In the common examples for the SD and the CoV the arithmetic mean (/ expected value) is used. Would I have to replace the arithmetic mean in these cases? I found this question, why I think, that specialized approaches are need, but I have much too little knowledge about statistics to judge my specific case.

Get this bounty!!!

## #StackBounty: #machine-learning #predictive-models #prediction #standard-deviation #confusion-matrix Compute standard deviation of accu…

### Bounty: 50

The following pseudocode outlines the problem as I have it

``````for each random seed in S
randomise the data
for k in 1 to 5
create test / training data
fit the model to the training data
generate score
``````

Therefore I will have $$S * 5$$ individual accuracy scores. My end score is an
average of these for which I would like to know the standard deviation.

# original post

The following code represents my problem :

``````# S is the total number of random seeds to use
S = 3
# the size of each category, so original data will have 2n rows
n = 100
# number of "folds" to use
K = 5
# sample data
set.seed(2019)
original_data = data.frame(
x = c(rnorm(n, 0.457, 0.01), c(rnorm(n, 0.508, 0.11))),
y = c(rep(0, n), rep(1, n))
)
# will be a data frame to store the results.
results = NULL
iteration = 1
for(s in 1:S){
set.seed(s)
rnd = sample(1:(2*n))
# get randomised data
td = original_data[rnd,]
for(k in 1:K){
# get test and training data
trainset = td[1:140,]
testset  = td[-(1:140),]
# fit model and get scores
m = glm(y ~ x, data = trainset, family = "binomial")
# get probabilities and predicted values
model_probabilities = predict(m, newdata=testset,
type="response")
model_predictions   = 1 * ( model_probabilities >= 0.5)
# store results
results = rbind(results, data.frame(
seed = s, k = k, iteration = iteration,
probability = model_probabilities,
prediction   = model_predictions,
observed     = testset\$y
))
iteration = iteration + 1
}
}

# table of predicted and observed
t = table(results$$prediction, results$$observed)
# convert into percentages
t = 100 * round(prop.table(t),3)
# compute the accuracy
accuracy = t[1,1] + t[2,2]
accuracy
``````

With the output of :

``````> accuracy
 51.1
> dim(results)
 900   6
``````

I want to know how to calculate the standard deviation for this accuracy measure.

## edit – choice of $$n$$

still interested in the answer to this question, not sure if there’s additional information required.

Initially I thought that I should just use

$$sqrt{ frac{p(1-p)}{n} }$$

Where $$n =$$ number of rows in test set.

This doesn’t seem to take into account that the accuracy score is averaged across many iterations, and I can’t find literature for this

Get this bounty!!!

## #HackerRank: Correlation and Regression Lines solutions

```import numpy as np
import scipy as sp
from scipy.stats import norm```

### Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: `0.255`

This is NOT the actual answer – just the format in which you should provide your answer.

```physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]```
`print(np.corrcoef(historyScores,physicsScores))`
``````0.144998154581
``````

### Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: `0.255`

This is NOT the actual answer – just the format in which you should provide your answer.

`sp.stats.linregress(physicsScores,historyScores).slope`
``````0.20833333333333331
``````

### Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: `0.255`

This is NOT the actual answer – just the format in which you should provide your answer.

```def predict(pi,x,y):
slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
return slope*pi+ intercept

predict(10,physicsScores,historyScores)```
``````15.458333333333332
``````

### Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

`4x – 5y + 33 = 0` (line of y on x)

`20x – 9y – 107 = 0` (line of x on y).

Estimate the value of `x` when `y = 7`. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: `7.2`

This is NOT the actual answer – just the format in which you should provide your answer.

```x=[i for i in range(0,20)]

'''
4x - 5y + 33 = 0
x = ( 5y - 33 ) / 4
y = ( 4x + 33 ) / 5

20x - 9y - 107 = 0
x = (9y + 107)/20
y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )```
``````8.5
``````

#### Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

`4x – 5y + 33 = 0` (line of y on x)

`20x – 9y – 107 = 0` (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: `7.2`

This is NOT the actual answer – just the format in which you should provide your answer.

#### Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

• calculate the arithmetic means of x and y respectively.
• estimate the value of x when y = 7. – find the variance of y when σx = 3.
##### Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16