#StackBounty: #distributions #variance #interpretation #standard-deviation #reporting Variance of a derived magnitude

Bounty: 50

I’m wondering about how to present results on a report (and how to interpret it).

Let $Y = f(mathbf{X})$ be a random variable. Of course, if we derive it’s PDF $f_{Y}(y)$, we could present it on the report and the reader would have all the information for that variable. But, suppose we compute it’s variance approximately (without passing through their PDF) by using the formula
begin{equation*}
text{Var}left(Yright)
= sum_{i=1}^{n}
left(
left. frac{partial y}{partial x_i} rightvert _{mu_i} sigma_{X_{i}}
right)^2
end{equation*}

When we present the result as $(mu_{Y} – sigma_{Y}, mu_{Y} + sigma_{Y})$, we aren’t really giving any information about how it’s distributed.

When the standard deviation of a variable is given without their PDF, we are supposed to interpret it with some inequality (Chebyshev’s, for example) to give us a bound for a confidence interval?


I’m asking this becuase I took two laboratory courses reporting magnitudes as mentioned and now, doing a probability course, I’ve learned that a function of gaussian distributed variables doesn’t follows, in general, a gaussian. So I want to know what’s the point on reporting standard deviation for an unknown distribution: using an upper bound for the confidence interval (Chebyshev, the only one that works for any distribution) or there are other reasons.

The question implicitly asks if what I’m saying is correct. If it’s not clear what I mean, please leave a comment so I can make an attempt to clarify.


Get this bounty!!!

#StackBounty: #confidence-interval #standard-deviation #experiment-design #coefficient-of-variation #harmonic-mean Sample of rates: mea…

Bounty: 50

A performance test of a software application measures the maximum rate (operations per second), which this application can handle. The test is repeated multiple times each iteration yielding the determined maximum rate. So, we have a sample of rates to be analyzed statistically. It should be analyzed if the result is useful (maximum dispersion as per CoV, minimum confidence as per the size of the bootstrapped confidence interval) and the results should be summarized in a single number. The size of the sample is rather small (<= 10).

  1. AFAIK the harmonic mean should be used to summarize such rates instead of the arithmetic mean or am I wrong?
  2. Do I also have to use variants of other statistical measures considering the fact, that the sample consists of rates, i.e.: Are specialized formulas for the standard deviation and the coefficient of variation different from the commonly used formulas needed? In the common examples for the SD and the CoV the arithmetic mean (/ expected value) is used. Would I have to replace the arithmetic mean in these cases? I found this question, why I think, that specialized approaches are need, but I have much too little knowledge about statistics to judge my specific case.


Get this bounty!!!

#StackBounty: #machine-learning #predictive-models #prediction #standard-deviation #confusion-matrix Compute standard deviation of accu…

Bounty: 50

edit – more information about what the code given should represent

The following pseudocode outlines the problem as I have it

for each random seed in S
    randomise the data
    for k in 1 to 5
        create test / training data
        fit the model to the training data
        generate score 

Therefore I will have $S * 5$ individual accuracy scores. My end score is an
average of these for which I would like to know the standard deviation.


original post

The following code represents my problem :

# S is the total number of random seeds to use
S = 3
# the size of each category, so original data will have 2n rows
n = 100
# number of "folds" to use
K = 5
# sample data
set.seed(2019)
original_data = data.frame(
  x = c(rnorm(n, 0.457, 0.01), c(rnorm(n, 0.508, 0.11))),
  y = c(rep(0, n), rep(1, n))
)
# will be a data frame to store the results. 
results = NULL
iteration = 1
for(s in 1:S){
  set.seed(s)
  rnd = sample(1:(2*n))
  # get randomised data
  td = original_data[rnd,]
  for(k in 1:K){
    # get test and training data
    trainset = td[1:140,]
    testset  = td[-(1:140),]
    # fit model and get scores
    m = glm(y ~ x, data = trainset, family = "binomial")
    # get probabilities and predicted values
    model_probabilities = predict(m, newdata=testset, 
                                type="response")
    model_predictions   = 1 * ( model_probabilities >= 0.5)
    # store results
    results = rbind(results, data.frame(
      seed = s, k = k, iteration = iteration,
      probability = model_probabilities, 
      prediction   = model_predictions,
      observed     = testset$y
    ))
    iteration = iteration + 1
  }
}

# table of predicted and observed
t = table(results$prediction, results$observed)
# convert into percentages 
t = 100 * round(prop.table(t),3)
# compute the accuracy 
accuracy = t[1,1] + t[2,2]
accuracy

With the output of :

> accuracy
[1] 51.1
> dim(results)
[1] 900   6

I want to know how to calculate the standard deviation for this accuracy measure.


edit – choice of $n$

still interested in the answer to this question, not sure if there’s additional information required.

Initially I thought that I should just use

$$
sqrt{
frac{p(1-p)}{n}
}
$$

Where $n = $ number of rows in test set.

This doesn’t seem to take into account that the accuracy score is averaged across many iterations, and I can’t find literature for this


Get this bounty!!!

#HackerRank: Correlation and Regression Lines solutions

import numpy as np
import scipy as sp
from scipy.stats import norm

Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
print(np.corrcoef(historyScores,physicsScores)[0][1])
0.144998154581

Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

sp.stats.linregress(physicsScores,historyScores).slope
0.20833333333333331

Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

def predict(pi,x,y):
    slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
    return slope*pi+ intercept

predict(10,physicsScores,historyScores)
15.458333333333332

Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

Estimate the value of x when y = 7. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

x=[i for i in range(0,20)]

'''
    4x - 5y + 33 = 0
    x = ( 5y - 33 ) / 4
    y = ( 4x + 33 ) / 5
    
    20x - 9y - 107 = 0
    x = (9y + 107)/20
    y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )
8.5

Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

http://www.mpkeshari.com/2011/01/19/lines-of-regression/

Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

  • calculate the arithmetic means of x and y respectively.
  • estimate the value of x when y = 7. – find the variance of y when σx = 3.
Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16