#StackBounty: #machine-learning #cross-validation #covariance #monte-carlo #approximation Estimate correlation between data and a data-…

Bounty: 50

Say that I want to estimate the mean of a function $f$, $mathbb{E}[f(X)]$, given some input distribution $xsim P(x)$. I don’t know anython about the form of $f$ except that it is smooth and continuous. I can do this with a conventional Monte Carlo estimation:

$mathbb{E}[f(X)] approx frac{1}{N}sum_{i=1}^N f(x_i)$.

The control variate technique can be used to decrease that variance of the estimate:

$mathbb{E}[f(X)] approx frac{1}{N} sum_{i=1}^N left( f(x_i) + alphaleft(hat{f}(x_i) – mathbb{E}[hat{f}(x)]right)right)$.

The varariance of the estiamte is minimized when $alpha=-mathrm{CORR}left(f(x), hat{f}(x)right)sqrt{frac{mathrm{VAR}(f(x))}{mathrm{VAR}(hat{f}(x))}}$,

where $mathrm{CORR}$ is the Peason correlation coefficient and $mathrm{VAR}$ is the variance operator.

It would be nice to use a data fit model as my $hat{f}$ function. The problem is that $hat{f}$ is a data fit model, so it is not clear to me how I can accurately estiamate the Pearson correlation between $f(x)$ and $hat{f}(x)$.

To reiterate, the goal is to estimate the Pearson correlation between the truth and the surrogate, $mathrm{COR}left(f(x), hat{f}(x)right)$, given the probability density $P(x)$, so that I can decrease the variance of the Monte Carlo approximation of $mathbb{E}[f(x)]$. The problem is that my surrogate will usually have a correlation close to 1 at the observed points, since these are the points used to make the surrogate, and the correlation will be lower at interpolated points. So I need to design the correlation estimation procedure to not over-estimate the correlation. Ultimately I am looking for a method to best estimate the correlation.

The only approach I can think of is something like cross-validation. I can leave one or more observations out, construct a data-fit model with the remaining observations, then compute the correlation between the observed points I left out and the values predicted by the data-fit model.

Is cross validation a valid approach to estimating the correlation between a surrogate and observed data? Is there a best practice here?

Everything below this line is to illustrate my question. Feel free to skip this part.


Illustrative Example

Here, I implemented a leave-one-out cross-validation approach to estimating the correlation, assuming that the input $x$ is one-dimensional and uniformly distributed between 0 and 10.

I used a Gaussian Process as the surrogate, though the important aspect of that is that it interpolates across the observed data. For each point in my observations, I left one out and fit the Gaussian Process to the remaining data points, then recorded the surrogate’s prediction and the truth value. I decided not to leave the end points out so that I was only using the Gaussian Process for interpolation. I used these pairs of truth and predicted values to estimate the correlation between the function and the surrogate.

This was just my take on the problem. Ultimately, I’d like to experiment with other cross-validating approaches. I’m not sure what other approaches I could try besides cross-validation. I’m not sure if cross-validation is even a valid approach to this problem.


Coded implementation

import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

np.random.seed(5)

# this is the truth. In my real problem, there is no known functional form and I have limited observations.
def f(x):
    return x * np.sin(x) - np.cos(5 * x) * np.sqrt(x)

# ----------------------------------------------------------------------
#  assume 20 observed inputs that are uniformly distributed
XL = 0
XU = 10
x_observed = np.atleast_2d(sorted(np.random.uniform(XL, XU, 20)))

# record of predictions 
y_pred = []

# Instanciate kernel
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))

# estimate correlation between surrogate and truth
#   make the surrogate but leave one out at a time. Always use the edge points.
for ii in range(1, x_observed.shape[1]-1):

   # prepare truth data leaving each one out at a time but never excluding the sides
   X = x_observed[:, list(range(0, ii)) + list(range(ii+1, x_observed.shape[1]))].T
   y = f(X.T).ravel()

   # Fit to data using Maximum Likelihood Estimation of the parameters
   gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
   gp.fit(X, y[:, None])

   # Make the prediction over all observations
   y_pred.append(gp.predict(x_observed[:, ii:ii+1], return_std=False)[0][0])

# Create surrogate using all observed data points
X = x_observed.T
y = f(X.T).ravel()
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])

# create points to estimate true correlation (we are using a uniform distribution)
xs = np.random.uniform(XL, XU, 100000)

# create domain for plotting
xb = np.linspace(XL, XU, 10000)

# estimate correlation
est_corr = np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1]

# calculate tru correlation
true_corr = np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, 0])[0][1]

# estimate mean with monte carlo
hf = np.mean(f(x_observed)[0, 1:-1])

# estimate mean with GP using small number of samples
gphf = np.mean(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

# estimate mean with GP using large number of samples
gpmean = np.mean(gp.predict(np.atleast_2d(xs).T)[:, 0])

# estimate the control variate value
alpha_est = -1 * np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1] * np.std(f(x_observed)[0, 1:-1]) / np.std(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

# calculate the true control variate value
alpha = -1 * np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, -1])[0][1] * np.std(f(xs)) / np.std(gp.predict(np.atleast_2d(xs).T)[:, 0])

print("I estimate the correlation of your surrogate is %04.3f" % est_corr)
print("True correlation: %04.3f" % true_corr)
print("True mean %f"%np.mean(f(xs)))
print("Monte Carlo Estimated mean %f"%hf)
print("Conntrol Variate Mean with estimated Correlation %f"% ( hf + alpha_est * (gphf - gpmean) ))
print("Conntrol Variate Mean with actual Correlation %f"% ( hf + alpha * (gphf - gpmean) ))
print("-------")
print("Approach | Error")
print("CV (estimated corr) | %f"%np.abs(np.mean(f(xs)) - (hf + alpha_est * (gphf - gpmean) )))
print("CV (actual corr) | %f"%np.abs(np.mean(f(xs)) - (hf + alpha * (gphf - gpmean) )))
print("Monte Carlo | %f"%np.abs(np.mean(f(xs)) - (hf)))

# Plot the truth, the surrogate, and the observed points
plt.plot(xb, f(xb), label='$f(x)$')
plt.plot(x_observed[0, :], f(x_observed[0, :]), 'r.', markersize=10, label=u'Observed $f(x)$')
plt.plot(xb, gp.predict(np.atleast_2d(xb).T), label=r'$widehat{f}(x)$')
plt.xlabel('$x$')
plt.legend(loc='upper left')
plt.savefig('fvsfhat')
plt.clf()


# plot the points used to estimate the correlation and the points used to calculate the true correlation
plt.scatter(f(xs), gp.predict(np.atleast_2d(xs).T), c='k', marker='o', s=1, label='Data used to calculate actual corerelation.')
plt.scatter((x_observed)[0, 1:-1], y_pred, c='purple', marker='*', label='Data used to calculate the correlation estimate.nThese were obtained using leave-one-out "cross-validation."')
plt.xlabel(r'$f(x)$')
plt.ylabel(r'$widehat{f}(x)$')
plt.legend()
plt.savefig('correlationData')
plt.clf()

Output:

I estimate the correlation of your surrogate is 0.962

True correlation: 0.856

True mean 0.823682

Monte Carlo Estimated mean 0.226071

Conntrol Variate Mean with estimated Correlation 0.698964

Conntrol Variate Mean with actual Correlation 0.720262


Approach | Error

CV (estimated corr) | 0.124718

CV (actual corr) | 0.103420

Monte Carlo | 0.597611

(In the above output, both control variate estimates were more accurate than the conventional Monte Carlo estimator. This trend should hold for most random number generator seeds.)

truth, surrogate, observed points
data used to calculate true correlation and estimated correlation

I want to accurately estimate the correlation between my truth and surrogate given the prescribed input distribution. In the above example, I overestimated the correlation (my estimate, 0.96, was much greater than the actual correlation, which I calculated as 0.86) by leaving one observation out at a time, recording the surrogate’s prediction of the out point, then computing the correlation between the predicted points and the truth values.


I compared the accuracy of the control variate Monte Carlo method using the leave-one-out cross-validation estimate of the correlation to if I somehow knew a very accurate estimate of the correlation even for small sample sizes. Here, I used 6 samples to estimate the mean value. The plot compares the absolute errors associated with the control variate Monte Carlo approaches assuming that I have an accurate estimation of the correlation against if I use the leave-one-out cross-validation estimate of the correlation.

import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

np.random.seed(5)

# this is the truth. In my real problem, there is no known functional form and I have limited observations.
def f(x):
    return x * np.sin(x) - np.cos(5 * x) * np.sqrt(x)

# ----------------------------------------------------------------------
#  assume observed inputs that are uniformly distributed
XL = 0
XU = 10
NSAMPS = 6

def experiment():
    # make observations
    x_observed = np.atleast_2d(sorted(np.random.uniform(XL, XU, NSAMPS)))

    # record of predictions 
    y_pred = []

    # Instanciate kernel
    kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))

    # estimate correlation between surrogate and truth
    #   make the surrogate but leave one out at a time. Always use the edge points.
    for ii in range(1, x_observed.shape[1]-1):

       # prepare truth data leaving each one out at a time but never excluding the sides
       X = x_observed[:, list(range(0, ii)) + list(range(ii+1, x_observed.shape[1]))].T
       y = f(X.T).ravel()

       # Fit to data using Maximum Likelihood Estimation of the parameters
       gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
       gp.fit(X, y[:, None])

       # Make the prediction over all observations
       y_pred.append(gp.predict(x_observed[:, ii:ii+1], return_std=False)[0][0])

    # Create surrogate using all observed data points
    X = x_observed.T
    y = f(X.T).ravel()
    gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
    gp.fit(X, y[:, None])

    # create points to estimate true correlation (we are using a uniform distribution)
    xs = np.random.uniform(XL, XU, 100000)

    # create domain for plotting
    xb = np.linspace(XL, XU, 10000)

    # estimate correlation
    est_corr = np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1]

    # calculate tru correlation
    true_corr = np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, 0])[0][1]

    # estimate mean with monte carlo
    hf = np.mean(f(x_observed)[0, 1:-1])

    # estimate mean with GP using small number of samples
    gphf = np.mean(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

    # estimate mean with GP using large number of samples
    gpmean = np.mean(gp.predict(np.atleast_2d(xs).T)[:, 0])

    # estimate the control variate value
    alpha_est = -1 * np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1] * np.std(f(x_observed)[0, 1:-1]) / np.std(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

    # calculate the true control variate value
    alpha = -1 * np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, -1])[0][1] * np.std(f(xs)) / np.std(gp.predict(np.atleast_2d(xs).T)[:, 0])

    return( ( hf + alpha_est * (gphf - gpmean) ), ( hf + alpha * (gphf - gpmean) ))

exps = [experiment() for _ in range(30)]
x, y = zip(*exps)
plt.scatter(np.abs(x), np.abs(y))
plt.title("Estimated Monte Carlo Estimate Errors.nThe Estimated means used %i samples."%NSAMPS)
plt.xlabel('Absolute Error, Leave-One-Out Approach')
plt.ylabel('Absolute Error, Using Correlation Esimated With Very Large N')
plt.gca().set_aspect('equal', adjustable='box')
plt.draw()
plt.savefig('errorPlot')
plt.clf()

enter image description here


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #covariance #monte-carlo #measurement-error Assessing significance when measurements are…

Bounty: 50

I am trying to estimate a function $f(x)$ at $x=0.1, 0.2, 0.3, 0.4, 0.5$.
I make this estimate through some complex procedure based on a set of independent but noisy data (with known uncertainties). The outcome is a set of correlated estimates.

The null hypothesis of my study is that $f(x)=0$. I want to assess, based on any/all of these measurements, whether $f(x)neq 0$ at some/any $x$.

I have run $N$ Monte Carlo simulations of the measurement procedure by repeating the measurement with random realizations of the data (since the measurement uncertainty is unknown). Now I have $N$ sets of estimates at these five points. I can use the $N$ trials to estimate the mean and variance of each $f(x)$ as well as the covariance between the different measurements.

Taking the mean and standard deviation across all of the $N$ trials, I find that each of the measurements are roughly $f(x)=-0.2 pm 0.1$. At normal thresholds of significance, this would be a $2sigma$ result and therefore not significantly different from 0. However, since all of the measurements are below zero, maybe the result is significant. However, I do not know how to assess this, given that all of the measurements are correlated.


Get this bounty!!!

#StackBounty: #math #3d #computer-vision #covariance #kalman-filter How to Project 3D Covariance Matrix to a given Image Plan (Pose)

Bounty: 100

I have a 3×3 covariance matrix for 3d-point and I want to know the equivalent 2d covariance (for u,v in image plane) , given the image pose [Xc,Yc,Zc,q0,q1,q2,q3] ,

There’s a long (geometric) way that the 3d covariance could be a 3d ellipse , then projecting it into plane give 2d ellipse ,lastly converting the ellipse to 2d matrix , but this is long,

Any direct way, to solve this algebraically will help

P. S: any clues or reference to a solution (no need for code) ,will also help, and I will rewrite an answer with code (in c++)

I tagged also kalman filter , cause I think it’s related to it


Get this bounty!!!

#StackBounty: #correlation #covariance #independence #covariance-matrix Correlation matrix of rows, correcting for correlation matrix o…

Bounty: 50

Given a matrix of observations (rows) x variables (columns), can we compute the correlation matrix of the rows, but corrected by the correlation matrix of the columns? The goal would be to avoid inflation in the correlation p values, since the t-test for Pearson correlation assumes that the columns/variables are independent.

Intuitively, this could be accomplished with a weighted correlation, where e.g. if a pair of variables are nearly perfectly correlated, they would each be down-weighted by a factor of 2.


Get this bounty!!!

#StackBounty: #probability #mathematical-statistics #estimation #multivariate-analysis #covariance Methods to prove that a guess for th…

Bounty: 50

Suppose we are interested in the covariance matrix $Sigma$ of a few MLE estimators $hat theta_1,hat theta_2,cdots,hat theta_n$. For each $j$, $hat theta_j$ is normally distributed and estimated from data. The data is multivariate normal with known covariance and mean $vec 0$.

The problem is, I obtained the covariance matrix $Sigma$ heuristically because it was impossible to compute directly. Now I want to prove that I have found the correct expression. What are some methods which would prove that I have found the correct covariance matrix?


Get this bounty!!!

#StackBounty: #covariance #covariance-matrix #graphical-model #graph-theory Given an adjacency matrix, how can we fit a covariance matr…

Bounty: 150

Suppose that I generate a k-regular graph like the following:

game <- sample_k_regular(k, r)
game <- as.matrix(as_adj(game))

Then, based on this adjacency matrix, suppose I replace the $1$’s with a correlation parameter $rho$ and replace all the $0$’s in the diagonal with $1$. Then, this is a covariance matrix for something that is like a multivariate normal. However, this way of creating a matrix results in a NON-positive definite matrix. Is there a way to create a covariance matrix structure based on an adjacency matrix based on putting a correlation parameter $rho$ where there are ties and $1$’s in the diagonal for a common variance? In other words, is there a way to create covariance matrices without running into the positive definiteness problem? Thanks.


Get this bounty!!!

#StackBounty: #covariance #spatial #kriging Determining covariance of irregularly spaced spatial data

Bounty: 50

I’m comparing concentration $C$ of a contaminant in the same spatial region at two time point 2000 and 2010 with sample size of $N_{2000}$ = 51 and $N_{2010}$ = 26 (not all the samples are from the same location), mean of $mu(C){2000}$ = 47 and $mu(C){2010}$ = 27 (determined by block kriging of all point observations) and variance of $V(C){2000}$ = 89 and $V(C){2010}$ = 68 (kriging variance). To determine if there has been any significant change over the last 10 years, we first need to determine the variance of change in the area:

$V(Delta C) = V(C){2000} + V(C){2010} – V(C)_{2000,2010}$

where, $V(Delta C)$ is the variance of change over time; and $V(C){2000,2010}$ is the covariance between the two temporal samples. Does anyone know how to determine the $V(C){2000,2010}$ term in the above equation?


Get this bounty!!!

#HackerRank: Correlation and Regression Lines solutions

import numpy as np
import scipy as sp
from scipy.stats import norm

Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
print(np.corrcoef(historyScores,physicsScores)[0][1])
0.144998154581

Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

sp.stats.linregress(physicsScores,historyScores).slope
0.20833333333333331

Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

def predict(pi,x,y):
    slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
    return slope*pi+ intercept

predict(10,physicsScores,historyScores)
15.458333333333332

Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

Estimate the value of x when y = 7. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

x=[i for i in range(0,20)]

'''
    4x - 5y + 33 = 0
    x = ( 5y - 33 ) / 4
    y = ( 4x + 33 ) / 5
    
    20x - 9y - 107 = 0
    x = (9y + 107)/20
    y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )
8.5

Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

http://www.mpkeshari.com/2011/01/19/lines-of-regression/

Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

  • calculate the arithmetic means of x and y respectively.
  • estimate the value of x when y = 7. – find the variance of y when σx = 3.
Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16