# #StackBounty: #machine-learning #cross-validation #covariance #monte-carlo #approximation Estimate correlation between data and a data-…

### Bounty: 50

Say that I want to estimate the mean of a function \$f\$, \$mathbb{E}[f(X)]\$, given some input distribution \$xsim P(x)\$. I don’t know anython about the form of \$f\$ except that it is smooth and continuous. I can do this with a conventional Monte Carlo estimation:

\$mathbb{E}[f(X)] approx frac{1}{N}sum_{i=1}^N f(x_i)\$.

The control variate technique can be used to decrease that variance of the estimate:

\$mathbb{E}[f(X)] approx frac{1}{N} sum_{i=1}^N left( f(x_i) + alphaleft(hat{f}(x_i) – mathbb{E}[hat{f}(x)]right)right)\$.

The varariance of the estiamte is minimized when \$alpha=-mathrm{CORR}left(f(x), hat{f}(x)right)sqrt{frac{mathrm{VAR}(f(x))}{mathrm{VAR}(hat{f}(x))}}\$,

where \$mathrm{CORR}\$ is the Peason correlation coefficient and \$mathrm{VAR}\$ is the variance operator.

It would be nice to use a data fit model as my \$hat{f}\$ function. The problem is that \$hat{f}\$ is a data fit model, so it is not clear to me how I can accurately estiamate the Pearson correlation between \$f(x)\$ and \$hat{f}(x)\$.

To reiterate, the goal is to estimate the Pearson correlation between the truth and the surrogate, \$mathrm{COR}left(f(x), hat{f}(x)right)\$, given the probability density \$P(x)\$, so that I can decrease the variance of the Monte Carlo approximation of \$mathbb{E}[f(x)]\$. The problem is that my surrogate will usually have a correlation close to 1 at the observed points, since these are the points used to make the surrogate, and the correlation will be lower at interpolated points. So I need to design the correlation estimation procedure to not over-estimate the correlation. Ultimately I am looking for a method to best estimate the correlation.

The only approach I can think of is something like cross-validation. I can leave one or more observations out, construct a data-fit model with the remaining observations, then compute the correlation between the observed points I left out and the values predicted by the data-fit model.

Is cross validation a valid approach to estimating the correlation between a surrogate and observed data? Is there a best practice here?

Everything below this line is to illustrate my question. Feel free to skip this part.

Illustrative Example

Here, I implemented a leave-one-out cross-validation approach to estimating the correlation, assuming that the input \$x\$ is one-dimensional and uniformly distributed between 0 and 10.

I used a Gaussian Process as the surrogate, though the important aspect of that is that it interpolates across the observed data. For each point in my observations, I left one out and fit the Gaussian Process to the remaining data points, then recorded the surrogate’s prediction and the truth value. I decided not to leave the end points out so that I was only using the Gaussian Process for interpolation. I used these pairs of truth and predicted values to estimate the correlation between the function and the surrogate.

This was just my take on the problem. Ultimately, I’d like to experiment with other cross-validating approaches. I’m not sure what other approaches I could try besides cross-validation. I’m not sure if cross-validation is even a valid approach to this problem.

Coded implementation

``````import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

np.random.seed(5)

# this is the truth. In my real problem, there is no known functional form and I have limited observations.
def f(x):
return x * np.sin(x) - np.cos(5 * x) * np.sqrt(x)

# ----------------------------------------------------------------------
#  assume 20 observed inputs that are uniformly distributed
XL = 0
XU = 10
x_observed = np.atleast_2d(sorted(np.random.uniform(XL, XU, 20)))

# record of predictions
y_pred = []

# Instanciate kernel
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))

# estimate correlation between surrogate and truth
#   make the surrogate but leave one out at a time. Always use the edge points.
for ii in range(1, x_observed.shape[1]-1):

# prepare truth data leaving each one out at a time but never excluding the sides
X = x_observed[:, list(range(0, ii)) + list(range(ii+1, x_observed.shape[1]))].T
y = f(X.T).ravel()

# Fit to data using Maximum Likelihood Estimation of the parameters
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])

# Make the prediction over all observations
y_pred.append(gp.predict(x_observed[:, ii:ii+1], return_std=False)[0][0])

# Create surrogate using all observed data points
X = x_observed.T
y = f(X.T).ravel()
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])

# create points to estimate true correlation (we are using a uniform distribution)
xs = np.random.uniform(XL, XU, 100000)

# create domain for plotting
xb = np.linspace(XL, XU, 10000)

# estimate correlation
est_corr = np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1]

# calculate tru correlation
true_corr = np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, 0])[0][1]

# estimate mean with monte carlo
hf = np.mean(f(x_observed)[0, 1:-1])

# estimate mean with GP using small number of samples
gphf = np.mean(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

# estimate mean with GP using large number of samples
gpmean = np.mean(gp.predict(np.atleast_2d(xs).T)[:, 0])

# estimate the control variate value
alpha_est = -1 * np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1] * np.std(f(x_observed)[0, 1:-1]) / np.std(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

# calculate the true control variate value
alpha = -1 * np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, -1])[0][1] * np.std(f(xs)) / np.std(gp.predict(np.atleast_2d(xs).T)[:, 0])

print("I estimate the correlation of your surrogate is %04.3f" % est_corr)
print("True correlation: %04.3f" % true_corr)
print("True mean %f"%np.mean(f(xs)))
print("Monte Carlo Estimated mean %f"%hf)
print("Conntrol Variate Mean with estimated Correlation %f"% ( hf + alpha_est * (gphf - gpmean) ))
print("Conntrol Variate Mean with actual Correlation %f"% ( hf + alpha * (gphf - gpmean) ))
print("-------")
print("Approach | Error")
print("CV (estimated corr) | %f"%np.abs(np.mean(f(xs)) - (hf + alpha_est * (gphf - gpmean) )))
print("CV (actual corr) | %f"%np.abs(np.mean(f(xs)) - (hf + alpha * (gphf - gpmean) )))
print("Monte Carlo | %f"%np.abs(np.mean(f(xs)) - (hf)))

# Plot the truth, the surrogate, and the observed points
plt.plot(xb, f(xb), label='\$f(x)\$')
plt.plot(x_observed[0, :], f(x_observed[0, :]), 'r.', markersize=10, label=u'Observed \$f(x)\$')
plt.plot(xb, gp.predict(np.atleast_2d(xb).T), label=r'\$widehat{f}(x)\$')
plt.xlabel('\$x\$')
plt.legend(loc='upper left')
plt.savefig('fvsfhat')
plt.clf()

# plot the points used to estimate the correlation and the points used to calculate the true correlation
plt.scatter(f(xs), gp.predict(np.atleast_2d(xs).T), c='k', marker='o', s=1, label='Data used to calculate actual corerelation.')
plt.scatter((x_observed)[0, 1:-1], y_pred, c='purple', marker='*', label='Data used to calculate the correlation estimate.nThese were obtained using leave-one-out "cross-validation."')
plt.xlabel(r'\$f(x)\$')
plt.ylabel(r'\$widehat{f}(x)\$')
plt.legend()
plt.savefig('correlationData')
plt.clf()
``````

Output:

I estimate the correlation of your surrogate is 0.962

True correlation: 0.856

True mean 0.823682

Monte Carlo Estimated mean 0.226071

Conntrol Variate Mean with estimated Correlation 0.698964

Conntrol Variate Mean with actual Correlation 0.720262

Approach | Error

CV (estimated corr) | 0.124718

CV (actual corr) | 0.103420

Monte Carlo | 0.597611

(In the above output, both control variate estimates were more accurate than the conventional Monte Carlo estimator. This trend should hold for most random number generator seeds.)

I want to accurately estimate the correlation between my truth and surrogate given the prescribed input distribution. In the above example, I overestimated the correlation (my estimate, 0.96, was much greater than the actual correlation, which I calculated as 0.86) by leaving one observation out at a time, recording the surrogate’s prediction of the out point, then computing the correlation between the predicted points and the truth values.

I compared the accuracy of the control variate Monte Carlo method using the leave-one-out cross-validation estimate of the correlation to if I somehow knew a very accurate estimate of the correlation even for small sample sizes. Here, I used 6 samples to estimate the mean value. The plot compares the absolute errors associated with the control variate Monte Carlo approaches assuming that I have an accurate estimation of the correlation against if I use the leave-one-out cross-validation estimate of the correlation.

``````import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

np.random.seed(5)

# this is the truth. In my real problem, there is no known functional form and I have limited observations.
def f(x):
return x * np.sin(x) - np.cos(5 * x) * np.sqrt(x)

# ----------------------------------------------------------------------
#  assume observed inputs that are uniformly distributed
XL = 0
XU = 10
NSAMPS = 6

def experiment():
# make observations
x_observed = np.atleast_2d(sorted(np.random.uniform(XL, XU, NSAMPS)))

# record of predictions
y_pred = []

# Instanciate kernel
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))

# estimate correlation between surrogate and truth
#   make the surrogate but leave one out at a time. Always use the edge points.
for ii in range(1, x_observed.shape[1]-1):

# prepare truth data leaving each one out at a time but never excluding the sides
X = x_observed[:, list(range(0, ii)) + list(range(ii+1, x_observed.shape[1]))].T
y = f(X.T).ravel()

# Fit to data using Maximum Likelihood Estimation of the parameters
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])

# Make the prediction over all observations
y_pred.append(gp.predict(x_observed[:, ii:ii+1], return_std=False)[0][0])

# Create surrogate using all observed data points
X = x_observed.T
y = f(X.T).ravel()
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])

# create points to estimate true correlation (we are using a uniform distribution)
xs = np.random.uniform(XL, XU, 100000)

# create domain for plotting
xb = np.linspace(XL, XU, 10000)

# estimate correlation
est_corr = np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1]

# calculate tru correlation
true_corr = np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, 0])[0][1]

# estimate mean with monte carlo
hf = np.mean(f(x_observed)[0, 1:-1])

# estimate mean with GP using small number of samples
gphf = np.mean(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

# estimate mean with GP using large number of samples
gpmean = np.mean(gp.predict(np.atleast_2d(xs).T)[:, 0])

# estimate the control variate value
alpha_est = -1 * np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1] * np.std(f(x_observed)[0, 1:-1]) / np.std(gp.predict(np.atleast_2d(x_observed).T)[:, 0])

# calculate the true control variate value
alpha = -1 * np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, -1])[0][1] * np.std(f(xs)) / np.std(gp.predict(np.atleast_2d(xs).T)[:, 0])

return( ( hf + alpha_est * (gphf - gpmean) ), ( hf + alpha * (gphf - gpmean) ))

exps = [experiment() for _ in range(30)]
x, y = zip(*exps)
plt.scatter(np.abs(x), np.abs(y))
plt.title("Estimated Monte Carlo Estimate Errors.nThe Estimated means used %i samples."%NSAMPS)
plt.xlabel('Absolute Error, Leave-One-Out Approach')
plt.ylabel('Absolute Error, Using Correlation Esimated With Very Large N')