*Bounty: 50*

*Bounty: 50*

Say that I want to estimate the mean of a function $f$, $mathbb{E}[f(X)]$, given some input distribution $xsim P(x)$. I don’t know anython about the form of $f$ except that it is smooth and continuous. I can do this with a conventional Monte Carlo estimation:

$mathbb{E}[f(X)] approx frac{1}{N}sum_{i=1}^N f(x_i)$.

The control variate technique can be used to decrease that variance of the estimate:

$mathbb{E}[f(X)] approx frac{1}{N} sum_{i=1}^N left( f(x_i) + alphaleft(hat{f}(x_i) – mathbb{E}[hat{f}(x)]right)right)$.

The varariance of the estiamte is minimized when $alpha=-mathrm{CORR}left(f(x), hat{f}(x)right)sqrt{frac{mathrm{VAR}(f(x))}{mathrm{VAR}(hat{f}(x))}}$,

where $mathrm{CORR}$ is the Peason correlation coefficient and $mathrm{VAR}$ is the variance operator.

It would be nice to use a data fit model as my $hat{f}$ function. The problem is that it is not clear to me if it is possible to accurately or reliably estimate the Pearson correlation between the truth, $f(x)$, and my data fit model, $hat{f}(x)$.

To reiterate, the goal is to estimate the Pearson correlation between the truth and the surrogate, $mathrm{COR}left(f(x), hat{f}(x)right)$, given the probability density $P(x)$, so that I can confidently decrease the variance of the Monte Carlo approximation of $mathbb{E}[f(x)]$ using the control variate technique with a data-fit model, using a small number of observed $f$, which were observed with random samples of $P(x)$. The problem is that my surrogate will usually have a correlation close to 1 at the observed points, since these are the points used to make the surrogate, and the correlation will be lower at interpolated points. If my surrogate is exceptionally poor, I could increase the variance of my Monte Carlo estimator. So I need to design the correlation estimation procedure to not over-estimate the correlation. Maybe there is some trade-off between the expected variance reduction and the variance of the variance reduction? Ultimately I am looking for a method to best estimate the correlation and decrease the variance of the Monte Carlo estimator.

The only approach I can think of is something like cross-validation. I can leave one or more observations out, construct a data-fit model with the remaining observations, then compute the correlation between the observed points I left out and the values predicted by the data-fit model. But, cross validation does not seem like a reliable solution to me—overestimation of the correlation could lead to an increase in the variance of the control variate Monte Carlo estimator.

**Is there a reliable approach, cross validation or otherwise, for estimating the correlation between a data-fit model and observed data in this setting?**

Everything below this line is to illustrate my question. Feel free to skip this part.

**Illustrative Example**

Here, I implemented a leave-one-out cross-validation approach to estimating the correlation, assuming that the input $x$ is one-dimensional and uniformly distributed from 0 to 10 and $f$ has a smooth and hard to predict form. I used a one-dimensional input to illustrate my point here, but my real problem has a much larger input dimension.

I used a Gaussian Process as the surrogate, though the important aspect of that is that it interpolates across the observed data. For each point in my observations, I left one out and fit the Gaussian Process to the remaining data points, then recorded the surrogate’s prediction and the truth value. I decided not to leave the end points out so that I was only using the Gaussian Process for interpolation. I used these pairs of truth and predicted values to estimate the correlation between the function and the surrogate.

This was just my take on the problem. Ultimately, I’d like to experiment with other cross-validating approaches. I’m not sure what other approaches I could try besides cross-validation. I’m not sure if cross-validation is even a valid approach to this problem.

Coded implementation

```
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
np.random.seed(5)
# this is the truth. In my real problem, there is no known functional form and I have limited observations.
def f(x):
return x * np.sin(x) - np.cos(5 * x) * np.sqrt(x)
# ----------------------------------------------------------------------
# assume 20 observed inputs that are uniformly distributed
XL = 0
XU = 10
x_observed = np.atleast_2d(sorted(np.random.uniform(XL, XU, 20)))
# record of predictions
y_pred = []
# Instanciate kernel
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
# estimate correlation between surrogate and truth
# make the surrogate but leave one out at a time. Always use the edge points.
for ii in range(1, x_observed.shape[1]-1):
# prepare truth data leaving each one out at a time but never excluding the sides
X = x_observed[:, list(range(0, ii)) + list(range(ii+1, x_observed.shape[1]))].T
y = f(X.T).ravel()
# Fit to data using Maximum Likelihood Estimation of the parameters
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])
# Make the prediction over all observations
y_pred.append(gp.predict(x_observed[:, ii:ii+1], return_std=False)[0][0])
# Create surrogate using all observed data points
X = x_observed.T
y = f(X.T).ravel()
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])
# create points to estimate true correlation (we are using a uniform distribution)
xs = np.random.uniform(XL, XU, 100000)
# create domain for plotting
xb = np.linspace(XL, XU, 10000)
# estimate correlation
est_corr = np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1]
# calculate tru correlation
true_corr = np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, 0])[0][1]
# estimate mean with monte carlo
hf = np.mean(f(x_observed)[0, 1:-1])
# estimate mean with GP using small number of samples
gphf = np.mean(gp.predict(np.atleast_2d(x_observed).T)[:, 0])
# estimate mean with GP using large number of samples
gpmean = np.mean(gp.predict(np.atleast_2d(xs).T)[:, 0])
# estimate the control variate value
alpha_est = -1 * np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1] * np.std(f(x_observed)[0, 1:-1]) / np.std(gp.predict(np.atleast_2d(x_observed).T)[:, 0])
# calculate the true control variate value
alpha = -1 * np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, -1])[0][1] * np.std(f(xs)) / np.std(gp.predict(np.atleast_2d(xs).T)[:, 0])
print("I estimate the correlation of your surrogate is %04.3f" % est_corr)
print("True correlation: %04.3f" % true_corr)
print("True mean %f"%np.mean(f(xs)))
print("Monte Carlo Estimated mean %f"%hf)
print("Conntrol Variate Mean with estimated Correlation %f"% ( hf + alpha_est * (gphf - gpmean) ))
print("Conntrol Variate Mean with actual Correlation %f"% ( hf + alpha * (gphf - gpmean) ))
print("-------")
print("Approach | Error")
print("CV (estimated corr) | %f"%np.abs(np.mean(f(xs)) - (hf + alpha_est * (gphf - gpmean) )))
print("CV (actual corr) | %f"%np.abs(np.mean(f(xs)) - (hf + alpha * (gphf - gpmean) )))
print("Monte Carlo | %f"%np.abs(np.mean(f(xs)) - (hf)))
# Plot the truth, the surrogate, and the observed points
plt.plot(xb, f(xb), label='$f(x)$')
plt.plot(x_observed[0, :], f(x_observed[0, :]), 'r.', markersize=10, label=u'Observed $f(x)$')
plt.plot(xb, gp.predict(np.atleast_2d(xb).T), label=r'$widehat{f}(x)$')
plt.xlabel('$x$')
plt.legend(loc='upper left')
plt.savefig('fvsfhat')
plt.clf()
# plot the points used to estimate the correlation and the points used to calculate the true correlation
plt.scatter(f(xs), gp.predict(np.atleast_2d(xs).T), c='k', marker='o', s=1, label='Data used to calculate actual corerelation.')
plt.scatter((x_observed)[0, 1:-1], y_pred, c='purple', marker='*', label='Data used to calculate the correlation estimate.nThese were obtained using leave-one-out "cross-validation."')
plt.xlabel(r'$f(x)$')
plt.ylabel(r'$widehat{f}(x)$')
plt.legend()
plt.savefig('correlationData')
plt.clf()
```

Output:

I estimate the correlation of your surrogate is 0.962

True correlation: 0.856

True mean 0.823682

Monte Carlo Estimated mean 0.226071

Conntrol Variate Mean with estimated Correlation 0.698964

Conntrol Variate Mean with actual Correlation 0.720262

Approach | Error

CV (estimated corr) | 0.124718

CV (actual corr) | 0.103420

Monte Carlo | 0.597611

(In the above output, both control variate estimates were more accurate than the conventional Monte Carlo estimator. This trend should hold for most random number generator seeds.)

I want to accurately estimate the correlation between my truth and surrogate given the prescribed input distribution. In the above example, I overestimated the correlation (my estimate, 0.96, was much greater than the actual correlation, which I calculated as 0.86) by leaving one observation out at a time, recording the surrogate’s prediction of the out point, then computing the correlation between the predicted points and the truth values.

I compared the accuracy of the control variate Monte Carlo method using the leave-one-out cross-validation estimate of the correlation to a scenario where I somehow knew a very accurate estimate of the correlation, even only with a small sample size available. Here, I used 6 samples to estimate the mean value. The plot compares the absolute errors associated with the control variate Monte Carlo approaches assuming that I have an accurate estimation of the correlation against if I use the leave-one-out cross-validation estimate of the correlation.

```
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
np.random.seed(5)
# this is the truth. In my real problem, there is no known functional form and I have limited observations.
def f(x):
return x * np.sin(x) - np.cos(5 * x) * np.sqrt(x)
# ----------------------------------------------------------------------
# assume observed inputs that are uniformly distributed
XL = 0
XU = 10
NSAMPS = 6
def experiment():
# make observations
x_observed = np.atleast_2d(sorted(np.random.uniform(XL, XU, NSAMPS)))
# record of predictions
y_pred = []
# Instanciate kernel
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
# estimate correlation between surrogate and truth
# make the surrogate but leave one out at a time. Always use the edge points.
for ii in range(1, x_observed.shape[1]-1):
# prepare truth data leaving each one out at a time but never excluding the sides
X = x_observed[:, list(range(0, ii)) + list(range(ii+1, x_observed.shape[1]))].T
y = f(X.T).ravel()
# Fit to data using Maximum Likelihood Estimation of the parameters
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])
# Make the prediction over all observations
y_pred.append(gp.predict(x_observed[:, ii:ii+1], return_std=False)[0][0])
# Create surrogate using all observed data points
X = x_observed.T
y = f(X.T).ravel()
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
gp.fit(X, y[:, None])
# create points to estimate true correlation (we are using a uniform distribution)
xs = np.random.uniform(XL, XU, 100000)
# create domain for plotting
xb = np.linspace(XL, XU, 10000)
# estimate correlation
est_corr = np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1]
# calculate tru correlation
true_corr = np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, 0])[0][1]
# estimate mean with monte carlo
hf = np.mean(f(x_observed)[0, 1:-1])
# estimate mean with GP using small number of samples
gphf = np.mean(gp.predict(np.atleast_2d(x_observed).T)[:, 0])
# estimate mean with GP using large number of samples
gpmean = np.mean(gp.predict(np.atleast_2d(xs).T)[:, 0])
# estimate the control variate value
alpha_est = -1 * np.corrcoef(f(x_observed)[0, 1:-1], y_pred)[0][1] * np.std(f(x_observed)[0, 1:-1]) / np.std(gp.predict(np.atleast_2d(x_observed).T)[:, 0])
# calculate the true control variate value
alpha = -1 * np.corrcoef(f(xs), gp.predict(np.atleast_2d(xs).T)[:, -1])[0][1] * np.std(f(xs)) / np.std(gp.predict(np.atleast_2d(xs).T)[:, 0])
return( ( hf + alpha_est * (gphf - gpmean) ), ( hf + alpha * (gphf - gpmean) ))
exps = [experiment() for _ in range(30)]
x, y = zip(*exps)
plt.scatter(np.abs(x), np.abs(y))
plt.title("Estimated Monte Carlo Estimate Errors.nThe Estimated means used %i samples."%NSAMPS)
plt.xlabel('Absolute Error, Leave-One-Out Approach')
plt.ylabel('Absolute Error, Using Correlation Esimated With Very Large N')
plt.gca().set_aspect('equal', adjustable='box')
plt.draw()
plt.savefig('errorPlot')
plt.clf()
```