#StackBounty: #maximum-likelihood #inference #loss-functions #decision-theory #risk Model fitting vs minimizing expected risk

Bounty: 50

I’m confused about the mechanics of model fitting vs minimizing risk in decision theory. There’s numerous resources online, but I can’t seem to find a straight answer regarding what I’m confused about.

Model fitting (via e.g. maximum log-likelihood):

Suppose I have some data pairs $lbrace (x_1, y_1), … , (x_N, y_N)rbrace$ and I want to come up with a parametric probability density modelling target $y$ given $x$: $$p(y|x; theta)$$

which I use to estimate the true codnitional distribution of the data, say $p_text{true}(y|x)$. I can do so via some procedure to e.g. maximize log likelihood:

$$max_theta sum_i log p(y_i| x_i; theta)$$

Then on future unseen data for $x$, we can give e.g. confidence intervals for its corresponding $y$ given $x$, or just report $y_text{guess} = y_text{mode} = argmax_y p(y|x; theta)$ ). $y$ and $x$ can both be continuous and/or discrete.

Decision theory:

A problem comes when we want a point estimate of $y$ and the cost associated with each estimate is not captured purely by which is most frequent or expect, I.e. we need to do better than picking the modal $$y_text{guess} = text{argmax}y p(y|x;theta)$$ or expected value $$mu_y = mathbb{E}{p(y|x;theta)}[Y|x]$$ for a particular application.

So suppose I fit a model using maximum likelihood, then I want to make point predictions. Since I must pick a single point, I can predict a new point which minimizes expected cost; I choose the $y_text{guess}$ with the lowest avg cost along all $y$:

y_text{guess} &= text{arg-min}{y}int{y^{‘}}L(y, y^{‘})p(y^{‘}|x; theta)dy^{‘}\
&= text{arg-min}{y}mathbb{E}{p(y^{‘}|x;theta)}Big[L(y, y^{‘})Big]\

This is the degree to which I understand decision theory. It’s a step that you take after one has fit their model to pick point estimates of $y$ and have a loss function $L(y, y’)$, when your model gives an entire distribution of $y$, but we need a point estimate, $y_text{guess}$.


  • If the loss $L(y_text{guess}, y^{‘})$ is what we actually care about minimizing, then why not do the following fitting procedure instead of maximum likelihood:

$$min_{theta} sum_i int_{y^{‘}}L(y_i, y^{‘})p(y^{‘}|x_i; theta)dy^{‘}$$

that is, minimize the expected loss under the parametric model $p(y|x; theta)$? My current understanding is approach is called "Expected Risk Minimization" this is done in practice sometimes, but the parametric model in this case would lose the interpretation as the approximation to the true distribution $p_text{true}(y|x)$. Is my understanding correct? Are there any problems with doing this?

Get this bounty!!!

#StackBounty: #neural-networks #maximum-likelihood #regularization #kullback-leibler Label smoothing and KL divergence

Bounty: 50

I am reading the paper Regularizing Neural Networks by Penalizing Confident Output Distributions where the authors introduce label smoothing in section 3.2. For a neural network that produces a conditional distribution $p_theta(y|x)$ over classes $y$ given an input $x$ through a softmax function, the label smoothing loss function is defined as:

$$mathcal L(theta) = -sum log p_theta(y|x) – D_{mathrm{KL}}(u||p_theta(y|x))$$

where $D_{mathrm{KL}}$ refers to the KL divergence and $u$ the uniform distribution. However my understanding is that minimising this expression would in fact attempt to maximise the KL divergence and, since this is a measure of the dissimilarity between the posterior distribution and the uniform distribution, this would encourage the opposite of smoothing. Where is my understanding falling down here?

Additional investigation

Trying to get to the bottom of this I noticed a few things. In the next line of the paper the authors mention that

By reversing the direction of the KL divergence, $D_{mathrm{KL}}(p_theta(y|x)‖u)$, we recover the confidence penalty.

Where, for entropy function $H$ and constant $beta$, the confidence penalty is defined as
$$mathcal L(theta) =−sum log p_theta(y|x)−beta H(p_theta(y|x))$$

However when I do the derivation myself I obtain

$$mathcal L(theta) =−sum log p_theta(y|x) + H(p_theta(y|x))$$

Since the experiments all use positive valued $beta$‘s this suggests to me that perhaps the original equation is in fact a typo and should be adding the KL divergence rather than subtracting.

I have checked all the versions of the paper I could find online and the original label smoothing equation always subtracts the KL divergence.

Get this bounty!!!

#StackBounty: #maximum-likelihood #least-squares #covariance #uncertainty #hessian Parameter uncertainity in least squares optimization…

Bounty: 50

Given a least squares optimization problem of the form:

$$ C(lambda) = sum_i ||y_i – f(x_i, lambda)||^2$$

I have found in multiple questions/answers (e.g. here) that an estimate for the covariance of the parameters can be computed from the inverse rescaled Hessian at the minimum point:

$$ mathrm{cov}(hatlambda) = hat H^{-1} hatsigma_r^2 = hat H^{-1} frac{sum_i ||y_i – f(x_i, hatlambda)||^2}{N_{DOF}} $$

While I understand why the covariance is related to the inverse Hessian (Fischer information), I haven’t found anywhere a demonstration or explanation for the $hatsigma_r^2$ term, although it appears reasonable to me on intuitive grounds.

Could anybody explain the need for the rescaling by the residual variance and/or provide a reference?

Get this bounty!!!

#StackBounty: #maximum-likelihood #inference #fisher-information Connection between Fisher information and variance of score function

Bounty: 100

The fisher information’s connection with the negative expected hessian at $theta_{MLE}$, provides insight in the following way: at the MLE, high curvature implies that an estimate of $theta$ even slightly different from the true MLE would have resulted in a very different likelihood.
mathbf{I}(theta)=-frac{partial^{2}}{partialtheta_{i}partialtheta_{j}}l(theta),~~~~ 1leq i, jleq p

This is good, as that means that we can be relatively sure about our estimate.

The other connection of Fisher information to variance of the score, when evaluated at the MLE is less clear to me.
$$ I(theta) = E[(frac{partial}{partialtheta}l(theta))^2]$$

The implication is; high Fisher information -> high variance of score function at the MLE.

Intuitively, this means that the score function is highly sensitive to the sampling of the data. i.e – we are likely to get a non-zero gradient of the likelihood, had we sampled a different data distribution. This seems to have a negative implication to me. Don’t we want the score function = 0 to be highly robust to different sampling of the data?

A lower fisher information on the other hand, would indicate the score function has low variance at the MLE, and has mean zero. This implies that regardless of the sampling distribution, we will get a gradient of log likelihood to be zero (which is good!).

What am I missing?

Get this bounty!!!

#StackBounty: #probability #mathematical-statistics #maximum-likelihood #inference #birthday-paradox Birthday puzzle

Bounty: 50

I need some help with Bayesian statistics likelihoods. Consider the following question: Given a number of persons. Each person $p$ knows $n(p)$ other persons – $p$‘s neighbourhood $N(p)$. Knowing a person $p’ in N(p)$ may imply (as an assumption) that $p$ knows if $p’$ was born on the same day as $p$ (neglecting the year). The probability that there are $k$ persons $p’ in N(p)$ born on the same day as $p$ is

$$P_1(X = k | n(p) = N) = binom{N}{k}alpha^k(1- alpha)^{N-k} $$

with $alpha = 1/365$.

Now consider a survey where each person $p$ was asked how many of the persons $p$ knows were born on the same day as $p$. Let the result of the survey (the evidence) be a distribution $P_2(X = a)$ giving the frequency that the answer was $a$.

By which (Bayesian) argument can be told which distribution $P_3(X = N)$ of the size of neighbourhoods $N(p)$ is the most probable one that would yield $P_2$? $P_3(X = N)$ gives the probability that a person has a neighbourhood of size $N$.

Before that: Is this question even well-posed and correctly worded? Which tacit assumptions have to be made explicit, possibly?

Edit: Maybe it’s easier to ask and answer what the most probable mean neighbourhood size $overline{N}$ is, assuming that it is distributed a) normally, b) Poisson, c) scale-free.

Get this bounty!!!

#StackBounty: #maximum-likelihood #exponential-distribution #likelihood-ratio Likelihood Ratio for two-sample Exponential distribution …

Bounty: 50

Let $X$ and $Y$ be two independent random variables with respective pdfs:

$$f left(x;theta_i right) = begin{cases} frac{1}{theta_i} e^{-x/ {theta_i}} quad 0<x<infty, 0<theta_i< infty \ 0 quad text{elsewhere} end{cases} $$

for $i=1,2$. Two indepedent samples are drawn in order to test $H_0: frac{theta_1}{theta_2} = k$ against $H_1 : frac{theta_1}{theta_2} neq k $ of sizes $n$ and constant $k$.

$(1)$ Find the likelihood ratio test statistic $lambda = lambda(X_1, dots, X_n, Y_1, dots, Y_n)$

(2) Use large sample approximation for the null distribution of $2 log Lambda$ , and duality of testing and interval estimation for LRT with $alpha = 0.95$ and $n=100, overline{X} =2, overline{Y} =1$ to describe a confidence set for $H_0$.

Attempted answer $(1)$: With the help of Likelihood Ratio for two-sample Exponential distribution and $theta_1=ktheta_2$,

$$L_{H_0} = k^{n}theta^{-2n}cdot expleft{-theta^{-1}left(sum x_i+ksum y_iright)right},$$

and the MLE is

$$hat theta_0 = frac {sum x_i+ksum y_i}{2n} = frac {1}{2}({overline{x} + koverline{y}}) $$

So$$ L_{H_0}(hat theta_0) = (hat theta_0)^{-2n}cdot e^{-2n}$$

Under the alternative, the likelihood is

$$L_{H_1} = theta_1^{-n}cdot expleft{-theta_1^{-1}left(sum x_iright)right}cdot theta_2^{-n}cdot expleft{-theta_2^{-1}left(ksum y_iright)right}$$

and the MLE’s are

$$hat theta_1 = frac {sum x_i}{n_1} = bar x, qquad hat theta_2 = frac {sum y_i}{n_2} = bar y$$

$$L_{H_1}(hat theta_1,,hat theta_2) = (hat theta_1)^{-n}(hat theta_2)^{-n}cdot e^{-2n}$$

To get the LRT:

$$frac {L_{H_0}(hat theta_0)}{L_{H_1}(hat theta_1,,hat theta_2)} =frac{k^{n}cdot(hat theta_0)^{-2n}}{(hat theta_1)^{-n}(hat theta_2)^{-n}}=frac{k^{n} cdot (hat theta_1)^{n}(hat theta_2)^{n}}{(hat theta_0)^{2n}}=frac{k^{n} cdot (overline x)^{n}(overline y)^{n}}{(frac{1}{2}(overline x + koverline y))^{2n}}=frac{k^{n} cdot 2^{2n} cdot (overline x)^{n}(overline y)^{n}}{(overline x + koverline y)^{2n}}=Lambda$$ Is this correct?

attempted answer $(2)$: with the help of (1): large sample approximation says $-2ln(Lambda)$ converges in distribution to $chi^{2}_{k}$ where $k= lvert 2-1 rvert = 1$.
Is the test one-sided? Then, for $alpha=0.95 rightarrow Z=3.841$. We have $-2 ln Lambda le 3.841 iff -2 lnleft(dfrac{k^{100}cdot :2^{200}cdot :2^{100}}{left(2+kright)^{200}}right)le 3.841$ Which is not easy to solve and since its a text-book problem – I think I have a mistake somewhere.

Get this bounty!!!

#StackBounty: #machine-learning #probability #maximum-likelihood #kullback-leibler #variational-bayes Why use KL-Divergence as loss ove…

Bounty: 50

I have came across this statement several time now

Maximizing likelihood is equivalent to minimizing KL-Divergence

(Sources: Kullback–Leibler divergence and Maximum likelihood as minimizing the dissimilarity between the empirical distriution and the model distribution)

I would like to know in applications such as VAE, why use KL- divergence then over MLE?
In which applications would you choose one over the other? And any specific reason for it given both are equivalent?

Get this bounty!!!

#StackBounty: #maximum-likelihood #expectation-maximization #latent-variable Consistency of a hard EM type approach to dealing with lat…

Bounty: 50

Suppose we observe a sample from the $d$ dimensional random vectors $Y_{1,t},dots,Y_{N,t}$ for $t=1,dots,T$.

Suppose further that the data generating process (DGP) is the following, for $t=1,dots,T$:
$$Y_{2,t}=f_2(Z_t,theta)+varepsilon_{2,t},hspace{1cm}varepsilon_{2,t}simtext{N}(0,sigma_2 I_{dtimes d})$$
$$Y_{3,t}=f_3(Z_t,theta)+varepsilon_{3,t},hspace{1cm}varepsilon_{3,t}simtext{N}(0,sigma_3 I_{dtimes d})$$
$$Y_{N,t}=f_N(Z_t,theta)+varepsilon_{N,t},hspace{1cm}varepsilon_{N,t}simtext{N}(0,sigma_N I_{dtimes d})$$
$$Z_tsim G(theta)$$
where $G$ is some distribution, and where $Z_t,varepsilon_{2,t},varepsilon_{3,t}dots,varepsilon_{N,t}$ are independent both within and across time periods.

Suppose finally that all of the functions $f_1,f_2,dots,f_N$ are bijections.

We would like to obtain a consistent estimate of $theta,sigma_2,dots,sigma_N$. (Consistent as $Trightarrowinfty$, with $N,d$ fixed.)

One approach is the following:

Given a guess of $theta,sigma_2,dots,sigma_N$, and an observation $y_{1,t},dots,y_{N,t}$ for $t=1,dots,T$, we can solve the non-linear system $y_{1,t}=f_1(z_t,theta)$ to find $z_1,dots,z_T$. We can then calculate the likelihood as the sum of Gaussian likelihood terms coming from the equations for $y_{2,t},dots,y_{N,t}$, plus a term coming from the likelihood of $z_t$, plus a final Jacobian term coming from the non-linear transformation from $y_{1,t}$ to $z_t$. Using this, we can choose $theta,sigma_2,dots,sigma_N$ to maximize the likelihood.

This approach is computationally expensive, and numerically challenging due to inaccuracies in the inner non-linear equation solution, which mess up numerical derivative calculation. (Assume that analtyic derivatives are impossible here.)

An alternative approach, inspired by the Hard EM algorithm and the particular structure of the model would be the following:

Rather than estimating the actual model, we instead "pretend" that the DGP for $Y_{1,t}$ was actually:
$$Y_{1,t}=f_1(Z_t,theta)+varepsilon_{1,t},hspace{1cm}varepsilon_{1,t}simtext{N}(0,sigma_1 I_{dtimes d})$$
The DGP for $Y_{2,t},dots,Y_{N,t},Z_t$ stays the same. This new model is even harder to estimate exactly, as we would have to integrate out the latent variables. The true period $t$ likelihood of this model is of the form:
But we can instead do the Hard EM cheat, and instead choose $theta,sigma_1,dots,sigma_N$ AND $z_1,dots,z_T$ to maximize:
This is computationally easier than the first approach, at least in my specific set-up.

The question is whether this delivers a consistent estimate of $theta,sigma_2,dots,sigma_N$ as $Trightarrowinfty$.

My hope is that it does, for the following reason. If the true DGP is really as stated, then the likelihood can be made infinite by sending $sigma_1rightarrow 0$. At this point, the optimal $z_1,dots,z_T$ will exactly equal the result of solving $y_{1,t}=f_1(z_t,theta)$ for $z_t$. So it looks like the new model nests the original one. (In practice, inevitable misspecification will ensure that $sigma_1>0$, but this is not a particularly bad thing as long as it would be consistent were the model correctly specified.)

Is this argument correct? I cannot see how the Jacobian term from the first approach would appear under the second approach, which makes me worry. Plus, optimising over the value of latent variables always seems fundamentally dodgy!

Get this bounty!!!