## #StackBounty: #normal-distribution #bootstrap #central-limit-theorem Distribution of bootstrap and central limit theorem

### Bounty: 50

Let’s take a simple example: we have 100000 observations, and we want to estimate the mean.

In theory, the distribution of the estimator is a normal distribution according to the Central limit theorem.

We can also use bootstraping to estimate the distribution of the mean estimation: we resample lots of times, then we get a distribution.

Now, my question is: is the normal distribution a good approximation for the bootstrap distribution?

Get this bounty!!!

## #StackBounty: #normal-distribution #maximum-likelihood #factor-analysis #canonical-correlation MLE for \$sigma^2\$ in inter-battery fact…

### Bounty: 100

NB: I’ve asked a related question here, but did not get the answer I needed. I’m asking again with more detail in hopes that those details matter.

Inter-battery factor analysis (IBFA) is similar to probabilistic CCA (Bach and Jordan, 2006) except that it explicitly models a shared latent variable $$mathbf{z}_0$$ and view-specific latent variables $$mathbf{z}_1$$ and $$mathbf{z}_2$$:

begin{aligned} mathbf{z}0 &sim mathcal{N}{K_0}(mathbf{0}, mathbf{I}), \ mathbf{z}1 &sim mathcal{N}{K_1}(mathbf{0}, mathbf{I}), \ mathbf{z}2 &sim mathcal{N}{K_2}(mathbf{0}, mathbf{I}), \ mathbf{x}1 mid mathbf{z}_0, mathbf{z}_1, mathbf{z}_2 &sim mathcal{N}{P_1}(mathbf{W}1 mathbf{z}_0 + mathbf{B}_1 mathbf{z}_1, sigma_1^2 mathbf{I}), \ mathbf{x}_2 mid mathbf{z}_0, mathbf{z}_1, mathbf{z}_2 &sim mathcal{N}{P_2}(mathbf{W}_2 mathbf{z}_0 + mathbf{B}_2 mathbf{z}_2, sigma_2^2 mathbf{I}). end{aligned} tag{1}

In (Klami and Kaski 2007) (see Table 1 on p. 10), the authors propose EM updates for $$mathbf{W}_i$$, $$mathbf{B}_i$$, and $$sigma_i^2$$ for $$i in {1,2}$$. I can derive all the EM updates except for the EM update for $$sigma_i^2$$.

To show you what I mean, we can find the optimal update for $$mathbf{W}_1$$ and $$mathbf{W}_1$$ by integrating out the view-specific latent variables. For example, to integrate out $$mathbf{z}_1$$:

$$int p(mathbf{x}_1, mathbf{x}_2, mathbf{z}_0, mathbf{z}_1, mathbf{z}_2) d mathbf{z}_1 = p(mathbf{x}_2 mid mathbf{z}_0, mathbf{z}_2) p(mathbf{z}_2) p(mathbf{z}_0) int p(mathbf{x}_1 mid mathbf{z}_0, mathbf{z}_1) p(mathbf{z}_1) d mathbf{z}_1. tag{2}$$

Notice that $$mathbf{z}_0$$ is a constant in the integration. Let $$tilde{mathbf{x}}_1 = mathbf{x}_1 – mathbf{W}_1 mathbf{z}_0$$. Then we can easily integrate

$$int p(tilde{mathbf{x}}_1 mid mathbf{z}_1) p(mathbf{z}_1) dmathbf{z}_1 tag{3}$$

since both densities are Gaussian:

begin{aligned} tilde{mathbf{x}}1 mid mathbf{z}_0 &sim mathcal{N}{P_1}(mathbf{0}, mathbf{B}1 mathbf{B}_1^{top} + sigma_1^2 mathbf{I}), \ &Downarrow \ mathbf{x}_1 mid mathbf{z}_0 &sim mathcal{N}{P_1}(mathbf{W}_1 mathbf{z}_0, mathbf{B}_1 mathbf{B}_1^{top} + sigma_1^2 mathbf{I}). end{aligned} tag{4}

Notice that if $$mathbf{B}_1 mathbf{B}_1^{top} + sigma_1^2 mathbf{I}$$ were full rank, we could write it as $$boldsymbol{Psi}_1$$ as in probabilistic CCA. If we applied this same logic to $$mathbf{x}_2$$, we would get the same generative model as probabilistic CCA:

begin{aligned} mathbf{z}0 &sim mathcal{N}{K_0}(mathbf{0}, mathbf{I}), \ mathbf{x}1 mid mathbf{z}_0 &sim mathcal{N}{P_1}(mathbf{W}1 mathbf{z}_0, mathbf{B}_1 mathbf{B}_1^{top} + sigma_1^2 mathbf{I}), \ mathbf{x}_2 mid mathbf{z}_0 &sim mathcal{N}{P_2}(mathbf{W}_2 mathbf{z}_0, mathbf{B}_2 mathbf{B}_2^{top} + sigma_2^2 mathbf{I}). end{aligned} tag{5}

Thus, the optimal updates for $$mathbf{W}_1$$ and $$mathbf{W}_2$$ are found in Section 4.1 ("EM algorithm") in (Bach and Jordan, 2006).

Furthermore, to find the optimal $$mathbf{B}_1$$ and $$mathbf{B}_2$$, we can integrate out of the shared latent variable. Let $$hat{mathbf{x}}_1 = mathbf{x}_1 – mathbf{B}_1 mathbf{z}_1$$, then we apply the same trick as before to get:

$$mathbf{x}1 mid mathbf{z}_1 sim mathcal{N}{P_1}(mathbf{B}_1 mathbf{z}_1, mathbf{W}_1 mathbf{W}_1^{top} + sigma_1^2 mathbf{I}). tag{6}$$

Since we’ve integrated out the dependencies between the two models, we essentially have two probabilistic PCA/factor analysis models. Now the EM update for both $$mathbf{B}_i$$ is the same as for probabilistic PCA. See equation $$27$$ in (Tipping and Bishop, 1999).

I’ve confirmed that everything so far matches what’s in Klami’s paper.

Question: I don’t know how to derive the EM update for $$sigma_i^2$$. As I mentioned in my previous post, if the covariance matrix in Eq. $$6$$ were just $$sigma^2 mathbf{I}$$, then the MLE would just be what you’d get for probabilistic PCA. However, we have to deal with this term:

$$(mathbf{WW}^{top} + sigma^2 mathbf{I})^{-1}. tag{7}$$

I don’t know how to either compute the derivative of this term w.r.t. to $$sigma^2$$ or even how to isolate $$sigma^2$$, since the Woodbury matrix formula will keep $$sigma^2$$ inside the inverse. In the other post, the accepted answer claims there is no closed form solution for $$sigma_i^2$$. I’m hoping that by providing the full modeling problem, someone can see something I have overlooked.

Klami’s MLE update for $$sigma_i^2$$ is almost the MLE update for $$sigma^2$$ in probabilistic PCA (see Eq. $$28$$ in (Tipping and Bishop, 1999)). However, he subtracts $$mathbf{W} mathbf{W}^{top}$$, which suggests to me that he’s somehow transforming Eq. $$6$$ before applying the probabilistic PCA updates.

Get this bounty!!!

## #StackBounty: #distributions #logistic #normal-distribution #pdf #model Relationship between a logistic decision function and Gaussian …

### Bounty: 50

Imagine an experiment, in which an observer has to discriminate between two stimulus categories at different contrast levels $$|x|$$. As $$|x|$$ becomes lower, the observer will be more prone to making perceptual mistakes. The stimulus category is coded in the sign of $$x$$. I’m interested in the relationship between two different ways of modeling the observer’s "perceptual noise" based on their choices in a series of stimulus presentations.

The first way would be to fit a logistic function

$$p_1(x) = frac{1}{1+e^{-betacdot x}}$$

where $$p_1(x)$$ is the probability to choose the stimulus category with positive signs ($$S^+$$). Here, $$beta$$ would reflect the degree of perceptual noise.

A second way would be to assume that the observer has Gaussian Noise $$mathcal{N}(0,sigma)$$ around each observation of $$x$$ and then compute the probability to choose $$S^+$$ by means of the cumulative probability density function as follows:

$$p_2(x) = frac{1}{sigmasqrt{2pi}}intlimits_{z=0}^{infty}e^{-frac{(z-x)^2}{2sigma^2}}$$

In this case, $$sigma$$ would be an estimate of the perceptual noise.

I have a hunch that both these approaches are intimately related, but I’m not sure how. Is it an underlying assumption of the logistic function that the noise is normally distributed? Is there a formula that describes the relationship between $$beta$$ of $$p_1(x)$$ and $$sigma$$ of $$p_2(x)$$? Are, in the end, $$p_1(x)$$ and $$p_2(x)$$ essentially identical and could $$p_1$$ be derived from $$p_2$$?

Get this bounty!!!

## #StackBounty: #bayesian #normal-distribution #variance #gamma-distribution #inverse-gamma Bayesian estimation of the variance

### Bounty: 50

The mean of the Gamma distribution is $$alpha/beta$$, while the mean of the Inverse Gamma is $$beta/(alpha-1)$$. Similarly, the mode of the Gamma is $$(alpha-1)/beta$$, but the mode of the Inverse Gamma is $$alpha/(beta+1)$$.

How does this relate to the title of the question?

Well, if we are given data $$X$$ assumed to be normally distributed, the population variance is given by:
$$sigma_{pop}^2=frac{sum(x_i-bar x)^2}{n}equivfrac{s_n^2}{n}$$
However, if we take a Bayesian approach and choose a non-informative Normal-Inverse-Gamma conjugate prior for the variance (i.e. $$alpha_0rightarrow0, beta_0rightarrow0$$), we have that the marginal distribution of $$sigma^2$$ is also Inverse-Gamma distributed, with $$alpha=n/2, beta=s_n^2/2$$ and mean:
$$E[sigma^2] = frac{s_n^2/2}{n/2-1}neqsigma_{pop}^2 quad(!)$$
On the other hand, if one uses an uninformative Normal-Gamma prior:
$$E[tau] = frac{n/2}{s_n^2/2} = frac{1}{sigma_{pop}^2}$$

Assuming the above is correct, I have a couple of questions:

1. I realize that $$E[1/X]neq1/E[X]$$, yet I’m not sure why $$E[tau]$$ specifically should yield the "correct" result. What’s wrong with using $$E[sigma^2]$$?
2. The frequentist approach would lead to using the sample variance, which seems to match neither approaches with an uninformative prior. What if any, is the significance of the particular prior that would result in the sample variance for both $$tau$$ and $$sigma$$?
3. What is the significance of the mode, i.e. the MAP estimator of $$sigma^2$$ or $$tau$$? again, they are both different, and I don’t believe I’ve seen either being used in practice.

Get this bounty!!!

### Bounty: 50

I am writing alerts to monitor the sign up conversion rate for an app. Sign up conversion rate here means the percent of users that open up the app, who end up making an account. Usually, this is around ~35 – 45% (meaning 35 – 45% of users that open the up make an account).

I want an alert to fire if it detects a significant drop in this conversion rate, such as due to buggy release where new users can’t sign up. I have the following:

• appOpenedCount: Number of users who opened the app

• signupCount: Number of users who created an account

• conversionRate: signupCount / appOpenedCount

• period: How far in the past to look. Or what time periods to use for the appOpenedCount or signupCount data. Usually, we want
this to be in the past 1 hour to be urgent.

So based on the above, how do I find the best condition to trigger an alert with minimal false positives? I have about months of past data for analysis. The system will check for the alert every 5 minutes.

My current condition: If appOpenedCount > 100 and conversionRate < 0.32 (2th percentile) in the past hour, fire an alert. However, I’m noticing a lot of false positives, so I’m thinking we could do better? Should I use something like 0.2th percentile instead? The conditions can be very flexible. For example, I can use week over week analysis, where I compare against past week’s data etc.

Here is a graph I made that might be useful. Each data point indicates the conversion rate and app opened in the last hour (over the course of a month). As you can see, with more data, it’s more accurate. App open count is lowest at night, and highest during noon.

Get this bounty!!!

## #StackBounty: #distributions #normal-distribution #matlab #random-generation #skew-normal Generating skew-normal distribution in Matlab

### Bounty: 50

My apologies if this is a trivial question, but I am having trouble with this for a while now.

I need to use a skew-normal distribution in research in MATLAB and the only way I found after googling was to use Pearsrnd, as given in here.

Now, I did the math and wrote function skewnormal function in MATLAB as follows:

%% The helper function calculating parameters for skew-normal using pearsrnd
function [m,s, sk, kurt] = skewnormal(a, e, w)
c = sqrt(2/pi); % it is used a lot in what follows
d = a/sqrt(1+a*a); % temp variable
m = e + d*w*c; % mean
s = w*sqrt(1 - d^2*c^2); % variance
sk = (4 - pi)/2*(d*c*w/s)^3; % skewness
kurt = 3 + 2*(pi-3)*(d*c*w/s)^4; % kurtosis accounted for the matlab convention.
end

Then, when I use the above in my code and return the type of the Pearsrnd, it returns $$1$$ – which is apparently Four parameters Beta distribution in Pearsrnd.

I did look for answers here that could immediately answer but I did not find any.

Can anyone fix my attempt at generating Skew-Normal distribution, since I am clearly doing something wrong?

Get this bounty!!!

## #StackBounty: #normal-distribution #pdf #likelihood #uniform #sufficient-statistics Sufficient statistics in the uniform distribution c…

### Bounty: 50

I am currently studying sufficiency statistics. My notes say the following:

A statistic $$T(mathbf{Y})$$ is sufficient for $$theta$$ if, and only if, for all $$theta in Theta$$,

$$L(theta; mathbf{y}) = g(T(mathbf{y}), theta) times h(mathbf{y}),$$

where the function $$g(cdot)$$ depends on $$theta$$ and the statistic $$T(mathbf{Y})$$, while the function $$h(cdot)$$ does not contain $$theta$$.

Sufficient statistics are not unique:

Any one-to-one transformation of a sufficient statistic is again a sufficient statistic.

Sufficiency depends on the model:

Let $$Y_1, dots, Y_n$$ be a sample from $$N(mu, sigma^2)$$, where $$sigma^2 > 0$$ is known. The only unknown parameter is $$mu = E[Y]$$.
$$T(mathbf{Y}) = sum_{i = 1}^n Y_i$$ or $$T(mathbf{Y}) = bar{Y}$$ are sufficient statistics for $$mu$$.

Let $$Y_1, dots, Y_n$$ be a sample from a $$text{Uniform}[ mu – 1, mu + 1]$$ distribution. The only unknown parameter is $$mu = E[Y]$$.
In this case, $$T(mathbf{Y}) = sum_{i = 1}^n Y_i$$ or $$T(mathbf{Y}) = bar{Y}$$ are not sufficient statistics for $$mu$$.

I don’t understand why $$T(mathbf{Y}) = sum_{i = 1}^n Y_i$$ or $$T(mathbf{Y}) = bar{Y}$$ are sufficient statistics for $$mu$$ for the normally distributed case, but $$T(mathbf{Y}) = sum_{i = 1}^n Y_i$$ or $$T(mathbf{Y}) = bar{Y}$$ are not sufficient statistics for $$mu$$ in the uniformly distributed case. I know that the unique characteristic of the uniform distribution is that its density is the same everywhere in the distribution, unlike the normal distribution, so I strongly suspect that this has something to do with it; although, as I said, I’m not sure precisely why.

An accompanying example for the uniformly distributed case is as follows:

Example

Let $$Y_1, dots, Y_n$$ be an i.i.d. $$U[mu – 1, mu + 1]$$. It has the density

$$f_mu (y) = begin{cases} 1/2 & text{if} mu – 1 le y le mu + 1 \ 0 & text{otherwise}, end{cases}$$

where $$mu in Theta = mathbb{R} = (-infty, infty)$$. The likelihood is given by

begin{align} L(mu; mathbf{y}) = prod_{i = 1}^n f_{mu} (y_i) &= begin{cases} 1/2^n & text{if} mu – 1 le y le mu + 1 \ 0 & text{otherwise} end{cases} \ &= begin{cases} 1/2^n & text{if} mu – 1 le y_{(1)} le dots le y_{(n)} le mu + 1 \ 0 & text{otherwise} end{cases} \ &= begin{cases} 1/2^n & text{if} y_{(n)} – 1 le mu le y_{(1)} + 1 \ 0 & text{otherwise} end{cases} end{align}

here $$(y_{(1)} le dots le y_{(n)})$$ is the order statistic of $$(y_1, dots, y_n)$$.

The only part of this example that is unclear to me is the last case:

$$begin{cases} 1/2^n & text{if} y_{(n)} – 1 le mu le y_{(1)} + 1 \ 0 & text{otherwise} end{cases}$$

Specifically, I don’t understand where $$y_{(n)} – 1 le mu le y_{(1)} + 1$$ came from; the equivalence of the algebra to the two cases that came before it are not clear to me.

I would greatly appreciate it if people would please take the time to clarify this.

Get this bounty!!!

## #StackBounty: #normal-distribution #poisson-distribution #likelihood #sufficient-statistics Finding the form \$g(T(mathbf{y}), lambda)…

### Bounty: 50

I’m studying some notes that present examples of sufficiency:

Let $$Y_1, dots, Y_n$$ be i.i.d. $$N(mu, sigma^2)$$. Note that $$sum_{i = 1}^n (y_i – mu)^2 = sum_{i = 1}^n (y_i – bar{y})^2 + n(bar{y} – mu)^2$$. Hence

begin{align} L(mu, sigma; mathbf{y}) &= prod_{i = 1}^n dfrac{1}{sqrt{2pi sigma^2}}e^{-frac{1}{2sigma^2}(y_i – mu)^2} \ &= dfrac{1}{(2pi sigma^2)^{n/2}}e^{-frac{1}{2sigma^2}sum_{i = 1}^n (y_i – bar{y})^2}e^{-frac{1}{2sigma^2}n(bar{y} – mu)^2} end{align}

From Theorem 1, it follows that where $$T(mathbf{Y}) = (bar{Y}, sum_{i = 1}^n (Y_i – bar{Y})^2)$$ is a sufficient statistic for $$(mu, sigma)$$.

Theorem 1 is presented as follows:

A statistic $$T(mathbf{Y})$$ is sufficient for $$theta$$ if, and only if, for all $$theta in theta$$

$$L(theta; mathbf{y}) = g(T(mathbf{y}), theta) times h(mathbf{y})$$

where the function $$g(cdot)$$ depends on $$theta$$ and the statistic $$T(mathbf{Y})$$, while the function $$h(cdot)$$ does not contain $$theta$$.

Theorem 1 implies that the likelihood $$L(theta; mathbf{y})$$ depends on the data only through $$T(mathbf{y})$$, $$T(mathbf{Y})$$ is a sufficient statistic for $$theta$$ and $$h(mathbf{y}) equiv 1$$.

For reference to another example, here is a Poisson example that I recently posted:

Let $$Y_1, dots, Y_n$$ be a i.i.d. $$text{Pois}(lambda)$$. Then

begin{align} L(lambda; mathbf{y} &= prod_{i = 1}^n e^{-lambda} dfrac{lambda^{y_i}}{y_i!} \ &= e^{-lambda n} dfrac{lambda^{sum_{i = 1}^n y_i}}{prod_{i = 1}^n y_i!} \ &= g(T(mathbf{y}), lambda) times h(mathbf{y}) end{align}

where $$T(mathbf{y}) = sum_{i = 1}^n y_i$$, $$g(T(mathbf{y}), lambda) = e^{-lambda n} lambda^{T(mathbf{y})}$$ and $$h(mathbf{y}) = dfrac{1}{prod_{i = 1}^n y_i!}$$

There are three things that I don’t understand here:

1. How is it that $$sum_{i = 1}^n (y_i – mu)^2 = sum_{i = 1}^n (y_i – bar{y})^2 + n(bar{y} – mu)^2$$? EDIT: Answered here.

2. If, for $$L(theta; mathbf{y})$$, we require the form $$g(T(mathbf{y}), theta) times h(mathbf{y})$$, then, for $$L(mu, sigma; mathbf{y})$$, what form do we require? Trying to think of this myself, I thought of three potentially correct forms: $$g(T(mathbf{y}), (mu, sigma)) times h(mathbf{y})$$, $$g(T(mathbf{y}), (sigma, mu)) times h(mathbf{y})$$, or $$g(T(mathbf{y}), mu, sigma) times h(mathbf{y})$$.

3. Related to 2., comparing the first example to the Poisson example, I don’t understand the conclusion of the first example. How does $$T(mathbf{Y}) = (bar{Y}, sum_{i = 1}^n (Y_i – bar{Y})^2)$$ satisfy the form $$g(T(mathbf{y}), lambda) times h(mathbf{y})$$?

I would greatly appreciate it if people would please take the time to clarify these points.

Get this bounty!!!

## #StackBounty: #normal-distribution #poisson-distribution #likelihood #sufficient-statistics Finding the form \$g(T(mathbf{y}), lambda)…

### Bounty: 50

I’m studying some notes that present examples of sufficiency:

Let $$Y_1, dots, Y_n$$ be i.i.d. $$N(mu, sigma^2)$$. Note that $$sum_{i = 1}^n (y_i – mu)^2 = sum_{i = 1}^n (y_i – bar{y})^2 + n(bar{y} – mu)^2$$. Hence

begin{align} L(mu, sigma; mathbf{y}) &= prod_{i = 1}^n dfrac{1}{sqrt{2pi sigma^2}}e^{-frac{1}{2sigma^2}(y_i – mu)^2} \ &= dfrac{1}{(2pi sigma^2)^{n/2}}e^{-frac{1}{2sigma^2}sum_{i = 1}^n (y_i – bar{y})^2}e^{-frac{1}{2sigma^2}n(bar{y} – mu)^2} end{align}

From Theorem 1, it follows that where $$T(mathbf{Y}) = (bar{Y}, sum_{i = 1}^n (Y_i – bar{Y})^2)$$ is a sufficient statistic for $$(mu, sigma)$$.

Theorem 1 is presented as follows:

A statistic $$T(mathbf{Y})$$ is sufficient for $$theta$$ if, and only if, for all $$theta in theta$$

$$L(theta; mathbf{y}) = g(T(mathbf{y}), theta) times h(mathbf{y})$$

where the function $$g(cdot)$$ depends on $$theta$$ and the statistic $$T(mathbf{Y})$$, while the function $$h(cdot)$$ does not contain $$theta$$.

Theorem 1 implies that the likelihood $$L(theta; mathbf{y})$$ depends on the data only through $$T(mathbf{y})$$, $$T(mathbf{Y})$$ is a sufficient statistic for $$theta$$ and $$h(mathbf{y}) equiv 1$$.

For reference to another example, here is a Poisson example that I recently posted:

Let $$Y_1, dots, Y_n$$ be a i.i.d. $$text{Pois}(lambda)$$. Then

begin{align} L(lambda; mathbf{y} &= prod_{i = 1}^n e^{-lambda} dfrac{lambda^{y_i}}{y_i!} \ &= e^{-lambda n} dfrac{lambda^{sum_{i = 1}^n y_i}}{prod_{i = 1}^n y_i!} \ &= g(T(mathbf{y}), lambda) times h(mathbf{y}) end{align}

where $$T(mathbf{y}) = sum_{i = 1}^n y_i$$, $$g(T(mathbf{y}), lambda) = e^{-lambda n} lambda^{T(mathbf{y})}$$ and $$h(mathbf{y}) = dfrac{1}{prod_{i = 1}^n y_i!}$$

There are three things that I don’t understand here:

1. How is it that $$sum_{i = 1}^n (y_i – mu)^2 = sum_{i = 1}^n (y_i – bar{y})^2 + n(bar{y} – mu)^2$$? EDIT: Answered here.

2. If, for $$L(theta; mathbf{y})$$, we require the form $$g(T(mathbf{y}), theta) times h(mathbf{y})$$, then, for $$L(mu, sigma; mathbf{y})$$, what form do we require? Trying to think of this myself, I thought of three potentially correct forms: $$g(T(mathbf{y}), (mu, sigma)) times h(mathbf{y})$$, $$g(T(mathbf{y}), (sigma, mu)) times h(mathbf{y})$$, or $$g(T(mathbf{y}), mu, sigma) times h(mathbf{y})$$.

3. Related to 2., comparing the first example to the Poisson example, I don’t understand the conclusion of the first example. How does $$T(mathbf{Y}) = (bar{Y}, sum_{i = 1}^n (Y_i – bar{Y})^2)$$ satisfy the form $$g(T(mathbf{y}), lambda) times h(mathbf{y})$$?

I would greatly appreciate it if people would please take the time to clarify these points.

Get this bounty!!!

## #StackBounty: #normal-distribution #poisson-distribution #likelihood #sufficient-statistics Finding the form \$g(T(mathbf{y}), lambda)…

### Bounty: 50

I’m studying some notes that present examples of sufficiency:

Let $$Y_1, dots, Y_n$$ be i.i.d. $$N(mu, sigma^2)$$. Note that $$sum_{i = 1}^n (y_i – mu)^2 = sum_{i = 1}^n (y_i – bar{y})^2 + n(bar{y} – mu)^2$$. Hence

begin{align} L(mu, sigma; mathbf{y}) &= prod_{i = 1}^n dfrac{1}{sqrt{2pi sigma^2}}e^{-frac{1}{2sigma^2}(y_i – mu)^2} \ &= dfrac{1}{(2pi sigma^2)^{n/2}}e^{-frac{1}{2sigma^2}sum_{i = 1}^n (y_i – bar{y})^2}e^{-frac{1}{2sigma^2}n(bar{y} – mu)^2} end{align}

From Theorem 1, it follows that where $$T(mathbf{Y}) = (bar{Y}, sum_{i = 1}^n (Y_i – bar{Y})^2)$$ is a sufficient statistic for $$(mu, sigma)$$.

Theorem 1 is presented as follows:

A statistic $$T(mathbf{Y})$$ is sufficient for $$theta$$ if, and only if, for all $$theta in theta$$

$$L(theta; mathbf{y}) = g(T(mathbf{y}), theta) times h(mathbf{y})$$

where the function $$g(cdot)$$ depends on $$theta$$ and the statistic $$T(mathbf{Y})$$, while the function $$h(cdot)$$ does not contain $$theta$$.

Theorem 1 implies that the likelihood $$L(theta; mathbf{y})$$ depends on the data only through $$T(mathbf{y})$$, $$T(mathbf{Y})$$ is a sufficient statistic for $$theta$$ and $$h(mathbf{y}) equiv 1$$.

For reference to another example, here is a Poisson example that I recently posted:

Let $$Y_1, dots, Y_n$$ be a i.i.d. $$text{Pois}(lambda)$$. Then

begin{align} L(lambda; mathbf{y} &= prod_{i = 1}^n e^{-lambda} dfrac{lambda^{y_i}}{y_i!} \ &= e^{-lambda n} dfrac{lambda^{sum_{i = 1}^n y_i}}{prod_{i = 1}^n y_i!} \ &= g(T(mathbf{y}), lambda) times h(mathbf{y}) end{align}

where $$T(mathbf{y}) = sum_{i = 1}^n y_i$$, $$g(T(mathbf{y}), lambda) = e^{-lambda n} lambda^{T(mathbf{y})}$$ and $$h(mathbf{y}) = dfrac{1}{prod_{i = 1}^n y_i!}$$

There are three things that I don’t understand here:

1. How is it that $$sum_{i = 1}^n (y_i – mu)^2 = sum_{i = 1}^n (y_i – bar{y})^2 + n(bar{y} – mu)^2$$? EDIT: Answered here.

2. If, for $$L(theta; mathbf{y})$$, we require the form $$g(T(mathbf{y}), theta) times h(mathbf{y})$$, then, for $$L(mu, sigma; mathbf{y})$$, what form do we require? Trying to think of this myself, I thought of three potentially correct forms: $$g(T(mathbf{y}), (mu, sigma)) times h(mathbf{y})$$, $$g(T(mathbf{y}), (sigma, mu)) times h(mathbf{y})$$, or $$g(T(mathbf{y}), mu, sigma) times h(mathbf{y})$$.

3. Related to 2., comparing the first example to the Poisson example, I don’t understand the conclusion of the first example. How does $$T(mathbf{Y}) = (bar{Y}, sum_{i = 1}^n (Y_i – bar{Y})^2)$$ satisfy the form $$g(T(mathbf{y}), lambda) times h(mathbf{y})$$?

I would greatly appreciate it if people would please take the time to clarify these points.

Get this bounty!!!