## #StackBounty: #intuition #naive-bayes #discriminant-analysis Intuition for why LDA is a special case of naive Bayes

### Bounty: 50

The naive Bayes classifier assumes the regressors to be mutually independent, while linear discriminant analysis (LDA) allows them to be correlated. James et al. "An Introduction to Statistical Learning" (2nd edition, 2021) section 4.5 (bottom of p. 159) claim that LDA is in fact a special case of the naive Bayes classifier (admitting that the fact is not at all obvious — with which I agree, and hence my question). What is the intuition?

Get this bounty!!!

## #StackBounty: #regression #econometrics #intuition #instrumental-variables #endogeneity Question about Instrumental variables, endogene…

### Bounty: 50

I have seen his notation to describe the Instrumental Variable framework, and I wish to make sure I understand it. Y is the dependent variable, x is treatment, and z is the instrument:

$$y = f(x,epsilon)$$

$$x = g(z,eta)$$

and the endogeneity structure is defined as: $$cov(epsilon,eta)neq0$$, $$cov(z,epsilon)=0$$, $$cov(z,eta)=0$$

I want to make sure I understand what this is saying.

1. First, is any variable z that can fit this an instrument?

2. If I am say approximating these functions with linear equations, that $$x = pi z + eta$$, is this saying we can partition the entire variation of x as the variation explained by z and then all the remaining variation $$eta$$, and the endogeneity can be expressed as $$cov(epsilon,eta)neq0$$? I am confused because usually this is simply expressed as $$cov(x,epsilon)neq0$$, and I am not familiar with writing this all in terms of errors. is this the same since I can just plug in the model of x as $$cov(pi z + eta,epsilon) = cov(eta,epsilon)$$ given the exogeneity of z?

3. Is this equivalent as saying there exists some subset of variables, $$rin epsilon$$ and $$r in eta$$, i.e. omitted variables that determine x and determine y?

Get this bounty!!!

## #StackBounty: #covariance #intuition Population covariance, are these two formulas equivalent?

### Bounty: 50

for the population covariance, you can write it as:

$$sigma_{x,y} = frac{sum_i(x_i-bar{x})(y_i-bar{y})}{N}$$

Where N is the population size, or in expected values:

E(x-$$mu_x$$)(y-$$mu_y$$)

Are these two formulations actually equivalent? if you had the total N, and plugged in equation 1, does that converge to the true expectation value?

I am just confused why the former equation is used to denote the population value if this is so, is it just a intuitive way to formulate the population covariance?

Get this bounty!!!

## #StackBounty: #distributions #interaction #intuition #h-statistic Intuitive explanation of Friedman's H-statistic

### Bounty: 50

What is the cleanest, easiest way to explain someone, a non-STEM person the concept of Friedman’s H-statistic? What does it intuitively mean?

While exploring feature interaction I went through Friedman’s H-statistic.

Mathematically, the H-statistic proposed by Friedman and Popescu for the interaction between feature $$j$$ and $$k$$ is:

$$H^2_{jk}=sum_{i=1}^nleft[PD_{jk}(x_{j}^{(i)},x_k^{(i)})-PD_j(x_j^{(i)})-PD_k(x_{k}^{(i)})right]^2/sum_{i=1}^n{PD}^2_{jk}(x_j^{(i)},x_k^{(i)})$$

The partial dependence function for regression is defined as:

$$hat{f}_{x_S}(x_S)=E_{x_C}left[hat{f}(x_S,x_C)right]=inthat{f}(x_S,x_C)dmathbb{P}(x_C)$$

It’s a concept that I have difficulty in articulating.

Can someone please explain it using simple examples?

Get this bounty!!!

## #StackBounty: #normal-distribution #conditional-expectation #intuition #multivariate-distribution Interpretation of multivariate condit…

### Bounty: 50

I’ve been reading over this Multivariate Gaussian conditional proof, trying to make sense of how the mean and variance of a gaussian conditional was derived. I’ve come to accept that unless I allocate a dozen or so hours to refreshing my linear algebra knowledge, it’s out of my reach for the time being.

that being said, I’m looking for a conceptual explanation for that these equations represent:

$$mu_{1|2} = mu_1 + Sigma_{1,2} * Sigma^{-1}_{2,2}(x_2 – mu_2)$$

I read the first as "Take $$mu1$$ and augment it by some factor, which is the covariance scaled by the precision (measure of how closely $$X_2$$ is clustered about $$mu_2$$, maybe?) and projected onto the distance of the specific $$x_2$$ from $$mu_2$$."

$$Sigma_{1|2} = Sigma_{1,1} – Sigma_{1,2} * Sigma^{-1}_{2,2} * Sigma_{1,2}$$

I read the second as, "take the variance about $$mu_1$$ and subtract some factor, which is covariance squared scaled by the precision about $$x_2$$."

In either case, the precision $$Sigma^{-1}_{2,2}$$ seems to be playing a really important role.

A few questions:

• Am I right to treat precision as a measure of how closely observations are clustered about the expectation?
• Why is the covariance squared in the latter equation? (Is there a geometric interpretation?) So far, I’ve been treating $$Sigma_{1,2} * Sigma^{-1}_{2,2}$$ as a ratio, (a/b), and so this ratio acts to scale the (second) $$Sigma_{1,2}$$, essentially accounting for/damping the effect of the covariance; I don’t know if this is valid.
• Anything else you’d like to add/clarify?

Get this bounty!!!

## #StackBounty: #econometrics #intuition #instrumental-variables Intuitive understanding of instrumental variables for natural experiments

### Bounty: 50

I am wondering if my understanding of Instrumental vairables to exploit natural experiments is correct, or if I am misunderstanding something.

Is the logic as follows: by using an instrument, you are now comparing the outcomes of those who recieved higher levels of treatment because they had higher exposure to the instrument to those who received lower levels of treatment because they had lower exposure to the instrument, but these latter units would have recieved higher treatment had they been more exposed to the instrument?

so should I think intuitively as if it is to some degree a random experiment on a subset of units?

Get this bounty!!!

## #StackBounty: #intuition #similarities #kullback-leibler #cross-entropy What's an intuitive way to understand how KL divergence dif…

### Bounty: 50

The general intuition I have seen for KL divergence is that it computes the difference in expected length sampling from distribution $$P$$ with an optimal code for $$P$$ versus sampling from distribution $$P$$ with an optimal code for $$Q$$.

This makes sense as a general intuition as to why it’s a similarity metric between two distributions, but there are a number of similarity metrics between two distributions. There must be some underlying assumptions based on how it chooses to assign distance versus other metrics.

This seems fundamental to understanding when to use KL divergence. Is there a good intuition for understanding how KL divergence differs from other similarity metrics?

Get this bounty!!!

## #StackBounty: #intuition #similarities #kullback-leibler #cross-entropy What's an intuitive way to understand how KL divergence dif…

### Bounty: 50

The general intuition I have seen for KL divergence is that it computes the difference in expected length sampling from distribution $$P$$ with an optimal code for $$P$$ versus sampling from distribution $$P$$ with an optimal code for $$Q$$.

This makes sense as a general intuition as to why it’s a similarity metric between two distributions, but there are a number of similarity metrics between two distributions. There must be some underlying assumptions based on how it chooses to assign distance versus other metrics.

This seems fundamental to understanding when to use KL divergence. Is there a good intuition for understanding how KL divergence differs from other similarity metrics?

Get this bounty!!!

## #StackBounty: #intuition #similarities #kullback-leibler #cross-entropy What's an intuitive way to understand how KL divergence dif…

### Bounty: 50

The general intuition I have seen for KL divergence is that it computes the difference in expected length sampling from distribution $$P$$ with an optimal code for $$P$$ versus sampling from distribution $$P$$ with an optimal code for $$Q$$.

This makes sense as a general intuition as to why it’s a similarity metric between two distributions, but there are a number of similarity metrics between two distributions. There must be some underlying assumptions based on how it chooses to assign distance versus other metrics.

This seems fundamental to understanding when to use KL divergence. Is there a good intuition for understanding how KL divergence differs from other similarity metrics?

Get this bounty!!!

## #StackBounty: #intuition #similarities #kullback-leibler #cross-entropy What's an intuitive way to understand how KL divergence dif…

### Bounty: 50

The general intuition I have seen for KL divergence is that it computes the difference in expected length sampling from distribution $$P$$ with an optimal code for $$P$$ versus sampling from distribution $$P$$ with an optimal code for $$Q$$.

This makes sense as a general intuition as to why it’s a similarity metric between two distributions, but there are a number of similarity metrics between two distributions. There must be some underlying assumptions based on how it chooses to assign distance versus other metrics.

This seems fundamental to understanding when to use KL divergence. Is there a good intuition for understanding how KL divergence differs from other similarity metrics?

Get this bounty!!!