#StackBounty: #maximum-likelihood #inference #loss-functions #decision-theory #risk Model fitting vs minimizing expected risk

Bounty: 50

I’m confused about the mechanics of model fitting vs minimizing risk in decision theory. There’s numerous resources online, but I can’t seem to find a straight answer regarding what I’m confused about.

Model fitting (via e.g. maximum log-likelihood):

Suppose I have some data pairs $lbrace (x_1, y_1), … , (x_N, y_N)rbrace$ and I want to come up with a parametric probability density modelling target $y$ given $x$: $$p(y|x; theta)$$

which I use to estimate the true codnitional distribution of the data, say $p_text{true}(y|x)$. I can do so via some procedure to e.g. maximize log likelihood:

$$max_theta sum_i log p(y_i| x_i; theta)$$

Then on future unseen data for $x$, we can give e.g. confidence intervals for its corresponding $y$ given $x$, or just report $y_text{guess} = y_text{mode} = argmax_y p(y|x; theta)$ ). $y$ and $x$ can both be continuous and/or discrete.

Decision theory:

A problem comes when we want a point estimate of $y$ and the cost associated with each estimate is not captured purely by which is most frequent or expect, I.e. we need to do better than picking the modal $$y_text{guess} = text{argmax}y p(y|x;theta)$$ or expected value $$mu_y = mathbb{E}{p(y|x;theta)}[Y|x]$$ for a particular application.

So suppose I fit a model using maximum likelihood, then I want to make point predictions. Since I must pick a single point, I can predict a new point which minimizes expected cost; I choose the $y_text{guess}$ with the lowest avg cost along all $y$:

$$
begin{equation}
begin{aligned}
y_text{guess} &= text{arg-min}{y}int{y^{‘}}L(y, y^{‘})p(y^{‘}|x; theta)dy^{‘}\
&= text{arg-min}{y}mathbb{E}{p(y^{‘}|x;theta)}Big[L(y, y^{‘})Big]\
end{aligned}
end{equation}
$$

This is the degree to which I understand decision theory. It’s a step that you take after one has fit their model to pick point estimates of $y$ and have a loss function $L(y, y’)$, when your model gives an entire distribution of $y$, but we need a point estimate, $y_text{guess}$.

Questions:

  • If the loss $L(y_text{guess}, y^{‘})$ is what we actually care about minimizing, then why not do the following fitting procedure instead of maximum likelihood:

$$min_{theta} sum_i int_{y^{‘}}L(y_i, y^{‘})p(y^{‘}|x_i; theta)dy^{‘}$$

that is, minimize the expected loss under the parametric model $p(y|x; theta)$? My current understanding is approach is called "Expected Risk Minimization" this is done in practice sometimes, but the parametric model in this case would lose the interpretation as the approximation to the true distribution $p_text{true}(y|x)$. Is my understanding correct? Are there any problems with doing this?


Get this bounty!!!

#StackBounty: #probability #inference #terminology #causality #causal-diagram Determining direct cause with some "realization&quot…

Bounty: 100

I’m bouncing around "Causality" by Judea Pearl.

On page 222 it offers this definition of a direct cause:

"$X$ is a direct cause of $Y$" if there exist two values $x$ and $x’$ of $X$ and a value $u$ of $U$ such that $Y_{xr}(u) not= Y_{x’r}(u)$ where $r$ is some realization of $V setminus {X, Y}$.

My questions are:

  1. What is a "realization", is it the same as Wikipedia’s Realization (probability) definition?
  2. What does the $setminus$ symbol mean in the context of the two functions $X$ and $Y$? Can you give me an example?
  3. Finally, how do I use this definition in practice?

Let’s say I have two structural causal models:

  1. $X rightarrow Y$
  2. $X rightarrow Q rightarrow Y$

How does this definition of direct cause allow me to discover that $X$ is a direct cause in the first case, and $X$ is not a direct cause in the second case?


Get this bounty!!!

#StackBounty: #probability #inference #terminology #causality #causal-diagram Determining direct cause with some "realization&quot…

Bounty: 100

I’m bouncing around "Causality" by Judea Pearl.

On page 222 it offers this definition of a direct cause:

"$X$ is a direct cause of $Y$" if there exist two values $x$ and $x’$ of $X$ and a value $u$ of $U$ such that $Y_{xr}(u) not= Y_{x’r}(u)$ where $r$ is some realization of $V setminus {X, Y}$.

My questions are:

  1. What is a "realization", is it the same as Wikipedia’s Realization (probability) definition?
  2. What does the $setminus$ symbol mean in the context of the two functions $X$ and $Y$? Can you give me an example?
  3. Finally, how do I use this definition in practice?

Let’s say I have two structural causal models:

  1. $X rightarrow Y$
  2. $X rightarrow Q rightarrow Y$

How does this definition of direct cause allow me to discover that $X$ is a direct cause in the first case, and $X$ is not a direct cause in the second case?


Get this bounty!!!

#StackBounty: #probability #inference #terminology #causality #causal-diagram Determining direct cause with some "realization&quot…

Bounty: 100

I’m bouncing around "Causality" by Judea Pearl.

On page 222 it offers this definition of a direct cause:

"$X$ is a direct cause of $Y$" if there exist two values $x$ and $x’$ of $X$ and a value $u$ of $U$ such that $Y_{xr}(u) not= Y_{x’r}(u)$ where $r$ is some realization of $V setminus {X, Y}$.

My questions are:

  1. What is a "realization", is it the same as Wikipedia’s Realization (probability) definition?
  2. What does the $setminus$ symbol mean in the context of the two functions $X$ and $Y$? Can you give me an example?
  3. Finally, how do I use this definition in practice?

Let’s say I have two structural causal models:

  1. $X rightarrow Y$
  2. $X rightarrow Q rightarrow Y$

How does this definition of direct cause allow me to discover that $X$ is a direct cause in the first case, and $X$ is not a direct cause in the second case?


Get this bounty!!!

#StackBounty: #probability #inference #terminology #causality #causal-diagram Determining direct cause with some "realization&quot…

Bounty: 100

I’m bouncing around "Causality" by Judea Pearl.

On page 222 it offers this definition of a direct cause:

"$X$ is a direct cause of $Y$" if there exist two values $x$ and $x’$ of $X$ and a value $u$ of $U$ such that $Y_{xr}(u) not= Y_{x’r}(u)$ where $r$ is some realization of $V setminus {X, Y}$.

My questions are:

  1. What is a "realization", is it the same as Wikipedia’s Realization (probability) definition?
  2. What does the $setminus$ symbol mean in the context of the two functions $X$ and $Y$? Can you give me an example?
  3. Finally, how do I use this definition in practice?

Let’s say I have two structural causal models:

  1. $X rightarrow Y$
  2. $X rightarrow Q rightarrow Y$

How does this definition of direct cause allow me to discover that $X$ is a direct cause in the first case, and $X$ is not a direct cause in the second case?


Get this bounty!!!

#StackBounty: #inference #reinforcement-learning #online-algorithms #multiarmed-bandit Bandit-like setup but taking max reward over seq…

Bounty: 50

Similar to my other question Bandit-like setup but taking max reward over multiple heads?, I’m interested in situations like the Multi-Armed Bandit setup, except where the reward is aggregated a maximising way rather than a summation.

This question is similarly about an altered scenario where the per-round reward is the maximum of a set of choices, but it allows me more information than my previous one.

Imagine per ’round’ I can sequentially choose up to $k$ of my $n$ options. My choice of next option can depend both on the result of the previous ones this round as well as the usual dependence on the entire history of plays to this point. After $k$ pulls, my reward is the maximum of the outcomes of those $k$. Then a new round begins. My overall reward is the sum over rounds.

As in the other question, potentially there is a treatment as a $binom n k$-armed bandit while taking care of the correlations between these ‘combo arms’.

An additional approach for this special case could be to treat the initial choice as if $k=1$ (i.e. any standard MAB algorithm), then treat subsequent choices also as $k=1$ MABs, but yielding ‘incremental reward’ of max(0, new_reward - current_best_reward). That would presumably require some special alteration to the reward estimation to zero out any reward mass below the ‘current max’. But I haven’t worked out how to do this appropriately yet and how it might interact with explore-exploit properties of existing algorithms.


As an illustrative example, I am a lone treasure-hunter. Every day the sea washes in junk or treasure to various of $n=100$ beaches and nearby islands. I have time to search $k=10$ beaches in order, but my pockets are only big enough to carry the single best thing I find! At the end of the day, the sea washes everything away. I take whatever best thing I found and add it to my treasure stash, and get ready for the next day. (I am not clever or rich enough to get bigger pockets. Travel time between beaches is fixed and I won’t find anything better by searching the same beach multiple times on the same day.)


Get this bounty!!!

#StackBounty: #probability #inference #references #descriptive-statistics #teaching What are some references to teach statistics to bus…

Bounty: 50

I am going to teach a Statistics course next year and I should cover the basics of probabilities and statistics to undergrad students in business. They don’t have any background in probability, so, at first, I should start with its basics, then cover some topics like descriptive statistics, confidence intervals, hypothesis testing, and regression.
I would be thankful if you can recommend some books which provide many real-world and tangible examples and provide intuition behind each topic. Although I personally like a course with rigorous math, I should discuss the motivation and intuition behind them.


Get this bounty!!!

#StackBounty: #deep-learning #image-classification #convolutional-neural-network #distributed #inference Distributed inference for imag…

Bounty: 50

I would like to take the output of an intermediate layer of a CNN (layer G) and feed it to an intermediate layer of a wider CNN (layer H) to complete the inference.

Challenge: The two layers G, H have different dimensions and thus it can’t be done directly.
Solution: Use a third CNN (call it r) which will take as input the output of layer G and output a valid input for layer H.
Then both the weights of layer G and r will be tuned using the loss function:

$$L(W_G, W_r) = MSE(text{output of layer H}, text{output of r})$$

My question: Will this method only change the layer G’s weights along with r’s weights? Does the whole system require finetuning afterwards to update the weights of the other layers?


Get this bounty!!!

#StackBounty: #maximum-likelihood #inference #fisher-information Connection between Fisher information and variance of score function

Bounty: 100

The fisher information’s connection with the negative expected hessian at $theta_{MLE}$, provides insight in the following way: at the MLE, high curvature implies that an estimate of $theta$ even slightly different from the true MLE would have resulted in a very different likelihood.
$$
mathbf{I}(theta)=-frac{partial^{2}}{partialtheta_{i}partialtheta_{j}}l(theta),~~~~ 1leq i, jleq p
$$

This is good, as that means that we can be relatively sure about our estimate.

The other connection of Fisher information to variance of the score, when evaluated at the MLE is less clear to me.
$$ I(theta) = E[(frac{partial}{partialtheta}l(theta))^2]$$

The implication is; high Fisher information -> high variance of score function at the MLE.

Intuitively, this means that the score function is highly sensitive to the sampling of the data. i.e – we are likely to get a non-zero gradient of the likelihood, had we sampled a different data distribution. This seems to have a negative implication to me. Don’t we want the score function = 0 to be highly robust to different sampling of the data?

A lower fisher information on the other hand, would indicate the score function has low variance at the MLE, and has mean zero. This implies that regardless of the sampling distribution, we will get a gradient of log likelihood to be zero (which is good!).

What am I missing?


Get this bounty!!!

#StackBounty: #hypothesis-testing #bayesian #estimation #inference Posterior calculation on binomial distribution using quadratic loss …

Bounty: 50

Que

Let x be a binomial variate with parameters n and p (0<p<1). using a quadratic error loss function and a priori distribution of p as $ pi(p) $ = 1, obtain the bayes’ estimate for p.

Hey lately I have been teaching myself bayes estimator( in relation to statistical inference) ,
$ f(x| p ) = C^{n}_{x} p^x (1-p)^{n-x } $

Since prior distribution is 1

So joint distribution of x, p

f(x,p) = $ C^{n}_{x} p^x (1-p)^{n-x } $ only

Now posterior distribution is directly proportional to joint distribution of x and p

f(p|x) $ propto C^{n}_{x} p^x (1-p)^{n-x } $

f(p|x) $ propto p^x (1-p)^{n-x } $

f(p|x) $ propto p^{x+1-1} (1-p)^{n-x+1-1 } $

f(p|x) $ ~ beta({x+1, n-x+1 )} $

As we know Expected value of posterior distribution is

E(f(p|x)) $ = frac{x+1}{ n+2 } $

Now can someone help in calculating bayes risk using quadratic loss function , because I have no idea on how to proceed .


Get this bounty!!!