#StackBounty: #normal-distribution #stata #beta-distribution Convert normal random variable into beta random variable in STATA

Bounty: 50

I need to generate two random variables – lognormal and beta distributed – while ensuring that the correlation between the two variables is -0.3.

I generated two normal random variables with -0.3 correlation as follows;

matrix C = (1, -0.3  -0.3, 1)
drawnorm x y, mean(0.921, 0) sds(0.174,1) corr(C)
// Here x is normal rv with mean 0.921 and sd 0.174 while y is a standard normal rv. 

Converting x to lognormal is simple. I do this

gen price = exp(x)
// price is now the lognormal(0.921, 0.174)

Problem is in converting y~$N(0,1)$ to $beta(alpha,beta)$. Is there a way to do it?


Get this bounty!!!

#StackBounty: #normal-distribution #normalization #z-score Accounting for differences between judge scoring

Bounty: 100

Consider a competition with 10,000 entrants and 200 judges. Each entrant gets scored on a scale of 0-100 by 2 different judges for a total of 20,000 scores.

I want to remove any judge-to-judge variations in their means and standard deviations. To do this I’m using a Z-score for each judge’s scores and converting that to a T-score to put in back on a 0-100 scale.

In R I’m doing

df$z_score <- ave(df$score, df$judge, FUN=scale)

df$t_score <- ave(df$score, df$judge, FUN = function(x) rescale(x, mean=50, sd=10, df=FALSE))

in Python the code would be

df['Z-Score'] = df.groupby('judge')['score'].transform(lambda x: stats.zscore(x, ddof=1))

df['T-Score'] = df['Z-Score'].transform(lambda x: x * 10 + 50)

However, for a variety of reasons, some judges only scored a handful of entrants. Let’s say between 3 and 20.

Is it valid to calculate a Z-score/T-score for those particular judge’s scores as the mean and standard deviations may be skewed due to the small sample or should I run a different test?


Get this bounty!!!

#StackBounty: #normal-distribution #descriptive-statistics #outliers #boosting #extreme-value Decision trees, Gradient boosting and nor…

Bounty: 50

I have a question regarding the normality of predictors. I have 100,000 observations in my data. The problem I am analysing is a classification problem so 5% of the data is assigned to class 1, 95,000 observations assigned to class 0, so the data is highly imbalanced. However the observations of the class 1 data is expected to have extreme values.

  • What I have done is, trim the top 1% and bottom 1% of the data removing, any possible mistakes in the entry of such data)
  • Winsorised the data at the 5% and 95% level (which I have checked and is an accepted practise when dealing with such data that I have).

So;
I plot a density plot of one variable after no outlier manipulation
enter image description here

Here is the same variable after trimming the data at the 1% level
enter image description here

Here is the variable after being trimmed and after being winsorised
enter image description here

My question is how should I approach this problem.

First question, should I just leave the data alone at trimming it? or should I continue to winsorise to further condense the extreme values into more meaningful values (since even after trimming the data I am still left with what I feel are extreme values). If I just leave the data after trimming it, I am left with long tails in the distribution like the following (however the observations that I am trying to classify mostly fall at the tail end of these plots).
enter image description here

Second question, since decisions trees and gradient boosted trees decide on splits, does the distribution matter? What I mean by that is if the tree splits on a variable at (using the plots above) <= -10. Then according to plot 2 (after trimming the data) and plot 3 (after winsorisation) all firms <= -10 will be classified as class 1.

Consider the decision tree I created below.

enter image description here

My argument is, irregardless of the spikes in the data (made from winsorisation) the decision tree will make the classification at all observations <= 0. So the distribution of that variable should not matter in making the split? It will only affect at what value that split will occur at? and I do not loose too much predictive power in these tails?


Get this bounty!!!

#StackBounty: #machine-learning #normal-distribution #descriptive-statistics #outliers #extreme-value Decision trees, Gradient boosting…

Bounty: 50

I have a question regarding the normality of predictors. I have 100,000 observations in my data. The problem I am analysing is a classification problem so 5% of the data is assigned to class 1, 95,000 observations assigned to class 0, so the data is highly imbalanced. However the observations of the class 1 data is expected to have extreme values.

  • What I have done is, trim the top 1% and bottom 1% of the data removing, any possible mistakes in the entry of such data)
  • Winsorised the data at the 5% and 95% level (which I have checked and is an accepted practise when dealing with such data that I have).

So;
I plot a density plot of one variable after no outlier manipulation
enter image description here

Here is the same variable after trimming the data at the 1% level
enter image description here

Here is the variable after being trimmed and after being winsorised
enter image description here

My question is how should I approach this problem.

First question, should I just leave the data alone at trimming it? or should I continue to winsorise to further condense the extreme values into more meaningful values (since even after trimming the data I am still left with what I feel are extreme values). If I just leave the data after trimming it, I am left with long tails in the distribution like the following (however the observations that I am trying to classify mostly fall at the tail end of these plots).
enter image description here

Second question, since decisions trees and gradient boosted trees decide on splits, does the distribution matter? What I mean by that is if the tree splits on a variable at (using the plots above) <= -10. Then according to plot 2 (after trimming the data) and plot 3 (after winsorisation) all firms <= -10 will be classified as class 1.

Consider the decision tree I created below.

enter image description here

My argument is, irregardless of the spikes in the data (made from winsorisation) the decision tree will make the classification at all observations <= 0. So the distribution of that variable should not matter in making the split? It will only affect at what value that split will occur at? and I do not loose too much predictive power in these tails?


Get this bounty!!!

#StackBounty: #probability #distributions #normal-distribution #expected-value #ratio Direct way of calculating $mathbb{E} left[ fra…

Bounty: 150

Considering the following random vectors:

begin{align}
textbf{h} &= [h_{1}, h_{2}, ldots, h_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},dtextbf{I}{M times M}right), [8pt]
textbf{w} &= [w_{1}, w_{2}, ldots, w_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},frac{1}{p}textbf{I}{M times M}right), [8pt]
textbf{y} &= [y_{1}, y_{2}, ldots, y_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},left(d + frac{1}{p}right)textbf{I}{M times M}right),
end{align}

where $textbf{y} = textbf{h} + textbf{w}$ and therefore, $textbf{y}$ and $textbf{h}$ are not independent.

I’m trying to find the following expectation:

$$mathbb{E} left[ frac{textbf{h}^{H} textbf{y}textbf{y}^{H} textbf{h}}{ | textbf{y} |^{4} } right],$$

where $| textbf{y} |^{4} = (textbf{y}^{H} textbf{y}) (textbf{y}^{H} textbf{y}$).

In order to find the desired expectation, I’m applying the following approximation:

$$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).$$

However, applying this approximation to the desired expectation is time consuming and prone to errors as it involves expansions with lots of terms .

I have been wondering if there is a more direct/smarter way of finding the desired expectation.

$textbf{UPDATE 21-04-2018}$: I’ve created a simulation in order to identify the pdf shape of the ratio inside of the expectation operator and as can be seen below it seems much like the pdf of a Gaussian random variable. Additionally, I’ve also noticed that the ratio results in real valued terms, the imaginary part is always equal to zero.

Is there another kind of approximation that can be used to find the expectation (one analytical/closed form result and not only the simulated value of the expection) given that the pdf looks like a Gaussian and probably can be approximated as such?

pdf of the ratio inside the expectation operator
$textbf{UPDATE 24-04-2018}$: I’ve found an approximation to the case where $textbf{h}$ and $textbf{y}$ are independent.

$$mathbb{E} left[ frac{textbf{h}^{H}{l} textbf{y}{k} textbf{y}^{H} {k} textbf{h}{l} }{ | textbf{y}{k} |^{4} } right] = frac{d{l}[(M+1)(M-2)+4M+6]}{zeta_{k}M(M+1)^{2}}$$

where $zeta_{k} = d_{k} + frac{1}{p}$, $textbf{h}{l} sim mathcal{CN}left(textbf{0}{M},d_{l}textbf{I}{M times M}right)$ and $textbf{h}{k} sim mathcal{CN}left(textbf{0}{M},d{k}textbf{I}{M times M}right)$. Note that $textbf{y}{k} = textbf{h}{k} + w$ and that $textbf{h}{k}$ and $textbf{h}_{l}$ are independent.

I have used the following approximation:
$$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).$$


Get this bounty!!!

#StackBounty: #probability #distributions #normal-distribution #expected-value #ratio Direct way of calculating $mathbb{E} left[ fra…

Bounty: 150

Considering the following random vectors:

begin{align}
textbf{h} &= [h_{1}, h_{2}, ldots, h_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},dtextbf{I}{M times M}right), [8pt]
textbf{w} &= [w_{1}, w_{2}, ldots, w_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},frac{1}{p}textbf{I}{M times M}right), [8pt]
textbf{y} &= [y_{1}, y_{2}, ldots, y_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},left(d + frac{1}{p}right)textbf{I}{M times M}right),
end{align}

where $textbf{y} = textbf{h} + textbf{w}$ and therefore, $textbf{y}$ and $textbf{h}$ are not independent.

I’m trying to find the following expectation:

$$mathbb{E} left[ frac{textbf{h}^{H} textbf{y}textbf{y}^{H} textbf{h}}{ | textbf{y} |^{4} } right],$$

where $| textbf{y} |^{4} = (textbf{y}^{H} textbf{y}) (textbf{y}^{H} textbf{y}$).

In order to find the desired expectation, I’m applying the following approximation:

$$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).$$

However, applying this approximation to the desired expectation is time consuming and prone to errors as it involves expansions with lots of terms .

I have been wondering if there is a more direct/smarter way of finding the desired expectation.

$textbf{UPDATE 21-04-2018}$: I’ve created a simulation in order to identify the pdf shape of the ratio inside of the expectation operator and as can be seen below it seems much like the pdf of a Gaussian random variable. Additionally, I’ve also noticed that the ratio results in real valued terms, the imaginary part is always equal to zero.

Is there another kind of approximation that can be used to find the expectation (one analytical/closed form result and not only the simulated value of the expection) given that the pdf looks like a Gaussian and probably can be approximated as such?

pdf of the ratio inside the expectation operator
$textbf{UPDATE 24-04-2018}$: I’ve found an approximation to the case where $textbf{h}$ and $textbf{y}$ are independent.

$$mathbb{E} left[ frac{textbf{h}^{H}{l} textbf{y}{k} textbf{y}^{H} {k} textbf{h}{l} }{ | textbf{y}{k} |^{4} } right] = frac{d{l}[(M+1)(M-2)+4M+6]}{zeta_{k}M(M+1)^{2}}$$

where $zeta_{k} = d_{k} + frac{1}{p}$, $textbf{h}{l} sim mathcal{CN}left(textbf{0}{M},d_{l}textbf{I}{M times M}right)$ and $textbf{h}{k} sim mathcal{CN}left(textbf{0}{M},d{k}textbf{I}{M times M}right)$. Note that $textbf{y}{k} = textbf{h}{k} + w$ and that $textbf{h}{k}$ and $textbf{h}_{l}$ are independent.

I have used the following approximation:
$$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).$$


Get this bounty!!!

#StackBounty: #probability #distributions #normal-distribution #expected-value #ratio Direct way of calculating $mathbb{E} left[ fra…

Bounty: 150

Considering the following random vectors:

begin{align}
textbf{h} &= [h_{1}, h_{2}, ldots, h_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},dtextbf{I}{M times M}right), [8pt]
textbf{w} &= [w_{1}, w_{2}, ldots, w_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},frac{1}{p}textbf{I}{M times M}right), [8pt]
textbf{y} &= [y_{1}, y_{2}, ldots, y_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},left(d + frac{1}{p}right)textbf{I}{M times M}right),
end{align}

where $textbf{y} = textbf{h} + textbf{w}$ and therefore, $textbf{y}$ and $textbf{h}$ are not independent.

I’m trying to find the following expectation:

$$mathbb{E} left[ frac{textbf{h}^{H} textbf{y}textbf{y}^{H} textbf{h}}{ | textbf{y} |^{4} } right],$$

where $| textbf{y} |^{4} = (textbf{y}^{H} textbf{y}) (textbf{y}^{H} textbf{y}$).

In order to find the desired expectation, I’m applying the following approximation:

$$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).$$

However, applying this approximation to the desired expectation is time consuming and prone to errors as it involves expansions with lots of terms .

I have been wondering if there is a more direct/smarter way of finding the desired expectation.

$textbf{UPDATE 21-04-2018}$: I’ve created a simulation in order to identify the pdf shape of the ratio inside of the expectation operator and as can be seen below it seems much like the pdf of a Gaussian random variable. Additionally, I’ve also noticed that the ratio results in real valued terms, the imaginary part is always equal to zero.

Is there another kind of approximation that can be used to find the expectation (one analytical/closed form result and not only the simulated value of the expection) given that the pdf looks like a Gaussian and probably can be approximated as such?

pdf of the ratio inside the expectation operator


Get this bounty!!!

#StackBounty: #probability #distributions #normal-distribution #expected-value #ratio Direct way of calculating $mathbb{E} left[ fra…

Bounty: 150

Considering the following random vectors:

begin{align}
textbf{h} &= [h_{1}, h_{2}, ldots, h_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},dtextbf{I}{M times M}right), [8pt]
textbf{w} &= [w_{1}, w_{2}, ldots, w_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},frac{1}{p}textbf{I}{M times M}right), [8pt]
textbf{y} &= [y_{1}, y_{2}, ldots, y_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},left(d + frac{1}{p}right)textbf{I}{M times M}right),
end{align}

where $textbf{y} = textbf{h} + textbf{w}$ and therefore, $textbf{y}$ and $textbf{h}$ are not independent.

I’m trying to find the following expectation:

$$mathbb{E} left[ frac{textbf{h}^{H} textbf{y}textbf{y}^{H} textbf{h}}{ | textbf{y} |^{4} } right],$$

where $| textbf{y} |^{4} = (textbf{y}^{H} textbf{y}) (textbf{y}^{H} textbf{y}$).

In order to find the desired expectation, I’m applying the following approximation:

$$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).$$

However, applying this approximation to the desired expectation is time consuming and prone to errors as it involves expansions with lots of terms .

I have been wondering if there is a more direct/smarter way of finding the desired expectation.

$textbf{UPDATE 21-04-2018}$: I’ve created a simulation in order to identify the pdf shape of the ratio inside of the expectation operator and as can be seen below it seems much like the pdf of a Gaussian random variable. Additionally, I’ve also noticed that the ratio results in real valued terms, the imaginary part is always equal to zero.

Is there another kind of approximation that can be used to find the expectation (one analytical/closed form result and not only the simulated value of the expection) given that the pdf looks like a Gaussian and probably can be approximated as such?

pdf of the ratio inside the expectation operator


Get this bounty!!!