## #StackBounty: #probability #classification #ensemble #feature-weighting How to describe most important features of ensemble model as li…

### Bounty: 100

I have created 3 different models and output of them is a class probability in binary classification problem. Models are bit different, showing importance from different features. I have of course one data matrix as a source for this exercise where 70% of data is used as training sample.

How one can summarize importance of different feature values to the final class prob estimate if only data matrix and list of features used is know besides this class probability estimate?

Individual models can be of course explained by different methods, but how one can explain avg ensemble predictions?

Get this bounty!!!

## #StackBounty: #probability #statistical-significance #mathematical-statistics #experiment-design Significance level example from "…

### Bounty: 50

Fisher immediately realized that this argument fails because every
possible result with the 6 pairs has probability (1/2)^6 = 1/64, so
every result is significant at 5%. Fisher avoided this absurdity by
saying that any outcome with just I W and 5 R’s, no matter where that
W occurred, is equally suggestive of discriminatory powers and so
should be included. There are 6 such possibilities, including the
actual outcome, so the relevant probability for (a) above is 6(1/2)^6
= 6/64 = .094, so now the result is not significant at 5%.

I do not understand how 1/64 is significant at 5% but 6/64 is not. It makes more sense to me that bigger of two numbers would be deemed significant as it describes something that happens more often.

What is wrong with my reasoning?

Get this bounty!!!

## #StackBounty: #probability #distance-functions #metric #measure-theory When does the Wasserstein metric attain inequality WLOG?

### Bounty: 50

I’m reading a classic paper [1] that describes a version of the Wasserstein metric (aka Mallows metric), defined as follows. Let \$F\$ and \$G\$ be probabilities in \$mathbb{R} ^B\$, and let \$U sim F\$ and \$V sim G\$ be \$B\$-valued RVs with marginal distributions \$F\$ and \$G\$ and an arbitrary joint distribution. Then:

\$\$d_p(F,G) := inflimits_{ substack{U sim F \ V sim G} } EBig[ || U-V||^p Big]^{1/p}\$\$

The paper says the infimum is always attained for some joint distribution of \$(U,V)\$ and then goes on to prove various results by considering only the case in which the infimum is attained, “without loss of generality”. For example (Lemma 8.6):

In view of Lemma 8.1 [stating that the infimum is attained and that \$d_2(.,.)\$ is a metric], assume without loss of generality that \$(U, V)\$ are independent and \$\$d_p(F, G) = EBig[ || U-V||^p Big]^{1/p}\$\$

First, I don’t understand why this holds WLOG. Second, I’m confused about why the infimum would be attained for independent \$(U,V)\$: intuitively, wouldn’t opposite be true? I would expect the “distance” to be smallest when \$(U,V)\$ are as correlated as their marginals would allow.

## References

[1] Bickel, P. J., & Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 1196-1217.

Get this bounty!!!

## #StackBounty: #probability #distributions #pdf #quantiles #cdf Find the PDF from quantiles

### Bounty: 50

I have been presented a problem of this kind:
suppose I know the values of k quantiles for a continuous random variable \$X\$

\$\$X_{1%} = x_1, X_{5%} = x_2, dots , X_{99%} = x_{k}\$\$

so that

\$\$ F_X(x_1)=1%, F_X(x_2)=5%, dots, F_X(x_k)=99% \$\$

From these informations I want to draw the chart of the PDF.

I thought that I could proceed this way:

• interpolate \$F_X(x)\$ to get a smooth CDF (for instance spline interpolation)
• find the derivative (numerical) of the smoothed CDF at some points to obtain the PDF.

Are there other more direct methods to address this problem? Do you think my solution is solid?

Thank you.

Get this bounty!!!

## #StackBounty: #r #probability #distributions #weibull Fitting weibull distribution to percentage vs proportion data

### Bounty: 50

EDIT

After studying for a bit, now I understand what I need to do. What I am trying to do is a Weibull regression to the below data.

``````x <- structure(list(year = c(2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2015L, 2015L, 2015L, 2015L, 2015L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L, 2013L, 2013L, 2013L, 2013L, 2013L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2012L, 2012L, 2012L, 2012L, 2012L, 2016L, 2016L,
2016L, 2016L, 2016L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2015L, 2015L, 2015L, 2015L, 2015L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L, 2013L, 2013L, 2013L, 2013L, 2013L, 2012L,
2012L, 2012L, 2012L, 2012L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L, 2013L, 2013L,
2013L, 2012L, 2012L, 2012L, 2012L, 2012L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2016L, 2016L, 2016L, 2016L, 2016L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2014L, 2014L, 2014L, 2014L, 2014L, 2013L,
2013L, 2013L, 2013L, 2013L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2016L, 2016L, 2016L, 2016L, 2016L, 2015L, 2015L, 2015L,
2015L, 2015L, 2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L,
2013L, 2013L, 2013L, 2012L, 2012L, 2012L, 2012L, 2012L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2015L, 2015L, 2015L, 2015L,
2015L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2012L, 2012L, 2012L, 2012L, 2012L,
2016L, 2016L, 2016L, 2016L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2014L, 2014L, 2014L, 2014L, 2014L, 2013L, 2013L, 2013L,
2013L, 2013L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L
), week.id = c(17L, 18L, 19L, 20L, 21L, 22L, 23L, 18L, 19L, 20L,
21L, 22L, 23L, 24L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 17L, 18L, 19L, 20L, 21L, 22L, 23L,
24L, 19L, 20L, 21L, 22L, 23L, 24L, 20L, 21L, 22L, 23L, 24L, 19L,
20L, 21L, 22L, 23L, 24L, 20L, 21L, 22L, 23L, 24L, 19L, 20L, 21L,
22L, 23L, 24L, 19L, 20L, 21L, 22L, 23L, 24L, 18L, 19L, 20L, 21L,
22L, 23L, 24L, 19L, 20L, 21L, 22L, 23L, 19L, 20L, 21L, 22L, 23L,
24L, 19L, 20L, 21L, 22L, 23L, 19L, 20L, 21L, 22L, 23L, 19L, 20L,
21L, 22L, 23L, 24L, 18L, 19L, 20L, 21L, 22L, 23L, 18L, 19L, 20L,
21L, 22L, 23L, 24L, 18L, 19L, 20L, 21L, 22L, 23L, 19L, 20L, 21L,
22L, 23L, 24L, 19L, 20L, 21L, 22L, 23L, 24L, 19L, 20L, 21L, 22L,
23L, 24L, 19L, 20L, 21L, 22L, 23L, 24L, 19L, 20L, 21L, 22L, 23L,
24L, 18L, 19L, 20L, 21L, 22L, 23L, 19L, 20L, 21L, 22L, 23L, 18L,
19L, 20L, 21L, 22L, 23L, 19L, 20L, 21L, 22L, 23L, 19L, 20L, 21L,
22L, 23L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 18L, 19L, 20L, 21L,
22L, 23L, 18L, 19L, 20L, 21L, 22L, 23L, 18L, 19L, 20L, 21L, 22L,
18L, 19L, 20L, 21L, 22L, 18L, 19L, 20L, 21L, 22L, 23L, 18L, 19L,
20L, 21L, 22L, 23L, 18L, 19L, 20L, 21L, 22L, 23L, 18L, 19L, 20L,
21L, 22L, 23L, 18L, 19L, 20L, 21L, 22L, 23L, 21L, 22L, 23L, 24L,
25L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 20L, 21L, 22L, 23L, 24L,
20L, 21L, 22L, 23L, 24L, 19L, 20L, 21L, 22L, 23L, 24L, 20L, 21L,
22L, 23L, 24L, 21L, 22L, 23L, 24L, 25L, 20L, 21L, 22L, 23L, 24L,
20L, 21L, 22L, 23L, 24L, 20L, 21L, 22L, 23L, 24L, 20L, 21L, 22L,
23L, 24L, 25L, 22L, 23L, 24L, 25L, 26L, 20L, 21L, 22L, 23L, 24L,
25L, 20L, 21L, 22L, 23L, 24L, 25L, 21L, 22L, 23L, 24L, 25L, 22L,
23L, 24L, 25L, 21L, 22L, 23L, 24L, 25L, 26L, 21L, 22L, 23L, 24L,
25L, 21L, 22L, 23L, 24L, 25L, 20L, 21L, 22L, 23L, 24L, 25L, 26L
), location.id = c(41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
41L, 41L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L,
43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L, 43L,
43L, 43L, 43L, 43L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L,
42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L,
42L, 42L, 42L, 42L, 42L, 42L, 42L, 35L, 35L, 35L, 35L, 35L, 35L,
35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L,
35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 31L, 31L,
31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L,
31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L,
31L, 31L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L,
52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L, 52L,
52L, 52L, 52L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L,
50L, 50L, 50L, 50L, 50L, 50L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 29L, 29L, 29L,
29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L,
29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 17L,
17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L,
22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L,
22L), cumulative.percentage = c(4.1, 14, 31.9, 60, 81.2, 94.9,
100, 5.1, 26.3, 66, 80.7, 91, 98.1, 100, 1.2, 7, 25.6, 47, 73.1,
90, 98.9, 100, 2.2, 20.3, 47.2, 77.7, 95, 99.1, 100, 0.5, 3,
22.2, 46, 77.3, 97, 98.9, 100, 1, 9, 38, 79, 94, 100, 6, 28,
77, 92, 100, 3, 14, 46, 85, 99, 100, 6, 27, 63, 95, 100, 4, 14,
36, 78, 98, 100, 11, 31, 66, 88, 98, 100, 5, 17, 35, 67, 86,
98, 100, 7, 28, 62, 90, 100, 5, 22, 51, 81, 99, 100, 9, 20, 46,
83, 100, 14, 35, 73, 96, 100, 8, 26, 66, 93, 99, 100, 2, 12,
31, 61, 95, 100, 5, 21, 49, 81, 92, 98, 100, 1, 10, 31, 68, 93,
100, 9, 28, 63, 85, 96, 100, 2, 14, 52, 80, 93, 100, 1, 15, 41,
74, 91, 100, 11, 31, 70, 85, 95, 100, 2, 10, 56, 88, 99, 100,
1, 14, 43, 79, 98, 100, 6, 19, 59, 91, 100, 1, 12, 39, 70, 96,
100, 9, 28, 67, 89, 100, 12, 30, 67, 88, 100, 1, 4, 26, 63, 87,
99, 100, 1, 20, 62, 88, 98, 100, 4, 23, 56, 83, 99, 100, 1, 16,
55, 83, 100, 2, 22, 63, 91, 100, 4.51, 31.4, 67.73, 88.38, 97.59,
100, 1.7, 14.3, 38, 83.7, 95.6, 100, 2, 8.6, 20.4, 67.2, 93.3,
100, 0.8, 9.1, 50.6, 86, 97.2, 100, 1.8, 17.4, 50.3, 77.8, 97.4,
100, 27, 67, 91, 98, 100, 1, 19, 57, 78, 88, 95, 100, 6, 28,
65, 88, 100, 3, 17, 53, 82, 100, 1, 9, 34, 71, 90, 100, 4, 15,
35, 76, 100, 10, 43, 79, 98, 100, 3, 22, 63, 87, 100, 7, 29,
69, 92, 100, 3, 26, 62, 89, 100, 3, 11, 21, 45, 81, 100, 9, 20,
37, 71, 100, 7, 20, 59, 84, 96, 100, 6, 18, 63, 86, 94, 100,
8, 27, 61, 88, 100, 12, 34, 69, 100, 2, 18, 39, 64, 94, 100,
7, 23, 47, 78, 100, 3, 20, 46, 79, 100, 7, 17, 31, 59, 71, 93,
100)), class = "data.frame", row.names = c(NA, -348L))
``````

This is my data which shows the pooled data of cumulative percentage of growth in a variable from multiple location and years. Hence goes from 0% to 100%. It is not always the case that same week reaches 100%. What I want to do is to fit a weibull regression to each location and iteratively change the shape and scale parameters in order to get the best match between the observed and predicted. Hence for each location, I will have a different shape and scale parameters.

Thank you to many members here who helped articulate the question clearly. I hope I can get some advise on how to do a weibull regression in R. Thank you

Get this bounty!!!

## #StackBounty: #r #probability #distributions #weibull Fitting Weibull distribution in R

### Bounty: 50

My sample data:

``````dat <- structure(list(cum.per.plant = c(0.051, 0.263, 0.66, 0.807, 0.91,
0.981, 1, 0.012, 0.07, 0.256, 0.47, 0.731, 0.9, 0.989, 1, 0.022,
0.203, 0.472, 0.777, 0.95, 0.991, 1, 0.005, 0.03, 0.222, 0.46,
0.773, 0.97, 0.989, 1, 0.06, 0.28, 0.77, 0.92, 1, 0.03, 0.14,
0.46, 0.85, 0.99, 1, 0.06, 0.27, 0.63, 0.95, 1, 0.04, 0.14, 0.36,
0.78, 0.98, 1, 0.05, 0.17, 0.35, 0.67, 0.86, 0.98, 1, 0.07, 0.28,
0.62, 0.9, 1, 0.05, 0.22, 0.51, 0.81, 0.99, 1, 0.09, 0.2, 0.46,
0.83, 1, 0.08, 0.26, 0.66, 0.93, 0.99, 1, 0.02, 0.12, 0.31, 0.61,
0.95, 1, 0.05, 0.21, 0.49, 0.81, 0.92, 0.98, 1, 0.01, 0.1, 0.31,
0.68, 0.93, 1, 0.02, 0.14, 0.52, 0.8, 0.93, 1, 0.01, 0.15, 0.41,
0.74, 0.91, 1, 0.11, 0.31, 0.7, 0.85, 0.95, 1, 0.02, 0.1, 0.56,
0.88, 0.99, 1, 0.06, 0.19, 0.59, 0.91, 1, 0.01, 0.12, 0.39, 0.7,
0.96, 1, 0.09, 0.28, 0.67, 0.89, 1, 0.12, 0.3, 0.67, 0.88, 1,
0.01, 0.2, 0.62, 0.88, 0.98, 1, 0.04, 0.23, 0.56, 0.83, 0.99,
1, 0.01, 0.16, 0.55, 0.83, 1, 0.02, 0.22, 0.63, 0.91, 1, 0.017,
0.143, 0.38, 0.837, 0.956, 1, 0.02, 0.086, 0.204, 0.672, 0.933,
1, 0.008, 0.091, 0.506, 0.86, 0.972, 1, 0.018, 0.174, 0.503,
0.778, 0.974, 1, 0.01, 0.19, 0.57, 0.78, 0.88, 0.95, 1, 0.06,
0.28, 0.65, 0.88, 1, 0.03, 0.17, 0.53, 0.82, 1, 0.01, 0.09, 0.34,
0.71, 0.9, 1, 0.1, 0.43, 0.79, 0.98, 1, 0.03, 0.22, 0.63, 0.87,
1, 0.07, 0.29, 0.69, 0.92, 1, 0.03, 0.26, 0.62, 0.89, 1, 0.09,
0.2, 0.37, 0.71, 1, 0.07, 0.2, 0.59, 0.84, 0.96, 1, 0.06, 0.18,
0.63, 0.86, 0.94, 1, 0.08, 0.27, 0.61, 0.88, 1, 0.02, 0.18, 0.39,
0.64, 0.94, 1, 0.07, 0.23, 0.47, 0.78, 1, 0.03, 0.2, 0.46, 0.79,
1, 0.07, 0.17, 0.31, 0.59, 0.71, 0.93, 1), loc.id = c(7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 12L, 12L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L), year.id = c(4L, 4L, 4L, 4L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L,
3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L,
3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), time.id = c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 4L, 5L,
6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
3L, 4L, 5L, 6L, 7L, 3L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L,
3L, 4L, 5L, 6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 2L, 3L, 4L, 5L, 6L, 7L, 3L, 4L, 5L, 6L, 7L, 8L, 3L,
4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L,
8L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 3L, 4L, 5L, 6L,
7L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L,
6L, 7L, 2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L, 5L,
6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L, 2L, 3L,
4L, 5L, 6L, 7L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 4L, 5L, 6L, 7L,
8L, 4L, 5L, 6L, 7L, 8L, 3L, 4L, 5L, 6L, 7L, 8L, 5L, 6L, 7L, 8L,
9L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 7L, 8L,
6L, 7L, 8L, 9L, 10L, 4L, 5L, 6L, 7L, 8L, 9L, 4L, 5L, 6L, 7L,
8L, 9L, 5L, 6L, 7L, 8L, 9L, 5L, 6L, 7L, 8L, 9L, 10L, 5L, 6L,
7L, 8L, 9L, 5L, 6L, 7L, 8L, 9L, 4L, 5L, 6L, 7L, 8L, 9L, 10L)), class = "data.frame", row.names = c(NA,
-279L))
``````

This data has four columns:

`cum.per.plant`: cumulative area that is planted by a crop (hence goes from 0 till 1

`loc.id`: locations where data was collected

`year.id`: years when the data was collected are the years

`time.id`: id of the days when the days was collected.

Hence for loc.id 7 and year.id 4, planting begins from week 2 and reaches 100% in week 8.

I want to implement the below paragraph from this paper if you want to read:
https://www.dropbox.com/s/v36i8npfwbutiro/Yang%20et%20al.%202017.pdf?dl=0

``````Preliminary analysis of the planting data indicates that once
planting is initiated, the cumulative proportion of fields planted for a
crop in a year at the county level follows a sigmoid pattern, but this can
be modified from planting delays due to soil being too wet, we thus fitted
the observed data to the following modified Weibull distribution function

ProportionFields = 1 - exp(-(DOY - DOYplanting.initiation - Days.no.plant/a)^b)

where ProportionFields is the cumulative proportion of fields that have
been planted in a county, DOY is a calendar day of year, (DOY >=
DOYplanting.initiation), DOYplanting.initiation is a
calendar day of year of the earliest planting, Days.no.plant
is the total number of days when planting does not occur due to soil being
too wet since the start of planting. a is a scale parameter and b is a
shape parameter. DOY - DOYplanting.initiation - Days.no.plant represents
the total number of days when planting does not occurr since start of
planting.
``````

I want to use the above approach, so I planned to do this:

1) Fit a distribution to the data. In the above example, they fitted Weibull distribution so I also fitted the same

``````library(fitdistrplus)
fw <- fitdist(dat\$cum.per.plant, "weibull")
summary(fw) # shape: 1.2254029, scale: 0.6022573
``````

My first question is:
1) Will the scale parameters and shape parameter affected by time step i.e. id this data was collected at daily-level, will my shape and scale parameter gets divided by a certain factor?

Now after I get the parameters, I want to implement this equation to calculate proportion planted each day for a given year and given location.

``````prop.planted <-  1 - exp(- (x/scale parameter)^shape parameter)
``````

where x = Day of year – Day of year when planting started – No. of days with no planting since the start of planting

I have the data to calculate x i.e.

Is the equation and my understanding correct of the above paper?

EDIT:

Suppose the data follows a beta distribution (and not a weibull distribution). How can I implement the factor where I calculate x in the beta distribution.

Thank you.

Get this bounty!!!

## #StackBounty: #probability #bayesian #python #inference #posterior How to structure this problem in a bayesian paradigm? (Updating the …

### Bounty: 50

I have some code that randomly generates a number from `m_min=10` to `m_max=100` (apologies if this nomenclature is unconventional) for a total of `(m_max - m_min) + 1 = 91 positions = n_positions` each with equal probability of being picked (i.e. a uniform distribution in the form of a probability vector).

I’m running a function using each number and receiving a score, if it’s better than the previous score I want to update the weights for each position. Instead of discretely updating weights, I want to also update the weights of the neighboring positions for a smooth curve.

My thoughts for implementing was to do the following:

(1) update the weights for `position_k`;

(2) fit curve to a `beta-distribution` (this part gets tricky); and then

(3) when I draw another number from `m_min` to `m_max` , call it `query_num`, I will divide that by `n_positions` (i.e. total number of positions) to get my `x` for the calculating the probability of drawing that value from the `beta-distribution`.

I believe this is a `bayesian` problem since I am using a uniform distribution as my prior and updating the posterior to get a `beta-distribution` (if this is incorrect, please inform me).

I am clearly missing a fundamental step in the logic.

My ultimate goal is to “update” the posterior each time the `condition == True` by adding mass to the regions around `query_num`, recompute the probabilities based on a `beta distribution`, and, in this case, plot the transformation.

I can’t use a simple `stats.beta.fit` because I don’t have draws from the distribution only probabilities for of the `x` values that I am updating.

``````import sys
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.optimize import curve_fit

# Defaults
m_min = 10
m_max = 100

# Uniform probabilities
n_positions = m_max - m_min + 1
positions = np.arange(m_min, m_max + 1)
x = np.linspace(0,1,n_positions)
# Prior probabilities
probs_num = np.ones(n_positions)/n_positions # Each value is 0.01098901

# Placeholder function
def func(x):
# This isn't actually what I'm doing but just to get the code to work for this example
return 15 < x < 30

n_iter = 100
update_value = 1e-3
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
ax.plot(x, probs_num, label="Prior", color="black")
for i, color in enumerate(sns.color_palette(n_colors=n_iter)):
rs = np.random.RandomState(i)
# Get index of the position vector which is why I'm subtract `m_min`
query_num = rs.choice(positions, size=1, replace=False, p=probs_num)[0] - m_min
# Again the actual conidition is more complicated
condition = func(query_num)
# If it meets the criteria then update the "mass" in this area of the distribution
if condition:
probs_num[query_num] += update_value
# Normalize the values to probabilities again
probs_num = probs_num/probs_num.sum()
# Fit the curve to beta distribution (this is where it gets messed up)
params_fitted, _ = curve_fit(lambda x,a,b:stats.beta.pdf(x, a, b),
xdata=x,
ydata=probs_num,
)
ax.plot(x, stats.beta(*params_fitted).pdf(x), color=color, alpha=0.5, label=i)
legend_kws = {'bbox_to_anchor': (1, 0.5), 'edgecolor': 'black', 'facecolor': 'white', 'fontsize': 10, 'frameon': True, 'loc': 'center left'}
ax.legend(**legend_kws)
``````

Versions:

``````Python v3.6.4, NumPy v1.14.2, SciPy v1.0.1
``````

Get this bounty!!!

## #StackBounty: #probability #bayesian #risk #medicine Convert classifier output for disease to probability using Bayes

### Bounty: 50

Method 1

If I am given a classifier for some disease that takes as input patient characteristics and has some sensitivity and specificity, I can then use Bayes rule to convert (say classifier output=test=1 instead of 0):

P(disease | test = 1) = P(test =1 |dz) P(dz) / P(test = 1)

Which can be rewritten in terms of sensitivity and specificity, along with the prior.

So, even if all I have is a classifier (which made a decision), I can get some probability estimate by considering the classifier itself as a kind of medical test.

Method 2

The better approach is to develop an estimator directly for

P(disease | patient characteristics)

Classifiers are heavily criticized in medicine, and I agree that it’s a poor choice to make a decision without patient and physician utilities, but why can’t we just use the first method to convert the classifiers to probabilities by treating the classifier as a medical test with a sensitivity and specificity?

Get this bounty!!!

## #StackBounty: #probability #distributions #normal-distribution #expected-value #ratio Direct way of calculating \$mathbb{E} left[ fra…

### Bounty: 150

Considering the following random vectors:

begin{align}
textbf{h} &= [h_{1}, h_{2}, ldots, h_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},dtextbf{I}{M times M}right), [8pt]
textbf{w} &= [w_{1}, w_{2}, ldots, w_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},frac{1}{p}textbf{I}{M times M}right), [8pt]
textbf{y} &= [y_{1}, y_{2}, ldots, y_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},left(d + frac{1}{p}right)textbf{I}{M times M}right),
end{align}

where \$textbf{y} = textbf{h} + textbf{w}\$ and therefore, \$textbf{y}\$ and \$textbf{h}\$ are not independent.

I’m trying to find the following expectation:

\$\$mathbb{E} left[ frac{textbf{h}^{H} textbf{y}textbf{y}^{H} textbf{h}}{ | textbf{y} |^{4} } right],\$\$

where \$| textbf{y} |^{4} = (textbf{y}^{H} textbf{y}) (textbf{y}^{H} textbf{y}\$).

In order to find the desired expectation, I’m applying the following approximation:

\$\$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).\$\$

However, applying this approximation to the desired expectation is time consuming and prone to errors as it involves expansions with lots of terms .

I have been wondering if there is a more direct/smarter way of finding the desired expectation.

\$textbf{UPDATE 21-04-2018}\$: I’ve created a simulation in order to identify the pdf shape of the ratio inside of the expectation operator and as can be seen below it seems much like the pdf of a Gaussian random variable. Additionally, I’ve also noticed that the ratio results in real valued terms, the imaginary part is always equal to zero.

Is there another kind of approximation that can be used to find the expectation (one analytical/closed form result and not only the simulated value of the expection) given that the pdf looks like a Gaussian and probably can be approximated as such?

\$textbf{UPDATE 24-04-2018}\$: I’ve found an approximation to the case where \$textbf{h}\$ and \$textbf{y}\$ are independent.

\$\$mathbb{E} left[ frac{textbf{h}^{H}{l} textbf{y}{k} textbf{y}^{H} {k} textbf{h}{l} }{ | textbf{y}{k} |^{4} } right] = frac{d{l}[(M+1)(M-2)+4M+6]}{zeta_{k}M(M+1)^{2}}\$\$

where \$zeta_{k} = d_{k} + frac{1}{p}\$, \$textbf{h}{l} sim mathcal{CN}left(textbf{0}{M},d_{l}textbf{I}{M times M}right)\$ and \$textbf{h}{k} sim mathcal{CN}left(textbf{0}{M},d{k}textbf{I}{M times M}right)\$. Note that \$textbf{y}{k} = textbf{h}{k} + w\$ and that \$textbf{h}{k}\$ and \$textbf{h}_{l}\$ are independent.

I have used the following approximation:
\$\$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).\$\$

Get this bounty!!!

## #StackBounty: #probability #distributions #normal-distribution #expected-value #ratio Direct way of calculating \$mathbb{E} left[ fra…

### Bounty: 150

Considering the following random vectors:

begin{align}
textbf{h} &= [h_{1}, h_{2}, ldots, h_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},dtextbf{I}{M times M}right), [8pt]
textbf{w} &= [w_{1}, w_{2}, ldots, w_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},frac{1}{p}textbf{I}{M times M}right), [8pt]
textbf{y} &= [y_{1}, y_{2}, ldots, y_{M}]^{T} sim mathcal{CN}left(textbf{0}{M},left(d + frac{1}{p}right)textbf{I}{M times M}right),
end{align}

where \$textbf{y} = textbf{h} + textbf{w}\$ and therefore, \$textbf{y}\$ and \$textbf{h}\$ are not independent.

I’m trying to find the following expectation:

\$\$mathbb{E} left[ frac{textbf{h}^{H} textbf{y}textbf{y}^{H} textbf{h}}{ | textbf{y} |^{4} } right],\$\$

where \$| textbf{y} |^{4} = (textbf{y}^{H} textbf{y}) (textbf{y}^{H} textbf{y}\$).

In order to find the desired expectation, I’m applying the following approximation:

\$\$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).\$\$

However, applying this approximation to the desired expectation is time consuming and prone to errors as it involves expansions with lots of terms .

I have been wondering if there is a more direct/smarter way of finding the desired expectation.

\$textbf{UPDATE 21-04-2018}\$: I’ve created a simulation in order to identify the pdf shape of the ratio inside of the expectation operator and as can be seen below it seems much like the pdf of a Gaussian random variable. Additionally, I’ve also noticed that the ratio results in real valued terms, the imaginary part is always equal to zero.

Is there another kind of approximation that can be used to find the expectation (one analytical/closed form result and not only the simulated value of the expection) given that the pdf looks like a Gaussian and probably can be approximated as such?

\$textbf{UPDATE 24-04-2018}\$: I’ve found an approximation to the case where \$textbf{h}\$ and \$textbf{y}\$ are independent.

\$\$mathbb{E} left[ frac{textbf{h}^{H}{l} textbf{y}{k} textbf{y}^{H} {k} textbf{h}{l} }{ | textbf{y}{k} |^{4} } right] = frac{d{l}[(M+1)(M-2)+4M+6]}{zeta_{k}M(M+1)^{2}}\$\$

where \$zeta_{k} = d_{k} + frac{1}{p}\$, \$textbf{h}{l} sim mathcal{CN}left(textbf{0}{M},d_{l}textbf{I}{M times M}right)\$ and \$textbf{h}{k} sim mathcal{CN}left(textbf{0}{M},d{k}textbf{I}{M times M}right)\$. Note that \$textbf{y}{k} = textbf{h}{k} + w\$ and that \$textbf{h}{k}\$ and \$textbf{h}_{l}\$ are independent.

I have used the following approximation:
\$\$mathbb{E} left[ frac{textbf{x}}{textbf{z}} right] approx frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]} – frac{text{cov}(textbf{x},textbf{z})}{mathbb{E}[textbf{z}]^{2}} + frac{mathbb{E}[textbf{x}]}{mathbb{E}[textbf{z}]^{3}}text{var}(mathbb{E}[textbf{z}]).\$\$

Get this bounty!!!