#StackBounty: #probability #simulation #martingale Reverse-time martingale for non-polynomial approximating functions

Bounty: 50

Let $f(lambda):(0,1)$$(0,1)$ be a continuous function. Given a coin with unknown probability of heads of $lambda$, sample the probability $f(lambda)$. One way to do so is to build randomized upper and lower bounds that converge to $f(lambda)$ on average. These bounds serve as an unbiased estimator of $f(lambda)$; the algorithm returns 1 with probability equal to the estimate, and 0 otherwise.

Part of the reverse-time martingale algorithm of Łatuszyński et al. (2009/2011) (see "General Factory Functions") to simulate a factory function $f(lambda)$ is as follows. For each n starting with 1:

  1. Flip the input coin, and compute the nth upper and lower bounds of f given the number of heads so far, call them L and U.

  2. Compute the $(n-1)$th upper and lower bounds of f given the number of heads so far, call them L* and U*. (These bounds must be the same regardless of the outcomes of future coin flips, and the interval [L*, U*] must equal or entirely contain the interval [L, U].)

More technically (Algorithm 4):

  1. obtain $L_n$ and $U_n$ given $F_{0, n-1}$,
  2. compute $L^*_n = mathbb{E}[L_{n-1} | F_n]$ and $U^*_n = mathbb{E}[U_{n-1} | F_n]$,

where $F_n$ is a filtration that depends on $L_n$ and $U_n$.

Though the paper as well as the section on general factory functions that I linked to above shows how this algorithm can be implemented for polynomials, these parts of the algorithm appear to work for any two sequences of functions that converge to $f$, where $L$ or $L^*$ and $U$ or $U^*$ are their lower and upper bound approximations. An example for polynomials follows:

  1. Given the number of heads $H_n$, $L_n$ is the $H_n$th Bernstein coefficient of the $n$th lower approximating polynomial, and $U_n$ is the $H_n$th Bernstein coefficient of the $n$th upper approximating polynomial.
  2. $L^*_n$ is the $H_n$th Bernstein coefficient of the $(n-1)$th lower approximating polynomial, and $U^*_n$ is the $H_n$th Bernstein coefficient of the $(n-1)$th upper approximating polynomial, after elevating both polynomials to degree $n$.

But how do these steps work when the approximating functions (the functions that converge to f) are other than polynomials? Specifically, what if the approximating functions are—

  • rational functions with integer coefficients?
  • rational functions with rational coefficients?
  • arbitrary approximating functions?

REFERENCES:


Get this bounty!!!

#StackBounty: #probability #sampling #simulation #permutation Probabilities arising from permutations

Bounty: 50

Certain interesting probability functions can arise from permutations. For example, permutations that are sorted or permutations that form a cycle.

Inspired by the so-called von Neumann schema given in a paper called "On Buffon machines and numbers" by Flajolet and colleagues (2010), we can describe the following algorithm that produces a discrete random number based on a permutation class:

  1. Create an empty list.
  2. If the list is empty, generate a random number distributed as $D$, call it $delta$. Otherwise, generate a random number distributed as $E$. Either way, append the random number to the end of the list.
  3. Let $n$ be the number of items in the list minus 1. If the items in the list do not form a permutation that meets the permutation class’s requirements, return $n$. Otherwise, go to step 2.

If $D$ and $E$ are both uniform(0, 1), this algorithm returns the number n with the following probability:

$$eqalign{
G(n)&= (1-frac{V(n+1)}{V(n)*(n+1)}) * (1-sum_{j=0}^{n-1} G(j)) \
&= frac{V(n)*(n+1)-V(n+1)}{V(0)*(n+1)!},
}$$

where $V(n) in (0,n!]$ is the number of permutations of size n that meet the permutation class’s requirements. $V(n)$ can be a sequence associated with an exponential generating function (EGF) for the kind of permutation involved in the algorithm. (Examples of permutation classes include permutations whose numbers are sorted in descending order, or permutations whose first number is highest.) For example, if we use the class of permutations sorted in descending order, the EGF is $exp(lambda)$, so that $V(n)$ = 1.

For this algorithm, if $D$ and $E$ are both uniform(0, 1), the probability that the generated n

  • is odd is $1-1/EGF(1)$, or
  • is even is $1 / EGF(1)$, or
  • is less than $k$ is $frac{V(0)-V(k)/k!}{V(0)}$.

Thus, for example, if we allow sorted permutations, the algorithm returns an odd number with probability that is exactly $1-exp(-1)$.

Depending on the permutation class, the distributions $D$ and $E$, and which values of $n$ we care about, different probabilities and different distributions of $delta$ will arise. For example:

  • If the class is sorted permutations, both $D$ and $E$ are the uniform distribution, and given that the return value $n$ is odd, it is known since von Neumann’s 1951 algorithm that that number has a truncated exponential distribution.
  • If the class is sorted permutations, both $D$ and $E$ are arbitrary distributions, and given that the return value $n$ is odd, then Forsythe (1972) and Monahan (1979) have characterized $delta$‘s distribution function.

See the tables in my section "Probabilities Arising from Certain Permutations" for further examples.

For these reasons, it seems to me that this algorithm can open the door to new and exact samplers for continuous and discrete distributions, including new and exact ways to sample certain irrational probabilities. (And I list many of them in "Bernoulli Factory Algorithms".) And this is why I ask the following questions:

For a given permutation class, a given distribution $D$, and a given distribution $E$

  • what is the probability that the algorithm will return a particular $n$?
  • what is the probability that the algorithm will return an $n$ that belongs to a particular class of values (such as odd numbers or even numbers)?
  • what is the probability that $delta$ is less than $x$ given that the algorithm returns $n$ (or one of a particular class of values of $n$)?

Note that the last part of the question is equivalent to: What is the CDF of $delta$‘s distribution given that $n$ is returned?

REFERENCES:

  • Forsythe, G.E., "Von Neumann’s Comparison Method for Random Sampling from the Normal and Other Distributions", Mathematics of Computation 26(120), October 1972.
  • Monahan, J.. "Extensions of von Neumann’s method for generating random variables." Mathematics of Computation 33 (1979): 1065-1069.


Get this bounty!!!

#StackBounty: #probability #self-study #expected-value #mean-absolute-deviation an upper bound of mean absolute difference?

Bounty: 50

Let $X$ be an integrable random variable with CDF $F$ and inverse CDF $F^*$. $Y$ is iid with $X$. Prove $$E|X-Y| leq frac{2}{sqrt{3}}sigma,$$ where $sigma=sqrt{Var(X)} = sqrt{E[(X-mu)^2]}$.

I am looking for some hint for this proof.

What I’ve got is $E|X-Y|=2int_{0}^{1}(2u-1)F^*(u)du$. But I am not sure if this is correct direction.

I also noticed that $frac{2}{sqrt{3}}$ may be related to the variance of the uniform distribution.


Get this bounty!!!

#StackBounty: #probability #distributions #high-dimensional Law of the norm of the empirical mean of uniforms on the sphere?

Bounty: 50

Let $U_1,dots,U_n$ be i.i.d, uniform on the euclidean sphere on $mathbb{R}^d$ that we denote $S^{d-1}$.
I am searching for the law of
$$M=left|frac{1}{n}sum_{i=1}^n U_i right|.$$

Attempts:

I don’t know if it is possible to determine the law exactly, I need a lower bound on $M$ so I tried U-Statistics deviation bounds but the bound I obtained is not sufficient, I want to be able to say that with some small probability, $M$ is greater than $sqrt{lambda/n}$ for some $lambdale n$ (with U-stats I am limited to $lambda le 1$).

  • U-Stat approach: if we take the square, we obtain
    $$M=left|frac{1}{n}sum_{i=1}^n U_i right|^2 = frac{1}{n} +frac{1}{n^2}sum_{i neq j} langle U_i, U_jrangle .$$
    For one individual term $langle U_i, U_jrangle$, we know what is the law (related to Distribution of scalar products of two random unit vectors in $D$ dimensions and https://math.stackexchange.com/questions/2977867/x-coordinate-distribution-on-the-n-sphere) but these terms are not independent hence taking the sum is not easy.

  • Another possibility is to write that
    $$M=left|frac{1}{n}sum_{i=1}^n U_i right|^2 = frac{1}{n^2} +frac{2}{n^2}langle U_1, sum_{i=2}^n U_irangle + left|frac{1}{n}sum_{i=2}^n U_i right|^2 .$$
    then, by independence of the $U_i’s$, we have
    $$mathbb{P}left(left|frac{1}{n}sum_{i=1}^n U_i right|^2 le xright)=mathbb{E}left[ mathbb{P}left(frac{1}{n^2} +frac{2}{n^2}langle U_1, sum_{i=2}^n U_irangle + left|frac{1}{n}sum_{i=2}^n U_i right|^2 le xbig| U_2,dots, U_nright)right] $$
    and we know that the law of $(1+langle U_1, sum_{i=2}^n U_irangle)/2$ is a $Beta(d/2, d/2)$ distribution (conditionally on $U_2,dots,U_n$) hence,
    $$mathbb{P}left(left|frac{1}{n}sum_{i=1}^n U_i right|^2 le xright)=mathbb{E}left[ F_{Beta(d/2,d/2)}left( frac{n^2x}{4}+frac{1}{4}- frac{1}{4}left|sum_{i=2}^n U_iright|^2 right)right] $$
    and this, I want to find an upper bound but I don’t think I will succeed with this line of thought.


Get this bounty!!!

#StackBounty: #r #probability #maximum-likelihood #roc How do I evaluate the likelihood of the binormal smoothed ROC curve?

Bounty: 50

As I understand the binormal model for ROC curves assumes that the decision variable can be monotonically transformed so that both the case and control values are normally distributed. Under this assumption there is a simple relationship getting the sensitivity from the specificity:

$$phi^{-1}(SE) = a+bphi^{-1}(SP)$$

Where $phi$ is the normal CDF, and SE/SP is sensitivity and specificity. The R package pROC (Robin et al 2011) fits a linear model to the observed SE/SP values to get $a$ and $b$ and then calculates the smoothed curve from that.

My question is, how do you evaluate the likelihood of this ROC curve on some holdout points (test set)? As an example of this, suppose we fit Kerned Density Estimates to the case ($bar{D}$) and control ($D$) points, call the resulting pdfs $k_{bar{D}}$ and $k_{D}$ with hyperparameters $Theta$. We could then evaluate (I think) the likelihood of the overall smoothing on holdout sets $X$ and $bar{X}$ as:

$$mathcal{L}(X,bar{X}, Theta) = prod_X k_{D}(x_i)prod_bar{X} k_{bar{D}}(bar{x}_i)$$
and compare different options for $k$.
I’m not sure how to do the same thing for the binormal model.


Get this bounty!!!

#StackBounty: #time-series #probability #stochastic-processes How to get this analytical results for probability of wait times

Bounty: 50

I’m working with a continious-time stochastic process, where a particular event may happen at some time t with an unkown underlying distribution.

One "run" of a simulation of this process will result in a series of event times for each time the event happened within the run. So the output is just $[t_1, t_2, … t_n]$.

From this output I’m trying to calculate a metric I’ll call $u$, which is defined as "the probability that if you choose a random time $t$ within a run and look within the time range $[t, t+L]$ (for a pre-specified L), that at least one event occured in that range".

I’ve found some documentation (from an employee long gone from the company) that gives an analytical form for $u$ and I’ve verified that this form aligns very well with experimental data, but I haven’t been able to recreate the deductions that lead to this form.

The analytical form makes use of a probability density function of wait times $f(t)$ where wait time is simply the time between conseuctive events. So the experimental wait times are simply $[t_1, t_2-t_1, t_3-t_2, … t_n – t_{n-1}]$

The form I’m given is: $u = 1 – frac{int_{t=L}^{inf} (t-L)f(t)}{int_{t=0}^{inf} tf(t)}$, where $t$ is wait time

It’s clear that $frac{int_{t=L}^{inf} (t-L)f(t)}{int_{t=0}^{inf} tf(t)}$ is the disjoint probability that in this random time range of length L, no events occur, but I’m still not clear on how the exact terms are arrived at.

In my attempt to make sense of it I’ve reconstructed it into $u= 1 – frac{E(t-L | t > L)P(t > L)}{E(t)} $

which makes some inuitive sense to me, but I still can’t find a way to start with the original problem and arrive at any of these forms of the analytical solution.

Any guidance on this would be greatly appreciated


Get this bounty!!!

#StackBounty: #probability #classification #regression-strategies #scoring-rules Brier score of calibrated probs is worse than non cali…

Bounty: 50

The question is related to
probability calibration and Brier score

I have faced with the following issue. I have Random forest binary classifier and then I apply isotonic regression to calibration of probabilities. The result is the following:

enter image description here

The question: why is Brier score of calibrated probabilities a bit worse than the one of non-calibratied probabilities? Which problem could it be?


Get this bounty!!!

#StackBounty: #time-series #probability #classification #bernoulli-distribution #sequential-pattern-mining Sequential classification, c…

Bounty: 50

What is the best way to combine outputs from a binary classifier, which outputs probabilities, and is applied to a sequence of non-iid inputs?

Here’s a scenario: Say I have a classifier which does an OK, but not great, job of classifying whether or not a cat is in an image. I feed the classifier frames from a video, and get as output a sequence of probabilities, near one if a cat is present, near zero if not.

Each of the inputs is clearly not independent. If a cat is present in one frame, it’s most likely it will be present in the next frame as well. Say I have the following sequence of predictions from the classifier (obviously there are more than six frames in one hour of video)

  • 12pm to 1pm: $[0.1, 0.3, 0.6, 0.4, 0.2, 0.1]$
  • 1pm to 2pm: $[0.1, 0.2, 0.45, 0.45, 0.48, 0.2]$
  • 2pm and 3pm: $[0.1, 0.1, 0.2, 0.1, 0.2, 0.1]$

The classifier answers the question, “What is the probability a cat is present in this video frame”. But can I use these outputs to answer the following questions?

  1. What is the probability there was a cat in the video between 12 and 1pm? Between 1 and 2pm? Between 2pm and 3pm?
  2. Given say, a day of video, what is the probability that we have seen a cat at least once? Probability we have seen a cat exactly twice?

My first attempts at this problem are to simply threshold the classifier at say, 0.5. In which case, for question 1, we would decide there was a cat between 12 and 1pm, but not between 1 to 3pm, despite the fact that between 1 and 2pm the sum of the probabilities is much higher than between 2 and 3pm.

I could also imagine this as a sequence of Bernoulli trials, where one sample is drawn for each probability output from the classifier. Given a sequence, one could simulate this to answer these questions. Maybe this is unsatisfactory though, because it treats each frame as iid? I think a sequence of high probabilities should provide more evidence for the presence of a cat than the same high probabilities in a random order.


Get this bounty!!!

#StackBounty: #probability #distributions #joint-distribution #symmetry #exchangeability When $(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1…

Bounty: 100

Consider a bivariate distribution function $P: mathbb{R}^2rightarrow [0,1]$. I have the following question:

Are there necessary and sufficient conditions on $P$ (or on its marginals) ensuring that
$$
exists text{ a random vector $(X_0,X_1,X_2)$ such that }
$$

$$
(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1)sim (X_0-X_1, X_0-X_2)sim P
$$


Remarks:

(I) $(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1)sim (X_0-X_1, X_0-X_2)$ does not imply that some of the random variables among $X_1, X_2, X_0$ are degenerate.

For example, $(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1)sim (X_0-X_1, X_0-X_2)$ is implied by $(X_0, X_1, X_2)$ exchangeable.

(II) The symbol "$sim$" denotes "DISTRIBUTED AS"


My thoughts: among the necessary conditions, I would list the following: let $P_1,P_2$ be the two marginals of $P$. Then it should be that
$$
begin{cases}
P_1 text{ is symmetric around zero, i.e., $P_1(a)=1-P_1(-a)$ $forall a in mathbb{R}$}\
P_2 text{ is symmetric around zero, i.e., $P_2(a)=1-P_2(-a)$ $forall a in mathbb{R}$}\
end{cases}
$$

Should $P$ be as well symmetric at zero?

Are these conditions also sufficient? If not, what else should be added to get an exhaustive set of sufficient and necessary conditions?


Get this bounty!!!

#StackBounty: #probability #distributions #joint-distribution #symmetry #exchangeability When $(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1…

Bounty: 100

Consider a bivariate distribution function $P: mathbb{R}^2rightarrow [0,1]$. I have the following question:

Are there necessary and sufficient conditions on $P$ (or on its marginals) ensuring that
$$
exists text{ a random vector $(X_0,X_1,X_2)$ such that }
$$

$$
(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1)sim (X_0-X_1, X_0-X_2)sim P
$$


Remarks:

(I) $(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1)sim (X_0-X_1, X_0-X_2)$ does not imply that some of the random variables among $X_1, X_2, X_0$ are degenerate.

For example, $(X_1-X_0, X_1-X_2)sim (X_2-X_0, X_2-X_1)sim (X_0-X_1, X_0-X_2)$ is implied by $(X_0, X_1, X_2)$ exchangeable.

(II) The symbol "$sim$" denotes "DISTRIBUTED AS"


My thoughts: among the necessary conditions, I would list the following: let $P_1,P_2$ be the two marginals of $P$. Then it should be that
$$
begin{cases}
P_1 text{ is symmetric around zero, i.e., $P_1(a)=1-P_1(-a)$ $forall a in mathbb{R}$}\
P_2 text{ is symmetric around zero, i.e., $P_2(a)=1-P_2(-a)$ $forall a in mathbb{R}$}\
end{cases}
$$

Should $P$ be as well symmetric at zero?

Are these conditions also sufficient? If not, what else should be added to get an exhaustive set of sufficient and necessary conditions?


Get this bounty!!!