#StackBounty: #probability #self-study #expected-value #mean-absolute-deviation an upper bound of mean absolute difference?

Bounty: 50

Let $X$ be an integrable random variable with CDF $F$ and inverse CDF $F^*$. $Y$ is iid with $X$. Prove $$E|X-Y| leq frac{2}{sqrt{3}}sigma,$$ where $sigma=sqrt{Var(X)} = sqrt{E[(X-mu)^2]}$.

I am looking for some hint for this proof.

What I’ve got is $E|X-Y|=2int_{0}^{1}(2u-1)F^*(u)du$. But I am not sure if this is correct direction.

I also noticed that $frac{2}{sqrt{3}}$ may be related to the variance of the uniform distribution.


Get this bounty!!!

#StackBounty: #self-study #pca #linear-model #functional-data-analysis Asymptotic properties of functional models

Bounty: 100

When working in Functional Data Analysis, a classical "preprocessing" step is to represent the "observations" using a B-spline expansion:

$$
X_i(t) approx sum_{j=1}^J lambda_{ij} f_j(t) qquad i=1, ldots, n
$$

where $J$ is the number of elements in the basis and $f_1, ldots, f_J$ are suitably defined B-spline functions.
Then, statistical methods are performed by working directly on the coefficients ${lambda_{ij}}$.

My question is if there are some asymptotic guarantees that as the number of data $n$ and the truncation level $J$ increase to $+infty$ the statistical methods converge to a "true" idealized solution.

In particular, I’m interested in functional on functional regression an functional PCA.
I know the literature is huge, but it would be great to have some papers to start from!


Get this bounty!!!

#StackBounty: #probability #self-study #conditional-probability Find the overall probability that the pairwise difference of consecutiv…

Bounty: 50

$X_1,X_2,…,X_N$ are consecutive RVs with $X_1simmathcal N(mu_1,sigma^2_1),..,X_Nsimmathcal N(mu_N,sigma^2_N)$.The probability that the diffference between two consecutive RVs is less than a constant $T$ is

$$P(Y_j<T)=Phileft(frac{T-mu_j}{sigma_j}right),j=1,2,…,N-1$$

where $Y_j =X_{i+1}-X_{i}$, $ mu_j=mu_{i+1}-mu_i$, and $ sigma_j^2=sigma_{i+1}^2 + sigma_i^2, i=1,2,…,N$.

[EDIT]

  1. How can I find the overall probability $mathbb P(Y_j<T), forall j$?
  2. And is it possible to find the overall probability when a $Y_j<T$ is conditioned on $Y_{j-1}<T, forall j$?


Get this bounty!!!

#StackBounty: #self-study #bayesian #continuous-data #uniform #analytical Two dependent uniformly distributed continuous variables and …

Bounty: 50

I am trying to solve the following exercise from Judea Pearl’s Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.

2.2. A billiard table has unit length, measured from left to right. A ball is rolled on this table, and when it stops, a partition is placed at its stopping position, a distance $x$ from the left end of the table. A second ball is now rolled between the left end of the table and the partition, and its stopping position, $y$, is measured.

a. Answer qualitatively: How does knowledge of $y$ affect our belief about $x$? Is $x$ more likely to be near $y$ , far from $y$, or near the midpoint between $y$ and 1?

b. Justify your answer for (a) by quantitative analysis. Assume the stopping position is uniformly distributed over the feasible range.

For b., I clearly need to use Bayes’ theorem:

$$
P(X|Y) = dfrac{P(Y|X)P(X)}{P(Y)}
$$

where I expressed

$$
P(X) sim U[0,1] =
begin{cases}
1, text{where } 0 leq x leq 1\
0, text{else}
end{cases}
\
P(Y|X) sim U[0,x] =
begin{cases}
1/x, text{where } 0 leq y leq x\
0, text{else}
end{cases}
$$

I tried getting $P(Y)$ by integrating the numerator over $X$.

$$
int_{-infty}^{infty} P(Y|X)P(X)dx = int_{0}^{1}P(Y|X)cdot 1 dx = int_{0}^{1}dfrac{1}{x} dx
$$

But the integral doesn’t converge.

I also tried to figure out the numerator itself, but I don’t see how $frac{1}{x}$ can represent $P(X|Y)$.

Where did I go wrong?


Get this bounty!!!

#StackBounty: #self-study #autocorrelation Mean square of x-component of uniformly distributed circle points

Bounty: 50

Recently I was looking at a paper where the velocity auto-correlation function
$$C(t) = langle v_x(t) v_x(0) rangle = langle costheta(t), cos theta(0) rangle$$
was being considered for a number of point particles with velocities distributed uniformly on $S^1$ at time zero (here $langle cdot rangle$ denotes an average over initial conditions). In the above, $theta(t)$ is the angle w.r.t to the horizontal at time $t$. In their plot of $C(t)$ vs. $t$ I noticed that $C(0) ne 1/2$ (there were multiple plots). However I don’t see this when trying to generate a uniform velocity distribution so I assume I am doing something wrong here.

I generate velocities uniformly on $S^1$ as follows:

    N <- 10^4                                          # number of samples
    r <- runif(N, 0, 1)                                # uniform radii in [0,1]
    theta <- runif(N, -pi, pi)                         # uniform angles
    P <- cbind(sqrt(r)*cos(theta), sqrt(r)*sin(theta)) # uniform points in the unit disk
    L <- sqrt(P[,1]*P[,1] + P[,2]*P[,2]) 
    V <- P/L 

Another method I’ve seen is:

X1 <- rnorm(N, 0, 1)
X2 <- rnorm(N, 0, 1)
R  <- sqrt(X1*X1 + X2*X2)
V  <- matrix(rbind(X1/R, X2/R), N, 2, byrow=TRUE)

Or using the pracma package:

pracma::rands(N, r=1.0, N=1.0)

Firstly, can someone confirm that these are appropriate methods for generating points uniformly on the unit circle?

In all cases mean(V[,1] * V[,1]) returns $approx 1/2$.

Moreover if $Theta sim mathcal{U}[-pi, pi]$ and $V = cos^2 Theta$ then it has the pdf $$f_V(x) = dfrac{-2}{pi} dfrac{d}{dx} (sin^{-1} sqrt{x}) = dfrac{-1}{sqrt{x(1-x)}}$$ which has mean value
$$int_0^1 x f_V(x) dx = 1/2.$$

Is this correct?

Edit:

The issue with the final calculation is that uniform points on the circle are not formed by simply taking the cosine and sine of uniformly distributed angles, so it must be incorrect..


Get this bounty!!!

#StackBounty: #self-study #survival #interpretation #weibull Interpretation of Weibull Accelerated Failure Time Model Output

Bounty: 50

In this case study I have to assume a baseline Weibull distribution, and I’m fitting an Accelerated Failure Time model, which will be interpreted by me later on regarding both hazard ratio and survival time.

The data looks like this.

head(data1.1)

TimeSurv IndSurv Treat Age
1     6 days       1     D  27
2    33 days       1     D  43
3   361 days       1     I  36
4   488 days       1     I  54
5   350 days       1     D  49
6   721 days       1     I  49
7  1848 days       0     D  32
8   205 days       1     D  47
9   831 days       1     I  24
10  260 days       1     I  38

I’m fitting a model using the function Weibullreg() in R. The survival function is built reading TimeSurv as the time measures and IndSurv as the indicator of censoring. The covariates considered are Treat and Age.

My issue deals with understanding the output properly:

wei1 = WeibullReg(Surv(TimeSurv, IndSurv) ~ Treat + Age, data=data1.1)
wei1


$formula
Surv(TimeSurv, IndSurv) ~ Treat + Age

$coef
            Estimate           SE
lambda  0.0009219183 0.0006803664
gamma   0.9843411517 0.0931305471
TreatI -0.5042111027 0.2303038312
Age     0.0180225253 0.0089632209

$HR
              HR       LB       UB
TreatI 0.6039819 0.384582 0.948547
Age    1.0181859 1.000455 1.036231

$ETR
             ETR        LB        UB
TreatI 1.6690124 1.0574337 2.6343045
Age    0.9818574 0.9644488 0.9995801

$summary

Call:
survival::survreg(formula = formula, data = data, dist = "weibull")
               Value Std. Error     z      p
(Intercept)  7.10024    0.41283 17.20 <2e-16
TreatI       0.51223    0.23285  2.20  0.028
Age         -0.01831    0.00913 -2.01  0.045
Log(scale)   0.01578    0.09461  0.17  0.868

Scale= 1.02 

Weibull distribution
Loglik(model)= -599.1   Loglik(intercept only)= -604.1
    Chisq= 9.92 on 2 degrees of freedom, p= 0.007 
Number of Newton-Raphson Iterations: 5 
n= 120

I don’t really get how Scale = 1.02 and log(scale) = 0.015, and if the p-value of this log(scale) is a big non-signfificant one, from how the documentation of the function shows some conversions it makes, am I to assume that the values of the alphas are also not to be trusted (considering they were reached using the scale value)?


Get this bounty!!!

#StackBounty: #machine-learning #svm #kernel #self-study Kernels in SVM primal form

Bounty: 50

In SVM primal form, we have a cost function that is:

$$J(mathbf{w}, b) = C {displaystyle sumlimits_{i=1}^{m} maxleft(0, 1 – y^{(i)} (mathbf{w}^t cdot mathbf{x}^{(i)} + b)right)} quad + quad dfrac{1}{2} mathbf{w}^t cdot mathbf{w}$$

When using kernel trick, we have to apply $phi$ to our input data $x^{(i)}$. So our new cost function will be:

$$J(mathbf{w}, b) = C {displaystyle sumlimits_{i=1}^{m} maxleft(0, 1 – y^{(i)} (mathbf{w}^t cdot phi(mathbf{x}^{(i)}) + b)right)} quad + quad dfrac{1}{2} mathbf{w}^t cdot mathbf{w}$$

But following Andrew Ng machine learning course, after selecting all training examples as landmarks to apply gaussian kernel $K$, he rewrites the cost function this way:

$hskip1in$enter image description here

where $f^{(i)}=(1, K(x^{(i)}, l^{(1)}), K(x^{(i)}, l^{(2)}), …, K(x^{(i)}, l^{(m)}))$ is a $m+1$ dimensional vector ($m$ is the number of training examples). So i have two questions:

  • The two cost functions are quite similar, but latter uses $f^{(i)}$ and former $phi(x^{(i)})$. How is $f^{(i)}$ related to $phi(x^{(i)})$ ? In case of gaussian kernels, i know that the mapping function $phi$, maps our input data space to an infinite dimensional space, so $phi(x^{(i)})$ must be an infinite dimensional vector, but $f^{(i)}$ has only $m+1$ dimensions.
  • When using kernels, as there is no dot product in the primal form that can be computed by the kernel function, is it faster to solve the dual form with some algorithm like SMO than solving the primal form with gradient descent?


Get this bounty!!!

#StackBounty: #self-study #algorithms #entropy #information-theory #maximum-entropy How do I prove conditional entropy is a good measur…

Bounty: 200

This question is a follow-up of Does “expected entropy” make sense?, which you don’t have to read as I’ll reproduce the relevant parts. Let’s begin with the statement of the problem

A student has to pass an exam, with $k$ questions to be answered by yes or no, on a subject he knows nothing about. Assume the questions are independently distributed with a half-half probability of being either yes or no. The student is allowed to pass mock exams who have the same questions as the real exam. After each mock exam the teacher tells the student how many right answers he got, and when the student feels ready, he can pass the real exam. How many mock exams on average (a.k.a. take the expectation) must the student take to ensure he can get every single question correct in the real exam, and what should be his optimal strategy?

I have proposed an entropy-based strategy in that question, but for it to work, it must be first established that conditional entropy is a good measure of information to be recovered.

Here is a more concrete statement of my question. Suppose a student Alice has already taken 3 mock exams and got incomplete information about the answers. In a parallel universe, another student Bob has also taken 3 mock exams, but his strategy and insight about the answers may differ from those of Alice. At this point, both Alice and Bob have a conditional distribution of the answers based on outcomes of previous mock exams. I wonder if it can be proved that “the entropy of the conditional distribution from the perspective of Alice is greater or equal than that of Bob” can lead to “the minimum expected number of mock exams to be taken by Alice is greater or equal than that of Bob”.

Intuitively it makes sense because more entropy means more uncertainty and thus more attempts required, but I have no idea how to attack it. As a side note, this will be my bachelor’s thesis, so please just leave hints/pointers instead of spoiling too much 🙂


Get this bounty!!!

#StackBounty: #self-study #stochastic-processes Are the following function families the families of the probability densities of some s…

Bounty: 100

I am a beginner in stochastic processes, and I am trying to learn this branch of math. I have a few questions about the exercises I solved. I would like to ask if my reasoning was proper, and if the solution is good. The exercise states:

Are the following function families the families of the probability densities of some stochastic process?

(a) $$f_n(mathbf{t}_n,mathbf{x}_n)=left{begin{matrix}
frac{1}{t_1t_2cdots t_n} & for; 0 leq x_i leq t_i,; i=1,2,…,n\
0 & otherwise
end{matrix}right.$$

(b)$$f_n(mathbf{t}_n,mathbf{x}_n)=left{begin{matrix}
a_1a_2cdots a_ncdot exp(-a_1x_1-a_2x_2…-a_nx_n) & for; x_1>0,x_2>0,…,x_n>0\
0 & otherwise
end{matrix}right.$$

where $mathbf{t}n=(t_1,t_2,…,t_n)$, $mathbf{x}_n=(x_1,x_2,…,x_n)$, $n=1,2,…$, $a_1=t_1$, $a_i=t_i-t{i-1}$.

My solution was to integrate $f_n(mathbf{t}_n,mathbf{x}_n)$ with respect to some $x_i$ and see wether the outcome depends on $t_i$. If it does the funcion is not a density, if it doesn’t depend on $t_i$ it can be a density of some stochastic process.

(a)

$$int_0^{t_i}f_n(mathbf{t}n,mathbf{x}_n)dx_i = int_0^{t_i}frac{1}{t_1 t_2 cdots t_n}dx_i = frac{1}{t_1 t_2 cdots t_n} int_0^{t_i}dx_i = frac{x_i|_0^{t_i}}{t_1 t_2 cdots t_n} = frac{t_i-0}{t_1 t_2 cdots t_n} = frac{1}{t_1 t_2 cdots t{i-1}t_{i+1}cdots t_n} $$

(b)

$$int_0^{+infty}f_n(mathbf{t}n,mathbf{x}_n)dx_i= int_0^{+infty} a_1a_2cdots a_ncdot exp(-a_1x_1-a_2x_2…-a_nx_n)dx_i=$$
$$prod _{k=1}^na_kint_0^{+infty} exp(-a_1x_1-a_2x_2…-a_nx_n)dx_i =$$
$$ prod _{k=1}^na_k cdot exp(-sum
{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)int_0^{+infty} exp(-a_ix_i)dx_i =$$
$$prod {k=1}^na_k cdot exp(-sum{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)frac{1}{-a_i}exp(-a_ix_i)|0^{+infty}=$$
$$prod _{k=1}^na_k cdot exp(-sum
{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)frac{1}{-a_i}[0-1] =$$
$$prod {k=1}^na_k cdot exp(-sum{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)frac{1}{a_i}= $$
$$prod {k=1neq i}^na_k cdot exp(-sum{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)$$

The function given in (a) can be a density, whereas (b) cannot as coefficients $a_i$ depend on $t_{i-1}$ and $t_i$.

Is it good?


Get this bounty!!!