#StackBounty: #self-study #bayesian #continuous-data #uniform #analytical Two dependent uniformly distributed continuous variables and …

Bounty: 50

I am trying to solve the following exercise from Judea Pearl’s Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.

2.2. A billiard table has unit length, measured from left to right. A ball is rolled on this table, and when it stops, a partition is placed at its stopping position, a distance $x$ from the left end of the table. A second ball is now rolled between the left end of the table and the partition, and its stopping position, $y$, is measured.

a. Answer qualitatively: How does knowledge of $y$ affect our belief about $x$? Is $x$ more likely to be near $y$ , far from $y$, or near the midpoint between $y$ and 1?

b. Justify your answer for (a) by quantitative analysis. Assume the stopping position is uniformly distributed over the feasible range.

For b., I clearly need to use Bayes’ theorem:

$$
P(X|Y) = dfrac{P(Y|X)P(X)}{P(Y)}
$$

where I expressed

$$
P(X) sim U[0,1] =
begin{cases}
1, text{where } 0 leq x leq 1\
0, text{else}
end{cases}
\
P(Y|X) sim U[0,x] =
begin{cases}
1/x, text{where } 0 leq y leq x\
0, text{else}
end{cases}
$$

I tried getting $P(Y)$ by integrating the numerator over $X$.

$$
int_{-infty}^{infty} P(Y|X)P(X)dx = int_{0}^{1}P(Y|X)cdot 1 dx = int_{0}^{1}dfrac{1}{x} dx
$$

But the integral doesn’t converge.

I also tried to figure out the numerator itself, but I don’t see how $frac{1}{x}$ can represent $P(X|Y)$.

Where did I go wrong?


Get this bounty!!!

#StackBounty: #self-study #autocorrelation Mean square of x-component of uniformly distributed circle points

Bounty: 50

Recently I was looking at a paper where the velocity auto-correlation function
$$C(t) = langle v_x(t) v_x(0) rangle = langle costheta(t), cos theta(0) rangle$$
was being considered for a number of point particles with velocities distributed uniformly on $S^1$ at time zero (here $langle cdot rangle$ denotes an average over initial conditions). In the above, $theta(t)$ is the angle w.r.t to the horizontal at time $t$. In their plot of $C(t)$ vs. $t$ I noticed that $C(0) ne 1/2$ (there were multiple plots). However I don’t see this when trying to generate a uniform velocity distribution so I assume I am doing something wrong here.

I generate velocities uniformly on $S^1$ as follows:

    N <- 10^4                                          # number of samples
    r <- runif(N, 0, 1)                                # uniform radii in [0,1]
    theta <- runif(N, -pi, pi)                         # uniform angles
    P <- cbind(sqrt(r)*cos(theta), sqrt(r)*sin(theta)) # uniform points in the unit disk
    L <- sqrt(P[,1]*P[,1] + P[,2]*P[,2]) 
    V <- P/L 

Another method I’ve seen is:

X1 <- rnorm(N, 0, 1)
X2 <- rnorm(N, 0, 1)
R  <- sqrt(X1*X1 + X2*X2)
V  <- matrix(rbind(X1/R, X2/R), N, 2, byrow=TRUE)

Or using the pracma package:

pracma::rands(N, r=1.0, N=1.0)

Firstly, can someone confirm that these are appropriate methods for generating points uniformly on the unit circle?

In all cases mean(V[,1] * V[,1]) returns $approx 1/2$.

Moreover if $Theta sim mathcal{U}[-pi, pi]$ and $V = cos^2 Theta$ then it has the pdf $$f_V(x) = dfrac{-2}{pi} dfrac{d}{dx} (sin^{-1} sqrt{x}) = dfrac{-1}{sqrt{x(1-x)}}$$ which has mean value
$$int_0^1 x f_V(x) dx = 1/2.$$

Is this correct?

Edit:

The issue with the final calculation is that uniform points on the circle are not formed by simply taking the cosine and sine of uniformly distributed angles, so it must be incorrect..


Get this bounty!!!

#StackBounty: #self-study #survival #interpretation #weibull Interpretation of Weibull Accelerated Failure Time Model Output

Bounty: 50

In this case study I have to assume a baseline Weibull distribution, and I’m fitting an Accelerated Failure Time model, which will be interpreted by me later on regarding both hazard ratio and survival time.

The data looks like this.

head(data1.1)

TimeSurv IndSurv Treat Age
1     6 days       1     D  27
2    33 days       1     D  43
3   361 days       1     I  36
4   488 days       1     I  54
5   350 days       1     D  49
6   721 days       1     I  49
7  1848 days       0     D  32
8   205 days       1     D  47
9   831 days       1     I  24
10  260 days       1     I  38

I’m fitting a model using the function Weibullreg() in R. The survival function is built reading TimeSurv as the time measures and IndSurv as the indicator of censoring. The covariates considered are Treat and Age.

My issue deals with understanding the output properly:

wei1 = WeibullReg(Surv(TimeSurv, IndSurv) ~ Treat + Age, data=data1.1)
wei1


$formula
Surv(TimeSurv, IndSurv) ~ Treat + Age

$coef
            Estimate           SE
lambda  0.0009219183 0.0006803664
gamma   0.9843411517 0.0931305471
TreatI -0.5042111027 0.2303038312
Age     0.0180225253 0.0089632209

$HR
              HR       LB       UB
TreatI 0.6039819 0.384582 0.948547
Age    1.0181859 1.000455 1.036231

$ETR
             ETR        LB        UB
TreatI 1.6690124 1.0574337 2.6343045
Age    0.9818574 0.9644488 0.9995801

$summary

Call:
survival::survreg(formula = formula, data = data, dist = "weibull")
               Value Std. Error     z      p
(Intercept)  7.10024    0.41283 17.20 <2e-16
TreatI       0.51223    0.23285  2.20  0.028
Age         -0.01831    0.00913 -2.01  0.045
Log(scale)   0.01578    0.09461  0.17  0.868

Scale= 1.02 

Weibull distribution
Loglik(model)= -599.1   Loglik(intercept only)= -604.1
    Chisq= 9.92 on 2 degrees of freedom, p= 0.007 
Number of Newton-Raphson Iterations: 5 
n= 120

I don’t really get how Scale = 1.02 and log(scale) = 0.015, and if the p-value of this log(scale) is a big non-signfificant one, from how the documentation of the function shows some conversions it makes, am I to assume that the values of the alphas are also not to be trusted (considering they were reached using the scale value)?


Get this bounty!!!

#StackBounty: #machine-learning #svm #kernel #self-study Kernels in SVM primal form

Bounty: 50

In SVM primal form, we have a cost function that is:

$$J(mathbf{w}, b) = C {displaystyle sumlimits_{i=1}^{m} maxleft(0, 1 – y^{(i)} (mathbf{w}^t cdot mathbf{x}^{(i)} + b)right)} quad + quad dfrac{1}{2} mathbf{w}^t cdot mathbf{w}$$

When using kernel trick, we have to apply $phi$ to our input data $x^{(i)}$. So our new cost function will be:

$$J(mathbf{w}, b) = C {displaystyle sumlimits_{i=1}^{m} maxleft(0, 1 – y^{(i)} (mathbf{w}^t cdot phi(mathbf{x}^{(i)}) + b)right)} quad + quad dfrac{1}{2} mathbf{w}^t cdot mathbf{w}$$

But following Andrew Ng machine learning course, after selecting all training examples as landmarks to apply gaussian kernel $K$, he rewrites the cost function this way:

$hskip1in$enter image description here

where $f^{(i)}=(1, K(x^{(i)}, l^{(1)}), K(x^{(i)}, l^{(2)}), …, K(x^{(i)}, l^{(m)}))$ is a $m+1$ dimensional vector ($m$ is the number of training examples). So i have two questions:

  • The two cost functions are quite similar, but latter uses $f^{(i)}$ and former $phi(x^{(i)})$. How is $f^{(i)}$ related to $phi(x^{(i)})$ ? In case of gaussian kernels, i know that the mapping function $phi$, maps our input data space to an infinite dimensional space, so $phi(x^{(i)})$ must be an infinite dimensional vector, but $f^{(i)}$ has only $m+1$ dimensions.
  • When using kernels, as there is no dot product in the primal form that can be computed by the kernel function, is it faster to solve the dual form with some algorithm like SMO than solving the primal form with gradient descent?


Get this bounty!!!

#StackBounty: #self-study #algorithms #entropy #information-theory #maximum-entropy How do I prove conditional entropy is a good measur…

Bounty: 200

This question is a follow-up of Does “expected entropy” make sense?, which you don’t have to read as I’ll reproduce the relevant parts. Let’s begin with the statement of the problem

A student has to pass an exam, with $k$ questions to be answered by yes or no, on a subject he knows nothing about. Assume the questions are independently distributed with a half-half probability of being either yes or no. The student is allowed to pass mock exams who have the same questions as the real exam. After each mock exam the teacher tells the student how many right answers he got, and when the student feels ready, he can pass the real exam. How many mock exams on average (a.k.a. take the expectation) must the student take to ensure he can get every single question correct in the real exam, and what should be his optimal strategy?

I have proposed an entropy-based strategy in that question, but for it to work, it must be first established that conditional entropy is a good measure of information to be recovered.

Here is a more concrete statement of my question. Suppose a student Alice has already taken 3 mock exams and got incomplete information about the answers. In a parallel universe, another student Bob has also taken 3 mock exams, but his strategy and insight about the answers may differ from those of Alice. At this point, both Alice and Bob have a conditional distribution of the answers based on outcomes of previous mock exams. I wonder if it can be proved that “the entropy of the conditional distribution from the perspective of Alice is greater or equal than that of Bob” can lead to “the minimum expected number of mock exams to be taken by Alice is greater or equal than that of Bob”.

Intuitively it makes sense because more entropy means more uncertainty and thus more attempts required, but I have no idea how to attack it. As a side note, this will be my bachelor’s thesis, so please just leave hints/pointers instead of spoiling too much 🙂


Get this bounty!!!

#StackBounty: #self-study #stochastic-processes Are the following function families the families of the probability densities of some s…

Bounty: 100

I am a beginner in stochastic processes, and I am trying to learn this branch of math. I have a few questions about the exercises I solved. I would like to ask if my reasoning was proper, and if the solution is good. The exercise states:

Are the following function families the families of the probability densities of some stochastic process?

(a) $$f_n(mathbf{t}_n,mathbf{x}_n)=left{begin{matrix}
frac{1}{t_1t_2cdots t_n} & for; 0 leq x_i leq t_i,; i=1,2,…,n\
0 & otherwise
end{matrix}right.$$

(b)$$f_n(mathbf{t}_n,mathbf{x}_n)=left{begin{matrix}
a_1a_2cdots a_ncdot exp(-a_1x_1-a_2x_2…-a_nx_n) & for; x_1>0,x_2>0,…,x_n>0\
0 & otherwise
end{matrix}right.$$

where $mathbf{t}n=(t_1,t_2,…,t_n)$, $mathbf{x}_n=(x_1,x_2,…,x_n)$, $n=1,2,…$, $a_1=t_1$, $a_i=t_i-t{i-1}$.

My solution was to integrate $f_n(mathbf{t}_n,mathbf{x}_n)$ with respect to some $x_i$ and see wether the outcome depends on $t_i$. If it does the funcion is not a density, if it doesn’t depend on $t_i$ it can be a density of some stochastic process.

(a)

$$int_0^{t_i}f_n(mathbf{t}n,mathbf{x}_n)dx_i = int_0^{t_i}frac{1}{t_1 t_2 cdots t_n}dx_i = frac{1}{t_1 t_2 cdots t_n} int_0^{t_i}dx_i = frac{x_i|_0^{t_i}}{t_1 t_2 cdots t_n} = frac{t_i-0}{t_1 t_2 cdots t_n} = frac{1}{t_1 t_2 cdots t{i-1}t_{i+1}cdots t_n} $$

(b)

$$int_0^{+infty}f_n(mathbf{t}n,mathbf{x}_n)dx_i= int_0^{+infty} a_1a_2cdots a_ncdot exp(-a_1x_1-a_2x_2…-a_nx_n)dx_i=$$
$$prod _{k=1}^na_kint_0^{+infty} exp(-a_1x_1-a_2x_2…-a_nx_n)dx_i =$$
$$ prod _{k=1}^na_k cdot exp(-sum
{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)int_0^{+infty} exp(-a_ix_i)dx_i =$$
$$prod {k=1}^na_k cdot exp(-sum{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)frac{1}{-a_i}exp(-a_ix_i)|0^{+infty}=$$
$$prod _{k=1}^na_k cdot exp(-sum
{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)frac{1}{-a_i}[0-1] =$$
$$prod {k=1}^na_k cdot exp(-sum{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)frac{1}{a_i}= $$
$$prod {k=1neq i}^na_k cdot exp(-sum{j=1}^{i-1}a_jx_j-sum_{j=i+1}^na_jx_j)$$

The function given in (a) can be a density, whereas (b) cannot as coefficients $a_i$ depend on $t_{i-1}$ and $t_i$.

Is it good?


Get this bounty!!!

#StackBounty: #regression #self-study #linear How to prove $beta_0$ has minimum variance among all unbiased linear estimator: Simple L…

Bounty: 50

Under the condition of simple linear regression model ( $Y_i = beta_0 + beta_1X_i + epsilon_i$) ordinary linear estimators have minimum variance among all linear estimators.

To prove OLS estimator $hat{beta_1} = sum{k_iy_i}$ has minimum variance we start by setting $tilde{beta_1} = sum{c_iy_i}$ and we show that variance of $tilde{beta_1}$ can only be larger than $beta_1$ if $c_i neq k_i$.

Similarly, I am trying to prove that $beta_0$ has minimum variance among all unbiased linear estimators, and I am told that the proof starts similarly.

I know that the OLS estimator is $hat{beta_0} = bar{y} – hat{beta_1}bar{x}$.

How do I start the proof: by constructing another linear estimator $tilde{beta_0}$? Is this a linear estimator $hat{beta_0} = cbar{y} – hat{beta_1}bar{x}$?


Get this bounty!!!

#StackBounty: #self-study #convergence How to show that quadratic mean convergence implies expectation value?

Bounty: 50

I am reading Larry Wasserman’s All of Statistics and exercise 2 in chapter 6 asks for a proof that given sequence of random variables $ X_1, X_2, dots $, show that $ X xrightarrow{text{QM}} b $ if and only if

$$
begin{align}
& lim_{n rightarrow infty} mathbb{E}(X_n) = b & text{and } & & lim_{n rightarrow infty} mathbb{V}(X_n) = 0.
end{align}
$$

I’m getting stuck proving the forward direction. I started by expanding the definition of quadratic mean convergence as follows. By assumption, we have
$$
lim_{n rightarrow infty} mathbb{E}(X-b)^2 = 0.
$$

And then by linearity of expectation we have,
$$
lim_{n rightarrow infty} mathbb{E}(X-b)^2 = lim_{n rightarrow infty} mathbb{E}(X_n^2) – 2b mathbb{E}(X_n) + b^2 = 0.
$$

This is where I get stuck. It seems like we will somehow get that $ mathbb{E}(X_n) $ has to equal $ b $ but I don’t see how.


Get this bounty!!!

#StackBounty: #time-series #self-study #references #markov-process #transition-matrix How to interpret clusters on Markov chain time ch…

Bounty: 50

I have a discrete-time Markov chain. The Markov chain is aperiodic (because self-loops exist) and is irreducible.

I have found the mean recurrence time (left graph) and then sorted mean recurrence time (right graph).
enter image description here
As on the left graph as on the right graph, one can see three ‘clusters’ (groups). I think that is not a typical case. Maybe the transition matrix has a specific form?

My question is:
How to interpret obtained clusters for Markov chain time characteristics?

Edit.

I have plotted the original graph with tree ‘clusters’.

enter image description here

  cluster vertexN edgeN     density diameter
       1      35   105  0.088235294  1.30119
       2      23    12  0.023715415  1.00000
       3      46    10  0.004830918  2.00000

Density of original graph is 0.0229649.

Meyn S P and Tweedie R L 2005 Markov Chains and Stochastic Stability


Get this bounty!!!