#StackBounty: #r #regression #circular-statistics Interpreting circular-linear regression coefficient

Bounty: 50

I’m trying to use the circular package in R to perform regression of a circular response variable and linear predictor, and I do not understand the coefficient value I’m getting. I’ve spent considerable time searching in vain for an explanation that I can understand, so I’m hoping somebody here may be able to help.

Here’s an example:

library(circular)

# simulate data
x <- 1:100
set.seed(123)
y <- circular(seq(0, pi, pi/99) + rnorm(100, 0, .1))

# fit model
m <- lm.circular(y, x, type="c-l", init=0)

> coef(m)
[1] 0.02234385

I don’t understand this coefficient of 0.02 — I would expect the slope of the regression line to be very close to pi/100, as it is in garden variety linear regression:

> coef(lm(y~x))[2]
         x
0.03198437

Does the circular regression coefficient not represent the change in response angle per unit change in the predictor variable? Perhaps the coefficient needs to be transformed via some link function to be interpretable in radians? Or am I thinking about this all wrong? Thanks for any help you can offer.


Get this bounty!!!

#StackBounty: #regression #hypothesis-testing #interaction #regression-coefficients #permutation-test How to do permutation test on mod…

Bounty: 100

Given the following model as an example:

$$Y=beta_0+beta_Acdot A+beta_Bcdot B+beta_{AB}cdot A cdot B+epsilon$$

In alternative notation:

$$Ysim A + B + A: B$$

The main question:

When permuting entries of variable $A$ to test its coefficient ($beta_A$) in a model, should an interaction term that includes it such as $Bcdot A$ be recomputed as well?

Secondary question:

And what about testing the $Bcdot A$ interaction term coefficient ($beta_{AB}$)? Are its permutations computed regardless of the variables $A$ and $B$?

A bit of context:

I want to perform a test on the coefficients of a model (it’s a canonical correlation analysis, but the question is applicable to any linear model including interactions).

I’m trying my hands with permutation tests. While it’s fairly straightforward to test the canonical correlation itself, how to do the same with the variable scores, or coefficients, is a bit unclear to me when including an interaction term.

I’ve read How to test an interaction effect with a non-parametric test (e.g. a permutation test)?, but my question is much more practical.


Get this bounty!!!

#StackBounty: #r #regression #estimation #piecewise-linear How to estimate parameters of a (almost) linear model from unpaired observat…

Bounty: 50

I have this model:

$a_i=mod(lfloor icdot T+Normal(0,sigma_a)rfloor,Q)$

$b_i=mod(lfloor a_i+D+Normal(0,sigma_b)rfloor,Q)$

with

$i=1..N$

$Ninmathbb{N}$

$Qinmathbb{N}$

$Tinmathbb{R}, Tgt0$

$Dinmathbb{R}, Dgt0$

$Dgt T$

$Normal(mu,sigma)$ is a random number drawn from a Normal distribution with mean $mu$ and variance $sigma^2$.

$mod(x,y)$ is the modulo operator, in some programming language it is x % y.

$lfloor x rfloor$ is the floor function.

Given the $N$ paired observations $(a_i,b_i)$ and $Q$, I need to find $T$ and $D$ and I would like to have an idea about $sigma_a$ and $sigma_b$.

My naïve solution is:

  1. Ignore $sigma_a$ and $sigma_b$ setting them to $0$;
  2. an estimate for $T$ is the median of the list composed by the $N-1$ values $a_i-a_{i-1}$;
  3. an estimate for $D$ is the median of the list composed by the $N$ values $b_i-a_{i}$.

This R code:

N=1000
T=200
D=3000
Q=2^16
sigma_a=1
sigma_b=2

i=seq(1,N)

a=floor(i*T+rnorm(N,0,sigma_a))%%Q

b=floor(a+D+rnorm(N,0,sigma_b))%%Q

print(sprintf("Estimate for T: %f",median(diff(a))))
print(sprintf("Estimate for D: %f",median(b-a)))

gives this output:

[1] "Estimate for T: 200.000000"
[1] "Estimate for D: 2999.000000"

Now, I would like to remove the assumption that the observations are paired, i.e., I will just have all the $a_j$ and all the $b_k$ but I will ignore the correspondences between them, so for example, with $N=3$, $T=200$, $D=3000$, $sigma_a=0$ and $sigma_b=0$ I will have $a={200,400,600}$ and $b={3400,3600,3200}$.

Is the problem still solvable? How?


Get this bounty!!!

#StackBounty: #regression #correlation #p-value #assumptions Difference between the assumptions underlying a correlation and a regressi…

Bounty: 50

My question grew out of a discussion with @whuber in the comments of a different question.

Specifically, @whuber ‘s comment was as follows:

One reason it might surprise you is that the assumptions underlying a correlation test and a regression slope test are different–so even when we understand that the correlation and slope are really measuring the same thing, why should their p-values be the same? That shows how these issues go deeper than simply whether $r$ and $beta$ should be numerically equal.

This got my thinking about it and I came across a variety of interesting answers. For example, I found this question “Assumptions of correlation coefficient” but can’t see how this would clarify the comment above.

I found more interesting answers about the relationship of Pearson’s $r$ and the slope $beta$ in a simple linear regression (see here and here for example) but none of them seem to answer what @whuber was referring to in his comment (at least not apparent to me).

Question 1: What are the assumptions underlying a correlation test and a regression slope test?

For my 2nd question consider the following outputs in R:

model <- lm(Employed ~ Population, data = longley)
summary(model)

Call:
lm(formula = Employed ~ Population, data = longley)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4362 -0.9740  0.2021  0.5531  1.9048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.3807     4.4224   1.895   0.0789 .  
Population    0.4849     0.0376  12.896 3.69e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.013 on 14 degrees of freedom
Multiple R-squared:  0.9224,    Adjusted R-squared:  0.9168 
F-statistic: 166.3 on 1 and 14 DF,  p-value: 3.693e-09

And the output of the cor.test() function:

with(longley, cor.test(Population, Employed))

    Pearson's product-moment correlation

data:  Population and Employed
t = 12.8956, df = 14, p-value = 3.693e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8869236 0.9864676
sample estimates:
      cor 
0.9603906 

As can be seen by the lm() and cov.test() output, the Pearson’s correlation coefficient $r$ and the slope estimate ($beta_1$) are largely different, 0.96 vs. 0.485, respectively, but the t-value and the p-values are the same.

Then I also tried to see if I am able to calculate the t-value for $r$ and $beta_1$, which are the same despite $r$ and $beta_1$ being different. And that’s where I get stuck, at least for $r$:

Calculate the the slope ($beta_1$) in a simple linear regression using the total sums of squares of $x$ and $y$:

x <- longley$Population; y <- longley$Employed
xbar <- mean(x); ybar <- mean(y)
ss.x <- sum((x-xbar)^2)
ss.y <- sum((y-ybar)^2)
ss.xy <- sum((x-xbar)*(y-ybar))

Calculate the least-squares estimate of the regression slope, $beta_{1}$ (there is a proof of this in Crawley’s R Book 1st edition, page 393):

b1 <- ss.xy/ss.x                        
b1
# [1] 0.4848781

Calculate the standard error for $beta_1$:

ss.residual <- sum((y-model$fitted)^2)
n <- length(x) # SAMPLE SIZE
k <- length(model$coef) # NUMBER OF MODEL PARAMETER (i.e. b0 and b1)
df.residual <- n-k
ms.residual <- ss.residual/df.residual # RESIDUAL MEAN SQUARE
se.b1 <- sqrt(ms.residual/ss.x)
se.b1
# [1] 0.03760029

And the t-value and p-value for $beta_1$:

t.b1 <- b1/se.b1
p.b1 <- 2*pt(-abs(t.b1), df=n-2)
t.b1
# [1] 12.89559
p.b1
# [1] 3.693245e-09

What I don’t know at this point, and this is Question 2, is, how to calculate the same t-value using $r$ instead of $beta_1$ (perhaps in baby-steps)?

I assume that since cor.test()‘s alternative hypothesis is whether the true correlation is not equal to 0 (see cor.test() output above), I would expect something like the Pearson correlation coefficient $r$ divided by the “standard error of the Pearson correlation coefficient” (similar to the b1/se.b1 above)?! But what would that standard error be and why?

Maybe this has something to do with the aforementioned assumptions underlying a correlation test and a regression slope test?!


Get this bounty!!!

#StackBounty: #regression #standard-error Settle a bet: Errors in prediction / regression

Bounty: 50

In my work, I am comparing “predicted” values to “theoretically true” values. To calculate one predicted value, I take some $N_{R_i}$ samples from an area in space, $R_i$ (i.e. the samples are from different locations in a region). I do some calculations on the samples to calculate a predicted value for each sample. Then I average the predictions to calculate an average predicted value for the region. There are around 10 regions in total, so 10 predicted values.

The theoretically true values are back-calculated using a very different methodology. Theoretically true values are only available many years after the original samples are taken (that’s why we bother to make predictions). There can only be one theoretically true value per region $R_i$. That’s why I averaged samples from within each region to compare to the single theoretically true value per region.

I hope you are with me so far.

So, what I have is a small data set of about 10 predicted and “true” values. The correlation between them is strong.

Now, what if we expand into a new region? I can take some new samples and make a prediction for that region.

I want to know how to calculate the uncertainty in my prediction.

I think I am after “Uncertainty in the mean” since my prediction is an average of $n$ samples.

$$
SE_x=frac{s}{sqrt{n}}
$$
So, I think I can say that if I calculate a predicted value, $x$, there is a 68% chance for the “true value” to be within 1 $SE_x$ of $x$. Is this correct?

My colleague thinks we should be interested in the standard error for a predicted value:

$$
s_{y_p}=s^2_ebigg(1+frac{1}{n}+frac{(x-overline{x})^2}{sum(x-overline{x})^2}bigg)
$$
(she found that equation here: http://courses.ncssm.edu/math/Talks/PDFS/Standard%20Errors%20for%20Regression%20Equations.pdf)

Who is right? Or maybe a better question is: what will the 2nd formula give me that the first will not?


Get this bounty!!!

#StackBounty: #r #regression #lm R – Transformations of continuous predictors, using predicted values

Bounty: 50

In the past, I have been assessing the relationship between outcome and continuous predictors without taking other predictors into account. I have also been playing around with a way to determine that same relationship when taking other model predictors into account using the predict function…but can’t get my head around a couple of things. Probably not the best example, but I’ve replicated the problem with the IRIS dataset (using Sepal.Length as the outcome variable):

library(ggplot2)
irisdata <- iris 

Here is what I might use to explore the relationship between sepal.length and petal.width; and determine whether a transformation is required (in this case I might just keep as linear).

  ggplot(irisdata, aes(x=Sepal.Length, y=Petal.Width)) +
    geom_point(shape=1) +    
    stat_smooth(method = "loess", color = 'red', size = 1.3, span = 0.5)   +
    stat_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, color = 'magenta', se = FALSE) +
    geom_smooth(method = "lm", color = 'purple', se = FALSE)

I’m interested in whether that relationship will change when including my other model variables. Here’s the final model excluding petal.width:

irismodel <- lm(Sepal.Length~Sepal.Width+Petal.Length+Species, data=irisdata)
summary(irismodel)
irisdata$predictedlength <- predict(irismodel, irisdata, type = "response")

And here is what I may use to see if the relationship has changed (in this case, both relationships look similar):

  ggplot(irisdata, aes(x=predictedlength, y=Petal.Width)) +
    geom_point(shape=1) +    
    stat_smooth(method = "loess", color = 'red', size = 1.3, span = 0.5)   +
    stat_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, color = 'magenta', se = FALSE) +
    geom_smooth(method = "lm", color = 'purple', se = FALSE)

Finally, when I include petal.width in the final model pedal.width is a significant variable:

  irismodel2 <- lm(Sepal.Length ~ Petal.Width + Sepal.Width+Petal.Length+Species, data= irisdata)
  summary(irismodel2)

However, when I include it as a predictor with ‘predictedlength’, it becomes non-significant:

  irismodel3 <- lm(Sepal.Length ~ Petal.Width + predictedlength, data= irisdata)
  summary(irismodel3)

I guess there are two questions here:

  1. Why does petal.width ‘lose’ statistical significance when included in the model with the predictedvalue (ie. model 2)?
  2. What is a reasonable approach for determining a correct continuous transformation? When considering transformations, should the impact of other predictors be taken into account? In this example petal.width more or less looks the same but in more complex models I’ve built the transformation requirements have changed (maybe I need to add a degree to a polynomial etc). I guess this is mostly due to a collinearity issue.

Thanks

Output from code below:

summary(irismodel)

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.39039    0.26227   9.114 5.94e-16 ***
Sepal.Width        0.43222    0.08139   5.310 4.03e-07 ***
Petal.Length       0.77563    0.06425  12.073  < 2e-16 ***
Speciesversicolor -0.95581    0.21520  -4.442 1.76e-05 ***
Speciesvirginica  -1.39410    0.28566  -4.880 2.76e-06 ***

Multiple R-squared:  0.8633,    Adjusted R-squared:  0.8595 

summary(irismodel2)

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 

Multiple R-squared:  0.8673,    Adjusted R-squared:  0.8627 

summary(irismodel3)

                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.30055    0.35236  -0.853    0.395    
Petal.Width     -0.07546    0.07406  -1.019    0.310    
predictedlength  1.06692    0.07337  14.541   <2e-16 ***

Multiple R-squared:  0.8643,    Adjusted R-squared:  0.8624 

enter image description here

enter image description here


Get this bounty!!!

#StackBounty: #regression #hypothesis-testing #anova #experiment-design #methodology Arguments/Advantages of Additive Model Constructio…

Bounty: 50

Assume I perform a psychological experiment with a number of manipulations, each hypothesized to influence the dependent variable. For instance, we perform a mixed-design short-term memory experiment where:

  • DV = Number of letters recalled
  • IV1 = Number of letters
  • IV2 = Number of distractions
  • IV3 = Delay time between presentation of stimuli and recall
  • IV4 = Whether memory is being stored internally (biologically) or externally (i.e., pen and paper)

Assuming the presence of interactions between terms, is there any argument for constructing a number of analytical model (e.x., ~IV1*IV2, ~IV1*IV2*IV4, ~IV1*IV2*IV3*IV4) in order to best understand the phenomena? This is in the context of performing successive mixed-design ANOVA’s to come to some multiple conclusions regarding the experiment. I have a basic understanding of regression/multivariate regression and recognize the folly of including unnecessary variables – but if they’re all hypothesized to exert some effect, isn’t the final model the most sound?


Get this bounty!!!

#StackBounty: #regression #self-study #linear-model #independence Implications pf mean dependence in the classical linear model

Bounty: 50

Consider the classical linear model

(1) $Y_i=X_i’beta+epsilon_i$

(2) $(Y_i, X_i)_{i=1}^n$ i.i.d.

(3) $E(epsilon_i| X_1,…, X_n)=0$


Could you help me to show step by step that

(1), $(Y_i, X_i)_{i=1}^n$ mutually independent, (3) $Rightarrow$ $E(epsilon_i|X_i)=0$


Could you help me to show step by step that

(1), $(Y_i, X_i)_{i=1}^n$ mutually independent, $E(epsilon_i|X_i)=0$ $Rightarrow$ (3)


Also, is it true that

(1), (2), $epsilon_i perp X_i$ $Rightarrow$ $epsilon_i perp (X_1,…, X_n)$

? Why?


Get this bounty!!!

#StackBounty: #regression #survival #modeling #cox-model #hazard Cumulative hazard in the setting of Cox regression of repeated events

Bounty: 50

Cox regression is commonly extended to estimate repeated events processes (for a quick review see [ 1 ] and [ 2 ]).

In Clark et al’s first article in their excellent review series of survival analysis [ 3 ] cumulative hazard is explained in the following manner:

The interpretation of H(t) is difficult, but perhaps the easiest way to think of H(t) is as the cumulative force of mortality, or the number of events that would be expected for each individual by time t if the event were a repeatable process.

In light of [ 3 ]: It seems logical to conclude that in the setting of Cox regression of repeated events, the cumulative hazard represents the expected number of events given the covariates.

I am especially interested in the relationship between the Nelson-Aalen estimator of the cumulative hazard and the mean cumulative function or mean cumulative count [ 4 ]. Assume there are no competing risks.

What is the relationship, if any, between the mean cumulative function and cumulative hazard in the setting of Cox regression for repeated events?

  1. Modelling recurrent events: a tutorial for analysis in epidemiology. Amorim LD, Cai J. Int J Epidemiol. 2015 Feb;44(1):324-33. doi: 10.1093/ije/dyu222.
  2. Survival analysis for recurrent event data: an application to childhood infectious diseases. Kelly PJ, Lim LL. Stat Med. 2000 Jan 15;19(1):13-33.
  3. Survival analysis part I: basic concepts and first analyses. Clark TG, Bradburn MJ, Love SB, Altman DG. Br J Cancer. 2003 Jul 21;89(2):232-8.
  4. Estimating the Burden of Recurrent Events in the Presence of Competing Risks: The Method of Mean Cumulative Count. Dong et al. Am J Epidemiol. 2015 Apr 1; 181(7): 532–540.


Get this bounty!!!

#StackBounty: #regression #ridge-regression #shrinkage #penalized If shrinkage is applied in a clever way, does it always work better f…

Bounty: 50

Suppose I have two estimators $widehat{beta}_1$ and $widehat{beta}_2$ that are consistent estimators of the same parameter $beta_0$ and such that
$$sqrt{n}(widehat{beta}_1 -beta_0) stackrel{d}rightarrow mathcal{N}(0, V_1), quad sqrt{n}(widehat{beta}_2 -beta_0) stackrel{d}rightarrow mathcal{N}(0, V_2)$$
with $V_1 leq V_2$ in the p.s.d. sense. Thus, asymptotically $widehat{beta}_1$ is more efficient than $widehat{beta}_2$. These two estimators are based on different loss functions.

Now I want to look for some shrinkage techniques to improve finite-sample properties of my estimators.

Suppose that I found a shrinkage technique that improves the estimator $widehat{beta}_2$ in a finite sample and gives me the value of MSE equal to $widehat{gamma}_2$. Does this imply that I can find a suitable shrinkage technique to apply to $widehat{beta}_1$ that will give me the MSE no greater than $widehat{gamma}_2$?

In other words, if shrinkage is applied cleverly, does it always work better for more efficient estimators?


Get this bounty!!!