#StackBounty: #regression #ordinal-data Guidance on when to use cumulative vs. stopping ratio vs. continuation ratio vs. adjacent categ…

Bounty: 200

When doing ordinal regression, what should I consider when deciding whether to use the cumulative probability, stopping ratio, continuation ratio, or adjacent category families? I know that the proportional odds assumption is key here, but am looking for more detail, and any other things I should consider. Thank you.


Get this bounty!!!

#StackBounty: #regression #machine-learning Artificial neural networks EQUIVALENT to linear regression with polynomial features?

Bounty: 100

I want to improve my understanding of neural networks and their benefits compared to other machine learning algorithms. My understanding is as below and my question is:

Can you correct and supplement my understanding please? 🙂

My understanding:

(1) Artificial neural networks = A function, which predicts output values from input values. According to a Universal Approximation Theorem (https://en.wikipedia.org/wiki/Universal_approximation_theorem), you usually can have any possible (though it should behave well) prediction function, given enough neurons.

(2) The same is true for linear regression, by taking polynomials of the input values as additional input values, since you can approximate (compare Taylor expansion) each function well by polynomials.

(3) This means, that (in a sense, with respect to best possible outcomes), those 2 methods are equivalent.

(4) Hence, their main difference lies in which method lends itself to better computational implementation. In other words, with which method can you find, based on training examples, faster good values for the parameters which eventually define the prediction function.

I welcome any thoughts, comments and recommendations to other links or books to improve my thinking.


Get this bounty!!!

#StackBounty: #regression #optimization #lasso #lars #glmmlasso Why under joint least squares direction is it possible for some coeffic…

Bounty: 50

I think I understand how LARS regression works. It basically adds features to the model when they are more correlated with the residuals than the current model. And then, after adding the features to the model, it will increase the coefficients in the joint least squares direction (which is the same as increasing the least angle).

If the coefficients are increased in the joint least squares direction, then doesn’t that mean that they can’t decrease? joint least squares means that the $beta$’s move such that the $Sigmabeta_i^2$ is as low as possible, but the $beta$’s must be increasing.

I’ve seen some plots where the $beta$’s seem to be decreasing as LARS is finding its solution path. for example, in the original paper on the top of page 4, it shows the following plot:

LARS showing betas decreasing as it finds solution

Am I misunderstanding something about the LARS algorithm? Perhaps I’m not understanding how the joint least squares direction, and equiangular are both possible?


Get this bounty!!!

#StackBounty: #regression #mixed-model #residuals #bias #two-step-estimation Fitting a fixed effect model to the residuals from a mixed…

Bounty: 50

In some statistical analyses (ie genetics), it may makes sense to perform a two-step regression analysis. In this analysis, the dependent variable is regressed against several independent variables. The residuals are taken from this first regression, and modeled against a final independent variable (ie. a SNP in a genetic association study). Demissie et al. discusses possible bias under this two-stage design, but this bias is only if the final independent variable is correlated with one of the independent variables from the first models. However, what if the first model is a mixed effect model and the residuals from the mixed effect model are then regressed against the final covariate? Would there be an issue with incorporating random effects first, and then regressing a remaining fixed effect variable on the resulting residuals?


Get this bounty!!!

#StackBounty: #regression #heteroscedasticity #phylogeny Consequences of heteroskedasticity for regression with correlated errors

Bounty: 50

What are the statistical consequences of heteroskedasticity for regression models where the errors are correlated, e.g., due to spatial or phylogenetic autocorrelation?

For example, consider a phylogenetic regression model of skull length vs. body mass, where errors are correlated due to phylogeny, and heteroskedasticity is induced because measurement error in skull length is positively correlated with body mass.


Get this bounty!!!

#StackBounty: #regression #logistic #goodness-of-fit #degrees-of-freedom #hosmer-lemeshow-test Degrees of freedom of $chi^2$ in Hosmer…

Bounty: 100

The test statistic for the Hosmer-Lemeshow test (HLT) for goodness of fit (GOF) of a logistic regression model is defined as follows:

The sample is then split into $d=10$ deciles, $D_1, D_2, dots , D_{d}$, per decile one computes the following quantities:

  • $O_{1d}=displaystyle sum_{i in D_d} y_i$, i.e. the observed number of positive
    cases in decile $D_d$;
  • $O_{0d}=displaystyle sum_{i in D_d} (1-y_i)$, i.e. the observed number of
    negative cases in decile $D_d$;
  • $E_{1d}=displaystyle sum_{i in D_d} hat{pi}_i$, i.e. the estimated number of
    positive cases in decile $D_d$;
  • $E_{0d}= displaystyle sum_{i in D_d} (1-hat{pi}_i)$, i.e. the estimated number
    of negative cases in decile $D_d$;

where $y_i$ is the observed binary outcome for the $i$-th observation and $hat{pi}_i$ the estimated probability for that observation.

Then the test statistic is then defined as:

$X^2 = displaystyle sum_{h=0}^{1} sum_{g=1}^d left( frac{(O_{hg}-E_{hg})^2}{E_{hg}} right)= sum_{g=1}^d left( frac{ O_{1g} – n_g hat{pi}_g}{sqrt{n_g (1-hat{pi}_g) hat{pi}_g}} right)^2,$

where $hat{pi}_g$ is the average estimated probability in decile $g$ and let $n_g$ be the number of companies in the decile.

According to Hosmer-Lemeshow (see this link) this statistic has (under certain assumptions) a $chi^2$ distribution with $(d-2)$ degrees of freedom.

On the other hand, if I would define a contingency table with $d$ rows (corresponding to the deciles) and 2 columns (corresponding to the true/false binary outcome) then the test-statistic for the $chi^2$ test for this contingency table would the the same as the $X^2$ defined above, however, in the case of the contingency table, this test statistic is $chi^2$ with $(d-1)(2-1)=d-1$ degrees of freedom. So one degree of freedom more !

How can one explain this difference in the number of degrees of freedom ?

EDIT: additions after reading comments:

@whuber

They say (see Hosmer D.W., Lemeshow S. (1980), A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069) that there is a theorem demonstrated by Moore and Spruill from which it follows that if (1) the parameters are estimated using likelihood functions for ungrouped data and (2) the frequencies in the 2xg table depend on the estimated parameters, namely the cells are random, not fixed, that then, under appropriate regularity conditions the goodness of fit statistic under (1) and (2) is that of a central chi-square with the usual reduction of degrees of freedom due to estimated parameters plus a sum of weighted chi-square variables.

Then, if I understand their paper well, they try to find an approximation for this ‘correction term’ that, if I understand it well, is this weighted sum of chi-square random variables, and they do this by making simulations, but I must admit that I do not fully understand what they say there, hence my question; why are these cells random, how does that influence the degrees of freedom ? Would it be different if I fix the borders of the cells and then I classify the observations in fixed cells based on the estimated score, in that case the cells are not random, though the ‘content’ of the cell is ?

@Frank Harell: couldn’t it be that the ‘shortcomings’ of the Hosmer-Lemeshow test that you mention in your comments below, are just a consequence of the approximation of the weighted sum of chi-squares ?


Get this bounty!!!

#StackBounty: #regression #logistic #goodness-of-fit #degrees-of-freedom #hosmer-lemeshow-test Degrees of freedom of $chi^2$ in Hosmer…

Bounty: 100

The test statistic for the Hosmer-Lemeshow test (HLT) for goodness of fit (GOF) of a logistic regression model is defined as follows:

The sample is then split into $d=10$ deciles, $D_1, D_2, dots , D_{d}$, per decile one computes the following quantities:

  • $O_{1d}=displaystyle sum_{i in D_d} y_i$, i.e. the observed number of positive
    cases in decile $D_d$;
  • $O_{0d}=displaystyle sum_{i in D_d} (1-y_i)$, i.e. the observed number of
    negative cases in decile $D_d$;
  • $E_{1d}=displaystyle sum_{i in D_d} hat{pi}_i$, i.e. the estimated number of
    positive cases in decile $D_d$;
  • $E_{0d}= displaystyle sum_{i in D_d} (1-hat{pi}_i)$, i.e. the estimated number
    of negative cases in decile $D_d$;

where $y_i$ is the observed binary outcome for the $i$-th observation and $hat{pi}_i$ the estimated probability for that observation.

Then the test statistic is then defined as:

$X^2 = displaystyle sum_{h=0}^{1} sum_{g=1}^d left( frac{(O_{hg}-E_{hg})^2}{E_{hg}} right)= sum_{g=1}^d left( frac{ O_{1g} – n_g hat{pi}_g}{sqrt{n_g (1-hat{pi}_g) hat{pi}_g}} right)^2,$

where $hat{pi}_g$ is the average estimated probability in decile $g$ and let $n_g$ be the number of companies in the decile.

According to Hosmer-Lemeshow (see this link) this statistic has (under certain assumptions) a $chi^2$ distribution with $(d-2)$ degrees of freedom.

On the other hand, if I would define a contingency table with $d$ rows (corresponding to the deciles) and 2 columns (corresponding to the true/false binary outcome) then the test-statistic for the $chi^2$ test for this contingency table would the the same as the $X^2$ defined above, however, in the case of the contingency table, this test statistic is $chi^2$ with $(d-1)(2-1)=d-1$ degrees of freedom. So one degree of freedom more !

How can one explain this difference in the number of degrees of freedom ?

EDIT: additions after reading comments:

@whuber

They say (see Hosmer D.W., Lemeshow S. (1980), A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069) that there is a theorem demonstrated by Moore and Spruill from which it follows that if (1) the parameters are estimated using likelihood functions for ungrouped data and (2) the frequencies in the 2xg table depend on the estimated parameters, namely the cells are random, not fixed, that then, under appropriate regularity conditions the goodness of fit statistic under (1) and (2) is that of a central chi-square with the usual reduction of degrees of freedom due to estimated parameters plus a sum of weighted chi-square variables.

Then, if I understand their paper well, they try to find an approximation for this ‘correction term’ that, if I understand it well, is this weighted sum of chi-square random variables, and they do this by making simulations, but I must admit that I do not fully understand what they say there, hence my question; why are these cells random, how does that influence the degrees of freedom ? Would it be different if I fix the borders of the cells and then I classify the observations in fixed cells based on the estimated score, in that case the cells are not random, though the ‘content’ of the cell is ?

@Frank Harell: couldn’t it be that the ‘shortcomings’ of the Hosmer-Lemeshow test that you mention in your comments below, are just a consequence of the approximation of the weighted sum of chi-squares ?


Get this bounty!!!

#StackBounty: #regression #missing-data #fractional-polynomial How can missing data be dealt with when using splines or fractional poly…

Bounty: 200

I am reading Multivariable Model Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables by Patrick Royston and Willie Sauerbrei. So far, I am impressed and it’s an interesting approach I had not considered before.

But the authors do not deal with missing data. Indeed, on p. 17 they say that missing data “introduces many additional problems. Not considered here.”

Does multiple imputation work with fractional polynomials>

FP is, in some ways (but not all) an alternative to splines. Is it easier to deal with missing data for spline regression?


Get this bounty!!!

#StackBounty: #regression #time-series #predictive-models #radial-basis #rbf-network fit radial basis function network for time series …

Bounty: 50

Many publications (example) suggest training radial basis function networks (RBFs) as follows:

  1. Use unsupervised learning to determine a set of bump locations
  2. Use LMS algorithm to train output weights

Let us say we use a self-organizing map (SOM) for step 1. My understanding is that the original n dimensional vectors V1s are mapped to the winning nodes’ prototype vectors V2s. V2s can then be used to perform linear regression to obtain the weights of the RBF. Is my understanding correct?


Get this bounty!!!

#StackBounty: #r #regression #circular-statistics Interpreting circular-linear regression coefficient

Bounty: 50

I’m trying to use the circular package in R to perform regression of a circular response variable and linear predictor, and I do not understand the coefficient value I’m getting. I’ve spent considerable time searching in vain for an explanation that I can understand, so I’m hoping somebody here may be able to help.

Here’s an example:

library(circular)

# simulate data
x <- 1:100
set.seed(123)
y <- circular(seq(0, pi, pi/99) + rnorm(100, 0, .1))

# fit model
m <- lm.circular(y, x, type="c-l", init=0)

> coef(m)
[1] 0.02234385

I don’t understand this coefficient of 0.02 — I would expect the slope of the regression line to be very close to pi/100, as it is in garden variety linear regression:

> coef(lm(y~x))[2]
         x
0.03198437

Does the circular regression coefficient not represent the change in response angle per unit change in the predictor variable? Perhaps the coefficient needs to be transformed via some link function to be interpretable in radians? Or am I thinking about this all wrong? Thanks for any help you can offer.


Get this bounty!!!