#StackBounty: #regression #least-squares #sufficient-statistics Sufficient Statistic for $beta$ in OLS

Bounty: 100

I have the classical regression model

$$y = beta X + epsilon$$
$$epsilon sim N(0, sigma^2)$$

where $X$ is taken to be fixed (not random), and $hatbeta$ is the OLS estimate for $beta$.

It is known that $(y^T y, X^T y)$ pair is a complete sufficient statistic for $x_0^T beta$, for some input $x_0$.

Can we conclude that $(y^T y, X^T y)$ is also a sufficient statistic for $beta$, and why? I think for this to work $X^T X$ should be full rank. I mean a 1 to 1 transformation of a sufficient statistic is still a sufficient statistic, but it is still a sufficient statistic for $x_0^T beta$. Based on what are we going to conclude the sufficiency of $hatbeta$ for $beta$ itself?


Get this bounty!!!

#StackBounty: #r #regression #interaction How to evaluate all possible contrasts of an interaction effect

Bounty: 50

Let’s say I have an experiment where I pair people up with either individuals of the same political orientation, or different. I track before and after measures of a respondent’s belief in global warming.

Consider the following specification:

fit <- lm_robust(
  belief_global_warming  ~ politics + partner_politics + partner_politics + politics * partner_politics + sex,
  data = data,
  clusters = team_id,
  se = "stata"
)

Let’s say the output for the coefficients of interest look like so:

                                               Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
politicsRepublican                              6.3265     2.6573  2.3808  0.01893   1.0625  11.5905 114
partner_politicsRepublican                      1.2334     1.5024  0.8210  0.41338  -1.7428   4.2096 114
politicsRepublican:partner_politicsRepublican  -6.5873     2.9706 -2.2175  0.02857 -12.4720  -0.7026 114

So, the reference group auto-selected by R is Democrats paired with Democrats. I want to be able to say that a D paired with an R has different response in the DV than any of the other combinations (RD, RR, DD).

What is the appropriate way to do the following:

(1) Compare mixed groups (Republican paired with Democrat or Democrat paired with Republican) against all other groupings to detect if before and after changes are significant relative to all reference groups, while also controlling for multiple tests.

My thought was to just set a different reference group and run all the possible combinations, but I remembered Stata’s contrast command, and wonder if there’s a parsimonious R equivalent.

(2) Is putting the DV as a before-and-after change the appropriate method? I have heard a suggestion about keeping the DV a level and have before and after dummies (like a D-I-D approach), but I’m not sure I understand it.


Get this bounty!!!

#StackBounty: #regression #bayesian #propensity-scores Use of svyglm for a weighted regression in a bayesian framework

Bounty: 50

I am using the twang package in R to balance two groups by creating propensity scores, which are then used as weights in the svyglm for a weighted regression of the two groups.

I would like however to use the weights in a bayesian glm, since this is the model employed also previously in the analysis. How could I implement this, or, is there even a package which allows for propensity-weighted regression in a bayesian context?

Edit: I have read that the weights parameter in stan is not equal to the parameter in svyglm, however, it seems to be that bmrs is allowing for survey-weighted regression in the same manner as svyglm does. Is that correct?


Get this bounty!!!

#StackBounty: #regression #cross-validation #xgboost Cross Validation Results Interpretation (XGBoost model)

Bounty: 50

I have a regression model using XGBoost that I was getting great MAE and MAPE results on my test dataset.

mape: 2.515660669106389
mae: 90591.77886478149

Thinking that it was too good to be true, I ran 10-fold cross validation on the train dataset, and got the following results and distribution in the results. Results are plotted by binning them into 10 bins.

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

xg = XGBRegressor(learning_rate=0.5, n_estimators=50, max_depth=4, random_state=4)
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg, X_train, Y_train, cv=kfold, scoring='neg_mean_absolute_error') 

results_y = scaler_y.inverse_transform(np.abs(results.reshape(-1,1)))
print(results_y)
plt.hist(results_y, bins=20)
plt.ylabel('MAE')
plt.show()

Results (MAE):

[[1737985.90765678]
 [ 466277.11674066]
 [  47184.70876369]
 [ 129014.99538841]
 [  23133.30322564]
 [  44112.92209214]
 [  69724.235821  ]
 [ 119278.83633742]
 [  39059.981985  ]
 [   8856.48620648]]

enter image description here

So my questions are:

1) Have I over-trained on my test dataset, for some reason?

2) Is the distribution of the cross validated results reasonable? If it is not reasonable, what should I be seeing?

3) If I have over-trained for some reason, what are the ways to mitigate this? What could be some of the reasons? Specifically with regards to XGBoost.

Thank you.


Get this bounty!!!

#StackBounty: #regression #logistic #residuals Regressing Logistic Regression Residuals on other Regressors

Bounty: 50

With OLS regression applied to continuous response, one can build up the multiple regression equation by sequentially running regressions of the residuals on each covariate. My question is, is there a way to do this with logistic regression via logistic regression residuals?

That is, if I want to estimate $Pr(Y = 1 | x, z)$ using the standard generalized linear modeling approach, is there a way to run logistic regression against $x$ and get pseudo-residuals $R_1$, then regress $R_1$ on $z$ to get an unbiased estimator of the logistic regression coefficients. References to textbooks or literature would be appreciated.


Get this bounty!!!

#StackBounty: #regression #optimization #gan #generative-models Can a GAN-like architecture be used for maximizing the value of a regre…

Bounty: 100

I can’t seem to convince myself why a GAN model similar to regGAN couldn’t be modified to maximize a regression predictor (see the image below). By changing the loss function to the difference between the current predicted value and the maximum predicted value generated so far, wouldn’t gradient decent converge such that the generator builds the inputs that will maximize the prediction in the Discriminator CNN?

In math terms, the loss calculation would look like:

  yhat = current prediction
  ymax = best prediction achieved yet
  Loss = ymax - yhat
  if Loss < 0 then Loss = 0; ymax = yhat
  Back-propagate the loss using SGD

If the current predicted value is higher than the maximum predicted so far, then the loss is 0 and the loss function is updated. Essentially, we are changing the objective from generating inputs that look real to generating inputs that optimize the complex function encoded in the CNN.

GAN Network ArchitectureDiscriminator


Get this bounty!!!

#StackBounty: #r #regression #machine-learning #predictive-models #r-squared Multiple Regression, good P-value, but Low R2

Bounty: 50

I am trying to build a model in R to predict Conversion Rate (CR) based on age, gender, and interest (and also the campaign_Id):

The CR values look like this:

CR

The correlation coefficients are not very promising:

rcorr(as.matrix(data.numeric))

correlations with CR:

xyz_campaign_id (-0.19), age (-0.1), gender(-0.04), interest(-0.03)

So, the model below:

library(caret)
set.seed(100)
TrainIndex <- sample(1:nrow(data), 0.8*nrow(data))
data.train <- data[TrainIndex,]
data.test <- data[-TrainIndex,]
nrow(data.test)
model <- lm(CR ~ age + gender + interest + xyz_campaign_id , data=data.train)

will not have a good adjusted r-squared (0.04):

Call:
lm(formula = CR ~ age + gender + interest + xyz_campaign_id, 
    data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.636 -11.858  -4.087   0.115  96.421 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     47.231250   6.287738   7.512  1.4e-13 ***
age35-39         1.214713   1.916649   0.634  0.52639    
age40-44        -1.971037   1.986316  -0.992  0.32131    
age45-49        -3.064858   1.866713  -1.642  0.10097    
genderM          3.709192   1.412311   2.626  0.00878 ** 
interest         0.030384   0.027617   1.100  0.27154    
xyz_campaign_id -0.037856   0.006076  -6.231  7.1e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.16 on 907 degrees of freedom
Multiple R-squared:  0.05237,   Adjusted R-squared:  0.04611 
F-statistic: 8.355 on 6 and 907 DF,  p-value: 7.81e-09

I also understand that I should probably convert “interest” from numeric to factor (I have tried that too, although I considered all 40 interest levels which is not ideal)

So, based on the provided information, is there any way to improve the model? what other models shall I try besides linear models to make sure that I have a good predictive model?

If you need more information, the challenge is available Here. Data is Here


Get this bounty!!!

#StackBounty: #regression #regression-coefficients How to deal with various sample sizes in the calculation of a predictor variable?

Bounty: 50

Let’s say one of the predictor variables in a regression model is 3-point shooting percentage. However, some of the observations (players) only have one or two attempts while others have several more. In a regression model, what are some techniques so the ability demonstrated over a large number of attempts will be awarded a relatively larger magnitude value than in cases where there are less attempts?

For example…

Player      3PA 3PM 3P%
Player 2    174 52  29.9%
Player 3    156 64  41.0%
Player 4    4   3   75.0%
Player 5    134 45  33.6%

Player #4’s low sample size is not as reliable as the other observations but it’s not meaningless. Are there transformations or other techniques for handling this?


Get this bounty!!!

#StackBounty: #regression #econometrics #panel-data #random-effects-model #units Which is the dimension (or units) of the predicted ran…

Bounty: 100

Consider a simple panel data (or multilevel model) with random effects. For context, consider a wage regression, where the dependent variable $ln(y_{it})$ is the natural log of wage, where the wage is measured in £ per hour. The regression to be estimated is:

$$ln(y_{it})= X_{it}beta + zeta_{i} + eta_{t} + epsilon_{it}$$

where $zeta$ and $eta$ represent individual heterogeneity and year effects, respectively, and $epsilon_{it}$ is white noise (or idiosyncratic error).

You estimate the above model, and obtain an estimate for the random effects. I have three related questions.

Question 1:

Which is the dimension/units of both error components? Do they have the same units as the dependent variable? (which actually has no units, because logarithm is dimentionless). If so, is there a formal proof of this?

Question 2:

If the answer is yes to Q1, then, does it mean that $exp(zeta_i)$ and $exp(eta_t)$ are measured in £ per hour?

Question 3:

But then, how can we go back to the theory? For instance, my theory could assume that workers are paid according to their productivity. Therefore, you can somehow split the pay to wages in terms of something like

$$ y_{it} = omega_t h_{it} $$

where $h_{it}$ is productivity (output per hour) and $omega_t$ is the pay rate per product unit, i.e. £ per output, which combined give £ per hour. Thus, if one wanted to use such a wage regression to find those two elements, it seems impossible to do so, because all we are measuring is always in the same units than the left-hand side variable. We can therefore never go back to the theory.

To put it differently, say the answer to Q1 is yes (as I expect so to be). Then, let’s exponentiate the regression:

$$ y_{it} = exp(X_{it}beta) exp(zeta_i) exp(eta_i) exp(
epsilon_{it}) $$

So, $y_{it}$ is measured in £ per hour. How do we get the same units from the right-hand side? If the exponential of the two random effects (and the error term) are measured in £ per hour (Q2), then it’s up to $exp(X_{it}beta)$ to balance the units of the equation. But for this to be the case, the units of the latter would have to be $left(dfrac{hour}{£}right)^2$, which looks totally arbitrary. Furthermore, how can we ever go back to the theory and write the resulting estimates in terms of productivity and pay per unit of output? (Q3)


Get this bounty!!!

#StackBounty: #regression #t-test #simulation #model-comparison #f-test Is this an appropriate way to compare simulated and measured da…

Bounty: 50

I have a probably fairly basic question which I couldn’t find an answer to on here or anywhere else. Would really appreciate any thoughts on this 🙂

I am modelling the propagation of sound through a building. I have an observational dataset with ~20 observations, each observation representing the total volume over a set time period for a different point within the building. I have then built three different simulation models which produce estimates of the same total volume variable at each of the same points within the building. These data are all single-point estimates only (the models are deterministic and the observational data was produced with an instrument that doesn’t give an indication of error) rather than estimates with a confidence interval etc.

The simulations are statistical – think Excel-type model with deterministic equations (though it’s not in Excel)

How could I test: (1) is there a significant difference between the results of each set of simulated results and the observational data, and (2) which simulation model is closest match to the observational data?

My initial thoughts were:

  • To test for differences: F-test to test for equal variance, then unpaired T-test to test for differences in means taking into account, for one model vs observations at a time

  • To pick model most closely represents reality: a set of single OLS regressions, using the F-stat or R2 to select the set of modelled data (X) that best ‘predicts’ the observed data (Y). I am thinking that because I am not interested in the regression coefficients themselves or any out-of-sample predictive or explanatory power, it doesn’t make sense to hone the model in any way (transformations, standard errors, different varieties of regression etc) – the ‘perfect’ model here would be y = x. However that also feels a bit sketchy?

Some other thoughts which I wondered about but not sure about them as I can’t find much info on them:

  • Some comparison of sum of squared differences (or absolute differences) between the observations and modelled results?
  • Something about testing whether regression slope coefficient is different from 1 (as 1 = a ‘perfect’ model result where it exactly predicts the observed data) .. however I haven’t seen this and can’t get my head around quite how it would work or what it would show
  • Picking best model based on T-stat from the T-test results

As a secondary question, it looks like certain models are better at predicting for specific parts of the building than others, which is quite interesting. I was thinking of running a T-test for those specific parts of the building (ie cutting down the dataset to those observations only) to test whether (for example) model A has a significant difference to the observed data in room A but not for room B.

I’d really appreciate any thoughts, ideas, suggestions etc.! Ideally I am looking for something that is relatively simple to implement and interpret because (as you can probably tell) my statistical knowledge is (at the moment!) fairly basic. Thank you in advance! 🙂

PS it is pretty ‘obvious’ from eyeballing the data that some models produce results that are closer to the real data than others, but there are some which are close to each other and I’d like to try approaching this in a more rigorous way.


Get this bounty!!!