#StackBounty: #r #regression #non-linear-regression #piecewise How to perform piece wise/spline regression for longitudinal temperature…

Bounty: 50

Here I have temperature time series panel data and I intend to run piecewise regression or cubic spline regression for it. So first I quickly looked into piecewise regression concepts and its basic implementation in R in SO, got an initial idea how to proceed with my workflow. In my first attempt, I tried to run spline regression by using splines::ns function in splines package, but I didn’t get right bar plot. For me, using baseline regression, or piecewise regression or spline regression could work.

Based on the post about piecewise regression in SO, I understand that I need to find out appropriate breakpoint or knots before running piecewise regression, which could render possible trend line based on data. I adopted the solution of piecewise but still not efficient and can’t get my desired plot.

Here is the general picture of my panel data specification: at the first row shown below are my dependent variables which presented in natural log terms and independent variables: average temperature, total precipitation and 11 temperature bins and each bin-width (AKA, bin’s window) is 3-degree Celsius. (<-6, -6~-3,-3~0,…>21).

reproducible example:

Here is the reproducible data that simulated with actual temperature time series panel data:

dat= data.frame(index = rep(c('dex111', 'dex112', 'dex113','dex114','dex115'), each = 30), year =1980:2009,
                region= rep(c('Berlin','Stuttgart','Böblingen','Wartburgkreis','Eisenach'), each=30),
                ln_gdp_percapita=rep(sample.int(40, 30), 5),ln_gva_agr_perworker=rep(sample.int(45, 30), 5),
                temperature=rep(sample.int(50, 30), 5), precipitation=rep(sample.int(60, 30), 5),
                bin_1=rep(sample.int(32, 30), 5), bin_2=rep(sample.int(34, 30), 5), bin_3=rep(sample.int(36, 30), 5),
                bin_4=rep(sample.int(38, 30), 5), bin_5=rep(sample.int(40, 30), 5), bin_6=rep(sample.int(42, 30), 5),
                bin_7=rep(sample.int(44, 30), 5),bin_8=rep(sample.int(46, 30), 5), bin_9=rep(sample.int(48, 30), 5),
                bin_10=rep(sample.int(50, 30), 5), bin_11=rep(sample.int(52, 30), 5))

update:

notes that each bin has equally divided temperature interval except its extreme temperature value, so each bin gives the number of days that fall in respective temperature interval.

Basically, I want to fit spline regression on my data (let’s say, choose one dependent variable such as ln_gdp_percapita, and multiple independent variables such as bin_1,bin_2,…, bin_11).

my attempt to run spline regression:

Here is what I did so far to run spline regression on temperature time series data:

fit_sp <- smooth.spline(dat$bin_6, ln_gdp_percapita, nknots = 4)
pred <- stats:::predict.smooth.spline(fit_sp, dat$bin6)$ln_gdp_percapita
summary(pred)

but this is not what I want to do. So I also tried as follow:

sp2=lm(dat$ln_gva_agr_perworker ~ splines::ns(x = c(dat$bin_6),df = 2 ,knots =c(3)), data = dat)
df_=mutate(dat, smooth=fitted(sp2))
stats::predict(sp2)
ggplot(df_, aes(dat_$bin_6, df_$ln_gva_agr_perworker)) + geom_line(aes(df_$bin_6, df_$smooth))

but it didn’t render the plot that I want to get. My understanding about piecewise regression is to observe its breaking point and run regression it’s left or right observation that next to breaking point, which gives us a gradual linear trend. But now I didn’t get such result. Perhaps, I try to focus more on searching efficient workflow.

In general, I want to see how agriculture or industry sectors respond to daily temperature. So running a simple regression on temperature bin panel data is not sufficient to reach a conclusion. I believe restricted spline regression/smoothing spline or baseline regression might produce better estimation.

desired regression output metrics: (this one is inspired by related paper’s appendix.)

Here is the general picture of my desired regression output metric:

        b_hi    b   b_low   st error    t_stat  p-value
bin1    (-23.87:-18.44) -0.0129 -0.0306 -0.0483 0.0090257   -3.39   0.001
bin2    (-18.44:-13.02) -0.0050 -0.0096 -0.0141 0.0023336   -4.1    0
bin3    (-13.02:-7.59)  -0.0040 -0.0057 -0.0075 0.0008904   -6.44   0
bin4    (-7.59:-2.17)   0.0030  0.0021  0.0011  0.000492    4.23    0
bin5    (-2.17:3.26)    -0.0007 -0.0012 -0.0018 0.0002781   -4.48   0
bin6    (3.26:8.69) 0.0000  0.0000  0.0000          
bin7    (8.69:14.11)    0.0008  0.0001  -0.0005 0.0003502   0.41    0.681
bin8    (14.11:19.54)   0.0010  0.0000  -0.0010 0.0005107   0.06    0.956
bin9    (19.54:24.96)   0.0028  0.0016  0.0005  0.0005737   2.85    0.004
bin10   (24.96:30.39)   -0.0031 -0.0057 -0.0083 0.0013075   -4.36   0

desired scatter plot:

Here is the desired scatter plot that I want to achieve (this is just simulated scatter plot that inspired by related paper’ figure):

enter image description here

in this plot, black point line is estimated regression (either baseline or restricted spline regression) coefficient, and dot blue line is 95% confidence interval based on clustered standard errors.

How do I accomplish my desired output with minimal code and efficiently? Could someone point me in the right direction? Any more thoughts? Thanks in advance!


Get this bounty!!!

#StackBounty: #regression #outliers #robust #diagnostic Iterative outlier diagnostic

Bounty: 50

I am working on outlier diagnostics and I have a question about the best way to conduct them. Irrespective of the way used to define an outlier (i.e., statistical indexes, threshold), some of my colleagues argue that we just have to check the presence of outliers in our dataset once, to discard them, and to conduct our main analysis. However, it makes more sense to me to consider outlier diagnostics as an iterative procedure in which we check the presence of outliers, discard them (assuming that they are influential points of course), repeat this two steps until no outlier emerges from the dataset, and perform our main analysis. My understanding of an iterative outlier diagnostic seems consistent with Parrinello et al. (2016)’s method of iterative outlier removal.

For instance, I am using the maximum absolute deviation (MAD, e.g., Leys et al., 2013) and I chose to consider an observation as an outlier if its absolute deviation to the median of the dataset was at least equal to 3MAD. My initial sample size was N = 36 and I detected three outliers. According to my colleagues, I should have discarded these three observations and conducted my main analysis without checking whether I detected new outliers in my reduced sample (N = 33). Their main argument for doing so is that I would have detected too many outliers at the end of my iterative diagnostic. However, it does not make sense to me to check the presence of outliers only once. So, I checked iteratively until I found no outlier according to my initial threshold (i.e., an absolute deviation to the median of the dataset at least equal to 3MAD). I noticed that conducting my diagnostic iteratively was almost equivalent to conducting only one diagnostic with a less conservative threshold (i.e., using 2MAD instead of 3MAD). Maybe conducting only one diagnostic with a less conservative threshold would be a better idea to save time than using an iterative diagnostic with a more conservative threshold. What do you think about that?

I know that discarding outliers is not necessarily the best way to manage outliers and that using robust method would certainly be a better alternative. However, I am not familiar with robust methods and I still have to analyze my data. In addition, although Fox (1991, p. 40) advised against mindless outlier deletion and argued in favor of robust methods, he also underlined that conducting non-robust analyses with a thoughtful outlier deletion would be almost equivalent to conducting robust analyses in which all outliers are downweighted to 0.

Thanks in advance for your thoughts and advice

References

Fox, J. (1991). Regression Diagnostics: An Introduction. Newbury Park, CA: SAGE Publications.

Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764‑766. https://doi.org/10.1016/j.jesp.2013.03.013

Parrinello, C. M., Grams, M. E., Sang, Y., Couper, D., Wruck, L. M., Li, D., … Coresh, J. (2016). Iterative Outlier Removal: A Method for Identifying Outliers in Laboratory Recalibration Studies. Clinical Chemistry, 62(7), 966‑972. https://doi.org/10.1373/clinchem.2016.255216


Get this bounty!!!

#StackBounty: #regression #multivariate-analysis #distance #regression-strategies Determine rate of change in dissimilarity (distance)?

Bounty: 50

I have repeated measures plant abundance data for 37 forest plots, across 80 years involving 50+ species of plants.

  • The data are structured as:
    • Columns = different species,
    • Rows = separate samples [Plot-Year combos],
    • Each cell = abundance (i.e., basal area) of the the given species in the given sample.

    Simplified Example (from here):

    > abund.data
    
        Plot Year Sp1 Sp2 Sp3 Sp4
      1   P1    1   1   2   0   0
      2   P2    1   1   0   3   2
      3   P3    1   0   2   1   0
      4   P1    2   1   2   0   0
      5   P2    2   1   0   3   2
      6   P3    2   0   2   1   0
    

I’ve calculated a Bray-Curtis dissimilarity (distance) matrix from these data.

Continuing the example:
library(ecodist)
distance(abun.data[,-c(1:2)], 'bray')

          1         2         3         4         5
2 0.7777778                                        
3 0.3333333 0.7777778                              
4 0.0000000 0.7777778 0.3333333                    
5 0.7777778 0.0000000 0.7777778 0.7777778          
6 0.3333333 0.7777778 0.0000000 0.3333333 0.7777778

I want to calculate the rate at which plots change in community composition over time.

I had originally run a non-metric multidimensional scaling (NMDS) ordination and wanted to simply calculate changes in NMDS space.

  • i.e., I wanted to create change vectors between plot points in subsequent years (I did so here) and then compare the lengths between years using some sort of regression….

    ChangeVectorLength ~ Time | Plot

However, I don’t think this is valid because of the rank-oriented construction of NMDs ordination.

Is there a way I could do something similar but using the “raw” distance (dissimilarity) values??

  • For example (using the example data above): I want to quantify how much the community of species (as a whole) in Plot P1 has changed from Year 1 to Year 2.
    • However, because the distance matrix represents — well — a matrix of pairwise distances bewteen all points, I’m not sure how to go about quantifying change in “distance space”


Get this bounty!!!

#StackBounty: #regression #interaction #fixed-effects-model Sensible to remove interaction term in this instance, to interpret a main e…

Bounty: 50

Predictor A is a continuous predictor.

Predictor B is a dummy coded categorical predictor with three levels.

I run a regression including PA, PB and PA*PB.

The results indicate a significant effect of PB (level 3 vs. level 1) and an interaction between PA and PB (level 2 vs. level 1).

I understand that so long as the interaction is in the model, the effect of PB (level 3 vs. level 1) is only a simple effect. Does it make sense to remove the interaction between PA and PB in order to interpret the main effect of PB (level 3 vs. level 1)? Or am I stuck only being able to talk about the simple effect of PB (3 vs 1) because another level of PB is involved in an interaction?


Get this bounty!!!

#StackBounty: #regression #predictive-models #multivariate-regression Predicting a single response variable with a multivariate regress…

Bounty: 50

For my research I am interested in performing inferential and predictive analysis on a single response variable. I have already made one univariate and one multivariate regression model. In the multivariate model, the response variable which I am particularly interested in is modeled together with another response variable. The multivariate model that I use can be described as a vector autoregressive model with random effects.

For the inferential part of my research, I am already aware that my multivariate model has some advantages over the univariate model. However, I would like to ask the following:

‘Does it make any sense to predict the single response variable I am interested in with my multivariate regression model rather than with my univariate regression model?’

I have some read some research where authors show that the prediction of for example multiple macroeconomic series can benefit from predicting them with a multivariate model rather than multiple univariate models. In my case I have a multivariate regression model in which I am actually only interested in one of the two response variables. The only reason why I made a multivariate model in the first case was because of the inferential part of my research.


Get this bounty!!!

#StackBounty: #regression #lasso #convergence #high-dimensional High-dimensional regression: why is $log p/n$ special?

Bounty: 100

I am trying to read up on the research in the area of high-dimensional regression; when $p >> n$. It seems like the term $log p/n$ appears often in terms of rate of convergence for regression estimators.

For example, here, equation (17) says that the lasso fit, $hat{beta}$ satisfies
$$ dfrac{1}{n}|Xhat{beta} – X beta|_2^2 = O_P left(sigma sqrt{dfrac{log p}{n} } |beta|_1right),.$$

Usually, this also implies that $log p$ should be smaller than $n$.

  1. Is there any intuition as to why this ratio of $log p/n$ is so prominent?
  2. Also, it seems from the literature the high-dimensional regression problem gets complicated when $log p geq n$. Why is it so?
  3. Is there a good reference that discusses the issues with how fast $p$ and $n$ should grow compared to each other?


Get this bounty!!!

#StackBounty: #regression #lasso #convergence #high-dimensional High-dimensional regression: why is $log p/n$ special?

Bounty: 100

I am trying to read up on the research in the area of high-dimensional regression; when $p >> n$. It seems like the term $log p/n$ appears often in terms of rate of convergence for regression estimators.

For example, here, equation (17) says that the lasso fit, $hat{beta}$ satisfies
$$ dfrac{1}{n}|Xhat{beta} – X beta|_2^2 = O_P left(sigma sqrt{dfrac{log p}{n} } |beta|_1right),.$$

Usually, this also implies that $log p$ should be smaller than $n$.

  1. Is there any intuition as to why this ratio of $log p/n$ is so prominent?
  2. Also, it seems from the literature the high-dimensional regression problem gets complicated when $log p geq n$. Why is it so?
  3. Is there a good reference that discusses the issues with how fast $p$ and $n$ should grow compared to each other?


Get this bounty!!!

#StackBounty: #scikit-learn #regression #supervised-learning #hyperparameter #hyperparameter-tuning Optimising Kernel parameters using …

Bounty: 50

I want to optimize the Kernel parameters or hyper-parameters using my training data in GaussianProcessRegressor of Scikit-learn.Following is my query:

My training datasets are:

X: 2-D Cartesian coordinate as input data

y: radio signal strength (RSS) at the 2-D coordinates points as observed output

What I’ve done so far:

I’ve installed python and Scikit-learn software. I’ve successfully tested the sample codes. I’m able to predict RSS at test points using training data. I use squared exponential Kernel.

What I want to do:

I want to train the Kernel parameters (hyper-parameter) with different optimizing algorithms like gradient descent, swarm intelligence, and trust-region-reflective algorithms.

What I learned and What help I am asking for:

I’ve learned that, in the GaussianProcessRegressor class of scikit, the optimizer is an argument where I can use my own optimizing algorithm. Since it is callable, I need to write my own function/method for it. Can I use any inbuilt library (library of optimization algorithm) in GaussianProcessRegressor class? Are there such libraries available for python? Could anybody provide any sample code for using kernel parameter optimization algorithm in GaussianProcessRegressor? I’ve learned that we use the training datasets for optimizing the hyper-parameters. Could anybody provide any insight about relating the training datasets with the optimization algorithm, please?

Thank you!


Get this bounty!!!

#StackBounty: #regression #regression-coefficients #categorical-encoding difference between dummy variable categories that weren't …

Bounty: 50

Assume we have a categorical variable (one-hot encoded) with three or more categories. {race1, race2, ..., race-n}
To avoid the dummy variable trap, assume we omitted race1. Knowing the coefficients of race2,3,...n would help us compare each to race1.

What I’m trying to understand/figure-out is how to compare in the same model without rerunning the regression, the difference between race2 and race3.


Get this bounty!!!

#StackBounty: #regression #regression-coefficients #categorical-encoding difference between dummy variable categories that weren't …

Bounty: 50

Assume we have a categorical variable (one-hot encoded) with three or more categories. {race1, race2, ..., race-n}
To avoid the dummy variable trap, assume we omitted race1. Knowing the coefficients of race2,3,...n would help us compare each to race1.

What I’m trying to understand/figure-out is how to compare in the same model without rerunning the regression, the difference between race2 and race3.


Get this bounty!!!