#StackBounty: #r #inference Statistical analyses for choices behaviour, free and restricted context evolution

Bounty: 50

I’m wondering obtaining help about a complex system.
Data is my datastet, for one travel Type I’m testing seating choice behaviour. I record walked distance 'Effort' from the entrance of subject toward each Type_s which is type of seat.

There is in fact a possible crowd effect that could impact these distances, the more the subject perceive agglutination of person (taken places) it is possible that he will walk more distance to be alone and have more choices. Naturally, maybe there is slight effect of exhaustion of seats according to nearest distance. We have also two Floor that contains almost the same architecture but implies additional Effort.

P_exhaustion is a measure of exhaustment of all the places, when 100 % is attained its imply no more places.

ind is indicator of chronological action so if it 1 for example it says that that was the first person choice its distance/places. For every distance we have the same set of places"f (door) ,m (middle) ,c (corridor)", excepting for "s(solo)" place that was installed in the extremity bus.

Questions :

- Is one subject will provide more `Effort` according 'Type_s' ? 
- Does the seats exhaustion happens near-to near 
- Does `P_Exhaustion` implies more effort ? or there is crossing effect of a well distributed choices (to be alone) that neutralize the imposed distance effect of the last places.
- Could we in some way, test how it behaves when there are free choices, restricted and no choices, 

My only trial about explaining Effort variation, i’m not very sure that it is the best test to use

aov.ex1 = aov(Effort~ Type_s +  P_exhaustion + Floor,data=Data)  #do the analysis of variance
summary(aov.ex1) 

I’m looking for help to formalize these effects and to test them statistically.

Data=structure(list(Type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "29v1", 
"29v2", "80v1", "80v2"), class = "factor"), ind = 1:82, Effort = c(3, 
10, 1, 5, 4, 2, 1, 8, 2, 13, 1, 6, 3, 5, 2, 1, 3, 8, 3, 6, 1, 
3, 7, 1, 2, 2, 2, 1, 2, 5, 4, 1, 6, 3, 6, 8, 2, 3, 4, 7, 3, 2, 
6, 2, 3, 7, 1, 5, 4, 1, 4, 3, 2, 3, 5, 5, 2, 1, 1, 5, 8, 7, 2, 
2, 4, 3, 4, 4, 2, 2, 10, 7, 5, 3, 3, 5, 7, 5, 3, 4, 5, 4), Type_s = structure(c(4L, 
6L, 4L, 4L, 4L, 4L, 6L, 3L, 6L, 4L, 3L, 4L, 4L, 4L, 6L, 6L, 4L, 
4L, 3L, 4L, 6L, 4L, 4L, 6L, 4L, 3L, 4L, 4L, 6L, 4L, 4L, 3L, 4L, 
3L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 4L, 4L, 3L, 3L, 4L, 3L, 3L, 
3L, 3L, 3L, 6L, 4L, 3L, 3L, 3L, 6L, 4L, 3L, 4L, 4L, 3L, 3L, 4L, 
3L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 
4L), .Label = c("", "*", "c", "f", "m", "s"), class = "factor"), 
    P_exhaustion = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 
    9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L), .Label = c("(0,10]", "(10,20]", "(20,30]", "(30,40]", 
    "(40,50]", "(50,60]", "(60,70]", "(70,80]", "(80,90]", "(90,100]"
    ), class = "factor"), Floor = structure(c(2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 
    3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 
    3L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 
    3L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 
    3L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L), .Label = c("", "down", "up"), class = "factor")), .Names = c("Type", 
"ind", "Effort", "Type_s", "P_exhaustion", "Floor"), class = "data.frame", row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 
55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 
68L, 69L, 70L, 71L, 73L, 75L, 76L, 79L, 80L, 81L, 82L, 84L, 85L, 
86L, 91L))


Get this bounty!!!

#StackBounty: #r #regression #non-linear-regression #piecewise How to perform piece wise/spline regression for longitudinal temperature…

Bounty: 50

Here I have temperature time series panel data and I intend to run piecewise regression or cubic spline regression for it. So first I quickly looked into piecewise regression concepts and its basic implementation in R in SO, got an initial idea how to proceed with my workflow. In my first attempt, I tried to run spline regression by using splines::ns function in splines package, but I didn’t get right bar plot. For me, using baseline regression, or piecewise regression or spline regression could work.

Based on the post about piecewise regression in SO, I understand that I need to find out appropriate breakpoint or knots before running piecewise regression, which could render possible trend line based on data. I adopted the solution of piecewise but still not efficient and can’t get my desired plot.

Here is the general picture of my panel data specification: at the first row shown below are my dependent variables which presented in natural log terms and independent variables: average temperature, total precipitation and 11 temperature bins and each bin-width (AKA, bin’s window) is 3-degree Celsius. (<-6, -6~-3,-3~0,…>21).

reproducible example:

Here is the reproducible data that simulated with actual temperature time series panel data:

dat= data.frame(index = rep(c('dex111', 'dex112', 'dex113','dex114','dex115'), each = 30), year =1980:2009,
                region= rep(c('Berlin','Stuttgart','Böblingen','Wartburgkreis','Eisenach'), each=30),
                ln_gdp_percapita=rep(sample.int(40, 30), 5),ln_gva_agr_perworker=rep(sample.int(45, 30), 5),
                temperature=rep(sample.int(50, 30), 5), precipitation=rep(sample.int(60, 30), 5),
                bin_1=rep(sample.int(32, 30), 5), bin_2=rep(sample.int(34, 30), 5), bin_3=rep(sample.int(36, 30), 5),
                bin_4=rep(sample.int(38, 30), 5), bin_5=rep(sample.int(40, 30), 5), bin_6=rep(sample.int(42, 30), 5),
                bin_7=rep(sample.int(44, 30), 5),bin_8=rep(sample.int(46, 30), 5), bin_9=rep(sample.int(48, 30), 5),
                bin_10=rep(sample.int(50, 30), 5), bin_11=rep(sample.int(52, 30), 5))

update:

notes that each bin has equally divided temperature interval except its extreme temperature value, so each bin gives the number of days that fall in respective temperature interval.

Basically, I want to fit spline regression on my data (let’s say, choose one dependent variable such as ln_gdp_percapita, and multiple independent variables such as bin_1,bin_2,…, bin_11).

my attempt to run spline regression:

Here is what I did so far to run spline regression on temperature time series data:

fit_sp <- smooth.spline(dat$bin_6, ln_gdp_percapita, nknots = 4)
pred <- stats:::predict.smooth.spline(fit_sp, dat$bin6)$ln_gdp_percapita
summary(pred)

but this is not what I want to do. So I also tried as follow:

sp2=lm(dat$ln_gva_agr_perworker ~ splines::ns(x = c(dat$bin_6),df = 2 ,knots =c(3)), data = dat)
df_=mutate(dat, smooth=fitted(sp2))
stats::predict(sp2)
ggplot(df_, aes(dat_$bin_6, df_$ln_gva_agr_perworker)) + geom_line(aes(df_$bin_6, df_$smooth))

but it didn’t render the plot that I want to get. My understanding about piecewise regression is to observe its breaking point and run regression it’s left or right observation that next to breaking point, which gives us a gradual linear trend. But now I didn’t get such result. Perhaps, I try to focus more on searching efficient workflow.

In general, I want to see how agriculture or industry sectors respond to daily temperature. So running a simple regression on temperature bin panel data is not sufficient to reach a conclusion. I believe restricted spline regression/smoothing spline or baseline regression might produce better estimation.

desired regression output metrics: (this one is inspired by related paper’s appendix.)

Here is the general picture of my desired regression output metric:

        b_hi    b   b_low   st error    t_stat  p-value
bin1    (-23.87:-18.44) -0.0129 -0.0306 -0.0483 0.0090257   -3.39   0.001
bin2    (-18.44:-13.02) -0.0050 -0.0096 -0.0141 0.0023336   -4.1    0
bin3    (-13.02:-7.59)  -0.0040 -0.0057 -0.0075 0.0008904   -6.44   0
bin4    (-7.59:-2.17)   0.0030  0.0021  0.0011  0.000492    4.23    0
bin5    (-2.17:3.26)    -0.0007 -0.0012 -0.0018 0.0002781   -4.48   0
bin6    (3.26:8.69) 0.0000  0.0000  0.0000          
bin7    (8.69:14.11)    0.0008  0.0001  -0.0005 0.0003502   0.41    0.681
bin8    (14.11:19.54)   0.0010  0.0000  -0.0010 0.0005107   0.06    0.956
bin9    (19.54:24.96)   0.0028  0.0016  0.0005  0.0005737   2.85    0.004
bin10   (24.96:30.39)   -0.0031 -0.0057 -0.0083 0.0013075   -4.36   0

desired scatter plot:

Here is the desired scatter plot that I want to achieve (this is just simulated scatter plot that inspired by related paper’ figure):

enter image description here

in this plot, black point line is estimated regression (either baseline or restricted spline regression) coefficient, and dot blue line is 95% confidence interval based on clustered standard errors.

How do I accomplish my desired output with minimal code and efficiently? Could someone point me in the right direction? Any more thoughts? Thanks in advance!


Get this bounty!!!

#StackBounty: #r #logistic #logit #effect-size #odds-ratio Comparison of two odds ratios: Take 2

Bounty: 50

I would like to test the difference of two odds ratios given the following R-output.

To do so, I would need to take the difference of the log odds and obtain the standard error (outlined here: Statistical test for difference between two odds ratios?).

One of the predictor variables is continuous and I am not sure how I could compute the values required for SE(logOR).

Could someone please explain whether the output I have is conducive to this method?
Results

f=with(data=imp, glm(Y~X1+X2, family=binomial(link=”logit”)))

s01=summary(pool(f1))

s01

                     est        se         t       df   Pr(>|t|) 

   (Intercept) -1.7805826 0.1857663 -9.585070 391.0135 0.00000000 
   X1           0.2662796 0.1308970  2.034268 390.4602 0.04259997  
   X2           0.6757952 0.3869652  1.746398 395.6098 0.08151794 

cbind(exp(s01[, c(“est”, “lo 95”, “hi 95”)]), pval=s01[, “Pr(>|t|)”])

                             est     lo 95     hi 95       pval
              (Intercept) 0.1685399 0.1169734 0.2428389 0.00000000
              X1          1.3051000 1.0089684 1.6881459 0.04259997
              X2          1.9655955 0.9185398 4.2062035 0.08151794


Get this bounty!!!

#StackBounty: #r #lme4-nlme #random-effects-model The mathematical representation of a nested random effect term

Bounty: 50

Suppose that a dependent level variable $y$ is measured at a unit level (level 1) that is nested within units of type $A$ (level $2$), and that units of type $A$ are nested within levels of type $B$ (level $3$).

Suppose that I fit the following formula:

y ~ "FIXED EFFECTS [my syntax]" + (1 + x | B/A)

where $x$ is some predictor at level $1$.

My understanding is that the mathematical representation of such a formula is the following. Is it correct?


In what follows, $y_{b,a,i}$ is the output of the $i$th data point in unit $a$ of $A$ nested in unit $b$ of $B$. This data point has a corresponding predictor $x_{b,a,i}$.

$y_{b,a,i} = text{“fixed effects”} + u_{b} + u_{b,1,a} + (beta_{b} + beta_{b,1,a})x$

where

$u_{b} sim N(0, sigma_{B})$

$u_{b,1,a} sim N(0, sigma)$

$beta_{b} sim N(0, rho_{B})$

$beta_{b,a} sim N(0, rho)$

That is, $sigma_{B}$ is a standard deviation term that varies across level $3$. On the other hand, given any $b$, a unit in level $3$, and $a$, a unit contained in level $2$, then the standard deviation term for $a$ is $sigma$. That is, $sigma$ is constant for any level $2$ units.


Is this correct (I based this reasoning by inferring from a related presentation on page 136 of Linear Mixed Models: A Practical Guide Using Statistical Software))? If this is correct, then is there any way to make $sigma$ be dependent on which unit of level $A$ the data point belongs to.


Get this bounty!!!

#StackBounty: #r #dplyr dplyr 0.7.5 change in select() behavior

Bounty: 50

select() in dplyr 0.7.5 returns a different result from dplyr 0.7.4 when using a named vector to specify columns.

library(dplyr)                               
df <- data.frame(a = 1:5, b = 6:10, c = 11:15)
print(df)                                     
#>   a  b  c
#> 1 1  6 11
#> 2 2  7 12
#> 3 3  8 13
#> 4 4  9 14
#> 5 5 10 15

# a named vector
cols <- c(x = 'a', y = 'b', z = 'c')          
print(cols)                                   
#>  x   y   z 
#> "a" "b" "c"

# with dplyr 0.7.4
# returns column names with vector values
select(df, cols)                              
#>   a  b  c
#> 1 1  6 11
#> 2 2  7 12
#> 3 3  8 13
#> 4 4  9 14
#> 5 5 10 15

# with dplyr 0.7.5
# returns column names with vector names
select(df, cols)                              
#>   x  y  z
#> 1 1  6 11
#> 2 2  7 12
#> 3 3  8 13
#> 4 4  9 14
#> 5 5 10 15

Is this a bug or a feature?


Get this bounty!!!

#StackBounty: #r #dplyr #weighted #summarize #rolling Rolling weighted mean across two factor levels or time points

Bounty: 50

I would like to create a rolling 2 quarter average for alpha, bravo and charlie (and lots of other variables. Research is taking me to zoo and lubricate packages but seem to always go back to rolling within one variable or grouping

set.seed(123)

dates <-  c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16", "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")

df <- data.frame(dates = sample(dates, 100,  replace = TRUE, prob=rep(c(.03,.07,.03,.08, .05),2)), 
                           alpha = rnorm(100, 5), bravo = rnorm(100, 10), charlie = rnorm(100, 15))

I’m looking for something like

x <- df %>% mutate_if(is.numeric, funs(rollmean(., 2, align='right', fill=NA)))

Desired result: a weighted average across “Q4’15” & “Q1’16”, “Q1’16” & “Q2’16”, etc for each column of data (alpha, bravo, charlie). Not looking for the average of the paired quarterly averages.

Here is what the averages would be for the Q4’15&”Q1’16” time point

df %>% filter(dates %in% c("Q4'15", "Q1'16")) %>%  select(-dates) %>% summarise_all(mean)


Get this bounty!!!

#StackBounty: #r #panel-data #r-squared #plm Reference or a mathematical equation for R squared and adjusted R squared for panel models…

Bounty: 50

I’m using in and I’m trying t o find out exactly how R squared and adjusted R squared is calculated for panel models in .

I looked at this plm man page, on which I can view the source code, but it did not have any references or further links. The source code is a bit complex for me to deconstruct. Normally the man pages are very good with references or links, but I can’t seem to find my way here. Any help would be appreciated. I realize this might be incredibly simply.


Get this bounty!!!

#StackBounty: #r #pandas #data-cleaning Tidy data in panda dataframe

Bounty: 100

There is a quite famous article by H. Wickham, Tidy data, where he defines a certain type of cleaned data and calls it (dataframe-)tidy, and illustrates in on several example using R. At the end, he compares his own definition of dataframe-tidy to other possible ways of achieving tidiness and mentions array-tidy in the following paragraph (but does not give any further explanations or examples):

Fortunately, because there are many efficient tools for working with high-dimensional arrays, even sparse ones, such an array-tidy
format is not only likely to be quite compact and efficient, it should
also be able to easily connect with the mathematical basis of
statistics. This, in fact, is the approach taken by the pandas Python
data analysis library (McKinney 2010).

What does array-tidy data mean and why does he imply that this is somehow the default for the pandas library in Python?


Get this bounty!!!

#StackBounty: #r #time-series #predictive-models #intermittent-time-series #crostons-method With non-contractual customers, what method…

Bounty: 100

I have customer data for 18 months in the format as follows:

CustomerID   Date    Total_Spend Spend_Cat_1 Spend_Cat_2 Spend_Cat_3 Spend_Cat_4
    1       1/1/2013     373         202           0          17         154
    1       1/18/2013    103          92          11           0           0
    1       8/2/2013     476         330         146           0           0
    1       12/12/2013   332         216         116           0           0
    2       1/1/2013     399         204         195           0           0
    2       1/13/2013    485         315           0           0         170
    3       1/1/2013     326         238           0          22          66
    4       1/1/2013     354         184         170           0           0

I have done my research towards finding a predictive model that could forecast each customer’s total spend as well categorized spend (Spend_Cat_1,2,3 & 4) in each of the next 12 months or 4 quarters. The nearest I could come was to Croston’s method (I am using R for the programming part) but I am not sure if my data qualifies for “Intermittent Demand”. Even if I got to the correct model I am not sure how to do it as there is little help available on this topic over the internet.

If it is needed, I have around 7000 rows of 2000 different customers.


Get this bounty!!!

#StackBounty: #r #bootstrap #simulation #garch #finance UPDATED: How to pass standardized bootstrapped residuals in a forward simulatio…

Bounty: 100

I am trying to use the Filtered Historical Simulation (FHS) approach to VaR and ES using the amazing rugarch package (with s&p500 data included in the package) as well as the steps FHS recommended by this paper:

Brandolini, D., & Colucci, S. (2013). Backtesting value-at-risk: A
comparison between filtered bootstrap and historical simulation. The
Journal of Risk Model Validation, 6(4), 3-16

The steps, according to the paper, are:

  1. Fitting the best (ARMA-GARCH) model
  2. Standardization of residuals
  3. Bootstrapping of standardised residuals
  4. Passing of bootstrapped residuals in a forward simulation using the estimated ARMA-GARCH model
  5. Estimation of returns
  6. VaR and ES estimations

I am currently stuck on the step 3. Here’s my current work in progress:

library(zoo)
library(forecast)
library(rugarch)



sp500 <- data(sp500ret)

sp500 <- zoo(sp500ret$SP500RET * 100)

plot(sp500)

# Get the best ARIMA model for the mean modelling of the GARCH model
forecast::auto.arima(sp500, trace = TRUE,
                     test = "kpss", ic = c("bic"))


############################# SPECIFY GARCH MODEL #########################################
model_spec <- rugarch::ugarchspec(variance.model = list(model = "sGARCH",
                                                        garchOrder = c(1,1)),
                                  mean.model = list(armaOrder = c(0,2)), 
                                  distribution.model = "std")

############################# FIT MODEL ##################################################
model_fit <- rugarch::ugarchfit(spec =  model_spec, data = sp500)


############################## BACKTESTING OF MODEL##################################
model_roll <- rugarch::ugarchroll(spec = model_spec, data = sp500,
                                  n.ahead = 1,
                                  n.start = 150, refit.every = 50, 
                                  refit.window = "recursive"
                                  )
report(model_roll, type = "VaR", VaR.alpha = 0.01, conf.level = 0.95)

############################ STANDARDISE RESIDUALS ####################################
standardized_resid <- model_fit@fit$residuals / model_fit@fit$sigma # correct?


########################## Bootstrapping of standardised residuals###################
set.seed(123)


myz <- matrix(sample(standardized_resid, size = 1000000, replace = TRUE), ncol = 10)


############Passing of bootstrapped residuals in a forward simulation################ 
#using the estimated ARMA-GARCH model and the assumed Student t distribution



  sim1<- ugarchsim(model_fit, n.sim = 100000, m.sim = 10, 
                   startMethod ="sample", 
                   custom.dist = list(name = "sample", distfit = myz,type = "myz"),
                   rseed = 10)

##################################Estimation of Hypothetical returns ##################
#No clue  


################################## VaR and ES estimates ################################
# on say 1,000 simulations for 5 trading days

Can someone help with from step 3 to 5 where I need to pass the bootstrap standardized residuals as some form of simulation and make the VaR and ES estimates?

Thanks in advance.


Get this bounty!!!