#StackBounty: #time-series #modeling #econometrics #stationarity #seasonality Decomposing U.S. Imports of Goods by Customs Basis from C…

Bounty: 50

I’m working on a project which aims at analyzing the dataset U.S. Imports of Goods by Customs Basis from China (IMPCH). The main point is to make some prediction but we also want to do a little inference from the data.

As far as I know, the first step is to decompose the time series. There are at least five decomposition methods and the residual (remainder) of all of them passes the unit root tests, indicating the results are stationary, which is desired. However, now I have to pick one of them to proceed, but all seem good.

Which one should I choose, decompose, HoltWinters, forecast::ets, stl, or seasonal::seas?

# dataset
# https://fred.stlouisfed.org/series/IMPCH

IMPCH <- read.csv("IMPCH.csv")
imp <- IMPCH$IMPCH
imp.ts <- ts(imp, frequency = 12, start = 1985)
log.imp.ts <- log(imp.ts)
ts.plot(log.imp.ts)

enter image description here

## Decomposition
library(tseries)

# Stock decompose
log.imp.ts.dcps <- decompose(log.imp.ts)
plot(log.imp.ts.dcps)
log.imp.ts.dcps.remainder <- na.remove(log.imp.ts.dcps$random)
adf.test(log.imp.ts.dcps.remainder)
pp.test(log.imp.ts.dcps.remainder)
kpss.test(log.imp.ts.dcps.remainder)

enter image description here

# Holt-Winters Filtering
log.imp.ts.hw <- HoltWinters(log.imp.ts)
plot(log.imp.ts.hw)
log.imp.ts.hw.remainder <- resid(log.imp.ts.hw)
adf.test(log.imp.ts.hw.remainder)
pp.test(log.imp.ts.hw.remainder)
kpss.test(log.imp.ts.hw.remainder)

enter image description here

# ETS
library(forecast)
log.imp.ts.ets <- ets(log.imp.ts)
plot(log.imp.ts.ets)
log.imp.ts.ets.remainder <- resid(log.imp.ts.ets)
adf.test(log.imp.ts.ets.remainder)
pp.test(log.imp.ts.ets.remainder)
kpss.test(log.imp.ts.ets.remainder)

enter image description here

# STL
log.imp.ts.stl <- stl(log.imp.ts, s.window = "periodic", robust = TRUE)
plot(log.imp.ts.stl)
log.imp.ts.stl.remainder <- log.imp.ts.stl$time.series[, "remainder"]
adf.test(log.imp.ts.stl.remainder)
pp.test(log.imp.ts.stl.remainder)
kpss.test(log.imp.ts.stl.remainder)

enter image description here

# X-13ARIMA-SEATS
library(seasonal)
log.imp.ts.seas <- seas(log.imp.ts)
plot(log.imp.ts.seas)
log.imp.ts.seas.remainder <- resid(log.imp.ts.seas)
adf.test(log.imp.ts.seas.remainder)
pp.test(log.imp.ts.seas.remainder)
kpss.test(log.imp.ts.seas.remainder)

enter image description here


Get this bounty!!!

#StackBounty: #r #regression #time-series #arima #autocorrelation How to specify an autoregressive correlation structure in a linear mo…

Bounty: 50

I want to fit a multiple regression model that accounts for temporally-correlated errors (i.e., some sort of autoregressive correlation structure, like those provided by ARMA or ARIMA models). However, the problem I have is that my data set isn’t a simple continuous time series—i.e., one that has only one data point at every single time step. In my data set, each time step may contain one, several or no data points.

Because of this, I cannot use standard autoregressive correlation structures in R such as corAR1() or corARMA() (which can be used in nlme::gls() using form = ~ time), or the R function auto.arima() with xreg regression.

It might help to be more specific about my data set. I am analysing estimates of biological diversity (numbers of animal species, standardised for sampling effort, estimated using the fossil record) for roughly equally-sized geographic regions through geological time (i.e., over millions of years). I am using multiple regression to understand the role of various environmental variables on diversity.

Time points/steps actually correspond to so-called “time bins”—lumped intervals of geological time with roughly equal durations (for example, the Late Cretaceous is one time bin). In some time bins, I have diversity estimates for several different geographic regions. In others, only one estimate is available, or even no estimates, if there are no known fossils. The identities/locations of geographic regions are not constant through time: they vary because the geographic distribution of fossil sites varies through time.

I think it would be reasonable to treat time steps with multiple data points as representing multiple measurements from the same subject (planet Earth).

Is there any way to specify an autoregressive correlation structure for this kind of more complex data set? I’m happy to use whatever modelling approach is necessary.

EDIT: The answer posted by user “HCQ” suggested that I a should fit a multilevel model with a random effect for geographic regions. This suggestion doesn’t solve the problem, because geographic identities of data points are not constant through time (see revised question, above).


Get this bounty!!!

#StackBounty: #time-series #treatment-effect #relative-risk Assess temporary effect of treatment

Bounty: 50

Imagine that I have a treatment that reduces the likelihood of response to a stimulus. This could be anything you like, but the simplest example is of a treatment (e.g., hand washing, mask wearing, etc.) that prevents disease when exposed.

For simplicity, I built an (overly complicated) model of the risk of responding to the stimulus over time under the two treatments (all of this is done in R using the tidyverse):

day_prob <-
  list(
    A = c(0.06, 0.06, pmax(dchisq(1:13, 3)*1, 0.06))
    , B = c(0.06, 0.06, pmax(dchisq(seq(1, 25, 2), 5)*6, 0.06))
  )

This can be plotted to show the risk:

day_prob %>%
  lapply(function(x){
    tibble(
      Day = 1:length(x)
      , Prob = x
    )
  }) %>%
  bind_rows(.id = "Group") %>%
  ggplot(aes(x = Day
             , y = Prob
             , col = Group)) +
  geom_line() +
  geom_point() +
  scale_color_brewer(palette = "Dark2")

Plot:

enter image description here

Note that there is a similar baseline risk for both group (0.06), some latency to develop after exposure (risk rises on day 3), and that risk falls over time back to baseline.

Now, assuming that I randomize individuals into the two treatments, what is my best approach to identify this effect? I can just analyze each day separately, though that raises some repeated-measures questions since (unlike the sample data below) it is likely that a positive individual is more (or even less) likely to test positive on subsequent days.

I’ve searched through a number of alternative questions, but nothing seems to quite capture this issue. I could try repeated-measures analyses, but the latency and return-to-baseline should both swamp my effect if they are sufficiently long. Further, it would be ideal to actually know when the effect is observed.

A sample data set:

make_obs <- function(group, day){
  sapply(1:length(group), function(idx){
    rbinom(1, 1, day_prob[[group[idx]]][day[idx]] )
  })
}

set.seed(12345)
example_data <-
  tibble(
    Group = rep(c("A", "B"), each = 50)
    , Ind = 1:100
    , Day = 1
  ) %>%
  complete(nesting(Group, Ind), Day = 1:15) %>%
  mutate(
    Obs = make_obs(Group, Day)
  )

Plotting to show the observed data:

example_data %>%
  group_by(Group, Day) %>%
  summarise(Prop_pos = mean(Obs)) %>%
  ungroup() %>%
  ggplot(aes(x = Day
             , y = Prop_pos
             , col = Group)) +
  geom_line() +
  scale_color_brewer(palette = "Dark2")

enter image description here


Get this bounty!!!

#StackBounty: #r #time-series #hypothesis-testing #model-selection Which model for my data?

Bounty: 50

I have this data:

Group Time  Size
A 1 0.56
A 2 0.97
A 3 1.33
A 4 1.75
B 1 0.12
B 2 0.24
B 3 0.31
B 4 0.47
B 5 0.51
B 6 0.69
B 7 0.73
B 8 0.85
C 1 0.16
C 2 0.23
C 3 0.38
C 4 0.49
C 5 0.53
C 6 0.66
C 7 0.78
C 8 0.81

Here is the respective plot:

enter image description here

Now I would like to test the three groups for differences in slope and intercept. I cannot use simple linear regression since these are time series and the data points are not independent of each other.

Here are the additional tests I performed on the linar model:

Data = read.table(textConnection(Input),header=TRUE)

model = lm(Size ~ Time + Group,data = Data)

Shapiro-Wilk test for normality:

shapiro.test(residuals(model))

p=0.001288 (not normally distributed)

Breusch-Pagan test for equal variances:

bptest(model)

p=0.016 (variances not equal)

Since residuals are not normally distributed and variances are not equal an ANOVA (for example) could not be performed. Furthermore, the residuals are auto-correlated according to the Durbin-Watson test:

dwtest(model)

p=0.001065 (data points auto-correlated)

Which model would be suitable for my problem (probably a multilevel linear model?) and which R packages I could use for the analysis?


Get this bounty!!!

#StackBounty: #python-3.x #pandas #deep-learning #time-series #lstm Time Series Forecsting using LSTM

Bounty: 50

I am new to deep learning,and I am trying to predict my amount for next 6 months using LSTM. I have looked at few blogs online, such as this one. However, I don’t know how to predict Amount for Each of my groups i.e A,B,C,D for 6 months.

Any blog post or helpful code with explanation is appreciated!

import pandas as pd
data = {'Date':['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01',
               '2017-02-01', '2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01'],'Group':['A','A','B','B','C','C','D','D','A','A','B','B','C','C','D','D'],
       'Amount':['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']}
df = pd.DataFrame(data)
print (df)


Get this bounty!!!

#StackBounty: #r #time-series Which model for my data?

Bounty: 50

I have this data:

Group Time  Size
A 1 0.56
A 2 0.97
A 3 1.33
A 4 1.75
B 1 0.12
B 2 0.24
B 3 0.31
B 4 0.47
B 5 0.51
B 6 0.69
B 7 0.73
B 8 0.85
C 1 0.16
C 2 0.23
C 3 0.38
C 4 0.49
C 5 0.53
C 6 0.66
C 7 0.78
C 8 0.81

Here is the respective plot:

enter image description here

Now I would like to test the three groups for differences in slope and intercept. I cannot use simple linear regression since these are time series and the data points are not independent of each other.

Here are the additional tests I performed on the linar model:

Data = read.table(textConnection(Input),header=TRUE)

model = lm(Size ~ Time + Group,data = Data)

Shapiro-Wilk test for normality:

shapiro.test(residuals(model))

p=0.001288 (not normally distributed)

Breusch-Pagan test for equal variances:

bptest(model)

p=0.016 (variances not equal)

Since residuals are not normally distributed and variances are not equal an ANOVA (for example) could not be performed. Furthermore, the residuals are auto-correlated according to the Durbin-Watson test:

dwtest(model)

p=0.001065 (data points auto-correlated)

Which model would be suitable for my problem (probably a multilevel linear model?) and which R packages I could use for the analysis?


Get this bounty!!!

#StackBounty: #python-3.x #time-series #pandas-groupby #facebook-prophet How to calculate SMAPE for groups in time-series data in python?

Bounty: 50

My data looks like following, and I am using facebook FbProphet for prediction. Next I would like to calculate SMAPE for each group in my dataframe. I found the function described by Kaggle user here But I am not sure How to implement in my current code. So that SMAPE can calculate for each group. Additionally, I know that fbProphet has validation function, but I would like to calculate SMAPE for each group.

Note: I am new to python, provide explanation with code.

Dataset

import pandas as pd
data = {'Date':['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01',
               '2017-02-01', '2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01'],'Group':['A','A','B','B','C','C','D','D','A','A','B','B','C','C','D','D'],
       'Amount':['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']}
df = pd.DataFrame(data)
print (df)

Code so far…

def get_prediction(df):
    prediction = {}
    df = df.rename(columns={'Date': 'ds','Amount': 'y', 'Group': 'group'})
    df=df.groupby(['ds','group'])['y'].sum()
    df=pd.DataFrame(df).reset_index()
    list_articles = df.group.unique()

    for group in list_articles:
        article_df = df.loc[df['group'] == group]
        # set the uncertainty interval to 95% (the Prophet default is 80%)
        my_model = Prophet(weekly_seasonality= True, daily_seasonality=True,seasonality_prior_scale=1.0)
        my_model.fit(article_df)
        future_dates = my_model.make_future_dataframe(periods=6, freq='MS')
        forecast = my_model.predict(future_dates)
        prediction[group] = forecast
        my_model.plot(forecast)
    return prediction


Get this bounty!!!

#StackBounty: #time-series #trend #discrete-data Checking assumptions for Mann-Kendall trend test

Bounty: 50

I have a data on counts of images submitted with requests overtime. I would like to test the null hypothesis of no trend vs alternative hypothesis that there is a downward trend.

My data is a sequence of numbers [2,3,5,6,2,4,5…], where every number represents number of images submitted with request. Order is chronological.

There are two things I am concerned about.

First how can I check for no serial dependence? Data is discrete not continuous – I ma not to sure how to test for serial dependence in this case. Is simple ACF/PACF plot good test?

Second I have around 200 observations but the range of possible values is small, X is form 1 to 8, hence I have a lot of tied groups. Is this a problem for Mann-Kendall trend test?


Get this bounty!!!

#StackBounty: #r #time-series #forecasting #multivariate-analysis #var VAR – Convert single differenced forecast into actual value

Bounty: 50

    #read data
    offered_contact <- read.csv(file="si_act_claim_off_act.csv", header=TRUE)
    offered_contact_ts <- ts(offered_contact, freq=365.25/7,  start=decimal_date(ymd("2016-1-4")))

    sold_item <- offered_contact_ts[, 2:2]
    offered_contacts <- offered_contact_ts[, 4:4]

    plot.ts(offered_contact_ts[,2:4])
    summary(offered_contact_ts[,2:4])

    #Run Augmented Dickey-Fuller & kpss tests to determine stationarity and differences to achieve stationarity.
    ndiffs(sold_item, alpha = 0.05, test = c("kpss"))
    adf.test(sold_item,k=12)
    ndiffs(offered_contacts, alpha = 0.05, test = c("kpss"))
    adf.test(offered_contacts,k=12)

    #Difference to achieve stationarity
    sold_item_diff = diff(sold_item, differences = 1)
    offered_contact_diff = diff(offered_contacts, differences = 1)



    #combine the data
    offered_contact_ts_diff = cbind(sold_item_diff, offered_contact_diff)
    plot.ts(offered_contact_ts_diff)

    #Lag optimisation
    VARselect(offered_contact_ts_diff, lag.max = 12, type = "both")

    #Vector autoregression with lags set according to results of lag optimisation 
    var1 <- VAR(offered_contact_ts_diff,p=1)
    serial.test(var1,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var2 <- VAR(offered_contact_ts_diff,p=2)
    serial.test(var2,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var3 <- VAR(offered_contact_ts_diff,p=3)
    serial.test(var3,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var4 <- VAR(offered_contact_ts_diff,p=4)
    serial.test(var4,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var5 <- VAR(offered_contact_ts_diff,p=5)
    serial.test(var5,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var6 <- VAR(offered_contact_ts[,2:4],p=6)
    serial.test(var6,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var7 <- VAR(offered_contact_ts_diff,p=7)
    serial.test(var7,lags.pt =10,type="PT.asymptotic")


    #at p=7, I get p-value as 0.1253
    #Portmanteau Test (asymptotic)

    #data:  Residuals of VAR object var7
    #Chi-squared = 17.695, df = 12, p-value = 0.1253

    #ARCH test (Autoregressive conditional heteroscedasdicity)
    arch.test(var7, lags.multi =10)
    summary(var7)

    grangertest(sold_item_diff ~ offered_contact_diff, order = 1)
    grangertest(offered_contact_diff ~ sold_item_diff , order = 1)

    prd <- predict(var7, n.ahead = 35, ci = 0.95, dumvar = NULL)
    print(prd)


  • I want to convert this forecast which I have got from the single differenced data. How do I do it?
forecast_si <- apply(rbind(sold_item_actual_prev,forecast_diff_sold_item),2,cumsum)
  • How do I include external factors like holidays or disruptive events in VAR?

  • How do I add back the seasonality component?


Get this bounty!!!

#StackBounty: #time-series #forecasting #multivariate-analysis #var VAR – Convert single differenced forecast into actual value

Bounty: 50

    #read data
    offered_contact <- read.csv(file="si_act_claim_off_act.csv", header=TRUE)
    offered_contact_ts <- ts(offered_contact, freq=365.25/7,  start=decimal_date(ymd("2016-1-4")))

    sold_item <- offered_contact_ts[, 2:2]
    offered_contacts <- offered_contact_ts[, 4:4]

    plot.ts(offered_contact_ts[,2:4])
    summary(offered_contact_ts[,2:4])

    #Run Augmented Dickey-Fuller & kpss tests to determine stationarity and differences to achieve stationarity.
    ndiffs(sold_item, alpha = 0.05, test = c("kpss"))
    adf.test(sold_item,k=12)
    ndiffs(offered_contacts, alpha = 0.05, test = c("kpss"))
    adf.test(offered_contacts,k=12)

    #Difference to achieve stationarity
    sold_item_diff = diff(sold_item, differences = 1)
    offered_contact_diff = diff(offered_contacts, differences = 1)



    #combine the data
    offered_contact_ts_diff = cbind(sold_item_diff, offered_contact_diff)
    plot.ts(offered_contact_ts_diff)

    #Lag optimisation
    VARselect(offered_contact_ts_diff, lag.max = 12, type = "both")

    #Vector autoregression with lags set according to results of lag optimisation 
    var1 <- VAR(offered_contact_ts_diff,p=1)
    serial.test(var1,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var2 <- VAR(offered_contact_ts_diff,p=2)
    serial.test(var2,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var3 <- VAR(offered_contact_ts_diff,p=3)
    serial.test(var3,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var4 <- VAR(offered_contact_ts_diff,p=4)
    serial.test(var4,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var5 <- VAR(offered_contact_ts_diff,p=5)
    serial.test(var5,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var6 <- VAR(offered_contact_ts[,2:4],p=6)
    serial.test(var6,lags.pt =10,type="PT.asymptotic")

    #Vector autoregression with lags set according to results of lag optimisation
    var7 <- VAR(offered_contact_ts_diff,p=7)
    serial.test(var7,lags.pt =10,type="PT.asymptotic")


    #at p=7, I get p-value as 0.1253
    #Portmanteau Test (asymptotic)

    #data:  Residuals of VAR object var7
    #Chi-squared = 17.695, df = 12, p-value = 0.1253

    #ARCH test (Autoregressive conditional heteroscedasdicity)
    arch.test(var7, lags.multi =10)
    summary(var7)

    grangertest(sold_item_diff ~ offered_contact_diff, order = 1)
    grangertest(offered_contact_diff ~ sold_item_diff , order = 1)

    prd <- predict(var7, n.ahead = 35, ci = 0.95, dumvar = NULL)
    print(prd)


  • I want to convert this forecast which I have got from the single differenced data. How do I do it?
forecast_si <- apply(rbind(sold_item_actual_prev,forecast_diff_sold_item),2,cumsum)
  • How do I include external factors like holidays or disruptive events in VAR?

  • How do I add back the seasonality component?


Get this bounty!!!