#StackBounty: #time-series #normal-distribution #variance #mean Are power law relations between means and standard deviations inherent …

Bounty: 100

In a recent paper I submitted for publication I document a power law relation between the means and standard deviations of several time series. That is, when plotting the log of the means of each of these (stationary) series against the log of their respective standard deviations, you get a straight, positively sloped line (with non-zero y axis intercept).

When researching for this paper I scoured the internet for any possible statistical or mathematical explanation for this behavior, but found none, and could recall nothing from my own training in statistics that would explain this either. I discovered variance functions and Bartlett’s identities along the way, but this still fell far short of explaining the relation I was documenting. The data I am dealing with are all normally distributed.

My paper was rejected, and one of the main grounds for rejection given by the editor was that the power law relation between means and standard deviations I had observed is "inherently true of more or less normally distributed sets of data".

Can someone please explain to me what the editor is talking about? Do power law relations trivially exist between the means and standard deviations of different normally distributed sets of data?

Edit: Some details on the data – Each data set is a stationary yearly time series. Number of observations in each series is the same. My logged plot of the means against their respective standard deviations follows below. In this graphic, the different shapes and colors of the points correspond to different commodity groups.

enter image description here


Get this bounty!!!

#StackBounty: #time-series #normal-distribution #variance #mean Are log-linear relations between means and standard deviations inherent…

Bounty: 100

In a recent paper I submitted for publication I document a log linear relation between the means and standard deviations of several time series. That is, when plotting the log of the means of each of these (stationary) series against the log of their respective standard deviations, you get a straight, positively sloped line (with non-zero y axis intercept).

When researching for this paper I scoured the internet for any possible statistical or mathematical explanation for this behavior, but found none, and could recall nothing from my own training in statistics that would explain this either. I discovered variance functions and Bartlett’s identities along the way, but this still fell far short of explaining the relation I was documenting. The data I am dealing with are all normally distributed.

My paper was rejected, and one of the main grounds for rejection given by the editor was that the log-linear relation between means and standard deviations I had observed is "inherently true of more or less normally distributed sets of data".

Can someone please explain to me what the editor is talking about? Does a positive log-linear relation trivially exist between the means and standard deviations of different normally distributed sets of data?


Get this bounty!!!

#StackBounty: #python #time-series #regression #predictive-modeling Multivariate time series forecast with VAR confusion

Bounty: 50

I am new to time-series forecasting. I am working now on a task in which I have a data set, containing samples of approx. 15 variables for every hour for several years. Then, I have a test data set (continues at the next time step where training data ended) containing values for all the variables except one. My task is to build a model using training data that can predict that one variable in the test data set.

From reading online, I understood I could use vector autoregression (VAR). I have read many tutorials such as this one. I understand most of it except one thing. When it comes to predicting, they (in the tutorials) predict all the variables. However, I would like to do something different: I would like to predict just the one target variable. And of course take into account values of the other variables in the test data set.

To illustrate this, let’s say Var Z is my target variable and this is my training set:

       Var X    Var Y     Var Z  
 Day 1     11       20       30
 Day 2     22       40       60
 Day 3     33       60       90

Then this is my test set for which I want to predict Var Z:

       Var X    Var Y     Var Z  
 Day 4     44       80       ??
 Day 5     55       84       ??
 Day 6     66       88       ??

But in the tutorials I have seen so far, they always predict all variables!

Question: How to specify I want to forecast only a single variable for certain timestamps and take into account values of other variables at those timestamps? Is VAR not the right tool to use?

I would be most grateful if someone could point me in the right direction. I use Python.


Get this bounty!!!

#StackBounty: #time-series #outliers #anomaly-detection #moving-window Optimal window size for contextual outlier detection

Bounty: 100

I am looking for methods to detect univariate contextual outliers in time series data. One example application is data from industrial plants in different (unknown) operation modes or slow trends or shifts but no seasonal effects.

In the following graph visually the contextual outliers above and below the trend can be identified clearly.

enter image description here

Most global outlier detection methods can be used with a sliding window approach. But a method, that automatically derives the optimal window size from the data or even provides an adaptive window size would be beneficial.


Get this bounty!!!

#StackBounty: #time-series #outliers #anomaly-detection Optimal window size for contextual outlier detection

Bounty: 100

I am looking for methods to detect univariate contextual outliers in time series data. One example application is data from industrial plants in different (unknown) operation modes or slow trends or shifts but no seasonal effects.

In the following graph visually the contextual outliers above and below the trend can be identified clearly.

enter image description here

Most global outlier detection methods can be used with a sliding window approach. But a method, that automatically derives the optimal window size from the data or even provide an adaptive window size would be beneficial.


Get this bounty!!!

#StackBounty: #time-series #outliers #anomaly-detection Contextual Outlier Detection Method

Bounty: 100

I am looking for an overview of methods to detect univariate contextual outliers in time series data. One example application is data from industrial plants in different (unknown) operation modes or slow trends, but no seasonal effects.

In the following graph visually the contextual outliers above and below the trend can be identified clearly.

enter image description here

Most global outlier detection methods can be used with an window-based approach. But a method, that automatically consideres the size of the context would be beneficial.

Are there any suggestions which methods are recommended for that purpose?


Get this bounty!!!

#StackBounty: #time-series #modeling #econometrics #panel-data #random-effects-model Is my model defensible for my setting? Help checki…

Bounty: 50

I have panel data for 150 countries, with daily data from the start of 2018. I want to estimate the effect of the lags of one variable (x) on another variable (y), accounting for some covariates. I have selected a model I think works, but I’m not sure, and I could use some help checking that I’m not violating any key assumptions of the model.

Here is my model

(Sorry for the length.) I am using a linear regression model that includes lags of x and y as an explanatory variable, with random coefficients and random effects for each unit. Specifically:

  • The form of the equation is y_i,t = a_i + j_1 * y_i,t-1 + j_2 * y_i,t-2 + B_1 * x_i,t-1 + B_2 * x_i,t-2…
  • a_i is a unit-specific intercept / random effect (not a fixed effect, as explained below)
  • In the equation, y is regressed on the past values of y, in addition to the past values of x
  • There are also 8 other covariates / other explanatory variables with the same form as x
  • One issue is that while y, x vary day by day, some covariates vary much more slowly – e.g. year by year.
  • I run a separate regression for each unit, which includes an intercept
  • I use the weighted average of the coefficients (each unit has its own coefficients) see page 204 here
  • The coefficients are just estimated with Ordinary Least Squares.
  • I can either find a way to Wald-test the coefficients of x, or just use the variance of the coefficents of x to test for significance.

Here are my concerns about the model

Some reasons I chose this model, and some concerns I had about it:

  • The data are non-stationary, and their order of integration is 1 (first-differencing the data yields stationary data). I think this could be a problem but I’m not sure how to fix it. The Toda-Yamamoto procedure recommends only modeling with p+1 lags of x and only Wald-testing the first p lags, but I’m not sure if that applies here.
  • Because of the equation’s form, the error term correlates with the first-difference of y. I think this is called Nickell bias, which I know disappears as T grows – my T is large so I think it should be okay.
  • x is lagged because it cannot have an immediate effect on y – the effect could be delayed by between 1 and 14 days (from domain knowledge)
  • The Granger Causality test is an intuitive choice for my data, and this equation has the same form as the equation used in the Granger test. Intuitively, if x has a causal effect on y, x should affect y beyond the past values of y alone.
  • I tested for serial correlation in the residuals and there isn’t any, for any of the panel units.


Get this bounty!!!

#StackBounty: #keras #time-series #forecasting #recurrent-neural-network Problems to understand how to create the input data for time s…

Bounty: 50

I just started to use recurrent neural networks (RNN) with Keras for time-series forecasting and I found this tutorial Forecasting with RNN. I have difficulties understanding how to build the training data both regarding the syntax and the format of the input data.

Here is the code:

import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow import keras
from matplotlib import pyplot as plt

# Read the data for the parameters from a csv file
df = pd.read_csv("C:/Users/Python/Data/tutorial_electricityPrice.csv", sep =",")

#Delete the first column as it is not used in the tutorial for forecasting
del df['datetime']


data = df.values

n_steps = 168

series_reshaped =  np.array([data[i:i + (n_steps+24)].copy() for i in range(len(data) - (n_steps+24))])


X_train = series_reshaped[:43800, :n_steps] 
X_valid = series_reshaped[43800:52560, :n_steps] 
X_test = series_reshaped[52560:, :n_steps] 
Y = np.empty((61134, n_steps, 24))  
for step_ahead in range(1, 24 + 1):     
   Y[..., step_ahead - 1] =   series_reshaped[..., step_ahead:step_ahead + n_steps, 0]
 
Y_train = Y[:43800] 
Y_valid = Y[43800:52560] 
Y_test = Y[52560:]

np.random.seed(42)
tf.random.set_seed(42)

model6 = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 6]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(24))
])

model6.compile(loss="mean_squared_error", optimizer="adam", metrics=['mean_absolute_percentage_error'])
history = model6.fit(X_train, Y_train, epochs=10,batch_size=64,
                    validation_data=(X_valid, Y_valid))

So in this case 168 hours of the past are used (n_steps) to make a prediction for the next 24 hours of electricity prices. 6 features are used.

I have problems both understanding the format and the syntax for creating the inputdata of the RNN.

Format question

I uploaded a screenshot of the dimensions of the data-arrays from SpyderVariable Explorer. So basically we have the full data array ‘series_reshaped’ with the size (61134, 192, 6). Then we have the input data X_train with the size (43800, 168, 6). The first dimension are the timeslot, the second dimension are the past timeslots that are used for prediction and the third timension are the 6 features for every of the 168 past timeslots. Then we have the labels Y_train with the size (43800, 168, 24). Here I do not understand why we have 168 in the second dimension. As far as I understood for each of the 168 past values * 6 features of the input data, we have 24 target values. So why is the second dimension then not 168*6 = 1008? Because we have a mapping of 1008 inputs to 24 outputs?

Syntax question

I do not really understand how these lines work in Python:

for step_ahead in range(1, 24 + 1):     
   Y[..., step_ahead - 1] =   series_reshaped[..., step_ahead:step_ahead + n_steps, 0]

Why does this create an Y array of the dimension (61134, 168, 24) or transfer the correct data into it? The index step_ahead only takes values from 1 to 24 and now we assign to 24 entries of the second dimension of the array Y 168 values from the past values of the series_reshaped. So why do we only assign the values to the 24 entries of the second dimension of Y and not to the full 168 entries? And why are we looking into the past data of the series_reshaped array (second dimension). For me these lines are extremely confusing altough they apparently do the right thing. Can anyone tell me a little bit more about the syntax of those lines?

Generally, I’d appreciate every comment and would be quite thankful for your help.


Update

Related questions: Hi all, as I still have problems with those question I would like to ask some related questions

  1. About the creation of the input data: how can I know which structure the input data should have? And how can I then derive something like this code
for step_ahead in range(1, 24 + 1):
Y[..., step_ahead - 1] = series_reshaped[..., step_ahead:step_ahead + n_steps, 0]

2)At the end of the training in the tutorial they use the following code for the prediciton

Y_pred = model6.predict(X_test)

last_list=[]

for i in range (0, len(Y_pred)):
  last_list.append((Y_pred[i][0][23]))

So they take Y_pred[i][0][23] to construct the 1-dimensional list with the predicted values. Why do they take [0][23] and not for example [1][14]? They want to predict 24 hours in advance. Can I just always take Y_pred[i][0][23] ?

  1. I still do not understand one of my inital questions: Why is the label dataset Y for training [Batch, 168, 24] if return sequence =true? We use the past 168 values to forecast 24 hours. We use 168*6 features for forecasting. For each element in the batch (each timeslot) we then have an output of 24 hours. So we should have the training data with dimension [Batch, 24] and not [Batch, 168, 24]. For every timeslot in the batch we need 168 past values. How is it then possible to map 24 hours of predictions to every 168 of the past values?


Get this bounty!!!

#StackBounty: #r #time-series #fixed-effects-model #difference-in-difference #multiple-seasonalities Sinusoidal unit-specific time trends

Bounty: 100

Suppose I have a panel dataset with monthly observations over 10 years. I have a simple dummy intervention, where some policy is put in place around the Spring in every year and only affects some people or groups. In other words, the units over time experience multiple shocks. Examples include something like emergency room re-entry or a crime policy affecting districts during warm weather. Note: the dummy is turning ‘on’ and ‘off’ over time and there is a seasonal component observed in the raw trends over time.

Let’s assume I’m observing aggregate hospital admissions over time. To include hospital-specific linear time trends, I would multiply each hospital dummy by a continuous linear time trend variable, where $t = 1, 2, 3, … , T$. To assess the robustness of the findings, it is common to estimate the following equation:

$$
y_{it} = alpha_{0i} + beta_i (t times alpha_{1i}) + lambda_t + delta D_{it} + u_{it},
$$

where $D_{it}$ is a treatment dummy equal to 1 if the facility was treated and is in a post-treatment time period, 0 otherwise. The parameters $alpha_{0,i}$ and $lambda_t$ denote hospital and month fixed effects, respectively. The interaction of the continuous time trend variable $t$ with each $alpha_{1,i}$ gives each $i$ it’s own unique linear time trend. This works well and is quite common in the literature with one intervention. But suppose the policy is removed and then reinstated in subsequent years. In other words, the intervention repeats every year around Spring time.

Here is what I am observing in practice. The toy dataset shows visits to a facility over 24 months. Every 12 months I see this inverted U-shape. The shaded regions represent the shocks observed over time.

# Loading the required packages

library(tidyr)
library(dplyr)

# Creating some fake data

set.seed(1987)

y <- c(8.4, 10.0, 11.8, 12.2, 13.1, 13.3, 12.0, 12.4, 12.0, 10.3, 10.0, 10.5, 9.3, 8.0, 8.1, 10.1 , 11.5, 12.1, 12.5, 12.1, 10.7, 8.8, 8.7, 7.0)
shocks <- c(5, 6, 18, 19)  # on/off shocks

df <- 
  tibble(unit = rep(1:10, each = 24),
         month = rep(1:24, 10),
         y = rep(y, 10) + rnorm(240, 1, 5)
         ) %>%
  mutate(group = ifelse(unit > 5, "Exposed", "Unexposed"),
         intervention = ifelse(unit > 5 & month %in% shocks, 1, 0),
         time = rep(1:10, 24),       # linear time trend
         sine = sin(2*pi*time/12),   # sin()
         cosine = cos(2*pi*time/12)  # cos()
         )

# Producing a fitted trend line for each group
# Shaded regions show transient shocks over time
# Some units become exposed to a policy and others do not

df %>%
  ggplot(., aes(x = factor(month), y = y, color = factor(group))) +
  geom_smooth(aes(group = group)) +
  labs(x = "Month (24 Periods)", y = "Mean Outcome") +
  annotate("rect", xmin = 5, xmax = 7, ymin = -Inf, ymax = Inf, fill = "maroon", alpha = .2) +
  annotate("rect", xmin = 18, xmax = 20, ymin = -Inf, ymax = Inf, fill = "maroon", alpha = .2) +
  scale_color_manual(name = "Group:", values = c("Exposed" = "maroon",
                                "Unexposed" = "royalblue")) +
  theme_classic() +
  theme(
    legend.position = "bottom"
    )

Cyclical Pattern

Questions:

(1) Is it valid to allow each unit to have their own sinusoidal trend, which is allowed to flexibly model the cyclical pattern over time between the two groups?

(2) Is there a better approach? I don’t see how adding a quadratic time trend would help given that the intervention repeats. I would argue that including a higher order polynomial term would be too demanding. Or maybe not?

To be clear, this is more of a robustness check—not the main specification. I don’t think it has been done before, in part because most quasi-experimental evaluations involve units with one treatment history. In applied work, the linear time trend is a way to adjust for the possibility that the treatment group and the control group were on somewhat different growth trajectories before the shock. But with multiple interventions, I wanted to adjust for any differential "cyclical" patterns observed over time, as they seem to repeat every year with the reintroduction of the policy.

Please review my R code below. I am somewhat new to including sinusoidal trends. I welcome any insight or criticism of this approach.

# I multiplied each unit by sin() and cos() separately

> summary(mod <- lm(y ~ as.factor(unit)*sine + as.factor(unit)*cosine + as.factor(month) + intervention, data = df))

Call:
lm(formula = y ~ as.factor(unit) * sine + as.factor(unit) * cosine + 
    as.factor(month) + intervention, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4567  -3.1386  -0.0271   3.0985  10.8461 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               10.8550     2.8408   3.821 0.000181 ***
as.factor(unit)2          -0.6000     1.9757  -0.304 0.761700    
as.factor(unit)3           0.8304     1.7146   0.484 0.628729    
as.factor(unit)4          -1.5308     1.7592  -0.870 0.385329    
as.factor(unit)5           1.4748     1.9535   0.755 0.451217    
as.factor(unit)6           1.9856     1.6350   1.214 0.226108    
as.factor(unit)7           2.5895     2.0160   1.284 0.200584    
as.factor(unit)8           1.2193     1.7436   0.699 0.485217    
as.factor(unit)9           0.5477     1.8026   0.304 0.761602    
as.factor(unit)10          3.3756     1.9880   1.698 0.091192 .  
sine                       2.6086     2.8070   0.929 0.353920    
cosine                    -4.1625     3.3955  -1.226 0.221802    
as.factor(month)2          0.2706     2.7212   0.099 0.920883    
as.factor(month)3          3.8197     3.8195   1.000 0.318591    
as.factor(month)4         -3.7247     4.6494  -0.801 0.424083    
as.factor(month)5         -1.4893     5.5054  -0.271 0.787066    
as.factor(month)6         -1.3394     5.7128  -0.234 0.814891    
as.factor(month)7          2.3822     5.3802   0.443 0.658440    
as.factor(month)8          2.3526     4.8477   0.485 0.628043    
as.factor(month)9          3.2434     3.7984   0.854 0.394272    
as.factor(month)10         3.3325     2.9592   1.126 0.261546    
as.factor(month)11         2.8721     2.3530   1.221 0.223786    
as.factor(month)12         0.3874     2.7212   0.142 0.886934    
as.factor(month)13        -0.7625     3.8195  -0.200 0.841997    
as.factor(month)14        -7.9720     4.6494  -1.715 0.088077 .  
as.factor(month)15        -7.7604     5.4126  -1.434 0.153318    
as.factor(month)16        -2.0334     5.6234  -0.362 0.718065    
as.factor(month)17        -1.1144     5.3802  -0.207 0.836129    
as.factor(month)18         2.9657     4.9512   0.599 0.549912    
as.factor(month)19         1.2505     3.9296   0.318 0.750660    
as.factor(month)20         4.6338     2.9592   1.566 0.119072    
as.factor(month)21         1.9352     2.3530   0.822 0.411883    
as.factor(month)22        -3.5087     2.7212  -1.289 0.198861    
as.factor(month)23        -4.7884     3.8195  -1.254 0.211542    
as.factor(month)24        -6.3957     4.6494  -1.376 0.170601    
intervention              -0.7227     2.0136  -0.359 0.720054    
as.factor(unit)2:sine     -2.5287     5.0387  -0.502 0.616358    
as.factor(unit)3:sine     -0.1849     3.4525  -0.054 0.957344    
as.factor(unit)4:sine     -0.7943     3.5224  -0.226 0.821831    
as.factor(unit)5:sine     -2.1947     4.9591  -0.443 0.658593    
as.factor(unit)6:sine     -1.7032     2.0679  -0.824 0.411209    
as.factor(unit)7:sine     -6.4227     5.0401  -1.274 0.204138    
as.factor(unit)8:sine     -0.3358     3.4440  -0.098 0.922423    
as.factor(unit)9:sine     -4.5655     3.5362  -1.291 0.198271    
as.factor(unit)10:sine    -2.6118     4.9475  -0.528 0.598196    
as.factor(unit)2:cosine    1.4540     6.0715   0.239 0.810997    
as.factor(unit)3:cosine    6.8385     4.2671   1.603 0.110720    
as.factor(unit)4:cosine   -1.1452     4.1782  -0.274 0.784312    
as.factor(unit)5:cosine    7.7818     6.1802   1.259 0.209551    
as.factor(unit)6:cosine    1.2134     2.4830   0.489 0.625636    
as.factor(unit)7:cosine    5.3385     6.0575   0.881 0.379291    
as.factor(unit)8:cosine    3.0046     4.2863   0.701 0.484185    
as.factor(unit)9:cosine    0.2422     4.1670   0.058 0.953707    
as.factor(unit)10:cosine   6.4106     6.1764   1.038 0.300653    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.262 on 186 degrees of freedom
Multiple R-squared:  0.2956,    Adjusted R-squared:  0.0949 
F-statistic: 1.473 on 53 and 186 DF,  p-value: 0.03175

All that matters is the estimate on intervention, at least for reporting purposes. I’m curious to see if effects hold even after the inclusion of such trends. I understand this is a very demanding procedure. To offer some perspective, I’m working with about 60 units and each unit is observed over 120 months.


Get this bounty!!!

#StackBounty: #r #regression #time-series #arima #data-imputation ARIMA with external regressors for district heating load time-series …

Bounty: 50

TL;DR ARIMA model does not work as expected.

There is a load for district heating with some external regressors like temperature and wind speed. There is missing data which we like to impute. We work with the fable framework in R.

> ts_heat_district
# A tsibble: 70,176 x 7 [15m] <UTC>
   datetime            power_district  temp  wind radiation wm_1w wm_4w
   <dttm>                       <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
 1 2019-01-01 00:00:00             NA  8     8.18         0  6.71  5.02
 2 2019-01-01 00:15:00             NA  8.02  8.26         0  6.72  5.02
 3 2019-01-01 00:30:00             NA  8.05  8.34         0  6.72  5.02
 4 2019-01-01 00:45:00             NA  8.07  8.43         0  6.73  5.02
 5 2019-01-01 01:00:00             NA  8.1   8.51         0  6.73  5.03
 6 2019-01-01 01:15:00             NA  8.1   8.58         0  6.73  5.03
 7 2019-01-01 01:30:00             NA  8.1   8.66         0  6.74  5.03
 8 2019-01-01 01:45:00             NA  8.1   8.73         0  6.74  5.03
 9 2019-01-01 02:00:00             NA  8.1   8.8          0  6.75  5.04
10 2019-01-01 02:15:00             NA  8.07  8.97         0  6.75  5.04
# ... with 70,166 more rows

enter image description here
enter image description here

I started with a simple linear regression model which works quite ok. The fourier term should capture the daily pattern. The yearly pattern is captured by temperature and weighted averages from the temperature (variable wm_1w and wm_4w).

fit_heat_district <- ts_heat_district %>%
  model(
    lm = TSLM(power_district ~ temp + I(pmax(15-temp, 0)) + wind + radiation + 
              wm_1w + wm_4w +
              fourier(period = "1 day", K = 24*4/2)),
    arima_fourier = ARIMA(power_district ~ PDQ(0, 0, 0) +
                            temp + I(pmax(15-temp, 0)) + wind + radiation +
                            wm_1w + wm_4w +
                            fourier(period = "1 day", K = 12)),
    arima_seasonal = ARIMA(power_district ~ PDQ(0, 0, 0, period = 24*4) +
                             temp + I(pmax(15-temp, 0)) + wind + radiation +
                             wm_1w + wm_4w)
  )

The imputation looks like expected:

fit_heat_district %>%
  select("lm") %>%
  fabletools::interpolate(ts_heat_district) %>%
  autoplot() +
  ggtitle("Linear model")

enter image description here

The residuals from external regressors from the linear model seems ok:

ts_heat_district %>%
  mutate(temp_15 = ifelse(temp > 15, 0, 15-temp)) %>%
  left_join(residuals(select(fit_heat_district, lm)), by = "datetime") %>%
  pivot_longer(temp:temp_15,
               names_to = "regressor", values_to = "x") %>%
  ggplot(aes(x = x, y = .resid)) +
  geom_point() +
  facet_wrap(. ~ regressor, scales = "free_x") +
  labs(y = "Residuals", x = "")

enter image description here

There seems to be some autocorrelation left which should be captured by ARIMA.

fit_heat_district %>%
  select(lm) %>%
  gg_tsresiduals()

enter image description here

I tried a seasonal ARIMA model and a non-seasonal ARIMA model with a Fourier term because the period is relatively long (24*4=96).

fit_heat_district <- ts_heat_district %>%
  model(
    lm = TSLM(power_district ~ temp + I(pmax(15-temp, 0)) + wind + radiation + 
              wm_1w + wm_4w +
              fourier(period = "1 day", K = 24*4/2)),
    arima_fourier = ARIMA(power_district ~ PDQ(0, 0, 0) +
                            temp + I(pmax(15-temp, 0)) + wind + radiation +
                            wm_1w + wm_4w +
                            fourier(period = "1 day", K = 12)),
    arima_seasonal = ARIMA(power_district ~ PDQ(0, 0, 0, period = 24*4) +
                             temp + I(pmax(15-temp, 0)) + wind + radiation +
                             wm_1w + wm_4w)
  )

The interpolated data from the ARIMA models don’t look as expected:

fit_heat_district %>%
  select("arima_fourier") %>%
  fabletools::interpolate(ts_heat_district) %>%
  autoplot() +
  ggtitle("arima model with fourier")

enter image description here

fit_heat_district %>%
  select("arima_seasonal") %>%
  fabletools::interpolate(ts_heat_district) %>%
  autoplot() +
  ggtitle("seasonal arima model")

enter image description here

What is wrong with the ARIMA models? What am I missing?

EDIT 2021-04-07

Some additional information:

  • tried versions with box-cox transformation to stabilize variance (got lambda via Guerrero’s method); at least both AIC and MAE for linear model improved slightly
  • KPSS test: one time differencing is necessary
  • for seasonal differencing p-value would be 0.0723
  • tried solutions from similiar posts on CV like setting D = 1 and period = 96
  • tried models without external regressors (keep it simple):

ARIMA w/ fourier

ARIMA(box_cox(heat_output, lambda = lambda) ~ pdq(0:5, 1, 0:5) + PDQ(0, 0, 0) + fourier(period = "1 day", K = 12))

<LM w/ ARIMA(5,1,0) errors>

Forecast and imputation values are still a straight line.

ARIMA w/o fourier

ARIMA(box_cox(heat_output, lambda = lambda) ~ pdq(0:5, 1, 0:5) + PDQ(0:2, 1, 0:2, period = 96))

<ARIMA(5,1,0)(0,1,0)[96]>

Interpolation looks strange and model is very slow:

enter image description here

I am wondering wether ARIMA is just not an appropriate method for this kind of time series.

Any information is very welcome.


Get this bounty!!!