#StackBounty: #time-series #forecasting #prediction-interval Generic Prediction Interval Methods

Bounty: 50

I’m trying to implement a generic method to calculate prediction intervals for univariate forecast methods.

While the literature is quite clear on how such a prediction interval is define (e.g., here https://otexts.com/fpp2/prediction-intervals.html), remains vague on how to calculate the estimate for the standard deviation $hat{sigma}_h$ of a given forecast step $h$, or the residual standard deviation.

I have come up with a generic approach which correctness I would like to verify with the community:

Given any time series and a forecast model, I can calculate an estimation of the residuals by performing a sliding window validation on the time series. By that I can apply my model on windows of the available data and calculate the average error the model makes for the different forecast steps.

Would this be a good generic estimator for the residual standard deviation for each of the forecast steps?

I would then multiplying this estimator with the constant for the coverage probability (e.g, 1.28 for the 80% interval ) and perhaps some adaption for seasonality to calculate the lower and upper bounds of my forecast model.

Would this be a robust and sound approach? If not how could I better calculate these intervals?


Get this bounty!!!

#StackBounty: #time-series #probability #stochastic-processes How to get this analytical results for probability of wait times

Bounty: 50

I’m working with a continious-time stochastic process, where a particular event may happen at some time t with an unkown underlying distribution.

One "run" of a simulation of this process will result in a series of event times for each time the event happened within the run. So the output is just $[t_1, t_2, … t_n]$.

From this output I’m trying to calculate a metric I’ll call $u$, which is defined as "the probability that if you choose a random time $t$ within a run and look within the time range $[t, t+L]$ (for a pre-specified L), that at least one event occured in that range".

I’ve found some documentation (from an employee long gone from the company) that gives an analytical form for $u$ and I’ve verified that this form aligns very well with experimental data, but I haven’t been able to recreate the deductions that lead to this form.

The analytical form makes use of a probability density function of wait times $f(t)$ where wait time is simply the time between conseuctive events. So the experimental wait times are simply $[t_1, t_2-t_1, t_3-t_2, … t_n – t_{n-1}]$

The form I’m given is: $u = 1 – frac{int_{t=L}^{inf} (t-L)f(t)}{int_{t=0}^{inf} tf(t)}$, where $t$ is wait time

It’s clear that $frac{int_{t=L}^{inf} (t-L)f(t)}{int_{t=0}^{inf} tf(t)}$ is the disjoint probability that in this random time range of length L, no events occur, but I’m still not clear on how the exact terms are arrived at.

In my attempt to make sense of it I’ve reconstructed it into $u= 1 – frac{E(t-L | t > L)P(t > L)}{E(t)} $

which makes some inuitive sense to me, but I still can’t find a way to start with the original problem and arrive at any of these forms of the analytical solution.

Any guidance on this would be greatly appreciated


Get this bounty!!!

#StackBounty: #r #time-series #clustering #algorithms Detect spans of consecutive values with average over certain limit

Bounty: 50

I have weekly data for volume of product ordered by any customer. I want to identify the longest span of consecutive weeks such that the average of that span is >= 33,000 (approximate; up to -2000 under would be okay too). There can be multiple distinct spans. Spans must be at least 4 weeks long.

A dummy dataset is given below in r. The expected output for this dataset is span 17-32 and span 45-48 as highlighted by the green line. Span 1-2 is not good as it’s not at least 4 weeks long.

I need to do thousands of datasets and was wondering if there’s a good algorithm to help with this. I feel hierarchical clustering or DBSCAN might be useful here but I couldn’t get the right results.

set.seed(1)

df <- data.frame(
  week = 1:52,
  vol = c(rnorm(2, 35000, 1000),
          runif(14, 12000, 20000), 
          rnorm(7, 35000, 1000),
          runif(1, 12000, 20000),
          rnorm(8, 35000, 1000), 
          runif(12, 12000, 20000),
          rnorm(4, 35000, 100),
          runif(4, 12000, 20000)
          )
)

barplot(df$vol, names.arg = df$week)

enter image description here


Get this bounty!!!

#StackBounty: #time-series #causality Causal assumptions when treating time series as a bunch of points

Bounty: 100

Suppose there are two time series, $x_t$ and $y_t$, that capture daily counts of some sort. $x_t$ is believed to have causal impact on $y_t$. Suppose further that a simple regression is fit to the data, disregarding the time aspect:

$$y_t = alpha + beta x_t + epsilon.$$

There are at least the following two features that make this case difficult to reason about: the treatment being non-binary and the time aspect.

What assumptions are needed in order to legitimately give $beta$ a causal interpretation?


Get this bounty!!!

#StackBounty: #python #scikit-learn #time-series any workaround to do forward forecasting for estimating time series in python?

Bounty: 100

I want to make forward forecasting for monthly times series of air pollution data such as what would be 3~6 months ahead of estimation on air pollution index. I tried scikit-learn models for forecasting and fitting data to the model works fine. But what I wanted to do is making a forward period estimate such as what would be 6 months ahead of the air pollution output index is going to be. In my current attempt, I could able to train the model by using scikit-learn. But I don’t know how that forward forecasting can be done in python. To make a forward period estimate, what should I do? Can anyone suggest a possible workaround to do this? Any idea?

my attempt

import pandas as pd
from sklearn.preprocessing StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import BayesianRidge

url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"

df = pd.read_csv(url, parse_dates=['dates'])
df.drop(columns=['Unnamed: 0'], inplace=True)

resultsDict={}
predictionsDict={}

split_date ='2017-12-01'
df_training = df.loc[df.index <= split_date]
df_test = df.loc[df.index > split_date]

df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)

scaler = StandardScaler() 
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)  
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']
X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)

reg = linear_model.BayesianRidge()
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['BayesianRidge'] = accuracy_score(df_test['pollution_index'], yhat)

new update 2

this is my attempt using ARMA model

from statsmodels.tsa.arima_model import ARIMA

index = len(df_training)
yhat = list()
for t in tqdm(range(len(df_test['pollution_index']))):
    temp_train = df[:len(df_training)+t]
    model = ARMA(temp_train['pollution_index'], order=(1, 1))
    model_fit = model.fit(disp=False)
    predictions = model_fit.predict(start=len(temp_train), end=len(temp_train), dynamic=False)
    yhat = yhat + [predictions]
    
yhat = pd.concat(yhat)
resultsDict['ARMA'] = evaluate(df_test['pollution_index'], yhat.values)

but this can’t help me to make forward forecasting of estimating my time series data. what I want to do is, what would be 3~6 months ahead of estimated values of pollution_index. Can anyone suggest me a possible workaround to do this? How to overcome the limitation of my current attempt? What should I do? Can anyone suggest me a better way of doing this? Any thoughts?

update: goal

for the clarification, I am not expecting which model or approach works best, but what I am trying to figure it out is, how to make reliable forward forecasting for given time series (pollution index), how should I correct my current attempt if it is not efficient and not ready to do forward period estimation. Can anyone suggest any possible way to do this?

update-desired output

here is my sketch desired forecasting plot that I want to make:

enter image description here


Get this bounty!!!

#StackBounty: #machine-learning #time-series #neural-networks #python #predictive-models Time series prediction completely off using ESN

Bounty: 50

I am attempting to predict closing prices based on closing prices extracted from OHLC data from a two-month window with 10 minute intervals (roughly 8600 data points). For this attempt, I am building an echo-state network (ESM), following this tutorial.

With the code below, the prediction is fairly worthless. It looks like noise around an arbitrary average, which does not even resemble the latest data point in the training data. This is nothing close to what the ESN in the tutorial managed at this point. I have tried to alter the results by manually tweaking the hyperparameters n_reservoir, sparsity, and spectral_radius, but all to no avail. During a 4-week course last spring ESNs were briefly touched upon, but not enough for me to understand where I am at fault.

enter image description here

My code:

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from pyESN import ESN
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('path')
data = df['close'].to_numpy()

n_reservoir = 500
sparsity = 0.2
spectral_radius = 1.2
noise = .0005

esn = ESN(n_inputs = 1,
          n_outputs = 1,
          n_reservoir = n_reservoir,
          sparsity=sparsity,
          spectral_radius = spectral_radius,
          noise=noise)

trainlen = 7000
future = 10
futureTotal = 100
pred_tot = np.zeros(futureTotal)

for i in range(0,futureTotal,future):
    pred_training = esn.fit(np.ones(trainlen),data[i:trainlen+i])
    prediction = esn.predict(np.ones(future))
    pred_tot[i:i+future] = prediction[:,0]

plt.plot(range(0,trainlen+futureTotal),data[0:trainlen+futureTotal],'b',label="Data", alpha=0.3)
plt.plot(range(trainlen,trainlen+futureTotal),pred_tot,'k',  alpha=0.8, label='Free Running ESN')

lo,hi = plt.ylim()
plt.plot([trainlen,trainlen],[lo+np.spacing(1),hi-np.spacing(1)],'k:', linewidth=4)

plt.title(r'Ground Truth and Echo State Network Output')
plt.xlabel(r'Time', labelpad=10)
plt.ylabel(r'Price ($)', labelpad=10)
plt.legend(loc='best')
sns.despine()
plt.show()


Get this bounty!!!

#StackBounty: #time-series #autocorrelation #autoregressive #seasonality Day-of-week effects on regression coefficients in autoregressi…

Bounty: 50

I have a timeseries (sampled daily, weekdays only) whose volatility clearly shows dependency on day of week. In particular the standard deviation of the differenced series $Delta y_t$ is smallest on Mondays and peaks on Thursdays.

I have considered GARCH-style models for the volatility, with the respective dummy variables for day of week. However, I am not interested in the volatility of the errors per se, but rather how the mean equation is affected by the day of week. For example if I fit an AR(1) model to $Delta y$ I observe that its residuals $varepsilon_t$ on Wednesdays are correlated with $Delta y_{t-1}$.

In addition, if I assume $Delta y_t = phi Delta y_{t-1} + varepsilon_t$ but estimate $phi$ via OLS regression for each weekday separately I get the following for each day of week:

monday: $phi = 0.68$ (SE = 0.02)
tuesday: $phi = 0.76$ (SE = 0.04)
wednesday: $phi = 1.03$ (SE = 0.02)
thursday: $phi = 0.90$ (SE = 0.02)
friday: $phi = 0.80$ (SE = 0.018)

Correct me if I’m wrong but to me these effects cannot be captured by a GARCH model for the errors. In light of the errors being correlated with $Delta y_{t-1}$ I have considered a model which looks something like this:
$$
Delta y_t = phi Delta y_{t-1} + varepsilon_t\
varepsilon_t = gamma 1_{lbrace t text{ is Wed} rbrace} Delta y_{t-1} + epsilon_t
$$

which can be written as
$$
Delta y_t = (phi + gamma 1_{lbrace t text{ is Wed} rbrace}) Delta y_{t-1} + epsilon_t
$$

but it is unclear to me how to estimate a standard ARIMA-type model in this case.

Are there any known models that deal with these kind of effects? Or maybe I’m missing something and that this is a routinely modeled in a typical ARIMA + GARCH setup?


Get this bounty!!!

#StackBounty: #time-series #autocorrelation #autoregressive Day-of-week effects on regression coefficients in autoregressive model?

Bounty: 50

I have a timeseries (sampled daily, weekdays only) whose volatility clearly shows dependency on day of week. In particular the standard deviation of the differenced series $Delta y_t$ is smallest on Mondays and peaks on Thursdays.

I have considered GARCH-style models for the volatility, with the respective dummy variables for day of week. However, I am not interested in the volatility of the errors per se, but rather how the mean equation is affected by the day of week. For example if I fit an AR(1) model to $Delta y$ I observe that its residuals $varepsilon_t$ on Wednesdays are correlated with $Delta y_{t-1}$.

In addition, if I assume $Delta y_t = phi Delta y_{t-1} + varepsilon_t$ but estimate $phi$ via OLS regression for each weekday separately I get the following for each day of week:

monday: $phi = 0.68$ (SE = 0.02)
tuesday: $phi = 0.76$ (SE = 0.04)
wednesday: $phi = 1.03$ (SE = 0.02)
thursday: $phi = 0.90$ (SE = 0.02)
friday: $phi = 0.80$ (SE = 0.018)

Correct me if I’m wrong but to me these effects cannot be captured by a GARCH model for the errors. In light of the errors being correlated with $Delta y_{t-1}$ I have considered a model which looks something like this:
$$
Delta y_t = phi Delta y_{t-1} + varepsilon_t\
varepsilon_t = gamma 1_{lbrace t text{ is Wed} rbrace} Delta y_{t-1} + epsilon_t
$$

which can be written as
$$
Delta y_t = (phi + gamma 1_{lbrace t text{ is Wed} rbrace}) Delta y_{t-1} + epsilon_t
$$

but it is unclear to me how to estimate a standard ARIMA-type model in this case.

Are there any known models that deal with these kind of effects? Or maybe I’m missing something and that this is a routinely modeled in a typical ARIMA + GARCH setup?


Get this bounty!!!

#StackBounty: #regression #time-series #linear Linear regression composition for time series

Bounty: 50

I am working on a time series data research project and have been struggling with an appropriate approach. I have a response time series $Y$ and several (<10) explanatory time series $X_1, X_2, cdots$ that are all over the same time frame of the past year.

I want to build a model that uses the $X$‘s as features and find the highest correlation of a linear combination ($a_1X_1 + a_2X_2 + cdots$) with $Y$ while being mindful of overfitting.

What would the best approach to this be, and what Python tools/kits can I use to help?


Get this bounty!!!

#StackBounty: #time-series #multiple-regression Continuous regression with little variation in dependent variable, but much variation i…

Bounty: 50

I have several years’ worth of data for various stores at which we sell products A, B, and C. We’ve sold products A and B for much longer than C; in fact, C is a new product as of this year.

I want to predict/forecast, at various points in the year, how much of each of these products will be sold by the end of the year for each store. For products A and B, since I have such great historical data and there are clear trend and seasonal patterns, using a more traditional forecasting approach (e.g. ARIMA or exponential smoothing) will very likely be my best bet.

The issue is with forecasting sales for product C (the new product). I have only a few months’ worth of historical data – not enough (AFAIK) for a traditional time series approach. So my thought was this: using monthly records for products A and B over the past several years, build a multiple regression model that predicts end-of-year sales for each store-product combination. I expect product C to follow a similar distribution / share similar time series properties as products A and B, based on the limited data I’ve collected so far and domain knowledge.

So my dataset looks like this:

enter image description here

Where a row of data contains monthly records at a store-product level. Let’s look at store = 'x' and product = 'a'. We see variation in a potential predictor, current_sales, but there’s obviously no variation in the dependent variable, end_of_year_sales.

Is this problematic, from a continuous linear regression standpoint? A scatterplot of end_of_year_sales against current_sales looks like this:

enter image description here

There seems to be somewhat of a linear trend here, but I still find it weird that, in this example, any point on the y axis is just one store-product’s end-of-year sales value, varied slightly by each month’s current_sales.

I guess I’m just looking for a sanity-check. Is there any inherent issue with modeling these data in such a way? Is there perhaps a better way of approaching this end-of-year sales forecast for completely new products problem? The linear trend seems obvious, but I’m worried I’m missing something. Such little variation in the dependent variable seems odd, when faced with potentially much more variation in the independent variables. Perhaps this approach is fine only because I have multiple store-products. If I had only one – there would only be variation along the x-axis, and none at all along the y (obviously).

I also don’t believe a next-month forecast would be very useful, although I have thought about something like predicting sales in 6 months until June, and then predicting end-of-year sales from there. But that also seems to be over-complicating things.


Get this bounty!!!