#StackBounty: #time-series #clustering #seasonality #hierarchical-clustering Definition and Taxonomy of Seasonal Time Series

Bounty: 50

I want to

  1. categorize a large number of time series into non-seasonal and seasonal
  2. divide the seasonal ones into a small number of subgroups by type of seasonality

Are there any formal definitions/taxonomies of seasonality out there?

Or is this an "I know it when I see it" kind of phenomenon (to paraphrase Justice Potter Stewart)?

I don’t want to reinvent the wheel here, so I am curious if there is existing wisdom on how to do this well.

Here are a couple of off-the-cuff ideas:

  • A simple concentration-index definition could be the sum of the
    squared shares of the total for each time unit: $$sum_{t=1}^{T}
    left(frac{y_t}{sum_{t=1}^{T}y_t} right)^2 $$

    When that sum exceeds some threshold, a series would be considered
    seasonal.

  • A more complicated approach would be to decompose a time series into
    trend, seasonal, cyclical, and idiosyncratic components and calculate
    the fraction of total variation due to the seasonal part. A series
    would be seasonal if that fraction exceeds some threshold.
  • The next step would be to cluster the shares or the seasonal components into groups that are similar.


Get this bounty!!!

#StackBounty: #time-series #stationarity #heteroscedasticity Implications of fitting an ARIMA model with constant variance to a process…

Bounty: 50

Hamilton (page 657) in Time Series Analysis warns that a variance that changes over time has implications for the validity and efficiency of statistical inference about the parameters in an AR(p) model (and by extension in an ARMA or ARIMA model).

enter image description here

Here are 3 articles that compare ARCH models with ARIMA models. As far as I can tell each of these articles fits an ARIMA model (with constant variance) to series that come from processes with clearly non-constant variance/processes with variance that exhibits volatility clusters. I will point out that each of these articles also fits and ARCH/GARCH type model but I raise the question of whether fitting an ARIMA type model (with constant variance) in these cases makes any sense? Curiously, article 3 even concludes that the ARMA model (with constant variance) does a better job predicting than the GARCH model. What are the implications of estimating a process with non-constant variance with an ARIMA model that assumes constant variance on the validity and efficiency of statistical inference about its parameters? Do those implications differ when the variance changes linearly versus when it exhibits volatility clusters? (see plot below) Routinely you see people concluding that rejecting the null hypothesis in a unit root test (e.g. ADF) amounts to concluding that a series is stationary (instead of I(0)), once again gliding over the possibility of non-constant variance and of its implications. Both of the series pictured below do not have a unit root, and yet have clearly non-constant variance, as did the series in each of the 3 cited articles.

enter image description here

par(mfrow=c(2,3))
########################################################################
set.seed(400)
y<-rep(NA,100)
for (i in 1:100) {
  y[i]<-rnorm(1,mean=0,sd=i)
}
plot(y,type="p",main="1: variance increases linearly",col="red",lwd=2);abline(h=0)

u<-urca::ur.df(y=y, type = "none",lags=12)
summary(u)
forecast::Acf(u@res,lag.max=70,type="correlation",main="ACF",xlab="")
forecast::Acf(u@res,lag.max=70,type="partial",main="PACF",xlab="")
########################################################################
y<-rep(NA,100)
for (i in 1:100) {
  if (i<50) y[i]<-rnorm(1,mean=0,sd=2/i^1.08)
  else y[i]<-rnorm(1,mean=0,sd=i^2/30000)
}
plot(y,type="p",main="2: variance does not increase linearly",col="blue",lwd=2);abline(h=0)

u<-urca::ur.df(y=y, type = "none",lags=12)
summary(u)
forecast::Acf(u@res,lag.max=70,type="correlation",main="ACF",xlab="")
forecast::Acf(u@res,lag.max=70,type="partial",main="PACF",xlab="")
########################################################################


Get this bounty!!!

#StackBounty: #time-series #arima #lags Lag operator; particular solution to ARMA as a MA$(infty)$ process

Bounty: 50

Let’s take an AR(1) model. I am comfortable with the fact that $Ly_t$ means that the lag operator operates on the process ${y_t}$ by lagging it by one period. What I am a bit less comfortable with to write a lag operator by itself without a process to operate on. In the denominator we have $1-a_1L$. It isn’t immediately clear which process the lag operator is operating on, but upon a moment’s reflection, I guess we realize we can multiply both sides by the denominator and I guess we can figure out that the lag is operating on $y_t$?

$$y_t=a_0+a_1Ly_t+epsilon_t$$

$$y_t=frac{a_0+epsilon_t}{1-a_1L}$$

Now let’s switch to the ARMA(p,q) model.

$$y_t=a_0+sum_{i=1}^{p}a_iy_{t-i}+sum_{i=0}^{q}beta_iepsilon_{t-i}$$
$$y_t-sum_{i=1}^{p}a_iy_{t-i}=a_0+sum_{i=0}^{q}beta_iepsilon_{t-i}$$
$$(1-sum_{i=1}^{p}a_iL^{i})y_t=a_0+sum_{i=0}^{q}beta_iepsilon_{t-i}$$

Because I am struggling to wrap my mind around the lag operator I cannot quite convince myself of what Enders says (page 51) "The important point to recognize is that the expansion (of the equation below) will yield an $MA(infty)$ process". I am looking for someone to help me see that the expansion of $y_t$ expression below will yield an $MA(infty)$ process and help me understand what it means for a lag operator to be written by itself, without something to operate on. What is in the denominator of $y_t$? How do I think about it?

$$y_t=frac{a_0+sum_{i=0}^{q}beta_iepsilon_{t-i}}{1-sum_{i=1}^{p}a_iL^{i}}$$


Get this bounty!!!

#StackBounty: #r #time-series #hypothesis-testing #statistical-significance #granger-causality How to better use and interpret granger …

Bounty: 50

I have the following code and I want to show the connection of two different factors aith a specific one. I want to use grangertest in R and I have the following question:

  1. how can I interpret the results based on different levels of significance?
  2. how can I interpret non-significant results?
  3. is there a way to visualise the results?
my_example <- data.frame(matrix(ncol = 3, nrow = 10))
my_example$X1 <- c(0.8619616, 1.1818621, 0.5530410, 0.6255634, 
       0.9971764, 1.3464298, 2.0889985, 1.5303893, 2.9503790, 
       2.9244321)
my_example$X2 <- c( -5.7333332, -4.7000000, -7.7000000, 
     -2.5000000,  1.5666667,  0.2666667, -2.7000000, -6.2000000, 
      0.2333333  ,0.5333333)
my_example$X3 <- c( 0.2200000, 0.3625000, 0.2100000, 0.3750000, 
      0.4966667, 0.4133333, 0.3800000, 0.2133333, 0.3733333, 
      0.4400000)

grangertest(X1 ~ X2, order = 2, data = my_example)

grangertest(X1 ~ X3, order = 2, data = my_example)
```


Get this bounty!!!

#StackBounty: #time-series #neural-networks #forecasting #predictive-models #keras what does it mean when keras+tensorflow predicts all…

Bounty: 100

From what I understand is that in supervised learning problems there is a dependent variable Y, which I included in my ANN. There is one set of matching predictions for each sample for each Y. The number of predictions should match the number of true values given.

The problem I’m having is that after using model.predict() in Keras the ANN is giving me the Y dependent variable + the 10 timesteps of the Y variable that I gave for the predictors (i think).

My training dataset includes 10 timesteps for each variable. I assumed that I could use timesteps to insert lagged versions of each predictor variable.

Basically I don’t understand what these 10 predicted timesteps for the Y variable are. They are not lagged versions of the predicted Y at time t.

The reason I’m asking is that I don’t know if the global score of the model should really include predicted timesteps of Y. Should I ignore it or include them?

Also, in terms of prediction which values do I use? Just the ones at time T?

Is the Y(t-1) the predicted values for all the predictors at timestep (t-1) like Y?


Get this bounty!!!

#StackBounty: #time-series #probability #hypothesis-testing #statistical-significance #missing-data (Sudden deafness ended?) How can I …

Bounty: 50

I think I found a major (conclusion-flipping) statistical error in a paper in an AMA journal. Did I?

If I messed up, I’d like to know how; I hope someone can point me in the direction of my errors. If I messed up, I must have made at least two major errors, as I came to the same conclusion in two independent ways. I communicated with the journal editor and corresponding author.

Here, you can find the paper and the correspondence. I reproduce it below.

To try to make this question fully self-contained, I’ll summarize the issue.

The authors calculate the background rate of sudden sensorineural hearing loss (SSNHL) per year and compare it to the rate of SSNHL over a three-week post-intervention period, and graph "Estimated incidence of SSNHL, per 100 000 per y". Their conclusion is that the data indicates the intervention does not increase the incidence of SSNHL; a substantially and significant reduction is indicated.
They state that "We then estimated the incidence of SSNHL that occurred after vaccination on an annualized basis." But this cannot be what they calculated. It is incompatible with what they report is the data their research yielded.

  1. It’s an error to limit the possible adverse side effect window to 3 weeks post-vaccination (excluding adverse events outside that window) but then spread the remaining adverse events over a year to calculate risk on an annualized basis.". It’s unjustifiable. A reasonable start for comparison would be to compare risk over the 3 weeks to the annual (52-week) risk, scaled to 3 week period. So the correct finding, based on their research, appears to be no risk difference over a 3 week period of SSNHL between groups, (0.6-4.4 vs 0.3-4.1, n.s.).
  2. Their conclusion implies that the authors have discovered that the intervention reduced SSNHL by about 94%. Which would be a groundbreaking discovery if confirmed, and a there’s no plausible mechanism presented for such a miraculous treatment effect, more evidence of grave error. It does not pass this basic plausibility test.
  3. As I finished writing this up, I found further concerns, which I’ll put in an answer. I put in chat: https://chat.stackexchange.com/rooms/18/ten-fold because preliminary.

[end summary]

Again, here, you can find the paper and the correspondence. I reproduce it immediately below.

I wrote:

Sirs:

I write in respect to *.

This study should be withdrawn. It’s an error to limit the possible
adverse side effect window to 3 weeks post-vaccination (excluding
adverse events outside that window) but then spread the remaining
adverse events over a year to calculate risk. It’s unjustifiable. A
reasonable start for comparison would be to compare risk over the 3
weeks to the annual (52-week) risk, scaled to 3 week period. So the
correct finding appears to be that risk in a 3 week period of SSNHL,
whether vaccinated or unvaccinated, is the same (0.6-4.4 vs 0.3-4.1,
n.s.). A closer look at adverse events within shorter periods after
vaccination would be an appropriate topic for further research.
Another way to see this error is to consider whether the original
results pass a basic plausibility test. They do not. If the results
shown in the figure accurately reflected Incidence Range / 100k of
SSNHL between the vaccinated and unvaccinated, then it would suggest
that the authors had discovered that vaccination reduced SSNHL by
about 94%. Which would be a groundbreaking discovery, and a there’s no
plausible mechanism presented for such a miraculous treatment effect,
this is more evidence of grave error.

*Formeister EJ, et. al. JAMA Otolaryngol Head Neck Surg. 2021;147(7):674–676. doi:10.1001/jamaoto.2021.0869

-Matthew

Assuming I have, this won’t my first time spotting a major error in a peer-reviewed publication. (I think that was in article High-fructose corn syrup causes characteristics of obesity in rats: Increase body weight, body fat and triglyceride levels (2010) in 2010. This HFCS, Bocarsly, Princeton paper was wildly popular in the lay press.)

Yet I received this non-response response (emphasis mine):

Matthew,

Thank you very much for your recent communication about the
paper w published in JAMA Otolaryngology ("Preliminary Analysis of
Association Between COVID-19 Vaccination and Sudden Hearing Loss Using
VAERS"). As a peer reviewed publication this manuscript was vetted in
a process that includes assessment and validation of hypotheses,
methodologies, and conclusions. Readers and scientists can have faith
in the integrity of these robust processes. We look forward to seeing
this important field expand and would encourage all interested
scientists to consider peer reviewed publication of their work in the
field.
We would encourage a thoughtful re-read of the manuscript to
understand the methodology, and additional reading on the topics of
idiopathic sudden sensorineural hearing loss and principles of
epidemiology, for your understanding. Respectfully, Dr. Eric
Formeister, MD, MS on behalf of the authors.


If I messed up, I’d like to know ; I hope someone can point me in the direction of my error(s).


Get this bounty!!!

#StackBounty: #time-series #forecasting #p-value #model-evaluation #diagnostic Result of a diagnostic test of a predictive model lookin…

Bounty: 50

I have created a predictive model that outputs a predictive density. I used 1000 rolling windows to estimate the model and predict one step ahead in each window. I collected the 1000 predictions and compared them to the actual realizations. I used several diagnostic tests, among them Kolmogorov-Smirnov. I saved the $p$-value of the test.

I did the same for multiple time series. Then I looked at all of the $p$-values from the different series. I found that they are 0.440, 0.579, 0.848, 0.476, 0.753, 0.955, 0.919, 0.498, 0.997. At first I was quite happy that they are much larger than 0.010, 0.050 or 0.100 (to use the standard cut-off values). But then a colleague of mine pointed out that the $p$-values should be distributed as $text{Uniform}[0,1]$ under the null of correct predictive distribution, and so I should perhaps not be so happy.

On the one hand, the colleague must be right; the $p$-values should ideally be uniformly distributed. On the other hand, I have found that my model predicts "better" than the true model normally would; the discrepancy between the predicted density and the realized density is less than one would normally expect between the true density and the realized density. This could be an indication of overfitting if I were evaluating my model in-sample, but the model has been evaluated out of sample. What does this tell me? Should I be concerned with a diagnostic test’s $p$-values being too high?

You could say this is just a small set of $p$-values (just 8 of them) so anything could happen, and you might be right. However, suppose I have a larger set of $p$-values that are closer to 1 than uniformly distributed; is that a problem? What does that tell me?


Get this bounty!!!

#StackBounty: #regression #time-series #lstm #rnn #data-preprocessing How do I prepare clinical data for multivariate time series analy…

Bounty: 50

I am trying to predict the progression of disease using certain clinical data (time series data) and covariates (such as age, sex, race etc.). I am aware of the existence of mainstream machine learning and deep learning models for such prediction tasks but since clinical data are longitudinal in nature I want to leverage this and use LSTMs or RNNs (if possible) for predictions.
I have a longitudinal dataset which describes a disease progression for multiple patients (100s of patient data) each with multiple visits (~10-20 visits) at different points of time with some conclusion about the disease at each time step. My point of confusion is how to prepare this dataset for an LSTM model since most of the literature I’ve read on this topic shows data preparation only for one patient. I want to understand how will my model be affected if I

  1. Ignore the "multiple patients model" and arrange all the data based on only time (date and time of visit).
  2. Arrange data based on the patient ID first and then the date and time of visit for each patient (nested arrangement if I am clear).

Thank you.


Get this bounty!!!

#StackBounty: #time-series #arima Difference in results between the forecast function in R, and manually calculating the predicted valu…

Bounty: 50

I have a series on which I fitted an ARIMA(4,0,4) model in R, and got the following estimations:

Coefficients:
      ar1     ar2      ar3      ar4     ma1     ma2     ma3     ma4  intercept
  -0.6498  0.0106  -0.7527  -0.8753  0.6727  0.0079  0.7486  0.8924     -1e-04
s.e.   0.0341  0.0274   0.0211   0.0530  0.0283  0.0275  0.0225  0.0497      1e-04

I then used the forecast library to get the next predicted value and got the following result

> forecast(ftfinal.arima, h=1)
 Point Forecast        Lo 80       Hi 80     Lo 95      Hi 95
3606   9.475018e-06 -0.007864678 0.007883628 -0.012033 0.01205195

This forecast result is different than the result I’m getting when I try to manually input the numbers into the ARIMA function, and I know that it’s because there’s something that I’m doing wrong but I don’t really understand what it is.

let the ARIMA(4,0,4) function be:

enter image description here

where p and q both equal 4.

and the most recent values of Xt are:

[3601]  1.502706e-03 -7.868107e-03  2.512803e-03  9.639389e-03  3.102150e-03

first of all for the AR part of the function, is the constant c the same as the "intercept" value that is output by the ARIMA model?
secondly, is the epsilon sub t series calculated as Xt minus the expectation of the whole series?


Get this bounty!!!

#StackBounty: #r #time-series #neural-networks #forecasting #keras Forecast when the time series is not sequential?

Bounty: 50

I have multivariate time series data consisting of monthly sales of contraceptives at various delivery sites in a certain country, between January 2016 and June 2019. The data looks as follows:

enter image description here

The task in hand is to predict the average monthly sales (stock_distributed) for July, August and September (row month) for 2019. However, the data is not a multivariate time series data (not sequential), and the predicted results should fit in this table:

enter image description here

As you can see the predictions are based on combinations of different explanatory variables. My question is: what is the most appropriate deep learning method that would allow me to predict the monthly sales as combinations of the four explanatory variables?


Get this bounty!!!