#StackBounty: #regression #time-series #lstm #rnn #data-preprocessing How do I prepare clinical data for multivariate time series analy…

Bounty: 50

I am trying to predict the progression of disease using certain clinical data (time series data) and covariates (such as age, sex, race etc.). I am aware of the existence of mainstream machine learning and deep learning models for such prediction tasks but since clinical data are longitudinal in nature I want to leverage this and use LSTMs or RNNs (if possible) for predictions.
I have a longitudinal dataset which describes a disease progression for multiple patients (100s of patient data) each with multiple visits (~10-20 visits) at different points of time with some conclusion about the disease at each time step. My point of confusion is how to prepare this dataset for an LSTM model since most of the literature I’ve read on this topic shows data preparation only for one patient. I want to understand how will my model be affected if I

  1. Ignore the "multiple patients model" and arrange all the data based on only time (date and time of visit).
  2. Arrange data based on the patient ID first and then the date and time of visit for each patient (nested arrangement if I am clear).

Thank you.


Get this bounty!!!

#StackBounty: #time-series #arima Difference in results between the forecast function in R, and manually calculating the predicted valu…

Bounty: 50

I have a series on which I fitted an ARIMA(4,0,4) model in R, and got the following estimations:

Coefficients:
      ar1     ar2      ar3      ar4     ma1     ma2     ma3     ma4  intercept
  -0.6498  0.0106  -0.7527  -0.8753  0.6727  0.0079  0.7486  0.8924     -1e-04
s.e.   0.0341  0.0274   0.0211   0.0530  0.0283  0.0275  0.0225  0.0497      1e-04

I then used the forecast library to get the next predicted value and got the following result

> forecast(ftfinal.arima, h=1)
 Point Forecast        Lo 80       Hi 80     Lo 95      Hi 95
3606   9.475018e-06 -0.007864678 0.007883628 -0.012033 0.01205195

This forecast result is different than the result I’m getting when I try to manually input the numbers into the ARIMA function, and I know that it’s because there’s something that I’m doing wrong but I don’t really understand what it is.

let the ARIMA(4,0,4) function be:

enter image description here

where p and q both equal 4.

and the most recent values of Xt are:

[3601]  1.502706e-03 -7.868107e-03  2.512803e-03  9.639389e-03  3.102150e-03

first of all for the AR part of the function, is the constant c the same as the "intercept" value that is output by the ARIMA model?
secondly, is the epsilon sub t series calculated as Xt minus the expectation of the whole series?


Get this bounty!!!

#StackBounty: #r #time-series #neural-networks #forecasting #keras Forecast when the time series is not sequential?

Bounty: 50

I have multivariate time series data consisting of monthly sales of contraceptives at various delivery sites in a certain country, between January 2016 and June 2019. The data looks as follows:

enter image description here

The task in hand is to predict the average monthly sales (stock_distributed) for July, August and September (row month) for 2019. However, the data is not a multivariate time series data (not sequential), and the predicted results should fit in this table:

enter image description here

As you can see the predictions are based on combinations of different explanatory variables. My question is: what is the most appropriate deep learning method that would allow me to predict the monthly sales as combinations of the four explanatory variables?


Get this bounty!!!

#StackBounty: #machine-learning #time-series #neural-networks #perceptron #volatility Volatility forecasting using MLP

Bounty: 100

I am currently working on a project which aims to predict the Monthly volatility of the S&P 500 index with the aid of Multilayer Perceptrons (MLP). Actually, I am trying to reproduce some of the results shown here: https://beta.vu.nl/nl/Images/werkstuk-ladokhin_tcm235-91388.pdf

I am also trying to use the same network architectures the author has used, including number of input nodes and etc.

The above document states that data has been divided as below:

  1. Training set: December 1978 to October 2000 (263 observations)
  2. Testing set: November 2000 to November 2008 (97 observations)

However, I am insecure as to how the proper training and testing procedure should be, and that is why I am here.

How would I do it?

  1. Compute the monthly volatility for each month since December 1978 until November 2008.
  2. Having computed those values, I would separate monthly volatility values which correspond to training values and testing values.
  3. Now I would aim to create the target arrays, which carry the correct numerical answer for each month. However, I am not sure how I would do it.. Since my network contains one output node, I would probably say my target array looks like target label = [observed_volatility] for each month of the training set. Is that correct?
  4. Having trained the network, I would compare the output values obtained that were generated versus the observed volatility for the corresponding month in the testing set.

Am I in the right track? My biggest issue regards the way target arrays should be constructed and how proper training and testing should take place in this case.

Quick observation: I dont know if the term "target array" is universal or not. However, by the aforementioned word, I am trying to refer to the array which contains the correct answer for a certain label, which in my case turns out to be the volatility of a certain month.

Thanks in advance, Lucas


Get this bounty!!!

#StackBounty: #machine-learning #time-series #classification #predictive-models Predicting time-to-failure using multiple failure datas…

Bounty: 50

I have a dataset of a machine for the past year. The dataset consists of timestamps, various sensors readings, and machine failures. Different sensors have different data recording intervals (some recorded every 5 mins, some recorded every 30 mins), so the timestamps are different for each sensor (temperature, humidity, vibration, etc). A sample dataset (sensors with their respective timestamps) structure is as follows:

enter image description here

There are six failure events during one year time. Each failure event is provided with its timestamp. I want to train a machine learning model using these data and predict future failure occurrences in advance using the streaming data from sensors. My situation is very close to this and this questions. Which ML model will give me a good prediction result? How can I use the failure events as a target in my ML model?


Get this bounty!!!

#StackBounty: #machine-learning #time-series #sequence-to-sequence Multi-step forecasts of factory production data using a Seq2Seq Enco…

Bounty: 100

I am attempting to use a Seq2Seq model to make forecasts of factory production data using an Encoder-Decoder model augmented with Attention. I have become a little stuck as the output of the model seems to be a constant and has the same size sequence length as the input, where in fact I would like to be able to specify that say I want to forecast 3 months into the future.

The Target
To my understanding, I went to be predicting the production volume of a given material from this factory into the future. So its dimensionality is $1$ and it is of course an integer.

The Encoder
The encoder takes as input a sequence of length $168$, with each input being the $20$ previous days data, as well as $37$ factory-level features such as number of workers etc etc..

The Decoder
This is where I get confused and where I am running into issues with my code. Again, to my understanding the Decoder should be taking the previous time-steps production levels as input (meaning dimension $1$), as well as the previous hidden and cell state.

Code

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, p):
        super(EncoderRNN, self).__init__()
        
        self.lstm = nn.LSTM(input_size, hidden_size,
                            num_layers, dropout = p, 
                            bidirectional = True)

        self.fc_hidden = nn.Linear(hidden_size*2, hidden_size) 
        self.fc_cell = nn.Linear(hidden_size*2, hidden_size)

    def forward(self, input):
        print(f"Encoder input shape is {input.shape}")
        
        encoder_states, (hidden, cell_state) = self.lstm(input)

        print(f"Encoder Hidden: {hidden.shape}")
        print(f"Encoder Cell: {cell_state.shape}")

        hidden = self.fc_hidden(torch.cat((hidden[0:1], hidden[1:2]), dim = 2))
        cell = self.fc_cell(torch.cat((cell_state[0:1], cell_state[1:2]), dim = 2))

        print(f"Encoder Hidden: {hidden.shape}")
        print(f"Encoder Cell: {cell.shape}")
        
        return encoder_states, hidden, cell


class Decoder_LSTMwAttention(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, p):
        super(Decoder_LSTMwAttention, self).__init__()
       
        self.rnn = nn.LSTM(hidden_size*2 + input_size, hidden_size,
                           num_layers)

        self.energy = nn.Linear(hidden_size * 3, 1)
        self.fc = nn.Linear(hidden_size, output_size)
        self.softmax = nn.Softmax(dim=0)
        self.dropout = nn.Dropout(p)
        self.relu = nn.ReLU()  

        self.attention_combine = nn.Linear(hidden_size, hidden_size)


    def forward(self, input, encoder_states, hidden, cell):


        input = input.unsqueeze(0)
        input = input.unsqueeze(0)

        input = self.dropout(input)

        sequence_length = encoder_states.shape[0]
        h_reshaped = hidden.repeat(sequence_length, 1, 1)

        concatenated = torch.cat((h_reshaped, encoder_states), dim = 2)
        print(f"Concatenated size: {concatenated.shape}")

        energy = self.relu(self.energy(concatenated))
        attention = self.softmax(energy)
        attention = attention.permute(1, 0, 2)

        encoder_states = encoder_states.permute(1, 0, 2)

        context_vector = torch.einsum("snk,snl->knl", attention, encoder_states)
        
        rnn_input = torch.cat((context_vector, input), dim = 2)

        output, (hidden, cell) = self.rnn(rnn_input, hidden, cell)

        output = self.fc(output).squeeze(0)
        
        return output, hidden, cell

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_force_ratio=0.5):
        batch_size = source.shape[1]
        target_len = target.shape[0]
        #target_vocab_size = len(english.vocab)

        outputs = torch.zeros(target_len, batch_size).to(device)
        encoder_states, hidden, cell = self.encoder(source)

        # First input will be <SOS> token
        x = target[0]

        for t in range(1, target_len):
            # At every time step use encoder_states and update hidden, cell
            output, hidden, cell = self.decoder(x, encoder_states, hidden, cell)

            # Store prediction for current time step
            outputs[t] = output

            # Get the best word the Decoder predicted (index in the vocabulary)
            best_guess = output.argmax(1)

            # With probability of teacher_force_ratio we take the actual next word
            # otherwise we take the word that the Decoder predicted it to be.
            # Teacher Forcing is used so that the model gets used to seeing
            # similar inputs at training and testing time, if teacher forcing is 1
            # then inputs at test time might be completely different than what the
            # network is used to. This was a long comment.
            x = target[t] if random.random() < teacher_force_ratio else best_guess

        return outputs

Training Routine

def Seq2seq_trainer(model, optimizer, train_input, train_target,
                  test_input, test_target, criterion, num_epochs):

    train_losses = np.zeros(num_epochs)
    validation_losses = np.zeros(num_epochs)

    for it in range(num_epochs):
        # zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(train_input, train_target)  
        loss = criterion(outputs, train_target)

        # Back prop
        loss.backward()

        # Clip to avoid exploding gradient issues
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

        # Gradient descent step
        optimizer.step()

        # Save losses
        train_losses[it] = loss.item()

        # Test loss
        
        test_outputs = model(test_input, test_target)
        validation_loss = loss_function(test_outputs, test_target)
        validation_losses[it] = validation_loss.item()
            
        if (it + 1) % 25 == 0:
            print(f'Epoch {it+1}/{num_epochs}, Train Loss: {loss.item():.4f}, Validation Loss: {validation_loss.item():.4f}')

    return train_losses, validation_losses

Results I get

The issue seems to be the decoder is predicting a constant value each time and does not pick up on the noise in the data

Predictions and true returns


Get this bounty!!!

#StackBounty: #r #time-series #autocorrelation #predictor #generalized-least-squares Can time be used as predictor in a GLS with tempor…

Bounty: 50

I am trying to do a GLS with a temporal autocorrelation structure. I am using R to do the analysis. My data looks like this:

Group   Time    Work    Category
DI226   5   1.351351351 ctrl
DI226   10  1.351351351 ctrl
DI226   15  1.351351351 ctrl
DI226   20  1.351351351 ctrl
DI226   25  3.378378378 ctrl
DI226   30  4.72972973  ctrl
DI226   35  4.72972973  ctrl
DI226   40  5.405405405 ctrl
DI226   45  8.783783784 ctrl
DI226   50  8.783783784 ctrl
DI226   55  8.783783784 ctrl
DI226   60  11.48648649 ctrl
DI226   65  11.48648649 ctrl
DI226   70  14.18918919 ctrl
DI226   75  5.405405405 ctrl
DI226   80  1.351351351 ctrl
DI226   85  2.027027027 ctrl
DI226   90  2.702702703 ctrl
DI226   95  0.675675676 ctrl
DI226   100 0.675675676 ctrl

Where Group is a grouping variable. I have 10 such groups.
Time is the percentage of total time in bins of 5%.
Work is percentage of total work done in the 5% bins.
Category is the predictor variable with two levels – control and experiment.

I have first performed an OLS, and looked at the ACF and PACF plots of the residuals, and looks like AR(2) autocorrelation structure would be suitable. I have used arima() function to find out the coefficients for AR(2) using the OLS. I am using gls() function from the nlme package.

My model looks like this:

gls(Work ~ Category * Time, correlation = corARMA(p=2, value=ar2, form = ~Time|Group), method="ML")

My questions are:

  1. Is this model correct? Can I use Time as a predictor variable?
  2. If so, then can I use Time both as a predictor and to state the correlation structure? Is putting Time inside the correlation structure necessary?
  3. How to select the proper order of AR for the model?


Get this bounty!!!

#StackBounty: #machine-learning #time-series #neural-networks #categorical-data Types of Models Suitable for Problem with Categorical T…

Bounty: 50

I have a situation with a standard input but have to output a set of categorical time series labels. I have had a look at the literature, and there is much to say when we have the reverse situation, i.e. variable length inputs, for example padding. But was wondering what to do with variable length outputs. Is this a situation for a one to many RNN? Or are there other options?


Get this bounty!!!

#StackBounty: #time-series #hypothesis-testing #statistical-significance What type of statistical test to use on ordered (e.g., time or…

Bounty: 50

I have 2 samples where one is using water as treatment (in black) and one using a test treatment (teal). The x axis is position on a gene and y axis is the coverage which is basically the number of times that region had a hit on the detecter.

Is there a statistical test I can use to say the test treatment has higher or lower coverage than the water treatment? I guess a t-test or wilcoxon could technically work but it’s not incorporating the position information which I think is important.

Also, is there a way to incorporate segments? For example, is the (mean/median) "coverage" on region X to Y significantly different between the 2 samples compared to the other regions?

Are there any tests that come to mind when looking at this type of data? These types of plots remind me of time series data so I feel that tests used on time series data would be applicable here.

enter image description here


Get this bounty!!!

#StackBounty: #regression #machine-learning #time-series #predictive-models #random-forest Predict from multiple data

Bounty: 100

Basically, I have to do many computer simulations which is very time-consuming and requires a lot of computation power.

I want to develop an algorithm/program which allows me to provide actual results from my simulations. I want to provide some inputs and receive a predicted output based on the actual data I have provided it.

EXAMPLE:

The run data LINK – convert to csv

I do 5 simulation runs, with each run the input I initially provided is a fixed velocity and a rotation:

enter image description here

Based on the velocity and rotation input, my output is force with respect to time:

enter image description here

Now, if I provide a velocity = 18500, rotation = 26°. I Should get a force vs time data similar to RUN 3 (as the inputs velocity and rotation are close to Run 3).

Visually, this can be represented on a graph. Where the grey curves are the 5 input runs and the red curve is the predicted run (based on the velocity and rotation I provide).

enter image description here

Recently, I have looked into Decision trees (Random Forest) and Multiple Regression. This problem could be a response surface modelling.

Below I have done a simple multiple regression in Excel. At time = 0.002, I get all the 5 forces and using Excel multiple regression function I get the coefficients.

So my predicted value is the coefficients multiplied by new inputs (in this case im copying the inputs of run 3, just to determine the accuracy):

Intercept + velocity * XVariable1 + rotation * XVariable2

Thus:

-25268.67 + 18000*3.184064 + 25*309.50398

enter image description here

The result is close, however, this method is for linear and my data isn’t linear.

Is there a way to enhance this and improve the accuracy for non-linear.


Get this bounty!!!