#StackBounty: #python #scikit-learn #time-series How to correctly predict target variables with sklearn regressor in python?

Bounty: 50

I tried to predict target variables in time series data by using AdaBoostRegressor in order to forecast 12 weeks forecast. However, I want to see how each individual feature contributed to forecasting target variables. To do so, I removed the seasonality of time series by taking its log value and differences, then make training data for the fitting model. In my current approach, I don’t know how should I know the relation between each feature which might contribute to the better prediction for target variables. I think there might be a better way of doing this. How can I improve my current output? Can anyone suggest me a possible way of doing this in sklearn? Any possible thoughts?

my attempt

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor

url = "https://gist.githubusercontent.com/adamFlyn/f71e2e0e66303df23dfc2f37ec98e8c7/raw/ba9e871e90201eb504e30127e99cf6179c3e3b18/tradedf.csv"

df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)

df['log_eyci'] = np.log(df.eyci)  ### Log value
df['log_aus_avg_rain'] = np.log(df['aus_avg_rain'])  ### Log value

for i in range(3):
    df[f'avgRain_lag_{i+1}'] = df['aus_avg_rain'].shift(i+1)   
    df.dropna(inplace=True)
    df[f'log_avgRain_lag_{i+1}'] = np.log(df[f'avgRain_lag_{i+1}'])
    
for i in range(3):
    df[f'eyci_lag_{i+1}'] = df.eyci.shift(i+1)   
    df.dropna(inplace=True)
    df[f'log_eyci_lag_{i+1}'] = np.log(df[f'eyci_lag_{i+1}'])
    df[f'log_difference_{i+1}'] = df.log_eyci - df[f'log_eyci_lag_{i+1}']

X,Y = df[['log_difference_2', 'log_difference_3', 'aus_avg_rain', 'aus_slg_fmCatl']] , df['log_difference_1']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, shuffle=False, random_state=42)

fit the model with AdaBoost Regressor

mdl_adaboost = AdaBoostRegressor(n_estimators=100, learning_rate=0.01)
mdl_adaboost.fit(X_train, Y_train)   # Fit the data
pred = mdl_adaboost.predict(X_test)  # make predictions

when I tried to make a plot for prediction output, I tried below

## make plot
test_size = X_test.shape[0]
plt.plot(list(range(test_size)), np.exp(df.tail(test_size).log_eyci_lag_1  + pred), label='predicted', color='red')
plt.plot(list(range(test_size)), df.tail(test_size).eyci, label='real', color='blue')
plt.legend(loc='best')
plt.title('Predicted vs Real with log difference values')

the main problem is I want to see how features such as 'aus_avg_rain', 'aus_slg_fmCatl' are contributed to predicting eyci. Because when I make a plot, It is hard to see the effect of each or more than features to forecast eyci value for 12 months ahead. How can we approach this? Can anyone suggest me a possible idea or way to get over this? Thanks in advance!


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.