#StackBounty: #cart #boosting #catboost Please correct my assumption on how regression trees work

Bounty: 50

I’m trying to understand how regression trees work, I’ve been experimenting with catboost and xgboost in python, and I’m getting results which I don’t expect, can someone please clarify (and apologies in advance if this is a coding error)

I’ve generated test data by adding random noise to a hinge function shown in the image below:

I then fitted a catboost regressor, with iterations=1 and depth=1. My understanding is this should split the x values into two leaf nodes, and the prediction is the mean of the y values in each node. My expectation is the model will look like the image below – this has a mean squared error of ~225:

However catboost split at ~30 and the predicted value in each split doesn’t appear to be the mean of the blue points in each split – the mean squared error is ~1240:

My code is:

# generate data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

x = np.linspace(0,40,1000)
e = np.random.normal(0,5,1000)
y = 10 + np.where(x>18,3*(x-18),0) + e

plt.figure();
plt.scatter(x,y, s=1);
plt.plot(x, 10 + np.where(x>18,3*(x-18),0), color = 'orange', alpha=0.7);

# expected model
y_pred = np.where(x>18,y[x>18].mean(),y[x<=18].mean())

plt.figure();
plt.scatter(x, y, s=1);
plt.plot(x, y_pred, color='orange');

print(f'mse: {mean_squared_error(y, y_pred):.2f}')

# catboost model
from catboost import CatBoostRegressor, Pool
from sklearn.metrics import mean_squared_error

train_pool = Pool(x, y)
estimator = CatBoostRegressor(n_estimators=1, max_depth=1, loss_function='RMSE')
estimator.fit(train_pool)

y_pred = estimator.predict(x)

plt.figure();
plt.scatter(x, y, s=1);
plt.plot(x, y_pred, color='orange');

print(f'mse: {mean_squared_error(y, y_pred):.2f}')


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.