Decision Tree Regressor, Explained: A Visual Guide with Code Examples | by Samy Baladram | Oct, 2024

REGRESSION ALGORITHM

Trimming branches smartly with cost-complexity pruning

Decision Trees aren’t limited to categorizing data — they’re equally good at predicting numerical values! Classification trees often steal the spotlight, but Decision Tree Regressors (or Regression Trees) are powerful and versatile tools in the world of continuous variable prediction.

While we’ll discuss the mechanics of regression tree construction (which are mostly similar to the classification tree), here, we’ll also advance beyond the pre-pruning methods like “minimal sample leaf” and “max tree depth” introduced in the classifier article. We’ll explore the most common post-pruning method which is cost complexity pruning, that introduces a complexity parameter to the decision tree’s cost function.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

A Decision Tree for regression is a model that predicts numerical values using a tree-like structure. It splits data based on key features, starting from a root question and branching out. Each node asks about a feature, dividing data further until reaching leaf nodes with final predictions. To get a result, you follow the path matching your data’s features from root to leaf.

Decision Trees for regression predict numerical outcomes by following a series of data-driven questions, narrowing down to a final value.

To demonstrate our concepts, we’ll work with our standard dataset. This dataset is used to predict the number of golfers visiting on a given day and includes variables like weather outlook, temperature, humidity, and wind conditions.

Columns: ‘Outlook’ (one-hot encoded to sunny, overcast, rain), ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ (Yes/No) and ‘Number of Players’ (numerical, target feature)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temp.': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humid.': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}
df = pd.DataFrame(dataset_dict)



# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'],prefix='',prefix_sep='')
# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)



# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

The Decision Tree for regression operates by recursively dividing the data based on features that best reduce prediction error. Here’s the general process:

Begin with the entire dataset at the root node.
Choose the feature that minimizes a specific error metric (such as mean squared error or variance) to split the data.
Create child nodes based on the split, where each child represents a subset of the data aligned with the corresponding feature values.
Repeat steps 2–3 for each child node, continuing to split the data until a stopping condition is reached.
Assign a final predicted value to each leaf node, typically the average of the target values in that node.

We will explore the regression part in the decision tree algorithm CART (Classification and Regression Trees). It builds binary trees and typically follows these steps:

1.Begin with all training samples in the root node.

2.For each feature in the dataset:
a. Sort the feature values in ascending order.
b. Consider all midpoints between adjacent values as potential split points.

In total, there are 23 split points to check.

3. For each potential split point:
a. Calculate the mean squared error (MSE) of the current node.
b. Compute the weighted average of errors for the resulting split.

As an example, here, we calculated the weighted average of MSE for split point “Temperature” with value 73.0

4. After evaluating all features and split points, select the one with lowest weighted average of MSE.

5. Create two child nodes based on the chosen feature and split point:
– Left child: samples with feature value <= split point
– Right child: samples with feature value > split point

6. Recursively repeat steps 2–5 for each child node. (Continue until a stopping criterion is met.)

7. At each leaf node, assign the average target value of the samples in that node as the prediction.

from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt# Train the model
regr = DecisionTreeRegressor(random_state=42)
regr.fit(X_train, y_train)
# Visualize the decision tree
plt.figure(figsize=(26,8))
plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=False, fontsize=16, precision=2)
plt.tight_layout()
plt.show()

In this scikit-learn output, the samples and values are shown for the leaf nodes and interim nodes.

Here’s how a regression tree makes predictions for new data:
1. Start at the top (root) of the tree.
2. At each decision point (node):
– Look at the feature and split value.
– If the data point’s feature value is smaller or equal, go left.
– If it’s larger, go right.
3. Keep moving down the tree until you reach the end (a leaf).
4. The prediction is the average value stored in that leaf.

Evaluation Step

This value of RMSE is so much better than the result of the dummy regressor.

After building the tree, the only thing we need to worry about is the method to make the tree smaller to prevent overfitting. In general, the method of pruning can be categorized as:

Pre-pruning

Pre-pruning, also known as early stopping, involves halting the growth of a decision tree during the training process based on certain predefined criteria. This approach aims to prevent the tree from becoming too complex and overfitting the training data. Common pre-pruning techniques include:

Maximum depth: Limiting how deep the tree can grow.
Minimum samples for split: Requiring a minimum number of samples to justify splitting a node.
Minimum samples per leaf: Ensuring each leaf node has at least a certain number of samples.
Maximum number of leaf nodes: Restricting the total number of leaf nodes in the tree.
Minimum impurity decrease: Only allowing splits that decrease impurity by a specified amount.

These methods stop the tree’s growth when the specified conditions are met, effectively “pruning” the tree during its construction phase.
(We have discussed these methods before, which is exactly the same in regression case.)

Post-pruning

Post-pruning, on the other hand, allows the decision tree to grow to its full extent and then prunes it back to reduce complexity. This approach first builds a complete tree and then removes or collapses branches that don’t significantly contribute to the model’s performance. One common post-pruning technique is called Cost-Complexity Pruning.

Step 1: Calculate the Impurity for Each Node

For each interim node, calculate the impurity (MSE for regression case). We then sorted this value from the lowest to highest.

# Visualize the decision tree
plt.figure(figsize=(26,8))
plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=True, fontsize=16, precision=2)
plt.tight_layout()
plt.show()

In this scikit learn output, the impurity are shown as “squared_error” for each nodes.

Let‘s give name to these interim nodes (from A-J). We then sort it based on their MSE, from lowest to highest

Step 2: Create Subtrees by Trimming The Weakest Link

The goal is to gradually turn the interim nodes into leaves starting from the node with the lowest MSE (= weakest link). We can create a path of pruning based on that.

Let’s name them “Subtree i” based on how many times (i) it is being pruned. Starting from the original tree, the tree will be pruned on the node with lowest MSE (starting from node J, M (already got cut by J), L, K, and so on)

Step 3: Calculate Total Leaf Impurities for Each Subtree

For each subtree T, total leaf impurities (R(T)) can be calculated as:

R(T) = (1/N) Σ I(L) * n_L

where:
· L ranges over all leaf nodes
· n_L is the number of samples in leaf L
· N is the total number of samples in the tree
· I(L) is the impurity (MSE) of leaf L

The more we prune, the higher the total leaf impurities.

Step 4: Compute the Cost Function

To control when to stop turning the interim nodes into leaves, we check the cost complexity first for each subtree T using the following formula:

Cost(T) = R(T) + α * |T|

where:
· R(T) is the total leaf impurities
· |T| is the number of leaf nodes in the subtree
· α is the complexity parameter

Step 5: Select the Alpha

The value of alpha control which subtree we will end up with. The subtree with the lowest cost will be the final tree.

When α is small, we care more about accuracy (bigger trees). When α is large, we care more about simplicity (smaller trees)

While we can freely set the α, in scikit-learn, you can also get the smallest value of α to obtain a particular subtree. This is called effective α.

This effective α *can also be computed.*

# Compute the cost-complexity pruning path
tree = DecisionTreeRegressor(random_state=42)
effective_alphas = tree.cost_complexity_pruning_path(X_train, y_train).ccp_alphas
impurities = tree.cost_complexity_pruning_path(X_train, y_train).impurities# Function to count leaf nodes
count_leaves = lambda tree: sum(tree.tree_.children_left[i] == tree.tree_.children_right[i] == -1 for i in range(tree.tree_.node_count))
# Train trees and count leaves for each complexity parameter
leaf_counts = [count_leaves(DecisionTreeRegressor(random_state=0, ccp_alpha=alpha).fit(X_train_scaled, y_train)) for alpha in effective_alphas]
# Create DataFrame with analysis results
pruning_analysis = pd.DataFrame({
'total_leaf_impurities': impurities,
'leaf_count': leaf_counts,
'cost_function': [f"{imp:.3f} + {leaves}α" for imp, leaves in zip(impurities, leaf_counts)],
'effective_α': effective_alphas
})
print(pruning_analysis)

Final Remarks

Pre-pruning methods are generally faster and more memory-efficient, as they prevent the tree from growing too large in the first place.

Post-pruning can potentially create more optimal trees, as it considers the entire tree structure before making pruning decisions. However, it can be more computationally expensive.

Both approaches aim to find a balance between model complexity and performance, with the goal of creating a model that generalizes well to unseen data. The choice between pre-pruning and post-pruning (or a combination of both) often depends on the specific dataset, the problem at hand, and of course, computational resources available.

In practice, it’s common to use a combination of these methods, like applying some pre-pruning criteria to prevent excessively large trees, and then using post-pruning for fine-tuning the model’s complexity.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Initialize Decision Tree Regressor
tree = DecisionTreeRegressor(random_state=42)
# Get the cost complexity path, impurities, and effective alpha
path = tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
print(ccp_alphas)
print(impurities)
# Train the final tree with the chosen alpha
final_tree = DecisionTreeRegressor(random_state=42, ccp_alpha=0.1)
final_tree.fit(X_train_scaled, y_train)
# Make predictions
y_pred = final_tree.predict(X_test)
# Calculate and print RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")