REGRESSION ALGORITHM
Trimming branches smartly with cost-complexity pruning

Decision Trees aren’t limited to categorizing data — they’re equally good at predicting numerical values! Classification trees often steal the spotlight, but Decision Tree Regressors (or Regression Trees) are powerful and versatile tools in the world of continuous variable prediction.
While we’ll discuss the mechanics of regression tree construction (which are mostly similar to the classification tree), here, we’ll also advance beyond the pre-pruning methods like “minimal sample leaf” and “max tree depth” introduced in the classifier article. We’ll explore the most common post-pruning method which is cost complexity pruning, that introduces a complexity parameter to the decision tree’s cost function.
A Decision Tree for regression is a model that predicts numerical values using a tree-like structure. It splits data based on key features, starting from a root question and branching out. Each node asks about a feature, dividing data further until reaching leaf nodes with final predictions. To get a result, you follow the path matching your data’s features from root to leaf.
To demonstrate our concepts, we’ll work with our standard dataset. This dataset is used to predict the number of golfers visiting on a given day and includes variables like weather outlook, temperature, humidity, and wind conditions.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temp.': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humid.': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'],prefix='',prefix_sep='')
# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
The Decision Tree for regression operates by recursively dividing the data based on features that best reduce prediction error. Here’s the general process:
- Begin with the entire dataset at the root node.
- Choose the feature that minimizes a specific error metric (such as mean squared error or variance) to split the data.
- Create child nodes based on the split, where each child represents a subset of the data aligned with the corresponding feature values.
- Repeat steps 2–3 for each child node, continuing to split the data until a stopping condition is reached.
- Assign a final predicted value to each leaf node, typically the average of the target values in that node.
We will explore the regression part in the decision tree algorithm CART (Classification and Regression Trees). It builds binary trees and typically follows these steps:
1.Begin with all training samples in the root node.
2.For each feature in the dataset:
a. Sort the feature values in ascending order.
b. Consider all midpoints between adjacent values as potential split points.
3. For each potential split point:
a. Calculate the mean squared error (MSE) of the current node.
b. Compute the weighted average of errors for the resulting split.
4. After evaluating all features and split points, select the one with lowest weighted average of MSE.
5. Create two child nodes based on the chosen feature and split point:
– Left child: samples with feature value <= split point
– Right child: samples with feature value > split point
6. Recursively repeat steps 2–5 for each child node. (Continue until a stopping criterion is met.)
7. At each leaf node, assign the average target value of the samples in that node as the prediction.
from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt# Train the model
regr = DecisionTreeRegressor(random_state=42)
regr.fit(X_train, y_train)
# Visualize the decision tree
plt.figure(figsize=(26,8))
plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=False, fontsize=16, precision=2)
plt.tight_layout()
plt.show()
Here’s how a regression tree makes predictions for new data:
1. Start at the top (root) of the tree.
2. At each decision point (node):
– Look at the feature and split value.
– If the data point’s feature value is smaller or equal, go left.
– If it’s larger, go right.
3. Keep moving down the tree until you reach the end (a leaf).
4. The prediction is the average value stored in that leaf.
Evaluation Step
After building the tree, the only thing we need to worry about is the method to make the tree smaller to prevent overfitting. In general, the method of pruning can be categorized as:
Pre-pruning
Pre-pruning, also known as early stopping, involves halting the growth of a decision tree during the training process based on certain predefined criteria. This approach aims to prevent the tree from becoming too complex and overfitting the training data. Common pre-pruning techniques include:
- Maximum depth: Limiting how deep the tree can grow.
- Minimum samples for split: Requiring a minimum number of samples to justify splitting a node.
- Minimum samples per leaf: Ensuring each leaf node has at least a certain number of samples.
- Maximum number of leaf nodes: Restricting the total number of leaf nodes in the tree.
- Minimum impurity decrease: Only allowing splits that decrease impurity by a specified amount.
These methods stop the tree’s growth when the specified conditions are met, effectively “pruning” the tree during its construction phase.
(We have discussed these methods before, which is exactly the same in regression case.)
Post-pruning
Post-pruning, on the other hand, allows the decision tree to grow to its full extent and then prunes it back to reduce complexity. This approach first builds a complete tree and then removes or collapses branches that don’t significantly contribute to the model’s performance. One common post-pruning technique is called Cost-Complexity Pruning.
Step 1: Calculate the Impurity for Each Node
For each interim node, calculate the impurity (MSE for regression case). We then sorted this value from the lowest to highest.
# Visualize the decision tree
plt.figure(figsize=(26,8))
plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=True, fontsize=16, precision=2)
plt.tight_layout()
plt.show()
Step 2: Create Subtrees by Trimming The Weakest Link
The goal is to gradually turn the interim nodes into leaves starting from the node with the lowest MSE (= weakest link). We can create a path of pruning based on that.
Step 3: Calculate Total Leaf Impurities for Each Subtree
For each subtree T, total leaf impurities (R(T)) can be calculated as:
R(T) = (1/N) Σ I(L) * n_L
where:
· L ranges over all leaf nodes
· n_L is the number of samples in leaf L
· N is the total number of samples in the tree
· I(L) is the impurity (MSE) of leaf L
Step 4: Compute the Cost Function
To control when to stop turning the interim nodes into leaves, we check the cost complexity first for each subtree T using the following formula:
Cost(T) = R(T) + α * |T|
where:
· R(T) is the total leaf impurities
· |T| is the number of leaf nodes in the subtree
· α is the complexity parameter
Step 5: Select the Alpha
The value of alpha control which subtree we will end up with. The subtree with the lowest cost will be the final tree.
While we can freely set the α, in scikit-learn, you can also get the smallest value of α to obtain a particular subtree. This is called effective α.
# Compute the cost-complexity pruning path
tree = DecisionTreeRegressor(random_state=42)
effective_alphas = tree.cost_complexity_pruning_path(X_train, y_train).ccp_alphas
impurities = tree.cost_complexity_pruning_path(X_train, y_train).impurities# Function to count leaf nodes
count_leaves = lambda tree: sum(tree.tree_.children_left[i] == tree.tree_.children_right[i] == -1 for i in range(tree.tree_.node_count))
# Train trees and count leaves for each complexity parameter
leaf_counts = [count_leaves(DecisionTreeRegressor(random_state=0, ccp_alpha=alpha).fit(X_train_scaled, y_train)) for alpha in effective_alphas]
# Create DataFrame with analysis results
pruning_analysis = pd.DataFrame({
'total_leaf_impurities': impurities,
'leaf_count': leaf_counts,
'cost_function': [f"{imp:.3f} + {leaves}α" for imp, leaves in zip(impurities, leaf_counts)],
'effective_α': effective_alphas
})
print(pruning_analysis)
Final Remarks
Pre-pruning methods are generally faster and more memory-efficient, as they prevent the tree from growing too large in the first place.
Post-pruning can potentially create more optimal trees, as it considers the entire tree structure before making pruning decisions. However, it can be more computationally expensive.
Both approaches aim to find a balance between model complexity and performance, with the goal of creating a model that generalizes well to unseen data. The choice between pre-pruning and post-pruning (or a combination of both) often depends on the specific dataset, the problem at hand, and of course, computational resources available.
In practice, it’s common to use a combination of these methods, like applying some pre-pruning criteria to prevent excessively large trees, and then using post-pruning for fine-tuning the model’s complexity.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Initialize Decision Tree Regressor
tree = DecisionTreeRegressor(random_state=42)
# Get the cost complexity path, impurities, and effective alpha
path = tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
print(ccp_alphas)
print(impurities)
# Train the final tree with the chosen alpha
final_tree = DecisionTreeRegressor(random_state=42, ccp_alpha=0.1)
final_tree.fit(X_train_scaled, y_train)
# Make predictions
y_pred = final_tree.predict(X_test)
# Calculate and print RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")