Ví dụ cây quyết định hồi quy (Decision Tree Regression) với Python

🍀 Ví dụ cây quyết định hồi quy (Decision Tree Regression) với Python: khi bài toán không phải phân loại mà là dự đoán giá trị số (như giá nhà, nhiệt độ…).

Dưới đây là một ví dụ đơn giản dùng DecisionTreeRegressor để dự đoán giá từ dữ liệu Boston Housing:

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Tải dữ liệu California Housing
data = fetch_california_housing()
X, y = data.data, data.target

# Chia tập huấn luyện và kiểm tra
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Khởi tạo cây hồi quy và huấn luyện
regressor = DecisionTreeRegressor(max_depth=4)
regressor.fit(X_train, y_train)

# Vẽ cây hồi quy
plt.figure(figsize=(16, 10))
plot_tree(regressor, feature_names=data.feature_names, filled=True)
plt.title("Decision Tree Regressor - California Housing")
plt.show()

📌 Giải thích:

DecisionTreeRegressor thay vì phân loại, sẽ dự đoán giá trị liên tục (giá nhà).
max_depth=4 giới hạn độ sâu cây để tránh overfitting.
plot_tree() giúp bạn nhìn trực quan cách mô hình quyết định các giá trị dự đoán dựa trên đặc trưng như thu nhập trung bình, số phòng…

Chạy trên Colab

📊 Bạn cũng có thể tính độ chính xác bằng mean_squared_error hoặc r2_score.

kết quả:

Chúng ta có thể thử nghiệm với các tham số pruning để tối ưu độ đơn giản. Bạn muốn cây hồi quy gọn như bonsai hay chi tiết như… sơ đồ thuế quốc gia? 😄

🌱 Pruning cây hồi quy (Decision Tree Regressor) giúp giảm overfitting và làm mô hình dễ hiểu hơn, tương tự như cây phân loại. Trong scikit-learn, kỹ thuật post-pruning được gọi là Cost Complexity Pruning, sử dụng tham số ccp_alpha.

Tự động chọn giá trị ccp_alpha tối ưu có thể thực hiện bằng cách lặp qua các mô hình được huấn luyện với các giá trị alpha khác nhau, rồi chọn mô hình có độ chính xác cao nhất trên tập kiểm tra (hoặc có độ sai thấp nhất nếu dùng lỗi).

Dưới đây là ví dụ tự động chọn mô hình tốt nhất:

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Tải và chia dữ liệu
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

# Huấn luyện cây đầy đủ và lấy đường pruning
regressor = DecisionTreeRegressor(random_state=0)
path = regressor.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]

# Tìm mô hình có MSE thấp nhất trên tập test
best_model = None
lowest_mse = float("inf")
best_alpha = None

for alpha in ccp_alphas:
    model = DecisionTreeRegressor(ccp_alpha=alpha, random_state=0)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    
    if mse < lowest_mse:
        lowest_mse = mse
        best_model = model
        best_alpha = alpha

print(f"✅ Alpha tối ưu là: {best_alpha:.5f} với MSE thấp nhất là: {lowest_mse:.4f}")

📌 Bạn có thể tùy biến:

Thay mean_squared_error bằng r2_score nếu muốn tối ưu theo độ chính xác (R²).
Vẽ thêm cây của best_model nếu muốn xem trực quan cây sau khi prune tối ưu.

Chạy trên Colab

📌 Giải thích:

📌 Bạn có thể tùy biến:

Related Posts

Để lại một bình luận Hủy