[ML/python] 결정 트리 (decision tree, DTC, class_weight, export_text, plot

⚙️ Tech/ML

[ML/python] 결정 트리 (decision tree, DTC, class_weight, export_text, plot_tree)

fiftyline 2025. 2. 14. 17:23

결정트리는 트리 구조를 기반으로 분류와 회귀에 모두 사용되는 모델이다.

전체 데이터를 가지고 있는 뿌리 노드(root node)에서 가지(branch)를 통해 분할되고, 최종 잎 노드(leaf node)로 분류 또는 예측 결과를 출력한다.

특징

결정트리는 해석이 매우 쉽다. 따라서 설명력이 중요한 과제에 많이 사용되며, 특정 이벤트의 발생 조건을 판단하는 데도 사용한다.

모든 특성을 이진화하므로 스케일링 등의 데이터 전처리가 필요없다.

결정 트리는 데이터 공간을 비선형적으로 분리할 수 없다. 따라서 기존 특징을 변환하거나 새로운 특징을 생성하는 등 특징 공간을 잘 정의하는 것이 중요하다.

분할 평가 지표

데이터의 분할은 데이터를 가장 잘 나누는 특성 기준을 평가한다. 평가 지표는 지니지수, 엔트로피 지수를 활용할 수 있다.

성능차이는 크지 않으나 지니는 계산이 빠르고, 엔트로피는 설명력이 좋다.

가지치기

결정트리는 깊어질수록 과적합이 발생할 수 있으므로 가지치기를 수행한다.

사전에 트리 깊이를 제한하거나(max_depth), 리프 노트 최소 데이터 개수(min_samples_leaf)를 설정할 수 있으며,

사후 가지치기를 통해 불필요한 노드를 제거(ccp_alpha) 할 수 있다.

클래스 가중치 설정

클래스 불균형이 있다면 클래스 가중치(w_0, w_1)를 사용하여,

n_1이 소수 클래스 샘플일 경우 w_0*n_0 > w_1*n_1이면 y=0으로 예측하고 그렇지 않으면 y=1이라고 예측한다. w_1이 클수록 소수클래스의 결정 공간이 넓어진다.

# 데이터 로드 및 분할
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("../../data/classification/glass6.csv")
X = df.drop('y', axis = 1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 50)

# 모델 학습 및 평가
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.metrics import *
model = DTC(random_state = 50).fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
print(acc, rec)

# 불균형 비율
num_majority_sample = y_train.value_counts().iloc[0]
num_minority_sample = y_train.value_counts().iloc[-1]
class_imbalance_ratio = num_majority_sample/num_minority_sample
print(class_imbalance_ratio)

# 소수 클래스에 가중치
model2 = DTC(random_state = 50,
             class_weight = {0:1, 1:class_imbalance_ratio}).fit(X_train, y_train)
y_pred = model2.predict(X_test)
acc = accuracy_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
print(acc, rec)

# 소수 클래스에 더 큰 가중치(*100)
model3 = DTC(random_state = 50,
             class_weight = {0:1, 1:class_imbalance_ratio * 100}).fit(X_train, y_train)
y_pred = model3.predict(X_test)
acc = accuracy_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
print(acc, rec)

# 모델 해석
#문자열로 시각화
from sklearn.tree import export_text
r = export_text(model, feature_names=list(X_train.columns))
print(r)
#그래프로 시각화
from sklearn.tree import plot_tree
from matplotlib import pyplot as plt
plt.figure(figsize = (10, 8))
plot_tree(model, feature_names=list(X_train.columns), class_names = ["0", "1"])
plt.show()

# 규칙 집합으로 정리
def text_to_rule_list(r):
    node_list = []
    leaf_node_list = []

    for i, node in enumerate(r.split("\n")[:-1]):
        rule = node.split('- ')[1]
        indent = node.count(' ' * 3)
        if 'class' in rule:
            leaf_node_list.append([i, rule, indent])

        node_list.append([i, rule, indent])

    prediction_rule_list = []
    for leaf_node in leaf_node_list:
        prediction_rule = []
        idx, decision, indent = leaf_node
        for indent_level in range(indent-1, -1, -1):
            for node_idx in range(idx, -1, -1):
                node = node_list[node_idx]
                rule = node[1]
                if node[2] == indent_level and "class" not in node[1]:
                    prediction_rule.append(rule)
                    break
        prediction_rule_list.append([prediction_rule, decision])

    return prediction_rule_list
    
prediction_rule_list = text_to_rule_list(r)
for prediction_rule, decision in prediction_rule_list:
    print(" & ".join(prediction_rule), decision)
'''
x2 <= 14.82 & x8 <= 0.34 class: 0
x7 <= 9.85 & x2 >  14.82 & x8 <= 0.34 class: 1
x7 >  9.85 & x2 >  14.82 & x8 <= 0.34 class: 0
x2 <= 13.84 & x8 >  0.34 class: 0
x6 <= 1.57 & x2 >  13.84 & x8 >  0.34 class: 1
x6 >  1.57 & x2 >  13.84 & x8 >  0.34 class: 0   
'''

참고 서적 : GIL's LAB, 「파이썬을 활용한 머신러닝 자동화 시스템 구축」, 위키북스(2022)