决策树知识概要

决策树定义

决策树是一种监督学习算法模型,用于分类和回归任务。其结构类似树状,由根节点、内部节点和叶子节点组成,通过对特征属性的逐层分裂,将样本空间划分为不同子区域,最终基于叶子节点的类别或数值输出决策结果,具有直观易解释的特点。

例子:

在这里插入图片描述

熵的定义

熵是信息论中衡量随机事件不确定性的核心概念,描述了一个系统或变量结果的“混乱程度”或“未知性”。简单来说,熵值越高,意味着结果的不确定性越强;熵值越低,结果越确定。

熵在决策树中的作用

在决策树算法中,熵的主要作用是评估数据的“纯度”。ID3算法中,通过计算分裂前后数据子集的熵变化(即信息增益),算法可以选择最能降低不确定性的特征作为节点,从而让决策树的分支更有效地区分不同类别样本,提升模型分类的准确性。熵帮助算法量化特征对数据划分的价值,是构建高效决策树的关键工具。

熵的计算公式

H ( X ) = − ∑ i = 1 n p ( x i ) l o g 2 p ( x i ) H(X)=- \sum_{i=1}^{n} p(x_i)log_2p(x_i) H(X)=i=1np(xi)log2p(xi)

计算熵的代码实现如下:

def entropy(labels):
    num_labels = len(labels)
    if num_labels == 0:
        return 0
    label_counts = Counter(labels)
    entropy_value = 0
    for count in label_counts.values():
        probability = count / num_labels
        entropy_value -= probability * math.log2(probability)
    return entropy_value

条件熵

H ( X ∣ Y ) = − ∑ x , y p ( x , y ) l o g ( p ( x ∣ y ) ) H(X|Y)=-\sum_{x,y}p(x,y)log(p(x|y)) H(XY)=x,yp(x,y)log(p(xy))

信息增益

G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D,a)=Ent(D)-\sum_{v=1}^{V} \frac{|D_v|}{|D|}Ent(D^v) Gain(D,a)=Ent(D)v=1VDDvEnt(Dv)
计算信息增益的代码实现如下:

def information_gain(parent_labels, splits):
    parent_entropy = entropy(parent_labels)
    total = len(parent_labels)
    return parent_entropy - sum(
        (len(split)/total)*entropy(split) 
        for split in splits.values()
    )

基尼(Gini)系数

在使用CART算法时需要使用基尼系数进行评估

基尼系数的计算

G i n i ( X ) = 1 − ∑ i = 1 k p i 2 Gini(X)=1-\sum_{i=1}^{k} p_i^2 Gini(X)=1i=1kpi2

计算基尼指数的代码实现如下:

def gini(labels):
    counts = Counter(labels)
    total = len(labels)
    return 1 - sum((count/total)**2 for count in counts.values())

决策树的构建

构建流程

(1)选择最佳的划分属性

用信息增益等标准评估属性,选提升子集纯度最大的属性,决定节点分裂,影响分类性能。

(2)构建子树

按划分属性将数据集划分子集,递归创建子节点,重复选属性操作,形成树状结构分类数据。

(3)剪枝处理

用预剪枝或后剪枝,去除贡献小的节点分支,解决过拟合问题,提高模型泛化能力。

(4)生成决策树

经属性划分、子树构建与剪枝,得到模型,依数据属性判断输出分类或预测结果 。

决策树构建方法

决策树有三种方法构建:分别为ID3算法,C4.5算法,CART算法

三种构建方法的区别

(1)ID3算法

ID3算法以信息增益为准则来进行选择划分属性,选择信息增益最大的。

(2)C4.5算法

C4.5算法选择增益率最高的。

(3)CART算法

CART算法使用基尼指数来选择划分属性,选择基尼指数最小的属性作为划分属性,

决策树构建实现

本次实验选取其中两种构建方式进行

(1)ID3算法

import math
from collections import Counter


def read_data(filename):
    with open(filename, 'r') as f:
        data = [line.strip().split(',') for line in f]
    return [sample[:-1] for sample in data], [sample[-1] for sample in data]

def entropy(labels):
    counts = Counter(labels)
    total = len(labels)
    return -sum((count / total) * math.log2(count / total) for count in counts.values())

def information_gain(parent_labels, splits):
    parent_entropy = entropy(parent_labels)
    total = len(parent_labels)
    return parent_entropy - sum(
        (len(split) / total) * entropy(split)
        for split in splits.values()
    )

def best_split_id3(X, y, features):
    best_gain = -1
    best_feature = None

    for feature in features:
        splits = {}
        for i, val in enumerate([x[feature] for x in X]):
            splits.setdefault(val, []).append(y[i])

        gain = information_gain(y, splits)
        if gain > best_gain:
            best_gain = gain
            best_feature = feature

    return best_feature

def build_id3(X, y, features):
    if len(set(y)) == 1:
        return y[0]

    if not features:
        return Counter(y).most_common(1)[0][0]

    best_feature = best_split_id3(X, y, features)
    tree = {best_feature: {}}

    new_features = [f for f in features if f != best_feature]
    feature_values = set(sample[best_feature] for sample in X)

    for value in feature_values:
        sub_X = [x for x in X if x[best_feature] == value]
        sub_y = [y[i] for i, x in enumerate(X) if x[best_feature] == value]
        tree[best_feature][value] = build_id3(sub_X, sub_y, new_features)

    return tree

X_train, y_train = read_data('dataset.txt')
features = list(range(len(X_train[0])))
id3_tree = build_id3(X_train, y_train, features)
print("ID3决策树结构:", id3_tree)

(2)CART算法

from collections import Counter


def read_data(filename):
    with open(filename, 'r') as f:
        data = [line.strip().split(',') for line in f]
    return [sample[:-1] for sample in data], [sample[-1] for sample in data]

def gini(labels):
    counts = Counter(labels)
    total = len(labels)
    return 1 - sum((count / total) ** 2 for count in counts.values())

def gini_for_feature(X, y, feature):
    splits = {}
    for i, val in enumerate([x[feature] for x in X]):
        splits.setdefault(val, []).append(y[i])
    return sum((len(split) / len(y)) * gini(split) for split in splits.values())

def best_split_cart(X, y, features):
    best_gini = float('inf')
    best_feature = None
    for feature in features:
        splits = {}
        for i, val in enumerate([x[feature] for x in X]):
            splits.setdefault(val, []).append(y[i])
        gini_val = sum((len(split) / len(y)) * gini(split) for split in splits.values())
        if gini_val < best_gini:
            best_gini, best_feature = gini_val, feature
    return best_feature

def build_cart(X, y, features):
    if len(set(y)) == 1:
        return y[0]
    if not features:
        return Counter(y).most_common(1)[0][0]
    best_feature = best_split_cart(X, y, features)
    tree = {best_feature: {}}
    feature_values = set(sample[best_feature] for sample in X)
    for value in feature_values:
        sub_X = [x for x in X if x[best_feature] == value]
        sub_y = [y[i] for i, x in enumerate(X) if x[best_feature] == value]
        tree[best_feature][value] = build_cart(sub_X, sub_y, features)
    return tree

X_train, y_train = read_data('dataset.txt')
features = list(range(len(X_train[0])))

print("候选特征基尼指数:")
for i in features:
    print(f"特征{i}: {gini_for_feature(X_train, y_train, i):.3f}")
    
cart_tree = build_cart(X_train, y_train, features)
print("\nCART决策树结构:", cart_tree)

决策树构建结果

实验结果:在这里插入图片描述
本次用ID3和CART算法构建决策树,ID3基于信息增益,CART基于基尼指数。在该训练集和测试集上,二者生成的决策树结构一致,均以特征2为根节点,并且CART输出的特征2基尼指数最小,表明特征2在两种评估标准下最优。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐