ID3，C4.5

一.引入

決策樹基本上是每一本機器學習入門書籍必講的東西，其決策過程和平時我們的思維很相似，是以非常好了解，同時有一堆資訊論的東西在裡面，也算是一個入門應用，決策樹也有回歸和分類，但一般來說我們主要講的是分類，友善了解嘛。

雖然說這是一個很簡單的算法，但其實作其實還是有些煩人，因為其feature既有離散的，也有連續的，實作的時候要稍加注意

(不同特征的決策，圖檔來自【1】)

O-資訊論的一些point：

首先看這裡：http://blog.csdn.net/dark_scope/article/details/8459576

然後加入一個叫資訊增益的東西：

□.資訊增益：(information gain)

g(D,A) = H(D)-H(D|A)

表示了特征A使得資料集D的分類不确定性減少的程度

□.資訊增益比：(information gain ratio)

g‘(D,A)=g(D,A) / H(D)

□.基尼指數：

二.各種算法

1.ID3

ID3算法就是對各個feature資訊計算資訊增益，然後選擇資訊增益最大的feature作為決策點将資料分成兩部分

然後再對這兩部分分别生成決策樹。

圖自【1】

2.C4.5

C4.5與ID3相比其實就是用資訊增益比代替資訊增益，應為資訊增益有一個缺點：

資訊增益選擇屬性時偏向選擇取值多的屬性

算法的整體過程其實與ID3差異不大：圖自【2】

3.CART

CART(classification and regression tree)的算法整體過程和上面的差異不大，然是CART的決策是二叉樹的

每一個決策隻能是“是”和“否”，換句話說，即使一個feature有多個可能取值，也隻選擇其中一個而把資料分類

兩部分而不是多個，這裡我們主要講一下分類樹，它用到的是基尼指數：

圖自【2】

三.代碼及實作

好吧，其實我就想貼貼代碼而已……本代碼在https://github.com/justdark/dml/tree/master/dml/DT

純屬toy~~~~~實作的CART算法：

from __future__ import division

import numpy as np

import scipy as sp

import pylab as py

def pGini(y):

ty=y.reshape(-1,).tolist()

label = set(ty)

sum=0

num_case=y.shape[0]

#print y

for i in label:

sum+=(np.count_nonzero(y==i)/num_case)**2

return 1-sum

class DTC:

def __init__(self,X,y,property=None):

'''

this is the class of Decision Tree

X is a M*N array where M stands for the training case number

N is the number of features

y is a M*1 vector

property is a binary vector of size N

property[i]==0 means the the i-th feature is discrete feature,otherwise it's continuous

in default,all feature is discrete

'''

I meet some problem here,because the ndarry can only have one type

so If your X have some string parameter,all thing will translate to string

in this situation,you can't have continuous parameter

so remember:

if you have continous parameter,DON'T PUT any STRING IN X !!!!!!!!

'''

self.X=np.array(X)

self.y=np.array(y)

self.feature_dict={}

self.labels,self.y=np.unique(y,return_inverse=True)

self.DT=list()

if (property==None):

self.property=np.zeros((self.X.shape[1],1))

else:

self.property=property

for i in range(self.X.shape[1]):

self.feature_dict.setdefault(i)

self.feature_dict[i]=np.unique(X[:,i])

if (X.shape[0] != y.shape[0] ):

print "the shape of X and y is not right"

for i in range(self.X.shape[1]):

for j in self.feature_dict[i]:

pass#print self.Gini(X,y,i,j)

pass

def Gini(self,X,y,k,k_v):

if (self.property[k]==0):

#print X[X[:,k]==k_v],'dasasdasdasd'

#print X[:,k]!=k_v

c1 = (X[X[:,k]==k_v]).shape[0]

c2 = (X[X[:,k]!=k_v]).shape[0]

D = y.shape[0]

return c1*pGini(y[X[:,k]==k_v])/D+c2*pGini(y[X[:,k]!=k_v])/D

else:

c1 = (X[X[:,k]>=k_v]).shape[0]

c2 = (X[X[:,k]<k_v]).shape[0]

D = y.shape[0]

#print c1,c2,D

return c1*pGini(y[X[:,k]>=k_v])/D+c2*pGini(y[X[:,k]<k_v])/D

pass

def makeTree(self,X,y):

min=10000.0

m_i,m_j=0,0

if (np.unique(y).size<=1):

return (self.labels[y[0]])

for i in range(self.X.shape[1]):

for j in self.feature_dict[i]:

p=self.Gini(X,y,i,j)

if (p<min):

min=p

m_i,m_j=i,j

if (min==1):

return (y[0])

left=[]

righy=[]

if (self.property[m_i]==0):

left = self.makeTree(X[X[:,m_i]==m_j],y[X[:,m_i]==m_j])

right = self.makeTree(X[X[:,m_i]!=m_j],y[X[:,m_i]!=m_j])

else :

left = self.makeTree(X[X[:,m_i]>=m_j],y[X[:,m_i]>=m_j])

right = self.makeTree(X[X[:,m_i]<m_j],y[X[:,m_i]<m_j])

return [(m_i,m_j),left,right]

def train(self):

self.DT=self.makeTree(self.X,self.y)

print self.DT

def pred(self,X):

X=np.array(X)

result = np.zeros((X.shape[0],1))

for i in range(X.shape[0]):

tp=self.DT

while ( type(tp) is list):

a,b=tp[0]

if (self.property[a]==0):

if (X[i][a]==b):

tp=tp[1]

else:

tp=tp[2]

else:

if (X[i][a]>=b):

tp=tp[1]

else:

tp=tp[2]

result[i]=self.labels[tp]

return result

pass

這個maketree讓我想起了線段樹………………代碼裡的變量基本都有說明

試驗代碼：

from __future__ import division

import numpy as np

import scipy as sp

from dml.DT import DTC

X=np.array([

[0,0,0,0,8],

[0,0,0,1,3.5],

[0,1,0,1,3.5],

[0,1,1,0,3.5],

[0,0,0,0,3.5],

[1,0,0,0,3.5],

[1,0,0,1,3.5],

[1,1,1,1,2],

[1,0,1,2,3.5],

[1,0,1,2,3.5],

[2,0,1,2,3.5],

[2,0,1,1,3.5],

[2,1,0,1,3.5],

[2,1,0,2,3.5],

[2,0,0,0,10],

])

y=np.array([

[1],

[0],

[1],

[1],

[0],

[0],

[0],

[1],

[1],

[1],

[1],

[1],

[1],

[1],

[1],

])

prop=np.zeros((5,1))

prop[4]=1

a=DTC(X,y,prop)

a.train()

print a.pred([[0,0,0,0,3.0],[2,1,0,1,2]])

可以看到可以學習出一個決策樹：

展示出來大概是這樣：注意第四個參數是連續變量

原文：https://blog.csdn.net/dark_scope/article/details/13168827

ID3，C4.5

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

分類算法的評價名額

K-近鄰算法以及圖像分類應用

weka之NB算法

使用weka的select attribute

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告