一.引入
決策樹基本上是每一本機器學習入門書籍必講的東西,其決策過程和平時我們的思維很相似,是以非常好了解,同時有一堆資訊論的東西在裡面,也算是一個入門應用,決策樹也有回歸和分類,但一般來說我們主要講的是分類,友善了解嘛。
雖然說這是一個很簡單的算法,但其實作其實還是有些煩人,因為其feature既有離散的,也有連續的,實作的時候要稍加注意
(不同特征的決策,圖檔來自【1】)
O-資訊論的一些point:
首先看這裡:http://blog.csdn.net/dark_scope/article/details/8459576
然後加入一個叫資訊增益的東西:
□.資訊增益:(information gain)
g(D,A) = H(D)-H(D|A)
表示了特征A使得資料集D的分類不确定性減少的程度
□.資訊增益比:(information gain ratio)
g‘(D,A)=g(D,A) / H(D)
□.基尼指數:
二.各種算法
1.ID3
ID3算法就是對各個feature資訊計算資訊增益,然後選擇資訊增益最大的feature作為決策點将資料分成兩部分
然後再對這兩部分分别生成決策樹。
圖自【1】
2.C4.5
C4.5與ID3相比其實就是用資訊增益比代替資訊增益,應為資訊增益有一個缺點:
資訊增益選擇屬性時偏向選擇取值多的屬性
算法的整體過程其實與ID3差異不大:圖自【2】
3.CART
CART(classification and regression tree)的算法整體過程和上面的差異不大,然是CART的決策是二叉樹的
每一個決策隻能是“是”和“否”,換句話說,即使一個feature有多個可能取值,也隻選擇其中一個而把資料分類
兩部分而不是多個,這裡我們主要講一下分類樹,它用到的是基尼指數:
圖自【2】
三.代碼及實作
好吧,其實我就想貼貼代碼而已……本代碼在https://github.com/justdark/dml/tree/master/dml/DT
純屬toy~~~~~實作的CART算法:
from __future__ import division
import numpy as np
import scipy as sp
import pylab as py
def pGini(y):
ty=y.reshape(-1,).tolist()
label = set(ty)
sum=0
num_case=y.shape[0]
#print y
for i in label:
sum+=(np.count_nonzero(y==i)/num_case)**2
return 1-sum
class DTC:
def __init__(self,X,y,property=None):
'''
this is the class of Decision Tree
X is a M*N array where M stands for the training case number
N is the number of features
y is a M*1 vector
property is a binary vector of size N
property[i]==0 means the the i-th feature is discrete feature,otherwise it's continuous
in default,all feature is discrete
'''
'''
I meet some problem here,because the ndarry can only have one type
so If your X have some string parameter,all thing will translate to string
in this situation,you can't have continuous parameter
so remember:
if you have continous parameter,DON'T PUT any STRING IN X !!!!!!!!
'''
self.X=np.array(X)
self.y=np.array(y)
self.feature_dict={}
self.labels,self.y=np.unique(y,return_inverse=True)
self.DT=list()
if (property==None):
self.property=np.zeros((self.X.shape[1],1))
else:
self.property=property
for i in range(self.X.shape[1]):
self.feature_dict.setdefault(i)
self.feature_dict[i]=np.unique(X[:,i])
if (X.shape[0] != y.shape[0] ):
print "the shape of X and y is not right"
for i in range(self.X.shape[1]):
for j in self.feature_dict[i]:
pass#print self.Gini(X,y,i,j)
pass
def Gini(self,X,y,k,k_v):
if (self.property[k]==0):
#print X[X[:,k]==k_v],'dasasdasdasd'
#print X[:,k]!=k_v
c1 = (X[X[:,k]==k_v]).shape[0]
c2 = (X[X[:,k]!=k_v]).shape[0]
D = y.shape[0]
return c1*pGini(y[X[:,k]==k_v])/D+c2*pGini(y[X[:,k]!=k_v])/D
else:
c1 = (X[X[:,k]>=k_v]).shape[0]
c2 = (X[X[:,k]<k_v]).shape[0]
D = y.shape[0]
#print c1,c2,D
return c1*pGini(y[X[:,k]>=k_v])/D+c2*pGini(y[X[:,k]<k_v])/D
pass
def makeTree(self,X,y):
min=10000.0
m_i,m_j=0,0
if (np.unique(y).size<=1):
return (self.labels[y[0]])
for i in range(self.X.shape[1]):
for j in self.feature_dict[i]:
p=self.Gini(X,y,i,j)
if (p<min):
min=p
m_i,m_j=i,j
if (min==1):
return (y[0])
left=[]
righy=[]
if (self.property[m_i]==0):
left = self.makeTree(X[X[:,m_i]==m_j],y[X[:,m_i]==m_j])
right = self.makeTree(X[X[:,m_i]!=m_j],y[X[:,m_i]!=m_j])
else :
left = self.makeTree(X[X[:,m_i]>=m_j],y[X[:,m_i]>=m_j])
right = self.makeTree(X[X[:,m_i]<m_j],y[X[:,m_i]<m_j])
return [(m_i,m_j),left,right]
def train(self):
self.DT=self.makeTree(self.X,self.y)
print self.DT
def pred(self,X):
X=np.array(X)
result = np.zeros((X.shape[0],1))
for i in range(X.shape[0]):
tp=self.DT
while ( type(tp) is list):
a,b=tp[0]
if (self.property[a]==0):
if (X[i][a]==b):
tp=tp[1]
else:
tp=tp[2]
else:
if (X[i][a]>=b):
tp=tp[1]
else:
tp=tp[2]
result[i]=self.labels[tp]
return result
pass
這個maketree讓我想起了線段樹………………代碼裡的變量基本都有說明
試驗代碼:
from __future__ import division
import numpy as np
import scipy as sp
from dml.DT import DTC
X=np.array([
[0,0,0,0,8],
[0,0,0,1,3.5],
[0,1,0,1,3.5],
[0,1,1,0,3.5],
[0,0,0,0,3.5],
[1,0,0,0,3.5],
[1,0,0,1,3.5],
[1,1,1,1,2],
[1,0,1,2,3.5],
[1,0,1,2,3.5],
[2,0,1,2,3.5],
[2,0,1,1,3.5],
[2,1,0,1,3.5],
[2,1,0,2,3.5],
[2,0,0,0,10],
])
y=np.array([
[1],
[0],
[1],
[1],
[0],
[0],
[0],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
])
prop=np.zeros((5,1))
prop[4]=1
a=DTC(X,y,prop)
a.train()
print a.pred([[0,0,0,0,3.0],[2,1,0,1,2]])
可以看到可以學習出一個決策樹:
展示出來大概是這樣:注意第四個參數是連續變量
原文:https://blog.csdn.net/dark_scope/article/details/13168827