天天看點

基于最小二乘法的一般多元線性回歸的實戰一、說明二、資料項說明三、實戰部分

文章目錄

  • 一、說明
  • 二、資料項說明
  • 三、實戰部分

一、說明

我是在jupyter完成的,然後導出成markdown格式,ipynb檔案導出為markdown的指令如下:

jupyter nbconvert --to markdown  xxx.ipynb
           

源代碼和資料檔案,點選這裡擷取

二、資料項說明

Name		Data Type	Meas.	Description
	----		---------	-----	-----------
	Sex		nominal			M, F, and I (infant)
	Length		continuous	mm	Longest shell measurement
	Diameter	continuous	mm	perpendicular to length
	Height		continuous	mm	with meat in shell
	Whole weight	continuous	grams	whole abalone
	Shucked weight	continuous	grams	weight of meat
	Viscera weight	continuous	grams	gut weight (after bleeding)
	Shell weight	continuous	grams	after being dried
	Rings		integer			+1.5 gives the age in years
           

現在有8個資料字段,前面7個是特征值,最最後一個Rings為預測,具體請查閱檔案内容

三、實戰部分

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
           
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
5 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
6 F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
7 F 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.260 16
8 M 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.165 9
9 F 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.320 19
# 檢視資料容量 
dataframe01.shape
           
(4177, 9)
           
Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight', 'Rings'],
      dtype='object')
           
# 清洗資料
# 替換特征值,将性别中的字元類型轉化為整數
dataframe02 = dataframe01.copy()

dataframe02.Sex[dataframe01['Sex']=='I']=0
dataframe02.Sex[dataframe01['Sex']=='F']=1
dataframe02.Sex[dataframe01['Sex']=='M']=2

           
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
2 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 2 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 1 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 2 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
5 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
6 1 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
7 1 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.260 16
8 2 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.165 9
9 1 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.320 19
# 導入線性回歸的庫
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
           
data_index
           
['Sex',
 'Length',
 'Diameter',
 'Height',
 'Whole weight',
 'Shucked weight',
 'Viscera weight',
 'Shell weight',
 'Rings']
           
# 擷取特征矩陣X 的index
X_index = data_index[0:-1]
Y_index = data_index[-1]
           
X_index, Y_index
           
(['Sex',
  'Length',
  'Diameter',
  'Height',
  'Whole weight',
  'Shucked weight',
  'Viscera weight',
  'Shell weight'],
 'Rings')
           
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight
2 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150
1 2 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070
2 1 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210
3 2 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155
4 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055
0    15
1     7
2     9
3    10
4     7
Name: Rings, dtype: int64
           
# 劃分訓練集和測試集
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y,test_size=0.2,random_state=420)
           
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight
2763 0.550 0.425 0.135 0.6560 0.2570 0.1700 0.203
439 2 0.500 0.415 0.165 0.6885 0.2490 0.1380 0.250
1735 2 0.670 0.520 0.165 1.3900 0.7110 0.2865 0.300
751 2 0.485 0.355 0.120 0.5470 0.2150 0.1615 0.140
1626 1 0.570 0.450 0.135 0.7805 0.3345 0.1850 0.210
2763    10
439     13
1735    11
751     10
1626     8
Name: Rings, dtype: int64
           
#恢複索引
for i in [Xtrain, Xtest]:
    i.index = range(i.shape[0])
           
#恢複索引
for i in [Ytrain, Ytest]:
    i.index = range(i.shape[0])
           
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight
0.550 0.425 0.135 0.6560 0.2570 0.1700 0.203
1 2 0.500 0.415 0.165 0.6885 0.2490 0.1380 0.250
2 2 0.670 0.520 0.165 1.3900 0.7110 0.2865 0.300
3 2 0.485 0.355 0.120 0.5470 0.2150 0.1615 0.140
4 1 0.570 0.450 0.135 0.7805 0.3345 0.1850 0.210
0    10
1    13
2    11
3    10
4     8
Name: Rings, dtype: int64
           
# 先用訓練集訓練(fit)标準化的類,然後用訓練好的類分别轉化(transform)訓練集和測試集

# 開始模組化
reg = LR().fit(Xtrain, Ytrain)
           
4.22923686878166
           
22.656846035572762
           
array([  0.40527178,  -0.88791132,  13.01662939,  10.39250886,
         9.64127293, -20.87747601, -10.50683081,   7.70632772])
           
Xtrain.columns
           
Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight'],
      dtype='object')
           
[('Sex', 0.4052717783379893),
 ('Length', -0.8879113179582045),
 ('Diameter', 13.016629389061475),
 ('Height', 10.39250886428478),
 ('Whole weight', 9.64127293101552),
 ('Shucked weight', -20.87747600529615),
 ('Viscera weight', -10.506830809919672),
 ('Shell weight', 7.706327719866024)]
           

Name Data Type Meas. Description

​ Sex nominal M, F, and I (infant)

​ Length continuous mm Longest shell measurement

​ Diameter continuous mm perpendicular to length

​ Height continuous mm with meat in shell

​ Whole weight continuous grams whole abalone

​ Shucked weight continuous grams weight of meat

​ Viscera weight continuous grams gut weight (after bleeding)

​ Shell weight continuous grams after being dried

​ Rings integer +1.5 gives the age in years

# 截距
reg.intercept_
           
2.7888240054011835
           
# 自定義最小二乘法嘗試
def my_least_squares(x_array, y_array):
    '''
    :param x: 清單,表示m*n矩陣
    :param y: 清單,表示m*1矩陣
    :return: coef:list 回歸系數(1*n矩陣)   intercept: float 截距
    '''
    # 矩陣對象化
    arr_x_01 = np.array(x_array)
    arr_y_01 = np.array(y_array)

    # x_array由 m*n矩陣轉化為 m*(n+1)矩陣,其中第n+1列系數全為1
    # 擷取行數
    row_num = arr_x_01.shape[0]

    # 生成常量系數矩陣  m*1矩陣
    arr_b = np.array([[1 for i in range(0, row_num)]])

    # 合并成m*(n+1)矩陣
    arr_x_02 = np.insert(arr_x_01, 0, values=arr_b, axis=1)

    # 矩陣運算
    w = np.linalg.inv(np.matmul(arr_x_02.T, arr_x_02))
    w = np.matmul(w, arr_x_02.T)
    w = np.matmul(w, arr_y_01)
    
    # w為1*(n+1)矩陣
    # print(w)
    result = list(w)
    coef = result.pop(-1)
    intercept = result
    
    return coef, intercept
           
# debug中
my_least_squares(Xtrain,list(Ytrain))
           
# 梯度下降法嘗試
def costFunc(X,Y,theta):
    '''
    代價函數
    '''
    inner = np.power((X*theta.T)-Y,2)
    return np.sum(inner)/(2*len(X))

def gradientDescent(X,Y,theta,alpha,iters):
    '''
    梯度下降
    '''
    temp = np.mat(np.zeros(theta.shape))
    cost = np.zeros(iters)
    thetaNums = int(theta.shape[1])
    print(thetaNums)
    for i in range(iters):
        error = (X*theta.T-Y)
        for j in range(thetaNums):
            derivativeInner = np.multiply(error,X[:,j])
            temp[0,j] = theta[0,j] - (alpha*np.sum(derivativeInner)/len(X))

        theta = temp
        cost[i] = costFunc(X,Y,theta)

    return theta,cost
           

繼續閱讀