天天看点

pandas用众数填充缺失值_Create ML处理缺失值

在pandas里面有很丰富的api来处理数据,但是对于需要使用苹果Create ML来训练模型,并运用到ios或者macOS设备上面的用户来说,就没有这么多丰富的api来使用。机器学习样本理想的情况下倒是不需要做太多处理,但是实际样本很有可能会有很多缺失值,这个时候如果不对缺失值做处理,就根本无法进行模型的训练。

例子所用的测试数据来源:

使用MLDataTable加载训练数据

import Cocoa

import CreateML

let trainFile = Bundle.main.url(forResource: "train", withExtension: "csv")!

var trainData = try MLDataTable(contentsOf: trainFile)

手动处理数据

获取数据的分布

要手动计算众数,中位数就需要知道各种数据的分布。也就是要知道每个值有多少个,用一个很简单的循环遍历数据,然后再用字典统计即可。简单的示例代码如下(以LotFrontage这一列为例):

let TYPE_INT = 0

let TYPE_STRING = 1

let missing = "missing" // 用来记录缺失的值

func valueCounts(data: MLUntypedColumn, type: Int) -> [String: Int] {

var vc = [String:Int]()

for i in 0..

if data[i].isValid {

if type == TYPE_INT {

addItem(data: &vc, key: String(stringInterpolationSegment: data[i].intValue!))

} else if type == TYPE_STRING {

addItem(data: &vc, key: data[i].stringValue!)

}

} else {

addItem(data: &vc, key: missing)

}

}

return vc

}

let vc = valueCounts(data: trainData["LotFrontage"], type: TYPE_INT)

print(vc)

输出结果如下:

["32": 5, "30": 6, "68": 19, "61": 8, "118": 2, "84": 9, "50": 57, "24": 19, "110": 6, "59": 13, "49": 4, "45": 3, "96": 8, "51": 15, "85": 40, "21": 23, "56": 5, "95": 7, "74": 15, "98": 8, "78": 25, "75": 53, "79": 17, "100": 16, "46": 1, "104": 3, "86": 10, "missing": 259, "57": 12, "124": 2, "114": 2, "76": 11, "122": 2, "115": 2, "80": 69, "55": 17, "130": 3, "102": 4, "72": 17, "60": 143, "54": 6, "36": 6, "81": 6, "92": 10, "106": 1, "47": 5, "89": 6, "35": 9, "42": 4, "69": 11, "94": 6, "144": 1, "141": 1, "107": 7, "129": 2, "150": 1, "120": 7, "105": 6, "116": 2, "182": 1, "62": 9, "93": 8, "65": 44, "112": 1, "63": 17, "137": 1, "138": 1, "101": 2, "108": 3, "140": 1, "82": 12, "66": 15, "71": 12, "70": 70, "58": 7, "64": 19, "67": 12, "48": 6, "160": 1, "174": 2, "103": 3, "99": 3, "37": 5, "149": 1, "41": 6, "87": 5, "52": 14, "88": 10, "91": 6, "40": 12, "134": 2, "53": 10, "121": 2, "83": 5, "109": 2, "97": 2, "38": 1, "90": 23, "128": 1, "313": 2, "152": 1, "33": 1, "153": 1, "73": 18, "39": 1, "43": 12, "44": 9, "168": 1, "111": 1, "34": 10, "77": 9]

获取数据缺失值所占比例

一般缺失值太多的时候我们会把一列都舍弃。在上面统计数据分布的时候也统计了缺失值所占的比例,直接算个除法即可

print(Double(vc[missing]!) / Double(trainData["LotFrontage"].count))

输出结果为:

0.1773972602739726

删除缺失值

MLDataTable删除缺失值的方式有两种,一种是按行删除,一种是按列删除

按行删除

MLDataTable有一个dropMissing方法,调用之后就会把带有空值的行删掉,代码如下

print(trainData.rows.count)

trainData = trainData.dropMissing()

print(trainData.rows.count)

输出结果如下:

1460

删掉缺失值之后一行都没有了,说明这个就必须得处理缺失值了

按列删除

MLDataTable的removeColumn方法可以删除一列,代码如下:

print(trainData.columnNames.count)

trainData.removeColumn(named: "LotFrontage")

print(trainData.columnNames.count)

输出结果:

81

80

可以看到列由开始的81行变成了80行。

均值填充

MLDataTable通过取下标的方式获取一列得到的是MLUntypedColumn对象,MLUntypedColumn对象有一个成员变量ints是将这一列无类型的转换为MLDataColumn转换失败的话就是nil,MLDataColumn有一个mean方法可以计算均值。对于数值类型的,可以填充均值,还是以LotFrontage这一列为例,这一列为数值类型,可以使用如下方法来获取均值并填充

print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))

let mean = trainData["LotFrontage"].ints?.mean()

print("mean:\(mean)\n")

// 创建一个int类型的DataValue

let LotFrontageMean = MLDataValue.int(Int(mean!))

// 填充缺失值

trainData = trainData.fillMissing(columnNamed: "LotFrontage", with: LotFrontageMean)

print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))

结果如下:

["64": 19, "150": 1, "57": 12, "160": 1, "24": 19, "118": 2, "130": 3, "114": 2, "152": 1, "174": 2, "80": 69, "43": 12, "313": 2, "63": 17, "153": 1, "68": 19, "41": 6, "46": 1, "98": 8, "40": 12, "120": 7, "106": 1, "30": 6, "75": 53, "82": 12, "103": 3, "61": 8, "121": 2, "34": 10, "39": 1, "182": 1, "38": 1, "21": 23, "111": 1, "52": 14, "73": 18, "112": 1, "74": 15, "77": 9, "44": 9, "85": 40, "51": 15, "137": 1, "105": 6, "missing": 259, "65": 44, "66": 15, "88": 10, "56": 5, "48": 6, "53": 10, "109": 2, "81": 6, "124": 2, "42": 4, "92": 10, "95": 7, "107": 7, "72": 17, "60": 143, "59": 13, "37": 5, "71": 12, "33": 1, "115": 2, "55": 17, "141": 1, "144": 1, "128": 1, "97": 2, "140": 1, "84": 9, "110": 6, "49": 4, "36": 6, "67": 12, "78": 25, "45": 3, "90": 23, "32": 5, "93": 8, "69": 11, "100": 16, "86": 10, "89": 6, "58": 7, "108": 3, "87": 5, "94": 6, "99": 3, "116": 2, "47": 5, "35": 9, "122": 2, "149": 1, "76": 11, "101": 2, "70": 70, "129": 2, "91": 6, "138": 1, "104": 3, "54": 6, "102": 4, "168": 1, "79": 17, "96": 8, "50": 57, "83": 5, "62": 9, "134": 2]

mean: Optional(70.04995836802664)

["64": 19, "150": 1, "57": 12, "160": 1, "24": 19, "118": 2, "130": 3, "114": 2, "152": 1, "174": 2, "80": 69, "43": 12, "313": 2, "63": 17, "153": 1, "68": 19, "41": 6, "46": 1, "98": 8, "40": 12, "120": 7, "106": 1, "30": 6, "75": 53, "82": 12, "103": 3, "61": 8, "121": 2, "34": 10, "39": 1, "182": 1, "38": 1, "21": 23, "111": 1, "52": 14, "73": 18, "112": 1, "74": 15, "77": 9, "44": 9, "85": 40, "51": 15, "137": 1, "105": 6, "65": 44, "66": 15, "88": 10, "56": 5, "48": 6, "53": 10, "109": 2, "81": 6, "124": 2, "42": 4, "92": 10, "95": 7, "107": 7, "72": 17, "60": 143, "59": 13, "37": 5, "71": 12, "33": 1, "115": 2, "55": 17, "141": 1, "144": 1, "128": 1, "97": 2, "140": 1, "84": 9, "110": 6, "49": 4, "36": 6, "67": 12, "78": 25, "45": 3, "90": 23, "32": 5, "93": 8, "69": 11, "100": 16, "86": 10, "89": 6, "58": 7, "108": 3, "87": 5, "94": 6, "99": 3, "116": 2, "47": 5, "35": 9, "122": 2, "149": 1, "76": 11, "101": 2, "70": 329, "129": 2, "91": 6, "138": 1, "104": 3, "54": 6, "102": 4, "168": 1, "79": 17, "96": 8, "50": 57, "83": 5, "62": 9, "134": 2]

可以看到第二个里面缺失值没有了,然后70的数量从70变成了329,增加的数量就是之前缺失值的数量

众数填充

众数就是一组数据里面出现最多的值,可能是一个多个也有可能没有,这里就简单一点,只考虑有一个的情况,代码如下:

func getMode(data: MLUntypedColumn, type: Int) -> String {

let vc = valueCounts(data: data, type: type)

return getMode(data: vc)

}

func getMode(data: [String:Int]) -> String{

var max:Int = 0

var maxKey:String = ""

var flag = false

for (key, value) in data{

if key == missing {

continue

} else {

if !flag {

max = value

maxKey = key

flag = true

} else {

if max < value {

max = value

maxKey = key

}

}

}

}

return maxKey

}

print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))

let mode = getMode(data: trainData["LotFrontage"], type: TYPE_INT)

print("Mode :\(mode)")

let LotFrontageMode = MLDataValue.int(Int(mode)!)

trainData = trainData.fillMissing(columnNamed: "LotFrontage", with: LotFrontageMode)

print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))

输出结果如下

["134": 2, "57": 12, "41": 6, "53": 10, "83": 5, "91": 6, "141": 1, "86": 10, "153": 1, "100": 16, "45": 3, "54": 6, "104": 3, "81": 6, "64": 19, "37": 5, "30": 6, "60": 143, "149": 1, "80": 69, "168": 1, "160": 1, "77": 9, "78": 25, "82": 12, "111": 1, "52": 14, "69": 11, "46": 1, "missing": 259, "112": 1, "36": 6, "182": 1, "49": 4, "107": 7, "84": 9, "93": 8, "61": 8, "137": 1, "94": 6, "129": 2, "59": 13, "33": 1, "97": 2, "98": 8, "24": 19, "174": 2, "130": 3, "21": 23, "99": 3, "313": 2, "96": 8, "74": 15, "110": 6, "62": 9, "152": 1, "56": 5, "120": 7, "105": 6, "47": 5, "103": 3, "90": 23, "65": 44, "85": 40, "88": 10, "138": 1, "108": 3, "39": 1, "75": 53, "122": 2, "48": 6, "79": 17, "76": 11, "44": 9, "118": 2, "150": 1, "73": 18, "42": 4, "102": 4, "70": 70, "50": 57, "109": 2, "124": 2, "116": 2, "101": 2, "32": 5, "58": 7, "89": 6, "35": 9, "34": 10, "144": 1, "128": 1, "67": 12, "87": 5, "121": 2, "68": 19, "40": 12, "95": 7, "66": 15, "71": 12, "63": 17, "92": 10, "38": 1, "51": 15, "115": 2, "43": 12, "114": 2, "72": 17, "140": 1, "55": 17, "106": 1]

Mode : 60

["134": 2, "57": 12, "41": 6, "53": 10, "83": 5, "91": 6, "141": 1, "86": 10, "153": 1, "100": 16, "45": 3, "54": 6, "104": 3, "81": 6, "64": 19, "37": 5, "30": 6, "60": 402, "149": 1, "80": 69, "168": 1, "160": 1, "77": 9, "78": 25, "82": 12, "111": 1, "52": 14, "69": 11, "46": 1, "112": 1, "36": 6, "182": 1, "49": 4, "107": 7, "84": 9, "93": 8, "61": 8, "137": 1, "94": 6, "129": 2, "59": 13, "33": 1, "97": 2, "98": 8, "24": 19, "174": 2, "130": 3, "21": 23, "99": 3, "313": 2, "96": 8, "74": 15, "110": 6, "62": 9, "152": 1, "56": 5, "120": 7, "105": 6, "47": 5, "103": 3, "90": 23, "65": 44, "85": 40, "88": 10, "138": 1, "108": 3, "39": 1, "75": 53, "122": 2, "48": 6, "79": 17, "76": 11, "44": 9, "118": 2, "150": 1, "73": 18, "42": 4, "102": 4, "70": 70, "50": 57, "109": 2, "124": 2, "116": 2, "101": 2, "32": 5, "58": 7, "89": 6, "35": 9, "34": 10, "144": 1, "128": 1, "67": 12, "87": 5, "121": 2, "68": 19, "40": 12, "95": 7, "66": 15, "71": 12, "63": 17, "92": 10, "38": 1, "51": 15, "115": 2, "43": 12, "114": 2, "72": 17, "140": 1, "55": 17, "106": 1]

可以看到填充之后60出现的次数也是增加了缺失值的个数。

完整的例子

要完成房价预测,需要下面几个步骤:对训练数据缺失值进行填充,如果是数值就填充均值进去,如果是字符串就填充众数进去,丢弃掉缺失值70%以上的数据。

选择合适的模型训练,我这里使用的是线性回归

使用训练好的模型来预测

完整代码参考:smartdone/ML_swift​github.com

pandas用众数填充缺失值_Create ML处理缺失值

没有调参提交到kaggle的成绩