在pandas里面有很丰富的api来处理数据,但是对于需要使用苹果Create ML来训练模型,并运用到ios或者macOS设备上面的用户来说,就没有这么多丰富的api来使用。机器学习样本理想的情况下倒是不需要做太多处理,但是实际样本很有可能会有很多缺失值,这个时候如果不对缺失值做处理,就根本无法进行模型的训练。
例子所用的测试数据来源:
使用MLDataTable加载训练数据
import Cocoa
import CreateML
let trainFile = Bundle.main.url(forResource: "train", withExtension: "csv")!
var trainData = try MLDataTable(contentsOf: trainFile)
手动处理数据
获取数据的分布
要手动计算众数,中位数就需要知道各种数据的分布。也就是要知道每个值有多少个,用一个很简单的循环遍历数据,然后再用字典统计即可。简单的示例代码如下(以LotFrontage这一列为例):
let TYPE_INT = 0
let TYPE_STRING = 1
let missing = "missing" // 用来记录缺失的值
func valueCounts(data: MLUntypedColumn, type: Int) -> [String: Int] {
var vc = [String:Int]()
for i in 0..
if data[i].isValid {
if type == TYPE_INT {
addItem(data: &vc, key: String(stringInterpolationSegment: data[i].intValue!))
} else if type == TYPE_STRING {
addItem(data: &vc, key: data[i].stringValue!)
}
} else {
addItem(data: &vc, key: missing)
}
}
return vc
}
let vc = valueCounts(data: trainData["LotFrontage"], type: TYPE_INT)
print(vc)
输出结果如下:
["32": 5, "30": 6, "68": 19, "61": 8, "118": 2, "84": 9, "50": 57, "24": 19, "110": 6, "59": 13, "49": 4, "45": 3, "96": 8, "51": 15, "85": 40, "21": 23, "56": 5, "95": 7, "74": 15, "98": 8, "78": 25, "75": 53, "79": 17, "100": 16, "46": 1, "104": 3, "86": 10, "missing": 259, "57": 12, "124": 2, "114": 2, "76": 11, "122": 2, "115": 2, "80": 69, "55": 17, "130": 3, "102": 4, "72": 17, "60": 143, "54": 6, "36": 6, "81": 6, "92": 10, "106": 1, "47": 5, "89": 6, "35": 9, "42": 4, "69": 11, "94": 6, "144": 1, "141": 1, "107": 7, "129": 2, "150": 1, "120": 7, "105": 6, "116": 2, "182": 1, "62": 9, "93": 8, "65": 44, "112": 1, "63": 17, "137": 1, "138": 1, "101": 2, "108": 3, "140": 1, "82": 12, "66": 15, "71": 12, "70": 70, "58": 7, "64": 19, "67": 12, "48": 6, "160": 1, "174": 2, "103": 3, "99": 3, "37": 5, "149": 1, "41": 6, "87": 5, "52": 14, "88": 10, "91": 6, "40": 12, "134": 2, "53": 10, "121": 2, "83": 5, "109": 2, "97": 2, "38": 1, "90": 23, "128": 1, "313": 2, "152": 1, "33": 1, "153": 1, "73": 18, "39": 1, "43": 12, "44": 9, "168": 1, "111": 1, "34": 10, "77": 9]
获取数据缺失值所占比例
一般缺失值太多的时候我们会把一列都舍弃。在上面统计数据分布的时候也统计了缺失值所占的比例,直接算个除法即可
print(Double(vc[missing]!) / Double(trainData["LotFrontage"].count))
输出结果为:
0.1773972602739726
删除缺失值
MLDataTable删除缺失值的方式有两种,一种是按行删除,一种是按列删除
按行删除
MLDataTable有一个dropMissing方法,调用之后就会把带有空值的行删掉,代码如下
print(trainData.rows.count)
trainData = trainData.dropMissing()
print(trainData.rows.count)
输出结果如下:
1460
删掉缺失值之后一行都没有了,说明这个就必须得处理缺失值了
按列删除
MLDataTable的removeColumn方法可以删除一列,代码如下:
print(trainData.columnNames.count)
trainData.removeColumn(named: "LotFrontage")
print(trainData.columnNames.count)
输出结果:
81
80
可以看到列由开始的81行变成了80行。
均值填充
MLDataTable通过取下标的方式获取一列得到的是MLUntypedColumn对象,MLUntypedColumn对象有一个成员变量ints是将这一列无类型的转换为MLDataColumn转换失败的话就是nil,MLDataColumn有一个mean方法可以计算均值。对于数值类型的,可以填充均值,还是以LotFrontage这一列为例,这一列为数值类型,可以使用如下方法来获取均值并填充
print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))
let mean = trainData["LotFrontage"].ints?.mean()
print("mean:\(mean)\n")
// 创建一个int类型的DataValue
let LotFrontageMean = MLDataValue.int(Int(mean!))
// 填充缺失值
trainData = trainData.fillMissing(columnNamed: "LotFrontage", with: LotFrontageMean)
print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))
结果如下:
["64": 19, "150": 1, "57": 12, "160": 1, "24": 19, "118": 2, "130": 3, "114": 2, "152": 1, "174": 2, "80": 69, "43": 12, "313": 2, "63": 17, "153": 1, "68": 19, "41": 6, "46": 1, "98": 8, "40": 12, "120": 7, "106": 1, "30": 6, "75": 53, "82": 12, "103": 3, "61": 8, "121": 2, "34": 10, "39": 1, "182": 1, "38": 1, "21": 23, "111": 1, "52": 14, "73": 18, "112": 1, "74": 15, "77": 9, "44": 9, "85": 40, "51": 15, "137": 1, "105": 6, "missing": 259, "65": 44, "66": 15, "88": 10, "56": 5, "48": 6, "53": 10, "109": 2, "81": 6, "124": 2, "42": 4, "92": 10, "95": 7, "107": 7, "72": 17, "60": 143, "59": 13, "37": 5, "71": 12, "33": 1, "115": 2, "55": 17, "141": 1, "144": 1, "128": 1, "97": 2, "140": 1, "84": 9, "110": 6, "49": 4, "36": 6, "67": 12, "78": 25, "45": 3, "90": 23, "32": 5, "93": 8, "69": 11, "100": 16, "86": 10, "89": 6, "58": 7, "108": 3, "87": 5, "94": 6, "99": 3, "116": 2, "47": 5, "35": 9, "122": 2, "149": 1, "76": 11, "101": 2, "70": 70, "129": 2, "91": 6, "138": 1, "104": 3, "54": 6, "102": 4, "168": 1, "79": 17, "96": 8, "50": 57, "83": 5, "62": 9, "134": 2]
mean: Optional(70.04995836802664)
["64": 19, "150": 1, "57": 12, "160": 1, "24": 19, "118": 2, "130": 3, "114": 2, "152": 1, "174": 2, "80": 69, "43": 12, "313": 2, "63": 17, "153": 1, "68": 19, "41": 6, "46": 1, "98": 8, "40": 12, "120": 7, "106": 1, "30": 6, "75": 53, "82": 12, "103": 3, "61": 8, "121": 2, "34": 10, "39": 1, "182": 1, "38": 1, "21": 23, "111": 1, "52": 14, "73": 18, "112": 1, "74": 15, "77": 9, "44": 9, "85": 40, "51": 15, "137": 1, "105": 6, "65": 44, "66": 15, "88": 10, "56": 5, "48": 6, "53": 10, "109": 2, "81": 6, "124": 2, "42": 4, "92": 10, "95": 7, "107": 7, "72": 17, "60": 143, "59": 13, "37": 5, "71": 12, "33": 1, "115": 2, "55": 17, "141": 1, "144": 1, "128": 1, "97": 2, "140": 1, "84": 9, "110": 6, "49": 4, "36": 6, "67": 12, "78": 25, "45": 3, "90": 23, "32": 5, "93": 8, "69": 11, "100": 16, "86": 10, "89": 6, "58": 7, "108": 3, "87": 5, "94": 6, "99": 3, "116": 2, "47": 5, "35": 9, "122": 2, "149": 1, "76": 11, "101": 2, "70": 329, "129": 2, "91": 6, "138": 1, "104": 3, "54": 6, "102": 4, "168": 1, "79": 17, "96": 8, "50": 57, "83": 5, "62": 9, "134": 2]
可以看到第二个里面缺失值没有了,然后70的数量从70变成了329,增加的数量就是之前缺失值的数量
众数填充
众数就是一组数据里面出现最多的值,可能是一个多个也有可能没有,这里就简单一点,只考虑有一个的情况,代码如下:
func getMode(data: MLUntypedColumn, type: Int) -> String {
let vc = valueCounts(data: data, type: type)
return getMode(data: vc)
}
func getMode(data: [String:Int]) -> String{
var max:Int = 0
var maxKey:String = ""
var flag = false
for (key, value) in data{
if key == missing {
continue
} else {
if !flag {
max = value
maxKey = key
flag = true
} else {
if max < value {
max = value
maxKey = key
}
}
}
}
return maxKey
}
print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))
let mode = getMode(data: trainData["LotFrontage"], type: TYPE_INT)
print("Mode :\(mode)")
let LotFrontageMode = MLDataValue.int(Int(mode)!)
trainData = trainData.fillMissing(columnNamed: "LotFrontage", with: LotFrontageMode)
print(valueCounts(data: trainData["LotFrontage"], type: TYPE_INT))
输出结果如下
["134": 2, "57": 12, "41": 6, "53": 10, "83": 5, "91": 6, "141": 1, "86": 10, "153": 1, "100": 16, "45": 3, "54": 6, "104": 3, "81": 6, "64": 19, "37": 5, "30": 6, "60": 143, "149": 1, "80": 69, "168": 1, "160": 1, "77": 9, "78": 25, "82": 12, "111": 1, "52": 14, "69": 11, "46": 1, "missing": 259, "112": 1, "36": 6, "182": 1, "49": 4, "107": 7, "84": 9, "93": 8, "61": 8, "137": 1, "94": 6, "129": 2, "59": 13, "33": 1, "97": 2, "98": 8, "24": 19, "174": 2, "130": 3, "21": 23, "99": 3, "313": 2, "96": 8, "74": 15, "110": 6, "62": 9, "152": 1, "56": 5, "120": 7, "105": 6, "47": 5, "103": 3, "90": 23, "65": 44, "85": 40, "88": 10, "138": 1, "108": 3, "39": 1, "75": 53, "122": 2, "48": 6, "79": 17, "76": 11, "44": 9, "118": 2, "150": 1, "73": 18, "42": 4, "102": 4, "70": 70, "50": 57, "109": 2, "124": 2, "116": 2, "101": 2, "32": 5, "58": 7, "89": 6, "35": 9, "34": 10, "144": 1, "128": 1, "67": 12, "87": 5, "121": 2, "68": 19, "40": 12, "95": 7, "66": 15, "71": 12, "63": 17, "92": 10, "38": 1, "51": 15, "115": 2, "43": 12, "114": 2, "72": 17, "140": 1, "55": 17, "106": 1]
Mode : 60
["134": 2, "57": 12, "41": 6, "53": 10, "83": 5, "91": 6, "141": 1, "86": 10, "153": 1, "100": 16, "45": 3, "54": 6, "104": 3, "81": 6, "64": 19, "37": 5, "30": 6, "60": 402, "149": 1, "80": 69, "168": 1, "160": 1, "77": 9, "78": 25, "82": 12, "111": 1, "52": 14, "69": 11, "46": 1, "112": 1, "36": 6, "182": 1, "49": 4, "107": 7, "84": 9, "93": 8, "61": 8, "137": 1, "94": 6, "129": 2, "59": 13, "33": 1, "97": 2, "98": 8, "24": 19, "174": 2, "130": 3, "21": 23, "99": 3, "313": 2, "96": 8, "74": 15, "110": 6, "62": 9, "152": 1, "56": 5, "120": 7, "105": 6, "47": 5, "103": 3, "90": 23, "65": 44, "85": 40, "88": 10, "138": 1, "108": 3, "39": 1, "75": 53, "122": 2, "48": 6, "79": 17, "76": 11, "44": 9, "118": 2, "150": 1, "73": 18, "42": 4, "102": 4, "70": 70, "50": 57, "109": 2, "124": 2, "116": 2, "101": 2, "32": 5, "58": 7, "89": 6, "35": 9, "34": 10, "144": 1, "128": 1, "67": 12, "87": 5, "121": 2, "68": 19, "40": 12, "95": 7, "66": 15, "71": 12, "63": 17, "92": 10, "38": 1, "51": 15, "115": 2, "43": 12, "114": 2, "72": 17, "140": 1, "55": 17, "106": 1]
可以看到填充之后60出现的次数也是增加了缺失值的个数。
完整的例子
要完成房价预测,需要下面几个步骤:对训练数据缺失值进行填充,如果是数值就填充均值进去,如果是字符串就填充众数进去,丢弃掉缺失值70%以上的数据。
选择合适的模型训练,我这里使用的是线性回归
使用训练好的模型来预测
完整代码参考:smartdone/ML_swiftgithub.com

没有调参提交到kaggle的成绩