08-字典特征抽取

为什么需要特征工程(Feature Engineering)

机器学习领域的大神Andrew Ng(吴恩达)老师说“Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. ”

注：业界广泛流传：数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已。

什么是特征工程

定义不唯一

特征工程是使用专业背景知识和技巧处理数据，使得特征能在机器学习算法上发挥更好的作用的过程。

意义：会直接影响机器学习的效果

特征工程 & 数据处理

08-字典特征抽取

sklearn 专门用来做特征工程
pandas 数据清洗、数据处理

特征工程介绍

特征工程包含内容：

特征抽取
特征预处理
特征降维

特征抽取/特征提取

特征提取

08-字典特征抽取

将任意数据（如文本或图像）转换为可用于机器学习的数字特征

特征值化

特征值化是为了让计算机能够更好地理解数据

字典特征提取(特征离散化)
文本特征提取
图像特征提取（深度学习将介绍）

如何在sklearn中实现特征提取？

sklearn.feature_extraction

字典特征提取

作用：对字典数据进行特征值化

08-字典特征抽取

vector 数学：向量物理：矢量

既有大小、又有方向

那么我们应该如何在计算机中使用vector呢？

矩阵 matrix 多维数组

矩阵是由向量构成的

向量 vector 一维数组

08-字典特征抽取

08-字典特征抽取

08-字典特征抽取

应用

我们对以下数据进行特征提取

[{'city': '北京','temperature':100}
{'city': '上海','temperature':60}
{'city': '深圳','temperature':30}]

08-字典特征抽取

# -*- coding: utf-8 -*-

"""
@Time    : 2021/3/7 14:50
@Author  : yuhui
@Email   : [email protected]
@FileName: 08_字典特征抽取.py
@Software: PyCharm
"""

from sklearn.feature_extraction import DictVectorizer

def dict_feature_extraction():
	"""
	字典特征抽取
	:return:
	"""
	data=[{'city': '北京','temperature':100},
{'city': '上海','temperature':60},
{'city': '深圳','temperature':30}]

	# 实例化一个转化器类
	# transfer=DictVectorizer(sparse=True)
	transfer=DictVectorizer(sparse=False)
	# sparse  稀疏

	# fit_transform()
	data_new=transfer.fit_transform(data)
	print(data_new)


if __name__ == '__main__':
    dict_feature_extraction()

# -*- coding: utf-8 -*-

"""
@Time    : 2021/3/7 14:50
@Author  : yuhui
@Email   : [email protected]
@FileName: 08_字典特征抽取.py
@Software: PyCharm
"""

from sklearn.feature_extraction import DictVectorizer

def dict_feature_extraction():
	"""
	字典特征抽取
	:return:
	"""
	data=[{'city': '北京','temperature':100},
{'city': '上海','temperature':60},
{'city': '深圳','temperature':30}]

	# 实例化一个转化器类
	# transfer=DictVectorizer(sparse=True)
	transfer=DictVectorizer(sparse=False)
	# sparse  稀疏

	# fit_transform()
	data_new=transfer.fit_transform(data)
	# print(data_new)

	# 查看属性

	# 查看特征名字
	# print(transfer.feature_names_)  # 方法一
	# print(transfer.get_feature_names())  # 方法二

if __name__ == '__main__':
    dict_feature_extraction()

"one-hot"编码

08-字典特征抽取

应用场景

凡是可以化为以下形式的数据，我们都可以使用字典特征提取的方法来对数据进行特征提取。

08-字典特征抽取

小结

字典特征提取

导入库

from sklearn.feature_extraction import DictVectorizer

实例化对象

调用方法

查看属性

# 查看特征名字
print(transfer.feature_names_)  # 方法一
print(transfer.get_feature_names())  # 方法二

注意

08-字典特征抽取

sparse=True

08-字典特征抽取

sparse=False

08-字典特征抽取

第一次复习

# -*- coding: utf-8 -*-

"""
@Time    : 2021/4/8 13:40
@Author  : yuhui
@Email   : [email protected]
@FileName: 08_字典特征抽取_2.py
@Software: PyCharm
"""
from sklearn import feature_extraction

def dictionary_feature_extraction():
	"""字典特征提取"""
	data=[{'city': '北京','temperature':100},
{'city': '上海','temperature':60},
{'city': '深圳','temperature':30}]
	transfer=feature_extraction.DictVectorizer(sparse=True)
	data_new=transfer.fit_transform(data)
	print(data_new)
	print(data)

	# 查看属性

	# 查看特征名字
	print(transfer.feature_names_)
	print(transfer.get_feature_names())

	# 查看
	print(transfer.inverse_transform(data_new))

if __name__ == '__main__':
	dictionary_feature_extraction()

D:\Anaconda3\Installation\envs\math\python.exe D:/Machine_Learning/Machine_Learning_1/code/08_字典特征抽取_2.py
  (0, 1)	1.0
  (0, 3)	100.0
  (1, 0)	1.0
  (1, 3)	60.0
  (2, 2)	1.0
  (2, 3)	30.0
[{'city': '北京', 'temperature': 100}, {'city': '上海', 'temperature': 60}, {'city': '深圳', 'temperature': 30}]
['city=上海', 'city=北京', 'city=深圳', 'temperature']
['city=上海', 'city=北京', 'city=深圳', 'temperature']
[{'city=北京': 1.0, 'temperature': 100.0}, {'city=上海': 1.0, 'temperature': 60.0}, {'city=深圳': 1.0, 'temperature': 30.0}]

Process finished with exit code 0

08-字典特征抽取

为什么需要特征工程(Feature Engineering)

什么是特征工程

特征工程 & 数据处理

特征工程介绍

特征抽取/特征提取

特征提取

将任意数据（如文本或图像）转换为可用于机器学习的数字特征

如何在sklearn中实现特征提取？

字典特征提取

应用

"one-hot"编码

应用场景

小结

第一次复习

继续阅读

2021-2025年中国运动疗法（KT）带行业市场供需与战略研究报告

cs231n斯坦福基于卷积神经网络的CV学习笔记（一）KNN和线性分类器/分类器损失/反向传播一，KNN图像分类算法二，线性分类器三，线性分类器损失四，反向传播五，神经网络

Small tricks

libsvm for python 安装

2021年危险化学品经营单位安全管理人员考试题库及危险化学品经营单位安全管理人员考试技巧

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

无人机--飞控科普

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入