Feature Engineering for Machine Learning 機器學習中的特征工程（一）

2023-07-03 15:09:07

本系列的部落格是《Feature Engineering for Machine Learning》這本書的整理及學習，如果有出錯的地方麻煩各位指正。英文版參考的是2018年4月出版的，連結是Feature Engineering for Machine Learning。全書一共有九章，是以初步打算寫九篇文章。原書所配套的代碼可以在https://github.com/alicezheng/feature-engineering-book上找到

(1). The Machine Learning Pipeline

(2). Fancy Tricks with Simple Numbers

(3). Text Data: Flattening, Filtering, and Chunking

(4). The Effects of Feature Scaling: From Bag-of-Words to Tf-idf

(5). Categorical Variables: Counting Eggs in the Age of Robotic Chickens

(6). Dimensionality Reduction: Squashing the Data Pancake with PCA

(7). Nonlinear Featurization via K-Means Model Stacking

(8). Automating the Featurizer: Image Feature Extraction and Deep Learning

(9). Back to the Feature: Building an Academic Paper Recommender

首先看第一章：The Machine Learning Pipeline。pipeline可以翻譯成管道或者流水線，是指資料在機器學習過程中像流水線一樣從一個過程到另一個過程，比如sklearn中的pipeline。

from sklearn.pipeline import Pipeline
 
pipe_lr = Pipeline([('sc', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression(random_state=1))
                    ])
pipe_lr.fit(X_train, y_train)

首先對資料做了一個标準化，其次是PCA降維，最後是分類。

第一章目錄

（1） Data

（2） Tasks

（3） Models

（4） Features

（5） Model Evaluation

data和task都是簡單的一些概念。

model是和資料有關的，比如一個預測股票價格的模型可能會對一個公司的曆史收入，股票的曆史價格模組化。但原始資料通常不是數值型的，是以需要特征來做原始資料和模型之間的橋梁。

特征就是原始資料的數值表示。特征工程是在給定資料，模型，任務的情況下給出最合适的特征。

Feature Engineering for Machine Learning 機器學習中的特征工程（一）

Feature Engineering for Machine Learning 機器學習中的特征工程（一）

繼續閱讀

贈《推薦系統》中文（蔣凡譯）

在特征工程中，樣本資料和特征資料是如何進行拼接的？

特征工程簡介特征工程簡介

SEBERTNets：金融領域的事件主體抽取方法

機器學習：鑽石鑒定

Feature Engineering for Machine Learning 機器學習中的特征工程（三）

Feature Engineering for Machine Learning 機器學習中的特征工程（二）

用GBDT建構組合特征

特征轉換

特征：什麼是特征和特征選擇？

特征工程——一個粗糙調研1. 資料預處理2. 特征選擇3. 降維——PCA和LDA

特征工程總結前言基礎類非圖像處理類圖像處理類

特征工程（1）--特征工程是什麼？

2.資料預處理2.資料預處理

資料标注員是人工智能訓練師。尚躍智能科技創始人兼董事長是尚立卓。很多人對資料标注的了解僅限于拉框，今天我想分享一下真正意

處理離散型特征和連續型特征共存的情況歸一化論述了對離散特征進行one-hot編碼的意義