lintcode上面有十幾道類似于Kaggle的小項目,用于深度學習的入手練習再好不過了,現在就讓我們上手這道貓狗分類器的問題吧!
(全程用Keras架構,簡單上手!)
本題網址:https://www.lintcode.com/ai/spam-message-classification
題目描述:
本題提供一個資料集, 它包括了5574條英文短信,每條短信内容由幾個長短不一的句子組成。每條短信都标注好了是否為垃圾短信,通過該訓練集訓練出一個分類器,預測短信内容是否為垃圾短信。
一 下載下傳,讀取資料
這一步比較麻煩,用pd.readcsv讀取總是出現bug,無法讀取完整的資料。最後import csv搞定.....讀取部分這樣寫就行了,最後得到(5572,)大小的train,(1115,)大小的test,具體函數沒什麼可說的,就是逐行讀取。
def read_data(file):
train_data = csv.reader(open(file, encoding="utf-8"))
lines = 0
for r in train_data:
lines += 1
train_data_label = np.zeros([lines - 1, ])
train_data_content = []
train_data = csv.reader(open(file, encoding="utf-8"))
i = 0
for data in train_data:
if data[0] == "Label" or data[0] == "SmsId":
continue
if data[0] == "ham":
train_data_label[i] = 0
if data[0] == "spam":
train_data_label[i] = 1
train_data_content.append(data[1])
i += 1
print(train_data_label.shape, len(train_data_content))
return train_data_label,train_data_content
# 載入資料
train_y,train_data_content = read_data("./垃圾短信分類data/train.csv")
_,test_data_content = read_data("./垃圾短信分類data/test.csv")
二 清洗資料
初步的清洗資料,包括把所有的單詞都轉換成小寫,删除除了英文之外的字元,考慮到英語有簡寫,恢複簡寫的部分,把所有的單詞恢複成詞幹。
def clean_text(comment_text):
comment_list = []
for text in comment_text:
# 将單詞轉換為小寫
text = text.lower()
# 删除非字母、數字字元
text = re.sub(r"[^a-z']", " ", text)
# 恢複常見的簡寫
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "can not ", text)
text = re.sub(r"cannot", "can not ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"\'m", " am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " will ", text)
text = re.sub(r"ain\'t", " are not ", text)
text = re.sub(r"aren't", " are not ", text)
text = re.sub(r"couldn\'t", " can not ", text)
text = re.sub(r"didn't", " do not ", text)
text = re.sub(r"doesn't", " do not ", text)
text = re.sub(r"don't", " do not ", text)
text = re.sub(r"hadn't", " have not ", text)
text = re.sub(r"hasn't", " have not ", text)
text = re.sub(r"\'ll", " will ", text)
#進行詞幹提取
new_text = ""
s = nltk.stem.snowball.EnglishStemmer() # 英文詞幹提取器
for word in word_tokenize(text):
new_text = new_text + " " + s.stem(word)
# 放回去
comment_list.append(new_text)
return comment_list
train_data_content = clean_text(train_data_content)
test_data_content = clean_text(test_data_content)
三 TF-IDF計算
這裡使用tfidf将一段話轉換成向量,取前5000常見單詞,也就是一段話成為了5000維的向量,具體的不解釋了...看代碼
# 資料的TF-IDF資訊計算
all_comment_list = list(train_data_content) + list(test_data_content)
text_vector = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode',token_pattern=r'\w{1,}',
max_features=5000, ngram_range=(1, 1), analyzer='word')
text_vector.fit(all_comment_list)
train_x = text_vector.transform(train_data_content)
test_x = text_vector.transform(test_data_content)
train_x = train_x.toarray()
test_x = test_x.toarray()
print(train_x.shape,test_x.shape,type(train_x)) # (5572, 5000) (1115, 5000) <class 'numpy.ndarray'>
四 模組化預測
最開始用的是神經網絡,效果極差,貌似遇到了梯度消失,怎麼訓練都沒用,最後忍無可忍,用sklearn的LogisticRegression,回歸原初.....然後稍微調整一下C就得到了100%正确率,你敢信?
# 構模組化型
clf = LogisticRegression(C=100.0)
clf.fit(train_x, train_y)
train_scores = clf.score(train_x, train_y)
print(train_scores)
test_y = clf.predict_proba(test_x)
# 預測答案
print(test_y.shape)
answer = pd.read_csv(open("./垃圾短信分類data/sampleSubmission.csv"))
for i in range(test_y.shape[0]):
predit = test_y[i,0]
if predit < 0.5:
answer.loc[i,"Label"] = "spam"
else:
answer.loc[i,"Label"] = "ham"
answer.to_csv("./垃圾短信分類data/submission.csv",index=False) # 不要儲存引索列
最後結果,得分1.000(震驚了,真的存在100%正确率的.....)
排名3/108(有史以來最好的一次)
代碼已經公布,參見github