天天看點

在lintcode刷AI題:垃圾短信分類

lintcode上面有十幾道類似于Kaggle的小項目,用于深度學習的入手練習再好不過了,現在就讓我們上手這道貓狗分類器的問題吧!

(全程用Keras架構,簡單上手!)

本題網址:https://www.lintcode.com/ai/spam-message-classification

題目描述:

本題提供一個資料集, 它包括了5574條英文短信,每條短信内容由幾個長短不一的句子組成。每條短信都标注好了是否為垃圾短信,通過該訓練集訓練出一個分類器,預測短信内容是否為垃圾短信。

一 下載下傳,讀取資料

這一步比較麻煩,用pd.readcsv讀取總是出現bug,無法讀取完整的資料。最後import csv搞定.....讀取部分這樣寫就行了,最後得到(5572,)大小的train,(1115,)大小的test,具體函數沒什麼可說的,就是逐行讀取。

def read_data(file):
    train_data = csv.reader(open(file, encoding="utf-8"))
    lines = 0
    for r in train_data:
        lines += 1
    train_data_label = np.zeros([lines - 1, ])
    train_data_content = []
    train_data = csv.reader(open(file, encoding="utf-8"))
    i = 0
    for data in train_data:
        if data[0] == "Label" or data[0] == "SmsId":
            continue
        if data[0] == "ham":
            train_data_label[i] = 0
        if data[0] == "spam":
            train_data_label[i] = 1
        train_data_content.append(data[1])
        i += 1
    print(train_data_label.shape, len(train_data_content))
    return train_data_label,train_data_content


# 載入資料
train_y,train_data_content = read_data("./垃圾短信分類data/train.csv")
_,test_data_content = read_data("./垃圾短信分類data/test.csv")
           

二 清洗資料

初步的清洗資料,包括把所有的單詞都轉換成小寫,删除除了英文之外的字元,考慮到英語有簡寫,恢複簡寫的部分,把所有的單詞恢複成詞幹。

def clean_text(comment_text):
    comment_list = []
    for text in comment_text:
        # 将單詞轉換為小寫
        text = text.lower()
        # 删除非字母、數字字元
        text = re.sub(r"[^a-z']", " ", text)
        # 恢複常見的簡寫
        text = re.sub(r"what's", "what is ", text)
        text = re.sub(r"\'s", " ", text)
        text = re.sub(r"\'ve", " have ", text)
        text = re.sub(r"can't", "can not ", text)
        text = re.sub(r"cannot", "can not ", text)
        text = re.sub(r"n't", " not ", text)
        text = re.sub(r"\'m", " am ", text)
        text = re.sub(r"\'re", " are ", text)
        text = re.sub(r"\'d", " will ", text)
        text = re.sub(r"ain\'t", " are not ", text)
        text = re.sub(r"aren't", " are not ", text)
        text = re.sub(r"couldn\'t", " can not ", text)
        text = re.sub(r"didn't", " do not ", text)
        text = re.sub(r"doesn't", " do not ", text)
        text = re.sub(r"don't", " do not ", text)
        text = re.sub(r"hadn't", " have not ", text)
        text = re.sub(r"hasn't", " have not ", text)
        text = re.sub(r"\'ll", " will ", text)
        #進行詞幹提取
        new_text = ""
        s = nltk.stem.snowball.EnglishStemmer()  # 英文詞幹提取器
        for word in word_tokenize(text):
            new_text = new_text + " " + s.stem(word)
        # 放回去
        comment_list.append(new_text)
    return comment_list

train_data_content = clean_text(train_data_content)
test_data_content = clean_text(test_data_content)
           

三 TF-IDF計算

這裡使用tfidf将一段話轉換成向量,取前5000常見單詞,也就是一段話成為了5000維的向量,具體的不解釋了...看代碼

# 資料的TF-IDF資訊計算
all_comment_list = list(train_data_content) + list(test_data_content)
text_vector = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode',token_pattern=r'\w{1,}',
                              max_features=5000, ngram_range=(1, 1), analyzer='word')
text_vector.fit(all_comment_list)
train_x = text_vector.transform(train_data_content)
test_x = text_vector.transform(test_data_content)
train_x = train_x.toarray()
test_x = test_x.toarray()
print(train_x.shape,test_x.shape,type(train_x)) # (5572, 5000) (1115, 5000) <class 'numpy.ndarray'>
           

四 模組化預測

最開始用的是神經網絡,效果極差,貌似遇到了梯度消失,怎麼訓練都沒用,最後忍無可忍,用sklearn的LogisticRegression,回歸原初.....然後稍微調整一下C就得到了100%正确率,你敢信?

# 構模組化型
clf = LogisticRegression(C=100.0)
clf.fit(train_x, train_y)
train_scores = clf.score(train_x, train_y)
print(train_scores)
test_y = clf.predict_proba(test_x)

# 預測答案
print(test_y.shape)
answer = pd.read_csv(open("./垃圾短信分類data/sampleSubmission.csv"))
for i in range(test_y.shape[0]):
    predit = test_y[i,0]
    if predit < 0.5:
        answer.loc[i,"Label"] = "spam"
    else:
        answer.loc[i,"Label"] = "ham"
answer.to_csv("./垃圾短信分類data/submission.csv",index=False)  # 不要儲存引索列
           

最後結果,得分1.000(震驚了,真的存在100%正确率的.....)

排名3/108(有史以來最好的一次)

代碼已經公布,參見github

繼續閱讀