天天看点

使用 tf-idf 提取关键词

tf-idf 的简要介绍:

tf:term frequency。某个关键词在整篇文档中出现的频率

idf:inverse document frequency。逆文档频率。某个词在所有文档中出现的频率。

tf 公式:

t f i j = n i j ∑ k n k j tf_{ij} = \frac{n_{ij}}{\sum_k n_{kj}} tfij​=∑k​nkj​nij​​

含义:tf 等于关键词 i 在文档 j 中出现的频率 / 文档中所有关键词的频率之和

idf 公式:

i d f i = log ⁡ D 1 + ∑ D t i idf_{i} = \log\frac{D}{1+\sum D_{t_i}} idfi​=log1+∑Dti​​D​

含义:idf 等于文档的个数 D / (包含关键词 i 的文档个数 + ),再取对数。

提取关键词的步骤:分词、去除停用词、使用 tf-idf 获取关键词

分词:

def fenci(file):
    seg_list = []
    with open(file, "r", encoding='utf-8') as f: 
        for line in f:
            line = line.split('\t')[1]
            seg = pseg.cut(line)        
            seg_list.append(seg)
    return seg_list
           

去除停用词:

def remove_stopwords(seg_list, stopwords_file):
    results = []
    stopwords = [line.rstrip() for line in open(stopwords_file)]
    for seg in seg_list:
        result = []
        for word in seg:
            if word not in stopwords:
                result.append(word)
        results.append(" ".join(result))
            
    return results
           

基于 sklearn 的 tf-idf:

def tfidf(words_list):
    vectorizer = CountVectorizer()
    word_frequence = vectorizer.fit_transform(words_list)
    words = vectorizer.get_feature_names()
    transformer = TfidfTransformer()
    tfidf = transformer.fit_transform(word_frequence)
    return tfidf, words
           

对 tf-idf 得到的向量进行排序

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
           

从排序结果中提取出前 n 个关键词

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    sorted_items = sorted_items[:topn]
    score_vals = []
    feature_vals = []
    for idx, score in sorted_items:
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results
           

综合得到结果

seg_list = fenci("***.txt")
word_list = remove_stopwords(seg_list, "../stopwords.txt")
tfidf, words = tfidf(word_list)
print(tfidf.tocoo())
sorted_items = sort_coo(tfidf.tocoo())
results = extract_topn_from_vector(words, sorted_items, 100)
           

之前就是不知道如何从 tf-idf 中得到的结果中抽取出前 n 个关键词,google 了以后才发现这篇文章:https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/,里面的排序算法我都是直接复制粘贴的,哈哈。在此做个记录吧。

继续阅读