使用 tf-idf 提取关键词

2023-03-20 21:31:26

tf-idf 的简要介绍：

tf：term frequency。某个关键词在整篇文档中出现的频率

idf：inverse document frequency。逆文档频率。某个词在所有文档中出现的频率。

tf 公式：

t f i j = n i j ∑ k n k j tf_{ij} = \frac{n_{ij}}{\sum_k n_{kj}} tfij=∑knkjnij

含义：tf 等于关键词 i 在文档 j 中出现的频率 / 文档中所有关键词的频率之和

idf 公式：

i d f i = log ⁡ D 1 + ∑ D t i idf_{i} = \log\frac{D}{1+\sum D_{t_i}} idfi=log1+∑DtiD

含义：idf 等于文档的个数 D / (包含关键词 i 的文档个数 + ),再取对数。

提取关键词的步骤：分词、去除停用词、使用 tf-idf 获取关键词

分词：

def fenci(file):
    seg_list = []
    with open(file, "r", encoding='utf-8') as f: 
        for line in f:
            line = line.split('\t')[1]
            seg = pseg.cut(line)        
            seg_list.append(seg）
    return seg_list

去除停用词：

def remove_stopwords(seg_list, stopwords_file):
    results = []
    stopwords = [line.rstrip() for line in open(stopwords_file)]
    for seg in seg_list:
        result = []
        for word in seg:
            if word not in stopwords:
                result.append(word)
        results.append(" ".join(result))
            
    return results

基于 sklearn 的 tf-idf：

def tfidf(words_list):
    vectorizer = CountVectorizer()
    word_frequence = vectorizer.fit_transform(words_list)
    words = vectorizer.get_feature_names()
    transformer = TfidfTransformer()
    tfidf = transformer.fit_transform(word_frequence)
    return tfidf, words

对 tf-idf 得到的向量进行排序

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

从排序结果中提取出前 n 个关键词

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    sorted_items = sorted_items[:topn]
    score_vals = []
    feature_vals = []
    for idx, score in sorted_items:
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

综合得到结果

seg_list = fenci("***.txt")
word_list = remove_stopwords(seg_list, "../stopwords.txt")
tfidf, words = tfidf(word_list)
print(tfidf.tocoo())
sorted_items = sort_coo(tfidf.tocoo())
results = extract_topn_from_vector(words, sorted_items, 100)

之前就是不知道如何从 tf-idf 中得到的结果中抽取出前 n 个关键词，google 了以后才发现这篇文章：https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/，里面的排序算法我都是直接复制粘贴的，哈哈。在此做个记录吧。

使用 tf-idf 提取关键词

继续阅读

新闻文本分类-06 基于Bert的文本分类

seq2seq模型 + Attention机制

elasticlunr.js 最新版本v0.6.7发布啦应用示例为什么你需要elasticlunr.js?

RNN/LSTM学习资料总结

使用中文维基百科进行GloVe实验

从词向量衡量标准到全局向量的词嵌入模型GloVe再到一词多义的解决方式衡量标准Evaluation引子全局向量的词嵌入应用对一词多义的思考Reference

NLP︱高级词向量表达（一）——GloVe（理论、相关测评结果、R&python实现、相关应用）一、理论简述二、测评三、Glove实现&R&python四、相关应用

GloVe与word2vec的区别，及GloVe的缺陷

更别致的词向量模型(一)：simpler glove

glove_python安装（避免编译错误）

信息理论与tf-idf

python 分析qq聊天记录

[一起学BERT]（一）：BERT模型的原理基础Self-Attention机制理论Multi-head Self-Attention注意力机制位置编码Transformer理论BERT理论

ELMO BERT GPT

BERT、Elmo、GPT一、发展历史二、bert三、ERNIE四、GPT—transformer的decoder

人工智能如何有效地运用于自然语言处理