tf-idf 的简要介绍:
tf:term frequency。某个关键词在整篇文档中出现的频率
idf:inverse document frequency。逆文档频率。某个词在所有文档中出现的频率。
tf 公式:
t f i j = n i j ∑ k n k j tf_{ij} = \frac{n_{ij}}{\sum_k n_{kj}} tfij=∑knkjnij
含义:tf 等于关键词 i 在文档 j 中出现的频率 / 文档中所有关键词的频率之和
idf 公式:
i d f i = log D 1 + ∑ D t i idf_{i} = \log\frac{D}{1+\sum D_{t_i}} idfi=log1+∑DtiD
含义:idf 等于文档的个数 D / (包含关键词 i 的文档个数 + ),再取对数。
提取关键词的步骤:分词、去除停用词、使用 tf-idf 获取关键词
分词:
def fenci(file):
seg_list = []
with open(file, "r", encoding='utf-8') as f:
for line in f:
line = line.split('\t')[1]
seg = pseg.cut(line)
seg_list.append(seg)
return seg_list
去除停用词:
def remove_stopwords(seg_list, stopwords_file):
results = []
stopwords = [line.rstrip() for line in open(stopwords_file)]
for seg in seg_list:
result = []
for word in seg:
if word not in stopwords:
result.append(word)
results.append(" ".join(result))
return results
基于 sklearn 的 tf-idf:
def tfidf(words_list):
vectorizer = CountVectorizer()
word_frequence = vectorizer.fit_transform(words_list)
words = vectorizer.get_feature_names()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(word_frequence)
return tfidf, words
对 tf-idf 得到的向量进行排序
def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
从排序结果中提取出前 n 个关键词
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
sorted_items = sorted_items[:topn]
score_vals = []
feature_vals = []
for idx, score in sorted_items:
score_vals.append(round(score, 3))
feature_vals.append(feature_names[idx])
#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
results[feature_vals[idx]]=score_vals[idx]
return results
综合得到结果
seg_list = fenci("***.txt")
word_list = remove_stopwords(seg_list, "../stopwords.txt")
tfidf, words = tfidf(word_list)
print(tfidf.tocoo())
sorted_items = sort_coo(tfidf.tocoo())
results = extract_topn_from_vector(words, sorted_items, 100)
之前就是不知道如何从 tf-idf 中得到的结果中抽取出前 n 个关键词,google 了以后才发现这篇文章:https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/,里面的排序算法我都是直接复制粘贴的,哈哈。在此做个记录吧。