python 分詞計算文檔TF-IDF值并排序

2023-05-07 19:07:47

文章來自于我的個人部落格：python 分詞計算文檔TF-IDF值并排序

該程式實作的功能是：首先讀取一些文檔，然後通過jieba來分詞，将分詞存入檔案，然後通過sklearn計算每個分詞文檔中的tf-idf值，再将文檔排序輸入一個大檔案中

依賴包：

sklearn

jieba

注：此程式參考了一位同行的程式後進行了修改

# -*- coding: utf-8 -*-
"""
@author: jiangfuqiang
"""

import os
import jieba
import jieba.posseg as pseg
import sys
import re
import time
import string
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
reload(sys)

sys.setdefaultencoding('utf-8')

def getFileList(path):
    filelist = []
    files = os.listdir(path)
    for f in files:
        if f[0] == '.':
            pass
        else:
            filelist.append(f)
    return filelist,path

def fenci(filename,path,segPath):
    f = open(path +"/" + filename,'r+')
    file_list = f.read()
    f.close()

     #儲存粉刺結果的目錄

    if not os.path.exists(segPath):
        os.mkdir(segPath)

    #對文檔進行分詞處理
    seg_list = jieba.cut(file_list,cut_all=True)
    #對空格，換行符進行處理
    result = []
    for seg in seg_list:
        seg = ''.join(seg.split())
        reg = 'w+'
        r = re.search(reg,seg)
        if seg != '' and seg != '
' and seg != '

' and seg != '=' and 
                        seg != '[' and seg != ']' and seg != '(' and seg != ')' and not r:
            result.append(seg)

    #将分詞後的結果用空格隔開，儲存至本地
    f = open(segPath+"/"+filename+"-seg.txt","w+")
    f.write(' '.join(result))
    f.close()

#讀取已經分詞好的文檔，進行TF-IDF計算
def Tfidf(filelist,sFilePath,path):
    corpus = []
    for ff in filelist:
        fname = path + ff
        f = open(fname+"-seg.txt",'r+')
        content = f.read()
        f.close()
        corpus.append(content)

    vectorizer = CountVectorizer()
    transformer = TfidfTransformer()
    tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
    word = vectorizer.get_feature_names()  #所有文本的關鍵字
    weight = tfidf.toarray()


    if not os.path.exists(sFilePath):
        os.mkdir(sFilePath)

    for i in range(len(weight)):
        print u'----------writing all the tf-idf in the ',i,u'file into ', sFilePath+'/' +string.zfill(i,5)+".txt"
        f = open(sFilePath+"/"+string.zfill(i,5)+".txt",'w+')
        for j in range(len(word)):
            f.write(word[j] + "  " + str(weight[i][j]) + "
")
        f.close()


if __name__ == "__main__":
    #儲存tf-idf的計算結果目錄
    sFilePath = "/home/lifeix/soft/allfile/tfidffile"+str(time.time())
    #儲存分詞的目錄
    segPath = '/home/lifeix/soft/allfile/segfile'
    (allfile,path) = getFileList('/home/lifeix/soft/allkeyword')
    for ff in allfile:
        print "Using jieba on " + ff
        fenci(ff,path,segPath)

    Tfidf(allfile,sFilePath,segPath)
    #對整個文檔進行排序
    os.system("sort -nrk 2 " + sFilePath+"/*.txt >" + sFilePath + "/sorted.txt")

python 分詞計算文檔TF-IDF值并排序

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入