文本分类（1）——分词&去停用词&取名词

2023-07-16 17:26:32

考完试了，然后总结一下。。因为时间有些久，我记不清都参考过哪些了

分词

就直接用的jieba。老师的要求是留名词，我就只留了名词（Ps.别的词性还是可以留的）

import os
import jieba.posseg as  pseg
import jieba
import  codecs
import myIO
def load_stopwords():
    # 读取停用词表
    f = open('C:/lyr/DM/stop_words_ch.txt')
    sw = [line.strip() for line in f]
    return sw
        
def cut_words(label, file_list, file_path, cut_dir):
    print('Run task (%s)...' % (os.getpid()))
    for j, file_name in enumerate(file_list):
        fullpath = file_path + file_name
        content = myIO.readfile(fullpath)
        content = content.replace('\r\n'.encode('utf-8'),''.encode('utf-8')).strip()
        content = content.replace(' '.encode('utf-8'),''.encode('utf-8')).strip()
        content_seg = pseg.cut(content)
        _write_noun(file_name, content_seg, cut_dir)
 
 
def _write_noun(file_name, content_seg, cut_words_path):
    # 这里也许可以试试set然后用pickle来存python对象
    fullpath = cut_words_path + file_name
    stop_words = load_stopwords()
    result_seg=''
    noun = ['n', 'ns', 'nt', 'nz', 'nx']  
    for word, flag in content_seg:
        if word in stop_words:
                continue       
        if flag in noun:
                result_seg=result_seg+word+' '
    myIO.savefile(fullpath, result_seg.encode('utf-8')) 

def gen_save_words(source_path, cut_path):
    path_list = os.listdir(source_path)
    for i, mydir in enumerate(path_list):
        print(mydir)
        file_path = source_path + mydir + '/'
        cut_dir = cut_path + mydir + '/'
        if not os.path.exists(cut_dir):
            os.makedirs(cut_dir)
        file_list = os.listdir(file_path)
        # 进行分词
        cut_words(mydir, file_list, file_path, cut_dir)


source_path='C:/lyr/DM/trainData/'
cut_path='C:/lyr/DM/train_cut/'
gen_save_words(source_path,cut_path)

文本分类（1）——分词&去停用词&取名词

分词

继续阅读

NLP之文本分类文本表示特征权重计算方法分类器设计文本分类评测指标

NLP实践四：朴素贝叶斯实现文本分类

学霸笔记：中小学英语16种时态的不同用法以及动词使用形式。把这些语法时态放到一起就更容易区别和记忆了。同学们在学习的过程

分词、去停用词分词、去停用词

【360智脑App现已登陆苹果AppStore】AI奇点网7月24日报道丨360公司旗下AI大语言模型的移动端应用产品”

教辅推荐：小学必刷题。大家好，今天为大家推荐一本小学必刷题。这是一本能够帮助基础较好的孩子提升成绩的练习册。从题目结构来

确立整体的设计思路，从塞尚绘画的结构、色彩、造型上进行归纳总结。把多幅绘画作品分为风景图和景物图两大类别。通过前期整理的

利用python,基于SVM实现文本分类

阿里few shot learning文章的个人理解

深度学习在文本分类中的应用

深度学习与文本分类总结

大规模文本分类参考（转发）大规模文本分类实践-知乎看山杯总结赛题简述不同网络结构的理解与回顾得分最高的单模型：RCNN+ATTENTION总结与反思

CNN文本分类原理讲解与实战

文本分类之词向量平均模型 Word Average Model

文本分类之 residual-connection+selfAttention的词向量平均模型

SVM支持向量机二（Lagrange Duality）SVM支持向量机二（Lagrange Duality）

文本分类（1）——分词&amp;去停用词&amp;取名词

分词

继续阅读

文本分类（1）——分词&去停用词&取名词