词性标注 #文本分类 #词性标注 #词频统计

词性标注

词性标注（Part-Of-Speech tagging, POS tagging）也被称为语法标注（grammatical tagging）或词类消疑（word-category disambiguation），是语料库语言学（corpus linguistics）中将语料库内单词的词性按其含义和上下文内容进行标记的文本数据处理技术----来自百度百科。

词性标注的作用

1、消除歧义：一些词的不同用法代表不同的意思，比如love

“I love the way she sings that song”

“Where there is great love, there are always miracles.”

2、强化基于单词的特征: 一个机器学习模型可以从一个词的很多方面提取信息，但如果一个词已经标注了词性，那么它作为特征就能提供更精准的信息。例如：

句子 -‘Love thy neighbor as thyself. We all love to talk about ourselves.’

词性标注统计词频：

{‘love/VB’: 1,

‘thy/JJ’: 1,

‘neighbor/NN’: 1,

‘as/IN’: 1,

‘thyself/NN’: 1,

‘we/PRP’: 1,

‘all/DT’: 1,

‘love/VBP’: 1,

‘to/TO’: 1,

‘talk/VB’: 1,

‘about/IN’: 1,

‘ourselves/PRP’: 1}

如果不带词性标注，两个“love”就被认为是同义词，词频为2。

{‘love’: 2,

‘thy’: 1,

‘neighbor’: 1,

‘as’: 1,

‘thyself’: 1,

‘we’: 1,

‘all’: 1,

‘to’: 1,

‘talk’: 1,

‘about’: 1,

‘ourselves’: 1}

3、标准化与词形还原: 位置标注是词形还原的基础步骤之一，可以帮助把单词还原为基本形式.

4、有效移除停用词 : 利用位置标记可以有效地去除停用词。

词性标注的应用

1、句法分析预处理

2、词汇获取预处理

3、信息抽取预处理

中文词性标注的步骤：

读取文本
读取停用词
分词、删除停用词
词性标注
统计词频

# 中文、词性标记、统计词频
import re
import jieba
import jieba.posseg as pseg
import pandas as pd

class WordsCounter():
    
    def __init__(self, filepath, path):
        self.filepath = filepath
        self.path = path
    
    # 读取文本
    def get_text(self):
        text = ''
        with open(self.filepath, 'r') as file:
            for line in file.readlines():
                if line != '\n':
                    text += line.strip()
        return text
    
    # 读取停用词
    def get_stopwords(self):
        stop_words = []
        with open(self.path, 'r', encoding='utf-8') as file:
            stop_words = [w.strip() for w in file.readlines()]
        return stop_words
    
    # 分词、停用词删除
    def words_token(self):
        # 对于一些人名和地名，jieba处理的不好，不过我们可以帮jieba加入以下词汇， 如：
        jieba.suggest_freq('沙瑞金', True)
        jieba.suggest_freq('易学习', True)
        jieba.suggest_freq('王大路', True)
        jieba.suggest_freq('欧阳菁', True)
        jieba.suggest_freq('高育良', True)
        jieba.suggest_freq('李达康', True)
        jieba.suggest_freq('侯亮平', True)
        jieba.suggest_freq('赵东来', True)
        jieba.suggest_freq('京州市', True)
        jieba.suggest_freq('毛娅', True)
        jieba.suggest_freq('陈海', True)
        jieba.suggest_freq('丁义珍', True)
        jieba.suggest_freq('赵德汉', True)
        jieba.suggest_freq('祁同伟', True)
        jieba.suggest_freq('陆亦可', True)
        jieba.suggest_freq('陈岩石', True)
        jieba.suggest_freq('郑西坡', True)
        jieba.suggest_freq('陈清泉', True)
        jieba.suggest_freq('蔡成功', True)
        jieba.suggest_freq('孙连城', True)
        jieba.suggest_freq('侦察处', True)
        jieba.suggest_freq('高小琴', True)

        item = re.sub('[A-Za-z0-9]', '', self.get_text())
        text_split = list(jieba.cut(item, cut_all=False))

        text_split_del_stopwords = [w for w in text_split if w not in self.get_stopwords()]

        text_split_del_stopwords = ' '.join(text_split_del_stopwords)
        
        return text_split_del_stopwords
    
    def words_pos(self):
        # 词性标注
        words = pseg.cut(self.words_token())
        words_list = []

        for w in words:
            words_list.append(w)
        
        # 删除空格
        n = -1
        for (k, v) in words_list:
            n += 1
            if k == ' ':
                words_list.remove(words_list[n])
        return words_list
    
    def words_frequence(self):
        # 统计词频
        words_dict = {}
        for item in self.words_pos():
            i = str(item)
            words_dict.setdefault(i, 0)
            words_dict[i] += 1
        # 字典排序
        words_dict_sorted = sorted(words_dict.items(), key=lambda x:x[1], reverse=True)
        return words_dict_sorted
    
    def output_result(self):
        # 保存为DataFrame格式
        result = []
        for k, v in self.words_frequence():
            result.append(k+'/'+str(v))
            
        df = pd.DataFrame(result)
        df['词'] = df[0].apply(lambda x:x.split('/')[0])
        df['词性'] = df[0].apply(lambda x:x.split('/')[1])
        df['个数'] = df[0].apply(lambda x:x.split('/')[2])

        df = df.drop([0], axis=1)

        return df

r = WordsCounter('人民的名义节选.txt', 'stopword.txt')
print(r.output_result())

词频排在前十的词：

词性标注

英文词性标注

from nltk.tokenize import word_tokenize
from nltk import pos_tag
import nltk
import re

text = 'Love thy neighbor as thyself. We all love to talk about ourselves.'
text_re = re.sub("[^a-zA-Z]", " ", text.lower())
text_token = word_tokenize(text_re)

text_token_pos = pos_tag(text_token)

text_str = []

for item in text_token_pos:
    text_str.append(nltk.tag.util.tuple2str(item))
    

words_dict = {}

for item in text_str:
    words_dict.setdefault(item, 0)
    words_dict[item] += 1
    
words_dict

{‘love/VB’: 1,

‘thy/JJ’: 1,

‘neighbor/NN’: 1,

‘as/IN’: 1,

‘thyself/NN’: 1,

‘we/PRP’: 1,

‘all/DT’: 1,

‘love/VBP’: 1,

‘to/TO’: 1,

‘talk/VB’: 1,

‘about/IN’: 1,

‘ourselves/PRP’: 1}

词性标注

继续阅读

机器学习文本分类Improved Iterative Scaling算法以及JAVA实现IIS算法数学理论算法的实现和模型训练测试代码和数据下载

文本分类（1）——分词&去停用词&取名词

NLP之文本分类文本表示特征权重计算方法分类器设计文本分类评测指标

NLP实践四：朴素贝叶斯实现文本分类

北交大表白墙爬取与分析环境及工具数据爬取数据分析

【360智脑App现已登陆苹果AppStore】AI奇点网7月24日报道丨360公司旗下AI大语言模型的移动端应用产品”

【NLP】LTP中文工具集使用

利用python,基于SVM实现文本分类

阿里few shot learning文章的个人理解

深度学习在文本分类中的应用

深度学习与文本分类总结

大规模文本分类参考（转发）大规模文本分类实践-知乎看山杯总结赛题简述不同网络结构的理解与回顾得分最高的单模型：RCNN+ATTENTION总结与反思

CNN文本分类原理讲解与实战

文本分类之词向量平均模型 Word Average Model

文本分类之 residual-connection+selfAttention的词向量平均模型

SVM支持向量机二（Lagrange Duality）SVM支持向量机二（Lagrange Duality）