词性标注
词性标注(Part-Of-Speech tagging, POS tagging)也被称为语法标注(grammatical tagging)或词类消疑(word-category disambiguation),是语料库语言学(corpus linguistics)中将语料库内单词的词性按其含义和上下文内容进行标记的文本数据处理技术----来自百度百科。
词性标注的作用
1、消除歧义:一些词的不同用法代表不同的意思,比如love
“I love the way she sings that song”
“Where there is great love, there are always miracles.”
2、强化基于单词的特征: 一个机器学习模型可以从一个词的很多方面提取信息,但如果一个词已经标注了词性,那么它作为特征就能提供更精准的信息。 例如:
句子 -‘Love thy neighbor as thyself. We all love to talk about ourselves.’
词性标注统计词频:
{‘love/VB’: 1,
‘thy/JJ’: 1,
‘neighbor/NN’: 1,
‘as/IN’: 1,
‘thyself/NN’: 1,
‘we/PRP’: 1,
‘all/DT’: 1,
‘love/VBP’: 1,
‘to/TO’: 1,
‘talk/VB’: 1,
‘about/IN’: 1,
‘ourselves/PRP’: 1}
如果不带词性标注,两个“love”就被认为是同义词,词频为2。
{‘love’: 2,
‘thy’: 1,
‘neighbor’: 1,
‘as’: 1,
‘thyself’: 1,
‘we’: 1,
‘all’: 1,
‘to’: 1,
‘talk’: 1,
‘about’: 1,
‘ourselves’: 1}
3、标准化与词形还原: 位置标注是词形还原的基础步骤之一,可以帮助把单词还原为基本形式.
4、有效移除停用词 : 利用位置标记可以有效地去除停用词。
词性标注的应用
1、句法分析预处理
2、词汇获取预处理
3、信息抽取预处理
中文词性标注的步骤:
- 读取文本
- 读取停用词
- 分词、删除停用词
- 词性标注
- 统计词频
# 中文、词性标记、统计词频
import re
import jieba
import jieba.posseg as pseg
import pandas as pd
class WordsCounter():
def __init__(self, filepath, path):
self.filepath = filepath
self.path = path
# 读取文本
def get_text(self):
text = ''
with open(self.filepath, 'r') as file:
for line in file.readlines():
if line != '\n':
text += line.strip()
return text
# 读取停用词
def get_stopwords(self):
stop_words = []
with open(self.path, 'r', encoding='utf-8') as file:
stop_words = [w.strip() for w in file.readlines()]
return stop_words
# 分词、停用词删除
def words_token(self):
# 对于一些人名和地名,jieba处理的不好,不过我们可以帮jieba加入以下词汇, 如:
jieba.suggest_freq('沙瑞金', True)
jieba.suggest_freq('易学习', True)
jieba.suggest_freq('王大路', True)
jieba.suggest_freq('欧阳菁', True)
jieba.suggest_freq('高育良', True)
jieba.suggest_freq('李达康', True)
jieba.suggest_freq('侯亮平', True)
jieba.suggest_freq('赵东来', True)
jieba.suggest_freq('京州市', True)
jieba.suggest_freq('毛娅', True)
jieba.suggest_freq('陈海', True)
jieba.suggest_freq('丁义珍', True)
jieba.suggest_freq('赵德汉', True)
jieba.suggest_freq('祁同伟', True)
jieba.suggest_freq('陆亦可', True)
jieba.suggest_freq('陈岩石', True)
jieba.suggest_freq('郑西坡', True)
jieba.suggest_freq('陈清泉', True)
jieba.suggest_freq('蔡成功', True)
jieba.suggest_freq('孙连城', True)
jieba.suggest_freq('侦察处', True)
jieba.suggest_freq('高小琴', True)
item = re.sub('[A-Za-z0-9]', '', self.get_text())
text_split = list(jieba.cut(item, cut_all=False))
text_split_del_stopwords = [w for w in text_split if w not in self.get_stopwords()]
text_split_del_stopwords = ' '.join(text_split_del_stopwords)
return text_split_del_stopwords
def words_pos(self):
# 词性标注
words = pseg.cut(self.words_token())
words_list = []
for w in words:
words_list.append(w)
# 删除空格
n = -1
for (k, v) in words_list:
n += 1
if k == ' ':
words_list.remove(words_list[n])
return words_list
def words_frequence(self):
# 统计词频
words_dict = {}
for item in self.words_pos():
i = str(item)
words_dict.setdefault(i, 0)
words_dict[i] += 1
# 字典排序
words_dict_sorted = sorted(words_dict.items(), key=lambda x:x[1], reverse=True)
return words_dict_sorted
def output_result(self):
# 保存为DataFrame格式
result = []
for k, v in self.words_frequence():
result.append(k+'/'+str(v))
df = pd.DataFrame(result)
df['词'] = df[0].apply(lambda x:x.split('/')[0])
df['词性'] = df[0].apply(lambda x:x.split('/')[1])
df['个数'] = df[0].apply(lambda x:x.split('/')[2])
df = df.drop([0], axis=1)
return df
r = WordsCounter('人民的名义节选.txt', 'stopword.txt')
print(r.output_result())
词频排在前十的词:
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIyZuBnL5EDN5IjMyATMwEDMxkTMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
英文词性标注
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import nltk
import re
text = 'Love thy neighbor as thyself. We all love to talk about ourselves.'
text_re = re.sub("[^a-zA-Z]", " ", text.lower())
text_token = word_tokenize(text_re)
text_token_pos = pos_tag(text_token)
text_str = []
for item in text_token_pos:
text_str.append(nltk.tag.util.tuple2str(item))
words_dict = {}
for item in text_str:
words_dict.setdefault(item, 0)
words_dict[item] += 1
words_dict
{‘love/VB’: 1,
‘thy/JJ’: 1,
‘neighbor/NN’: 1,
‘as/IN’: 1,
‘thyself/NN’: 1,
‘we/PRP’: 1,
‘all/DT’: 1,
‘love/VBP’: 1,
‘to/TO’: 1,
‘talk/VB’: 1,
‘about/IN’: 1,
‘ourselves/PRP’: 1}