使用stanford nlp时强制自定义分词

2023-05-10 12:58:24

本文章适用于这样的情景：

1. 不仅仅使用stanford nlp做分词，而是用它做句法分析或依存分析等；

2. 对默认的分词结果不满意，想要加入强制的自定义词典；

一、stanford nlp的基本用法

// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(
	PropertiesUtils.asProperties(
		"annotators", "tokenize,ssplit,pos,lemma,parse,natlog",
		"ssplit.isOneSentence", "true",
		"parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz",
		"tokenize.language", "en"));

// read some text in the text variable
String text = ... // Add your text here!
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

可参考官网：https://stanfordnlp.github.io/CoreNLP/api.html

二、自定义词典的添加

设置属性：

segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz，yourDictionaryFile

自定义词典的格式是一行一个词；

但加入自定义词典后，程序并不会完全按照它分词，自定义词典只作为分词时的参考；

stanford nlp没有提供强制分词的解决方案；

三、强制自定义分词

3.1 annotate()方法解析

该方法会完成配置中所定义的所有动作（如tokenize,ssplit,pos,lemma,parse）;

内部的逻辑是逐一调用相应功能的annotater.annotate();

所有结果保存在Annotation对象中，以键值对的形式

3.2 手动依次调用annotator

思路是手动调用需要的annotator，并在tokenizerAnnotator完成之后，修改他的结果。

难点在于：

修改完的结果必须合法，不然之后的Annotator不理解；
寻找正确的Annotator；

以下代码可用来代替annotate():

Properties properties = ...
    tokenizerAnnotator = new TokenizerAnnotator(properties);
    tokenizerAnnotator.annotate(annotation);
    
    //这里插入对于annotation的强制分词操作
    
    properties = ...
    sentencesAnnotator = new WordsToSentencesAnnotator(properties);
	sentencesAnnotator.annotate(annotation);
	 
    String taggerPath = "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger";
    MaxentTagger tagger = new MaxentTagger(taggerPath);
    taggerAnnotator = new POSTaggerAnnotator(tagger);
    taggerAnnotator.annotate(annotation);

	XXXAnnotator
	......

关于Annotator的官方文档：https://stanfordnlp.github.io/CoreNLP/annotators.html

3.3 手动修改Annotation中保存的分词结果

首先需要了解Annotation对象的结构；它是个Map<Class,Object>，具体不展开；

每个annotator的结果就是Annotation中的一个键值对；

//获得分词结果，即之后的修改对象
	List<CoreLabel> tokens = annotation.get(edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation.class);

然后需要了解CoreLabel类，他也是个Map<Class,Object>；

//eg. 将i位置替换为一个新CoreLabel
 	CoreLabel newLabel = CoreLabel.wordFromString("...");
    newLabel.setBeginPosition(startIdx);//新token在text的起始位置
    newLabel.setEndPosition(endIdx);//新token在text的结束位置        					       	
    newLabel.set(edu.stanford.nlp.ling.CoreAnnotations.TokenBeginAnnotation.class, i); //新token是第几个token
    newLabel.set(edu.stanford.nlp.ling.CoreAnnotations.TokenEndAnnotation.class, i + 1);//新token的下一个是第几个  
   	newLabel.set(edu.stanford.nlp.ling.CoreAnnotations.IsNewlineAnnotation.class,false)    
   	
    tokens.remove(i);
    tokens.add(replaceLabel);

到tokenizerAnnotator之后，一个CoreLabel对象应该有的属性是：

token在整个句子中的起始位置
结束位置
在List中的位置，即它是第几个CoreLabel
他的下一个是第几个CoreLabel
isNewline

使用stanford nlp时强制自定义分词

一、stanford nlp的基本用法

二、自定义词典的添加

三、强制自定义分词

3.1 annotate()方法解析

3.2 手动依次调用annotator

3.3 手动修改Annotation中保存的分词结果

继续阅读

golang源码分析：jsonparser不讲武德

Durid的SQL解析器浅释1. 各种语法支持2. 性能3. Druid SQL Parser的代码结构

RAML文件的Java解析器raml-java-parser概述

easyui动态生成页面元素时，easyui样式失效问题

Jsoup解析完整的HTML

python中的configparser模块操作ini文件详解

Python读写配置configparser一、介绍二、API三、ConfigParser Objects四、RawConfigParser Objects五、实例

Spark 应用程序参数解析Spark 应用程序参数解析

python类库configparser

crawler4j源码分析（四）Parser

Easyui笔记之动态添加组件

再谈Parser

一个完整的语法分析、词法分析例子——Universal Pasrser

[android]Lyric LRC格式文件解析Lyric LRC格式文件解析

key-value pair列表parser