Stanford CoreNlp中英文Java API使用方法

Stanford Nlp是一个比较牛叉的自然语言处理工具，其很多模型都是基于深度学习方法进行训练得到的，准确率比起原来的很多工具有了很大程度的提高。近年来很多开源项目也用到了其中的一些方法。

最近重拾这个工具做点语义分析的工作，但是发现中文资料比较少，入门比较困难，所以整理一下自己的使用方法，希望对有需要的童鞋能够有点帮助。

本文主要是讲如何在Java工程中调用Stanford NLP的API。

一.环境准备

Eclipse或者IDEA，JDK1.8，Apache Maven（注意，3.5及以后的版本都需要Java8环境才能运行，如果不想在Java8运行的话，请使用以前的版本）。

建立好一个新的Maven工程，在pom文件中加入如下代码：

<properties>
        <corenlp.version>3.6.0</corenlp.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
        </dependency>

        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
            <classifier>models</classifier>
        </dependency>

        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>${corenlp.version}</version>
            <classifier>models-chinese</classifier>
        </dependency>
    </dependencies>

三个依赖包分别是CoreNlp的算法包、英文语料包、中文语料包，由于Maven默认镜像在国外，而Stanford NLP的模型文件很大，因此对网络要求比较高，网速慢的一不小心就time out下载失败了。解决方法是找一个包含Stanford NLP依赖库的国内镜像，修改Maven的setting,xml中的mirror属性。

二.英文文本的处理

英文的处理官网也给出了示例代码，我这里只做一下整合，代码如下：

package edu.zju.cst.krselee.examples.english;

import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.CorefCoreAnnotations;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeCoreAnnotations;
import edu.stanford.nlp.util.CoreMap;

import java.util.List;
import java.util.Map;
import java.util.Properties;

/**
 * Created by KrseLee on 2016/11/5.
 */
public class StanfordEnglishNlpExample {

    public static void main(String[] args) {

        StanfordEnglishNlpExample example = new StanfordEnglishNlpExample();

        example.runAllAnnotators();

    }

    public void runAllAnnotators(){
        // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        // read some text in the text variable
        String text = "this is a simple text"; // Add your text here!

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);

        // run all Annotators on this text
        pipeline.annotate(document);

        parserOutput(document);
    }

    public void parserOutput(Annotation document){
        // these are all the sentences in this document
        // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

        for(CoreMap sentence: sentences) {
            // traversing the words in the current sentence
            // a CoreLabel is a CoreMap with additional token-specific methods
            for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                // this is the text of the token
                String word = token.get(CoreAnnotations.TextAnnotation.class);
                // this is the POS tag of the token
                String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                // this is the NER label of the token
                String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
            }

            // this is the parse tree of the current sentence
            Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
            System.out.println("语法树：");
            System.out.println(tree.toString());

            // this is the Stanford dependency graph of the current sentence
            SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
            System.out.println("依存句法：");
            System.out.println(dependencies.toString());
        }

        // This is the coreference link graph
        // Each chain stores a set of mentions that link to each other,
        // along with a method for getting the most representative mention
        // Both sentence and token offsets start at 1!
        Map<Integer, CorefChain> graph =
                document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
    }
}

值得注意的是，Stanford NLP采用的是pipeline的方式，给用户一个参数的设置接口，之后的过程全都被封装好了，使用起来非常方便。所有的返回结果都保存在一个<pre>Annotation对象中，根据需要去获取。<a target=_blank href="http://nlp.stanford.edu/pubs/StanfordCoreNlp2014.pdf" target="_blank" rel="external nofollow" >The Stanford CoreNLP Natural Language Processing Toolkit</a> 一文中对PileLine方式做了详细的介绍，这里就不多说了，

需要提到一点就是参数中，后面的参数往往依赖于前面的参数（直观的讲，就是标注pos依赖于分词tokenize，语法分析paser依赖于标注，等等）。

三.中文文本的处理

文件内容如下：

# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
annotators = segment, ssplit, pos, lemma, ner, parse, mention, coref

# segment
customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator

segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

# sentence split
ssplit.boundaryTokenRegex = [.]|[!?]+|[。]|[！？]+

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

# ner
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = false
ner.useSUTime = false

# parse
parse.model = edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz

# coref
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.input.type = raw
coref.postprocessing = true
coref.calculateFeatureImportance = false
coref.useConstituencyTree = true
coref.useSemantics = false
coref.md.type = RULE
coref.mode = hybrid
coref.path.word2vec =
coref.language = zh
coref.print.md.log = false
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz

主要是指定相应pipeline的操作步骤以及对应的语料文件的位置。实际使用中我们可能用不到所有的步骤，或者要使用不同的语料库，因此可以自定义配置文件，再引入代码中。

主要的Java程序代码如下：

package edu.zju.cst.krselee.examples.chinese;

import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.CorefCoreAnnotations;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeCoreAnnotations;
import edu.stanford.nlp.util.CoreMap;
import edu.stanford.nlp.util.PropertiesUtils;
import edu.zju.cst.krselee.examples.english.StanfordEnglishNlpExample;

import java.util.List;
import java.util.Map;
import java.util.Properties;

/**
 * Created by KrseLee on 2016/11/4.
 */
public class StanfordChineseNlpExample {


    public static void main(String[] args) {

        StanfordChineseNlpExample example = new StanfordChineseNlpExample();

        example.runChineseAnnotators();

    }

    public void runChineseAnnotators(){

        String text = "克林顿说，华盛顿将逐步落实对韩国的经济援助。"
                + "金大中对克林顿的讲话报以掌声：克林顿总统在会谈中重申，他坚定地支持韩国摆脱经济危机。";
        Annotation document = new Annotation(text);
        StanfordCoreNLP corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
        corenlp.annotate(document);
        parserOutput(document);
    }

    public void parserOutput(Annotation document){
        // these are all the sentences in this document
        // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

        for(CoreMap sentence: sentences) {
            // traversing the words in the current sentence
            // a CoreLabel is a CoreMap with additional token-specific methods
            for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                // this is the text of the token
                String word = token.get(CoreAnnotations.TextAnnotation.class);
                // this is the POS tag of the token
                String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                // this is the NER label of the token
                String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

                System.out.println(word+"\t"+pos+"\t"+ne);
            }

            // this is the parse tree of the current sentence
            Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
            System.out.println("语法树：");
            System.out.println(tree.toString());

            // this is the Stanford dependency graph of the current sentence
            SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
            System.out.println("依存句法：");
            System.out.println(dependencies.toString());
        }

        // This is the coreference link graph
        // Each chain stores a set of mentions that link to each other,
        // along with a method for getting the most representative mention
        // Both sentence and token offsets start at 1!
        Map<Integer, CorefChain> graph =
                document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
    }
}

参考文献：

[1] http://stanfordnlp.github.io/CoreNLP/index.html

[2] https://blog.sectong.com/blog/corenlp_segment.html

Stanford CoreNlp中英文Java API使用方法

一.环境准备

二.英文文本的处理

三.中文文本的处理

继续阅读

seq2sqe与attenton实现聊天机器人

奋战聊天机器人（四）自然语言处理中的文本分类nltk中的贝叶斯分类器

从词向量衡量标准到全局向量的词嵌入模型GloVe再到一词多义的解决方式衡量标准Evaluation引子全局向量的词嵌入应用对一词多义的思考Reference

NLP︱高级词向量表达（一）——GloVe（理论、相关测评结果、R&python实现、相关应用）一、理论简述二、测评三、Glove实现&R&python四、相关应用

GloVe与word2vec的区别，及GloVe的缺陷

统计学习大作业-BERT模型1 文本处理-BERT模型2 参考资料：

更别致的词向量模型(一)：simpler glove

glove_python安装（避免编译错误）

python 分析qq聊天记录

[一起学BERT]（一）：BERT模型的原理基础Self-Attention机制理论Multi-head Self-Attention注意力机制位置编码Transformer理论BERT理论

ELMO BERT GPT

BERT、Elmo、GPT一、发展历史二、bert三、ERNIE四、GPT—transformer的decoder

anaconda中科大镜像

NLP从入门到放弃_IBM Model1IBM Model1

人工智能如何有效地运用于自然语言处理

解码器用于语义分割：数据依赖的解码可以实现灵活的特征聚合