spaCy is an open-source natural language processing library written on Python. Based on the latest research in the field of natural processing, spaCy provides a series of efficient and easy-to-use tools for tasks such as text preprocessing, text parsing, named entity recognition, part-of-speech annotation, syntactic analysis, and text classification. The official repository address of spaCy is: spaCy-github. This article mainly refers to the documentation of its official website, the official website address of spaCy is: spaCy.

1 Background introduction with spaCy installation

1.1 Introduction to Natural Language Processing

Natural Language Processing (NLP) is a field that studies the interaction between human language and computers, aiming to enable computers to understand, parse, generate, and process human language. NLP combines knowledge of computer science, artificial intelligence, and linguistics to process and analyze textual data through a variety of algorithms and techniques. In recent years, with the development of deep learning technology, neural network models have made major breakthroughs in the field of natural language processing (NLP). Among them, recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and Transformer models all play key roles. These models bring better performance and effectiveness to NLP tasks, and promote the development and application of NLP.

The main knowledge structure of NLP is shown in the figure below, and the picture is from the Internet.

[Natural Language Processing] The natural language processing library spaCy uses the North

NLP has a wide range of applications, covering multiple fields, such as machine translation, information extraction, text classification, sentiment analysis, automatic summarization, question answering systems, speech recognition, and speech synthesis. The following techniques and methods commonly used in NLP:

Tokenization: Splitting continuous text into meaningful words or tags is the foundation of many NLP tasks.
Part-of-speech annotation: Assign a corresponding part of speech to each word in the text, such as nouns, verbs, adjectives, etc.
Syntactic analysis: Analyze the grammatical structure of a sentence and identify phrases, modifiers, and dependencies in the sentence.
Semantic analysis: Understand the meaning and semantic relationships of texts, including named entity recognition, semantic role annotation, and semantic parsing.
Machine translation: Automatically translates text from one language into another.
Text classification: Classify text into predefined categories, such as spam classification, sentiment classification, and so on.
Information extraction: Extract specific information from structured and unstructured text, such as entity relationship extraction, event extraction, etc.
Q&A system: Provide accurate answers or relevant information by understanding and answering questions raised by users.
Sentiment analysis: Identify and analyze sentimental tendencies in text, such as positive, negative, or neutral sentiment.
Text generation: Use NLP techniques to generate natural language text, such as automatic summary generation, dialogue systems, and machine composition.

Among the many natural language processing libraries, the spaCy library provides support for more than 73 languages and provides training code for 25 languages. The library provides a series of easy-to-use model and function interfaces, including word segmentation, part-of-speech annotation, and other functions. Users can also use frameworks such as PyTorch and TensorFlow to create custom models in spaCy to meet specific needs. For the language models supported by spaCy, see spaCy-models.

In fact, there are some natural language processing open-source libraries, such as HuggingFace's Transformers, PaddleNLP, and NLTK, that are more specialized and perform better than spaCy. However, for simple applications, spaCy is better suited because of its ease of use, comprehensive features, and the benefits of a large number of pre-trained models for multiple languages. In addition, with the great breakthrough and success of large language models represented by GPT-3 in the field of natural language processing, some natural language processing libraries are actually inferior to large language models in terms of accuracy. However, the use of large language models requires huge inference resources, and in scenarios where accuracy is not required, the use of small natural language processing libraries such as spaCy is still a good choice.

1.2 spaCy Installation

spaCy uses a module that is installed together with a language module. One of spaCy's design goals is modularity and customizability. It allows users to install only the necessary modules and language data to reduce the overall size of the installation and reduce the burden on resources. If you use spaCy's model, you need to use the model package required to install the model through pip to use the pretrained model. This is because spaCy's model contains the trained weight parameters and other necessary files, which are stored in a specific location at the time of installation, rather than in the form of individual files. If you want to train the model and run the GPU, you need to select the corresponding installation package. Installing the module together with the language module simplifies the configuration process of spaCy. Users don't need to download and configure language data separately, or manually specify which language model to use. This reduces the user's workload and potential errors during the installation process. However, the customizability is very weak, so spaCy is suitable for simple use with low accuracy requirements, and it is more appropriate to choose other large natural language processing libraries for engineering applications.

To achieve this, spaCy provides a configurable installation command selection page for users to use. Select the page address for the installation command to spaCy-usage. The following figure shows the installation configuration items in this document, which uses the simplest CPU inference mode.

After spaCy is installed, run the following code to determine whether spaCy and its corresponding language model are successfully installed.

# jupyter notebook环境去除warning
import warnings
warnings.filterwarnings("ignore")
import spacy
spacy.__version__

'3.6.0'

import spacy

# 加载已安装的中文模型
nlp = spacy.load('zh_core_web_sm')

# 执行一些简单的NLP任务
doc = nlp("早上好!")
for token in doc:
    # token.text表示标记的原始文本，token.pos_表示标记的词性（part-of-speech），token.dep_表示标记与其他标记之间的句法依存关系
    print(token.text, token.pos_, token.dep_)

早上 NOUN nmod:tmod
好 VERB ROOT
! PUNCT punct

2 spaCy Quick Start

This section and images are primarily based on a summary of Spacy-101. The main function modules provided by spaCy are divided into the following modules, which are described below.

name	description
Tokenization	Split text into words, punctuation, and more.
Part-of-speech (POS) Tagging	Assign a part of speech, such as a verb or noun, to the tag.
Dependency Parsing	Assign syntactically dependent labels that describe relationships between individual tags, such as subjects or objects.
Lemmatization	Assign the basic form of the word. For example, the basic form of "was" is "be" and the basic form of "rats" is "rat".
Sentence Boundary Detection (SBD)	Find and segment individual sentences.
Named Entity Recognition (NER)	Label named "real-world" objects, such as people, companies, or places.
Entity Linking (EL)	Disambiguate a text entity with a unique identifier in the knowledge base.
Similarity	Compare the similarity between words, text fragments, and documents.
Text Classification	Assign a category or label to an entire document or a portion of a document.
Rule-based Matching	Finds a sequence of tags based on their text and linguistic annotations, similar to regular expressions.
Training	Update and improve the predictive capabilities of statistical models.
Serialization	Save the object to a file or byte string.

2.1 Participle

In the process of processing, spaCy first marks the text, that is, divides it into tokens such as words, punctuation marks, etc. This is achieved by applying rules that are unique to each language. Token represents the smallest unit of natural language text. Each token represents an atomic element in the text, usually a word or punctuation mark.

import spacy

nlp = spacy.load("zh_core_web_sm")
# 使对文本进行一键处理
doc = nlp("南京长江大桥是金陵四十景之一！")
# 遍历doc中的每个Token
for token in doc:
    print(token.text)

南京
长江
大桥
是
金陵
四十
景
之一
！

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion

For Chinese participles, proper nouns are sometimes split, such as Nanjing Yangtze River Bridge is split into Nanjing, Yangtze River, and Bridge. We can add custom dictionaries to solve this problem, but be aware that custom dictionary additions are only for certain language models.

import spacy

nlp = spacy.load("zh_core_web_sm")
# 添加自定义词汇
nlp.tokenizer.pkuseg_update_user_dict(["南京长江大桥","金陵四十景"])

doc = nlp("南京长江大桥是金陵四十景之一！")
for token in doc:
    print(token.text)

南京长江大桥
是
金陵四十景
之一
！

2.2 Part-of-speech tagging and dependency

After word segmentation, spaCy will label the part of speech of each word in the sentence and determine the grammatical relationship between the words in the sentence, and it should be noted that the specific scope of these relationships depends on the model used. The sample code looks like this:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# token.text: 单词的原始形式。
# token.lemma_: 单词的基本形式（或词干）。例如，“running”的词干是“run”。
# token.pos_: 单词的粗粒度的词性标注，如名词、动词、形容词等。
# token.tag_: 单词的细粒度的词性标注，提供更多的语法信息。
# token.dep_: 单词在句子中的依存关系角色，例如主语、宾语等。
# token.shape_: 单词的形状信息，例如，单词的大小写，是否有标点符号等。
# token.is_alpha: 这是一个布尔值，用于检查token是否全部由字母组成。
# token.is_stop: 这是一个布尔值，用于检查token是否为停用词（如“the”、“is”等在英语中非常常见但通常不包含太多信息的词）。
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN advcl xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False

In the above code pos_ common part-of-speech annotations are used. The parts of speech annotation and explanations supported by tag_ are as follows:

# 获取所有词性标注（tag_）和它们的描述
tag_descriptions = {tag: spacy.explain(tag) for tag in nlp.get_pipe('tagger').labels}
# 打印词性标注（tag_）及其描述
# print("词性标注 (TAG) 及其描述：")
# for tag, description in tag_descriptions.items():
#     print(f"{tag}: {description}")

The dependencies supported by the dep__ and explained are as follows:

# 获取所有依存关系标注和它们的描述
par_descriptions = {par: spacy.explain(par) for par in nlp.get_pipe('parser').labels}
# print("依存关系及其描述：")
# for par, description in par_descriptions.items():
#     print(f"{par}: {description}")

2.3 Named Entity Recognition

Named Entity Recognition (NER) is a basic task in natural language processing and has a wide range of applications. NER refers to the identification of entities with specific meaning or strong referentiality in the text, usually including personal names, place names, organization names, dates and times, proper nouns, etc. spaCy is used as follows:

import spacy

nlp = spacy.load("zh_core_web_sm")
# 添加自定义词汇
nlp.tokenizer.pkuseg_update_user_dict(["东方明珠"])

# 自定义词汇可能不会进入实体识别。
doc = nlp("东方明珠是一座位于中国上海市的标志性建筑，建造于1991年，是一座高度为468米的电视塔。")
for ent in doc.ents:
    # 实体文本，开始位置，结束位置，实体标签
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

中国上海市 9 14 GPE
1991年 24 29 DATE
468米 36 40 QUANTITY

If you want to know all the entities and what they mean, you can execute the following code:

# 获取命名实体标签及其含义
entity_labels = nlp.get_pipe('ner').labels

# 打印输出所有命名实体及其含义
for label in entity_labels:
    print("{}: {}".format(label,spacy.explain(label)))

CARDINAL: Numerals that do not fall under another type
DATE: Absolute or relative dates or periods
EVENT: Named hurricanes, battles, wars, sports events, etc.
FAC: Buildings, airports, highways, bridges, etc.
GPE: Countries, cities, states
LANGUAGE: Any named language
LAW: Named documents made into laws.
LOC: Non-GPE locations, mountain ranges, bodies of water
MONEY: Monetary values, including unit
NORP: Nationalities or religious or political groups
ORDINAL: "first", "second", etc.
ORG: Companies, agencies, institutions, etc.
PERCENT: Percentage, including "%"
PERSON: People, including fictional
PRODUCT: Objects, vehicles, foods, etc. (not services)
QUANTITY: Measurements, as of weight or distance
TIME: Times smaller than a day
WORK_OF_ART: Titles of books, songs, etc.

2.4 Word vectors and similarity

Word vectors are an important representation in natural language processing, which maps words to real vectors. This representation captures the semantic relationships between words and translates the semantic information into numerical forms that can be processed by a computer.

Traditional natural language processing methods often represent text as discrete symbols, such as monothermal encoding or bag-of-word models. However, this approach ignores the semantic similarity between words and is too dimensional, causing sparsity problems. In contrast, word vectors can better capture the semantic relationships between words by mapping each word into a continuous vector space, and reduce the dimensionality of the feature space, making text processing more efficient. The similarity of word vectors can be measured by calculating the distance or angle between two word vectors.

The word vector code for extracting each word in the sentence is as follows:

import spacy

# 加载中文模型"zh_core_web_sm"
nlp = spacy.load("zh_core_web_sm")

# 对给定文本进行分词和词性标注
tokens = nlp("东方明珠是一座位于中国上海市的标志性建筑！")

# 遍历分词后的每个词语
for token in tokens:
    # 输出词语的文本内容、是否有对应的向量表示、向量范数和是否为未登录词（Out-of-vocabulary，即不在词向量词典中的词）
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

东方 True 11.572288 True
明珠 True 10.620552 True
是 True 12.337883 True
一 True 12.998204 True
座位 True 10.186406 True
于 True 13.540245 True
中国 True 12.459145 True
上海市 True 12.004954 True
的 True 12.90457 True
标志性 True 13.601862 True
建筑 True 10.46621 True
！ True 12.811246 True

If you want to get the word vector of a sentence or a word, the code is as follows:

# 该词向量没有归一化
tokens.vector.shape

(96,)

spaCy provides a similarity function to calculate the similarity of two text vectors. The sample code is as follows:

import spacy

# 为了方便这里使用小模型，推荐使用更大的模型
nlp = spacy.load("zh_core_web_sm")  
doc1 = nlp("东方明珠是一座位于中国上海市的标志性建筑")
doc2 = nlp("南京长江大桥是金陵四十景之一！")

# 计算两个文本的相似度
print(doc1, "<->", doc2, doc1.similarity(doc2))

东方明珠是一座位于中国上海市的标志性建筑 <-> 南京长江大桥是金陵四十景之一！ 0.5743045135827821

In the code above, the similarity calculation uses cosine similarity by default. Cosine similarity ranges from 0 to 1, with higher values indicating more similar word vectors. Generally speaking, the accuracy of the word vector similarity using the Spacy general model may be very low, you can try to use a dedicated model or train the model yourself, Spacy recommends using sense2vec to calculate the model similarity.

In addition, if you only use spacy to extract text vectors, you can manually calculate the text similarity with numpy, and the code is as follows:

import numpy as np
import spacy
nlp = spacy.load("zh_core_web_sm")  
doc1 = nlp("东方明珠是一座位于中国上海市的标志性建筑")
doc2 = nlp("南京长江大桥是金陵四十景之一！")
# 获取doc1和doc2的词向量
vec1 = doc1.vector
vec2 = doc2.vector

# 使用NumPy计算相似度得分，np.linalg.norm(vec1)就是doc1.vector_norm
similarity_score = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

print(doc1, "<->", doc2,similarity_score)

东方明珠是一座位于中国上海市的标志性建筑 <-> 南京长江大桥是金陵四十景之一！ 0.5743046

3 spaCy structural system

3.1 spaCy treatment process

When the nlp model is invoked on a text, spaCy first tokenizes the text to generate a Doc object. The Doc object is then processed in a few different steps. The trained processing flow usually includes processing components such as part-of-speech annotators, dependency syntax parsers, and entity recognizers. These components are independent of each other, and each process flow component returns a processed Doc object, which is then passed to the next component. The resulting Doc object is a sequence of all the words and punctuation marks, each word represented as a token object. Each token object contains information such as the content of the word itself, the part of speech tag, and the form of the word form. The following image explains the process of using spaCy for text processing.

As shown in the diagram above, the modules involved in the model pipeline depend primarily on how the model is structured and trained. The tokenizer is a special component and is independent of other components, because other component modules will call tokenizer to tokenizer to tokenizer before calling. All of the main modules that support are below, and the use of these modules has been described in the previous chapter.

name	subassembly	create	description
tokenizer	Tokenizer	Doc	Split text into tags.
Tags	Tags	Token.tag	Assign a part-of-speech label to the tag.
parser	DependencyParser	Token.head,Token.dep,Doc.sents,Doc.noun_chunks	Assign dependency labels.
down	EntityRecognizer	Doc.ents,Token.ent_iob,Token.ent_type	Detect and tag named entities.
lemmatizer	Lemmatizer	Token.lemma	Assign the basic form of the word.
textcat	TextCategorizer	Doc.cats	Assign document labels.
custom	Custom components	Doc.. xxx,Token.. xxx,Span._.xxx	Assign custom attributes, methods, or attributes.

The text processing components supported by a spacy model can be viewed as follows:

import spacy

# 加载中文模型"zh_core_web_sm"
nlp = spacy.load("zh_core_web_sm")
# 查看所支持的组件
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner']

You can control the selection and use of components to speed up execution based on the following code:

# 加载不包含命名实体识别器（NER）的管道
nlp = spacy.load("zh_core_web_sm", exclude=["ner"])
# 查看所支持的组件
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler']

# 只启用tagger管道
nlp = spacy.load("zh_core_web_sm",enable=[ "tagger"])
nlp.pipe_names

['tagger']

# 加载词性标注器（tagger）和依存句法解析器（parser），但不启用它们
nlp = spacy.load("zh_core_web_sm", disable=["tagger", "parser"],)
# 禁用某些组件
nlp.disable_pipe("ner")
nlp.pipe_names

['tok2vec', 'attribute_ruler']

3.2 spaCy engineering structure

The central data structures in spaCy are the Language class, the Vocab, and the Doc object. The Language class is used to process text and convert it into a Doc object. It is usually stored as a variable called nlp. The DOC object owns the token sequence and all of its annotations. By centralizing strings, word vectors, and lexical attributes in Vocab. These main classes and objects are described as follows:

The commonly used modules are described as follows:

Doc

Doc is an important object in spaCy, which represents a text document and contains all the information in the text, such as words, punctuation, parts of speech, dependencies, etc. The text can be processed with the spaCy's Language object to obtain a Doc object.

DocBin

DocBin is a data structure for efficiently serializing and deserializing Doc objects to save and load Doc objects in different processes. Use the code as follows:

# 导入所需的库
import spacy
from spacy.tokens import DocBin

# 加载英文预训练模型
nlp = spacy.load("en_core_web_sm")

# 定义待处理文本
texts = ["This is sentence 1.", "And this is sentence 2."]

# 将每个文本转化为Doc对象，用nlp处理并保存到docs列表中
docs = [nlp(text) for text in texts]

# 创建一个新的DocBin对象，用于保存文档数据，并启用存储用户数据的功能
docbin = DocBin(store_user_data=True)

# 将每个Doc对象添加到DocBin中
for doc in docs:
    docbin.add(doc)

# 将DocBin保存到文件中
with open("documents.spacy", "wb") as f:
    f.write(docbin.to_bytes())

# 从文件中加载DocBin
with open("documents.spacy", "rb") as f:
    bytes_data = f.read()

# 从字节数据中恢复加载DocBin对象
loaded_docbin = DocBin().from_bytes(bytes_data)

# 使用nlp.vocab获取词汇表，并通过DocBin获取所有加载的文档
loaded_docs = list(loaded_docbin.get_docs(nlp.vocab))

# 输出加载的文档
loaded_docs

[This is sentence 1., And this is sentence 2.]

Example

The example is used to train the spaCy model, which contains an input text (Doc) and its corresponding annotated data. For training of the spaCy model, see: spaCy-training

Language

Language is one of the core objects of spaCy, which is responsible for handling tasks such as text preprocessing, part-of-speech annotation, and syntactic analysis. You can use spacy.load() to load a specific language model and obtain the corresponding Language object.

Lexeme

Lexeme is a representation of a word in a vocabulary that contains various information about the word, such as parts of speech, word frequency, etc. The sample code is as follows:

nlp = spacy.load("en_core_web_sm")

# 定义一个单词
word = "hello"

# 获取单词对应的词元（lexeme）
lexeme = nlp.vocab[word]

# 打印词元的文本内容、是否为字母（alphabetical）
# 是否为停用词（stopword）\是否为字母（is_alpha），是否为数字（is_digit），是否为标题（is_title），语言（lang_）
print(lexeme.text, lexeme.is_alpha, lexeme.is_stop, lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

hello True False True False False en

In fact, in the case of Lexeme, whenever possible, spaCy tries to store the data in a vocabulary Vocab that will be shared by multiple models. To save memory, spaCy also encodes all strings into hashes.

As shown below, the hash value of "coffee" is 3197928453018144401 in different models. Note, however, that only spaCy does this, other natural language processing libraries don't necessarily do it.

import spacy

nlp = spacy.load("zh_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee

Span

A Span is a continuous piece of text that can consist of one or more tokens. It is often used to mark entities or phrases.

nlp = spacy.load("zh_core_web_sm")
text = "东方明珠是一座位于中国上海市的标志性建筑！"
doc = nlp(text)

# 从doc中选择了第0个和第1个词元（token）组成的片段。
# 注意，spaCy中的词元是文本的基本单元，可能是单词、标点符号或其它词汇单位。
# 这里东方、明珠是前两个词
span = doc[0:2]  
print(span.text)

东方明珠

4 See also

Cyspa-github
spaCy
spaCy-models
Transformers
PaddleNLP
NLTK
spaCy-usage
sense2vec
spaCy Training

[Natural Language Processing] The natural language processing library spaCy uses the North