[NOTE] Indexing MongoDB with ES

the original page

What is Full Text Search？

omitted (Refer to ‘Information Retrieval (IR)’)

Why don’t we use MongoDB to search full text?

MongoDB supports full text search.

For example:

Step 1: create a database called fulltext in MongoDB:

$ mongo
$ use fulltext

Step 2: add a collection called articles:

$ db.createCollection('articles')

Step 3: insert some articles including two fields, title and content:

$ db.articles.insert({title:'xxxx',content:'xxx'})

Step 4: Now we have documents, and then we need to index them using a MongoDB text index. So create a text index in both the title and content fields of the articles collection:

$ db.articles.createIndex({title:'text',content:'text'})

Finally, index created, so it’s time to search something.

$ db.articles.find({$text:{$search:'chinese'}})

Seems it’s working fine, but why don’t use it? Please search for the word ‘chi’:

$ db.articles.find({$text:{$search:'chi'}})

The result is empty! So as you can see, the one of the biggest limitations of MongoBD is that it’s impossible by using a text index to do that called partial matching .

How to use Elasticsearch?

The advantages of the indexing engine are not only it can provide the function of partial matching, but it can satisfy other indexing requirements. The above is just a simple example. Then, how to use Elasticsearch for full text search in MongoDB?

ES Installation: Since ES is built on Java, just make sure you have installed Java and the JAVA_HOME variable set.

Once you have installed ES, this is the overall process we’ll follow:

Create the index for our documents.
Import our MongoBD collection into ES with a tool called mongo-connector.
Migrate the index created by mongo-connector in ES to the index we created in step 1.
Try out our new index and see how documents are indexed all the time while we keep the mongo-connector running.

So, let’s start for more details.

How to create an ES index?

What is Analysis Chain?

We’ll have to define Analysis Chain which is a pipeline through which each of our documents we insert into the index will go through in order to be indexed.

An analysis chain is formed by analysers. In short, Analysers are composed by three functions:

A character filter: Cleaning up the string before it’s tokenized.
A tokenizer filter: Splitting the string (eg. by spaces) into terms
A token filter: Modify terms to optimize the index purpose.

Use one of analysers: Edge N-grams

ES provides different analysers. One of analysers is called edge_ngrams analyser.

Explanation:

N-Gram wikipedia: An n-gram is a contiguous sequence of n items from a given sequence of text or speech.

For example, word ‘blueberry’:

The 1-gram or unigram will be :

[b, l, u, e, b, e, r, r, y]

The 2-gram will be :

[bl, lu, ub, be, er, rr, ry]

etc.

Edge N-grams: Edge n-grams are anchored to the beginning of the word.

For example, word ‘blueberry’:

The edge N-grams will be :

[b, bl, blu, blue, blueb, bluebe, blueber, blueberr, blueberry]

So we can create the filter named autocomplete_filter:

{
    "filter": {
        "autocomplete_filter": {
            "type":     "edge_ngram",
            "min_gram": 3,
            "max_gram": 20
        }
    }
}

I used 3 as minimum is because for very big databases, having unigrams would slow down the performance a lot, since lots of documents would match the search.

And now we need to define our custom analyser named autocomplete:

{
    "analyzer": {
        "autocomplete": {
            "type":      "custom",
            "tokenizer": "standard",
            "filter": [
                "lowercase",
                "autocomplete_filter" 
            ]
        }
    }
}

Here we set two filtering step: lowercase and autocomplete_filter.

Use REST-API

We have deifned the filter and the analyser, so let’s create the index then. Use curl command:

$ curl \
    -H 'Content-Type: application/json' \
    -X PUT http://localhost:9200/fulltext_opt \
    -d \
    "{ \
        \"settings\": { \
            \"number_of_shards\": 1, \
            \"analysis\": { \
                \"filter\": { \
                    \"autocomplete_filter\": { \
                        \"type\":     \"edge_ngram\", \
                        \"min_gram\": 3, \
                        \"max_gram\": 20 \
                    } \
                }, \
                \"analyzer\": { \
                    \"autocomplete\": { \
                        \"type\":      \"custom\", \
                        \"tokenizer\": \"standard\", \
                        \"filter\": [ \
                            \"lowercase\", \
                            \"autocomplete_filter\" \
                        ] \
                    } \
                } \
            } \
        } \
    }"

The

{acknowledged: true}

response means our index was successfully created. The fulltext_opt in the endpoint of URL tells ES to create a new index named like that.

The last thing we have to do in our index fulltext_opt is to create the mappings. We’ll create a mapping called articles and we’ll define the property title and content on it:

$ curl \
    -H 'Content-Type: application/json' \
    -X PUT http://localhost:9200/fulltext_opt/_mapping/articles \
    -d \
    "{ \
        \"articles\": { \
            \"properties\": { \
                \"title\": { \
                    \"type\":     \"text\", \
                    \"analyzer\": \"autocomplete\" \
                }, \
                \"content\": { \
                    \"type\":    \"text\" \
                } \
            } \
        } \
    }"

The

{acknowledged: true}

response means the mappings added.

That’s all! Our index in ES has created!

How to import documents from MongoDB into ES?

There’s a tool called mongo-connector, which is what we need!

Step 1: You can install the mongo-connector using the Python package manager pip. You’ll need to install the elastic2-doc-manager which will provide the support to copy stuff from MongoDB into ElasticSearch 2.X or 5.X (6.X is not supported presently).

$ pip install mongo-connector 
$ pip install elastic2-doc-manager

Step 2: Start our MongoDB server as a replica set. The step is same with Solr because both of Solr and ES need to use mongo-connector to build the relationship with MongoDB. ( For more details about Solr, see Indexing MongoDB Data in Apache Solr )

$ mongod --replSet rs0
$ mongo
> rs.initiate()
> exit

Step 3: Go into your ES installation dirctory and run ( And create an index like fulltext_opt ).

$ ./bin/elasticsearch
$ curl .../fulltext_opt ...

Step 4: It’s time to run the mongo-connector ( for es 2.X or 5.X ).

$ mongo-connector -m 127.0.0.1:27017 -t 127.0.0.1:9200 -d elastic2_doc_manager

Now, you can see the two indices, fulltext from MongoDB and fulltext_opt that we created, using the command:

curl localhost:9200/_cat/indices?v

Step 5: So we have now two indices, one created by mongo-connector which is not optimized and has our two documents, and another one optimized but empty. All we have to do now is copy the documents between indices.

There is a great tool for this purpose called elasticdump which makes this task extremely easy.

$ npm install -g elasticdump
$ elasticdump \
    --input=http://localhost:9200/fulltext \
    --output=http://localhost:9200/fulltext_opt

You can also run the indices query in ES and see docs.count of the fulltext_opt, which has been changed to 2 instead of 0:

curl localhost:9200/_cat/indices?v

That’s it, the new index fulltext_opt has had documents copied from fulltext of MongoDB.

Finally, try out our new index using the search command:

$ curl \ 
    -H 'Content-Type: application/json' \
    localhost:9200/fulltext_opt/articles/_search?pretty \
    -d "{ \"query\": { \"match\": { \"title\": { \"query\": \"chi\", \"analyzer\": \"standard\" } } } }"

Got our document back!

What about the incremental indexing?

So far we moved all documents in MongoDB to the fulltext_opt index, but there is a problem that if we keep mongo-connector running, all the new doucments inserted in MongoDB will be indexed in the fulltext in ES and not optimized the fulltext_opt. ( Except using elasticdump again.)

The way to solve this problem is to configure mongo-connector a bit more. There are many configuration options that you can find here.

You can see how to configure mongo-connector via a JSON file. Here I’ll just use the command line.

$ mongo-connector -m 127.0.0.1:27017 -t 127.0.0.1:9200 -d elastic2_doc_manager -n fulltext.articles -g fulltext_opt.articles

In the command, use

namespaces.include

(-n in command line) and

namespaces.mapping

(-g in command line) to connect with each other.

Conclusion

What can we learn in the note?

How to use MongoDB to search full-text?
Why ES?
What are analysers?
How to create an ES index?
How to import MongoDB documents into ES?
How to implement the incremental indexing?

[NOTE] Indexing MongoDB with ES[NOTE] Indexing MongoDB with ES

[NOTE] Indexing MongoDB with ES

What is Full Text Search？

Why don’t we use MongoDB to search full text?

How to use Elasticsearch?

How to create an ES index?

What is Analysis Chain?

Use one of analysers: Edge N-grams

Use REST-API

How to import documents from MongoDB into ES?

What about the incremental indexing?

Conclusion

繼續閱讀

MongoDB學習筆記一ID自增長

elasticsearch 的 Percolator操作

es使用項目中遇到的問題

網絡蜘蛛Spider的邏輯Logic（二）

15.profile-api

303、副本集如何擴容

Mongodb4.0操作指令

【轉】ElasticSearch是什麼以及應用場景

ElasticSearch是什麼以及應用場景ES是如何産生的？ES 基礎一網打盡ES特點和優勢為什麼要用ES？ES的應用場景是怎樣的？

Error: couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error

Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Erro

couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error conne

延雲行業搜尋資料庫在大資料生态中位置和重要性大資料的挑戰大資料技術的現狀延雲行業搜尋資料庫

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Ubuntu14.04 LTS下安裝mongodb

30天了解30種技術系列---(10)面向Cloud的搜尋引擎 ElasticSearch