天天看點

向量檢索milvus之一:以圖搜圖安裝milvus實驗向量搜尋建索引以圖搜尋利用pic-search-webserver來圖檔向量化利用pic-search-webclient來頁面互動總結參考文獻

安裝milvus

關于milvus

milvus作為一個內建的開源平台,目标就是向量檢索的內建平台。類似于elasticsearch內建了搜尋。細節大家可以直接看官網。https://www.milvus.io/cn/docs/v0.11.0/overview.md

安裝

說起來其實挺容易,方法也比較清晰。不過下載下傳比較慢。

>> docker pull milvusdb/milvus:0.11.0-cpu-d101620-4c44c0
0.11.0-cpu-d101620-4c44c0: Pulling from milvusdb/milvus
75f829a71a1c: Pull complete
e654e509dcd3: Pull complete
482d74c614ad: Pull complete
85d20808a7e5: Pull complete
2f8820d4255e: Pull complete
Digest: sha256:6a5dc00b26dc18be5e5bfddc8cfb36370188e4c951e62ffafa30fbe3f4b1ad60
Status: Downloaded newer image for milvusdb/milvus:0.11.0-cpu-d101620-4c44c0
docker.io/milvusdb/milvus:0.11.0-cpu-d101620-4c44c0
           

啟動

docker run -d --name milvus_cpu_0.11.0 \
	-p 19530:19530 \
	-p 19121:19121 \
	-v /home/$yourname/milvus/db:/var/lib/milvus/db \
	-v /home/$yourname/milvus/conf:/var/lib/milvus/conf \
	-v /home/$yourname/milvus/logs:/var/lib/milvus/logs \
	-v /home/$yourname/milvus/wal:/var/lib/milvus/wal \
	milvusdb/milvus:0.11.0-cpu-d101620-4c44c0```
           

這個地方有個小細節,就是不要-d啟動,觀察錯誤,等沒問題之後再用-d啟動,這樣可以觀察細節。

另外,milvus.yaml如果下載下傳不了就翻牆吧。 我還是放一份在這裡吧。https://download.csdn.net/download/iterate7/13081889

安裝對應的admin觀察界面

>> docker pull milvusdb/milvus-em:v0.5.0
>> docker run -d -p 3000:80 -e API_URL=http://192.168.13.218:3000 milvusdb/milvus-em:v0.5.0
           

然後就可以在界面觀察milvus了。

向量檢索milvus之一:以圖搜圖安裝milvus實驗向量搜尋建索引以圖搜尋利用pic-search-webserver來圖檔向量化利用pic-search-webclient來頁面互動總結參考文獻

實驗向量搜尋

主要是增删改查。 直接看代碼更直接。

import numpy as np
import random
from milvus import Milvus
from milvus import Status

_HOST = '192.168.xx.xx'
_PORT = 19530

# Connect to Milvus Server
milvus = Milvus(_HOST, _PORT)

# Close client instance
# milvus.close()

# Returns the status of the Milvus server.
server_status = milvus.server_status(timeout=4)
print(server_status)


# Vector parameters
_DIM = 8  # dimension of vector

_INDEX_FILE_SIZE = 32  # max file size of stored index

# the demo name.
collection_name = 'example_collection_'
partition_tag = 'demo_tag_'
segment_name= ''

# 10 vectors with 8 dimension, per element is float32 type, vectors should be a 2-D array
vectors = [[random.random() for _ in range(_DIM)] for _ in range(10)]
ids = [i for i in range(10)]

print(vectors)

# Returns the version of the client.
client_version= milvus.client_version()
print(client_version)

# Returns the version of the Milvus server.
server_version = milvus.server_version(timeout=10)
print(server_version)

print("has collection:",milvus.has_collection(collection_name=collection_name, timeout=10))


from milvus import DataType
# Information needed to create a collection.Defult index_file_size=1024 and metric_type=MetricType.L2
collection_param = {
    "fields": [
        #  Milvus doesn't support string type now, but we are considering supporting it soon.
        #  {"name": "title", "type": DataType.STRING},
        {"name": "duration", "type": DataType.INT32, "params": {"unit": "minute"}},
        {"name": "release_year", "type": DataType.INT32},
        {"name": "embedding", "type": DataType.FLOAT_VECTOR, "params": {"dim": 8}},
    ],
    "segment_row_limit": 4096,
    "auto_id": False
}

# ------
# Basic create collection:
#     After create collection `demo_films`, we create a partition tagged "American", it means the films we
#     will be inserted are from American.
# ------
# milvus.create_collection(collection_name, collection_param)
# milvus.create_partition(collection_name, "American")


# ------
# Basic create collection:
#     You can check the collection info and partitions we've created by `get_collection_info` and
#     `list_partitions`
# ------
print("--------get collection info--------")
collection = milvus.get_collection_info(collection_name)
print(collection)
partitions = milvus.list_partitions(collection_name)
print("\n----------list partitions----------")
print(partitions)

# ------
# Basic insert entities:
#     We have three films of The_Lord_of_the_Rings series here with their id, duration release_year
#     and fake embeddings to be inserted. They are listed below to give you a overview of the structure.
# ------
The_Lord_of_the_Rings = [
    {
        "title": "The_Fellowship_of_the_Ring",
        "id": 1,
        "duration": 208,
        "release_year": 2001,
        "embedding": [random.random() for _ in range(8)]
    },
    {
        "title": "The_Two_Towers",
        "id": 2,
        "duration": 226,
        "release_year": 2002,
        "embedding": [random.random() for _ in range(8)]
    },
    {
        "title": "The_Return_of_the_King",
        "id": 3,
        "duration": 252,
        "release_year": 2003,
        "embedding": [random.random() for _ in range(8)]
    }
]

# ------
# Basic insert entities:
#     To insert these films into Milvus, we have to group values from the same field together like below.
#     Then these grouped data are used to create `hybrid_entities`.
# ------
ids = [k.get("id") for k in The_Lord_of_the_Rings]
durations = [k.get("duration") for k in The_Lord_of_the_Rings]
release_years = [k.get("release_year") for k in The_Lord_of_the_Rings]
embeddings = [k.get("embedding") for k in The_Lord_of_the_Rings]

hybrid_entities = [
    # Milvus doesn't support string type yet, so we cannot insert "title".
    {"name": "duration", "values": durations, "type": DataType.INT32},
    {"name": "release_year", "values": release_years, "type": DataType.INT32},
    {"name": "embedding", "values": embeddings, "type": DataType.FLOAT_VECTOR},
]

# ------
# Basic insert entities:
#     We insert the `hybrid_entities` into our collection, into partition `American`, with ids we provide.
#     If succeed, ids we provide will be returned.
# ------
for _ in range(2000):
    ids = milvus.insert(collection_name, hybrid_entities, ids, partition_tag="American")
    print("\n----------insert----------")
    print("Films are inserted and the ids are: {}".format(ids))


# ------
# Basic insert entities:
#     After insert entities into collection, we need to flush collection to make sure its on disk,
#     so that we are able to retrieve it.
# ------
before_flush_counts = milvus.count_entities(collection_name)
milvus.flush([collection_name])
after_flush_counts = milvus.count_entities(collection_name)
print("\n----------flush----------")
print("There are {} films in collection `{}` before flush".format(before_flush_counts, collection_name))
print("There are {} films in collection `{}` after flush".format(after_flush_counts, collection_name))

# ------
# Basic insert entities:
#     We can get the detail of collection statistics info by `get_collection_stats`
# ------
info = milvus.get_collection_stats(collection_name)
print("\n----------get collection stats----------")
print(info)

# ------
# Basic search entities:
#     Now that we have 3 films inserted into our collection, it's time to obtain them.
#     We can get films by ids, if milvus can't find entity for a given id, `None` will be returned.
#     In the case we provide below, we will only get 1 film with id=1 and the other is `None`
# ------
films = milvus.get_entity_by_id(collection_name, ids=[1, 200])
print("\n----------get entity by id = 1, id = 200----------")
for film in films:
    if film is not None:
        print(" > id: {},\n > duration: {}m,\n > release_years: {},\n > embedding: {}"
              .format(film.id, film.duration, film.release_year, film.embedding))

# ------
# Basic hybrid search entities:
#      Getting films by id is not enough, we are going to get films based on vector similarities.
#      Let's say we have a film with its `embedding` and we want to find `top3` films that are most similar
#      with it by L2 distance.
#      Other than vector similarities, we also want to obtain films that:
#        `released year` term in 2002 or 2003,
#        `duration` larger than 250 minutes.
#
#      Milvus provides Query DSL(Domain Specific Language) to support structured data filtering in queries.
#      For now milvus supports TermQuery and RangeQuery, they are structured as below.
#      For more information about the meaning and other options about "must" and "bool",
#      please refer to DSL chapter of our pymilvus documentation
#      (https://pymilvus.readthedocs.io/en/latest/).
# ------
query_embedding = [random.random() for _ in range(8)]
query_hybrid = {
    "bool": {
        "must": [
            {
                "term": {"release_year": [2002, 2003]}
            },
            {
                # "GT" for greater than
                "range": {"duration": {"GT": 250}}
            },
            {
                "vector": {
                    "embedding": {"topk": 3, "query": [query_embedding], "metric_type": "L2"}
                }
            }
        ]
    }
}

# ------
# Basic hybrid search entities:
#     And we want to get all the fields back in results, so fields = ["duration", "release_year", "embedding"].
#     If searching successfully, results will be returned.
#     `results` have `nq`(number of queries) separate results, since we only query for 1 film, The length of
#     `results` is 1.
#     We ask for top 3 in-return, but our condition is too strict while the database is too small, so we can
#     only get 1 film, which means length of `entities` in below is also 1.
#
#     Now we've gotten the results, and known it's a 1 x 1 structure, how can we get ids, distances and fields?
#     It's very simple, for every `topk_film`, it has three properties: `id, distance and entity`.
#     All fields are stored in `entity`, so you can finally obtain these data as below:
#     And the result should be film with id = 3.
# ------
results = milvus.search(collection_name, query_hybrid, fields=["duration", "release_year", "embedding"])
print("\n----------search----------")
for entities in results:
    for topk_film in entities:
        current_entity = topk_film.entity
        print("- id: {}".format(topk_film.id))
        print("- distance: {}".format(topk_film.distance))

        print("- release_year: {}".format(current_entity.release_year))
        print("- duration: {}".format(current_entity.duration))
        print("- embedding: {}".format(current_entity.embedding))

# ------
# Basic delete:
#     Now let's see how to delete things in Milvus.
#     You can simply delete entities by their ids.
# ------
# milvus.delete_entity_by_id(collection_name, ids=[1, 2])
# milvus.flush()  # flush is important
# result = milvus.get_entity_by_id(collection_name, ids=[1, 2])
#
# counts_delete = sum([1 for entity in result if entity is not None])
# counts_in_collection = milvus.count_entities(collection_name)
# print("\n----------delete id = 1, id = 2----------")
# print("Get {} entities by id 1, 2".format(counts_delete))
# print("There are {} entities after delete films with 1, 2".format(counts_in_collection))
#
# # ------
# # Basic delete:
# #     You can drop partitions we create, and drop the collection we create.
# # ------
# milvus.drop_partition(collection_name, partition_tag='American')
# if collection_name in milvus.list_collections():
#     milvus.drop_collection(collection_name)

# ------
# Summary:
#     Now we've went through all basic communications pymilvus can do with Milvus server, hope it's helpful!
# ------
#https://github.com/milvus-io/pymilvus/tree/0.3.0#insert-entities-in-a-collection
           

建索引

上面隻是插入庫裡。真正的搜尋還是要建索引的。

ivf_param = {"index_type": "IVF_FLAT", "metric_type": "L2", "params": {"nlist": 4096}}
# the demo name.
collection_name = 'example_collection_'
partition_tag = 'demo_tag_'
segment_name= ''
_HOST = '192.168.xx.xx'
_PORT = 19530

# Connect to Milvus Server
client = Milvus(_HOST, _PORT)
client.create_index(collection_name, "embedding", ivf_param)
           

建了索引之後搜尋就非常快了。

以圖搜尋

https://tutorials.milvus.io/how-to-do-reverse-image-search-with-milvus/index.html可以是一個粗略的參考。本質上就是圖的向量化,然後milvus建索引搜尋。

搜尋之後,找到id和圖的關系展示。

邏輯非常簡單。

向量檢索milvus之一:以圖搜圖安裝milvus實驗向量搜尋建索引以圖搜尋利用pic-search-webserver來圖檔向量化利用pic-search-webclient來頁面互動總結參考文獻

利用pic-search-webserver來圖檔向量化

docker run \
-v /Users/xx/milvus/data/VOCdevkit/VOC2012/JPEGImages:/tmp/pic1 \
-p 35000:5000 -e "DATA_PATH=/tmp/images-data" \
-e "MILVUS_HOST=192.168.xx.xx" milvusbootcamp/pic-search-webserver:0.7.0
           

這個指令是啟動一個服務,來完成圖檔的向量化。後續我們專門來一個章節來分析這部分。

前提是搞一些圖檔放在JPEGImages檔案夾裡。 當然提前裝好鏡像, docker pull milvusbootcamp/pic-search-webserver:0.7.0

利用pic-search-webclient來頁面互動

>> docker pull milvusbootcamp/pic-search-webclient:0.1.0
>> docker run --name zilliz_search_images_demo_web  --rm -p 8001:80 \
-e API_URL=http://0.0.0.0:35000 \
milvusbootcamp/pic-search-webclient:0.1.0
           

裝好之後就可以在界面觀察。

向量檢索milvus之一:以圖搜圖安裝milvus實驗向量搜尋建索引以圖搜尋利用pic-search-webserver來圖檔向量化利用pic-search-webclient來頁面互動總結參考文獻

總結

  1. 安裝的一些實操;向量的一些基礎操作;
  2. 圖檔的向量化
  3. 向量的索引以及搜尋docker的部署

參考文獻

  1. https://tutorials.milvus.io/how-to-do-reverse-image-search-with-milvus/index.html
  2. https://github.com/milvus-io/pymilvus/tree/0.3.0#insert-entities-in-a-collection
  3. https://zilliz.blog.csdn.net/article/details/103884272?utm_medium=distribute.pc_relevant_t0.none-task-blog-OPENSEARCH-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-OPENSEARCH-1.channel_param

繼續閱讀