安裝milvus
關于milvus
milvus作為一個內建的開源平台,目标就是向量檢索的內建平台。類似于elasticsearch內建了搜尋。細節大家可以直接看官網。https://www.milvus.io/cn/docs/v0.11.0/overview.md
安裝
說起來其實挺容易,方法也比較清晰。不過下載下傳比較慢。
>> docker pull milvusdb/milvus:0.11.0-cpu-d101620-4c44c0
0.11.0-cpu-d101620-4c44c0: Pulling from milvusdb/milvus
75f829a71a1c: Pull complete
e654e509dcd3: Pull complete
482d74c614ad: Pull complete
85d20808a7e5: Pull complete
2f8820d4255e: Pull complete
Digest: sha256:6a5dc00b26dc18be5e5bfddc8cfb36370188e4c951e62ffafa30fbe3f4b1ad60
Status: Downloaded newer image for milvusdb/milvus:0.11.0-cpu-d101620-4c44c0
docker.io/milvusdb/milvus:0.11.0-cpu-d101620-4c44c0
啟動
docker run -d --name milvus_cpu_0.11.0 \
-p 19530:19530 \
-p 19121:19121 \
-v /home/$yourname/milvus/db:/var/lib/milvus/db \
-v /home/$yourname/milvus/conf:/var/lib/milvus/conf \
-v /home/$yourname/milvus/logs:/var/lib/milvus/logs \
-v /home/$yourname/milvus/wal:/var/lib/milvus/wal \
milvusdb/milvus:0.11.0-cpu-d101620-4c44c0```
這個地方有個小細節,就是不要-d啟動,觀察錯誤,等沒問題之後再用-d啟動,這樣可以觀察細節。
另外,milvus.yaml如果下載下傳不了就翻牆吧。 我還是放一份在這裡吧。https://download.csdn.net/download/iterate7/13081889
安裝對應的admin觀察界面
>> docker pull milvusdb/milvus-em:v0.5.0
>> docker run -d -p 3000:80 -e API_URL=http://192.168.13.218:3000 milvusdb/milvus-em:v0.5.0
然後就可以在界面觀察milvus了。
![](https://img.laitimes.com/img/_0nNw4CM6IyYiwiM6ICdiwiIyVGduV2YfNWawNCM38FdsYkRGZkRG9lcvx2bjxiNx8VZ6l2csMTVHRGaKhlWwwmMMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2X0hXZ0xCMx81dvRWYoNHLrdEZwZ1Rh5WNXp1bwNjW1ZUba9VZwlHdssmch1mclRXY39CXldWYtlWPzNXZj9mcw1ycz9WL49zZuBnLwMzN3EDOxgTMxATMxAjMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
實驗向量搜尋
主要是增删改查。 直接看代碼更直接。
import numpy as np
import random
from milvus import Milvus
from milvus import Status
_HOST = '192.168.xx.xx'
_PORT = 19530
# Connect to Milvus Server
milvus = Milvus(_HOST, _PORT)
# Close client instance
# milvus.close()
# Returns the status of the Milvus server.
server_status = milvus.server_status(timeout=4)
print(server_status)
# Vector parameters
_DIM = 8 # dimension of vector
_INDEX_FILE_SIZE = 32 # max file size of stored index
# the demo name.
collection_name = 'example_collection_'
partition_tag = 'demo_tag_'
segment_name= ''
# 10 vectors with 8 dimension, per element is float32 type, vectors should be a 2-D array
vectors = [[random.random() for _ in range(_DIM)] for _ in range(10)]
ids = [i for i in range(10)]
print(vectors)
# Returns the version of the client.
client_version= milvus.client_version()
print(client_version)
# Returns the version of the Milvus server.
server_version = milvus.server_version(timeout=10)
print(server_version)
print("has collection:",milvus.has_collection(collection_name=collection_name, timeout=10))
from milvus import DataType
# Information needed to create a collection.Defult index_file_size=1024 and metric_type=MetricType.L2
collection_param = {
"fields": [
# Milvus doesn't support string type now, but we are considering supporting it soon.
# {"name": "title", "type": DataType.STRING},
{"name": "duration", "type": DataType.INT32, "params": {"unit": "minute"}},
{"name": "release_year", "type": DataType.INT32},
{"name": "embedding", "type": DataType.FLOAT_VECTOR, "params": {"dim": 8}},
],
"segment_row_limit": 4096,
"auto_id": False
}
# ------
# Basic create collection:
# After create collection `demo_films`, we create a partition tagged "American", it means the films we
# will be inserted are from American.
# ------
# milvus.create_collection(collection_name, collection_param)
# milvus.create_partition(collection_name, "American")
# ------
# Basic create collection:
# You can check the collection info and partitions we've created by `get_collection_info` and
# `list_partitions`
# ------
print("--------get collection info--------")
collection = milvus.get_collection_info(collection_name)
print(collection)
partitions = milvus.list_partitions(collection_name)
print("\n----------list partitions----------")
print(partitions)
# ------
# Basic insert entities:
# We have three films of The_Lord_of_the_Rings series here with their id, duration release_year
# and fake embeddings to be inserted. They are listed below to give you a overview of the structure.
# ------
The_Lord_of_the_Rings = [
{
"title": "The_Fellowship_of_the_Ring",
"id": 1,
"duration": 208,
"release_year": 2001,
"embedding": [random.random() for _ in range(8)]
},
{
"title": "The_Two_Towers",
"id": 2,
"duration": 226,
"release_year": 2002,
"embedding": [random.random() for _ in range(8)]
},
{
"title": "The_Return_of_the_King",
"id": 3,
"duration": 252,
"release_year": 2003,
"embedding": [random.random() for _ in range(8)]
}
]
# ------
# Basic insert entities:
# To insert these films into Milvus, we have to group values from the same field together like below.
# Then these grouped data are used to create `hybrid_entities`.
# ------
ids = [k.get("id") for k in The_Lord_of_the_Rings]
durations = [k.get("duration") for k in The_Lord_of_the_Rings]
release_years = [k.get("release_year") for k in The_Lord_of_the_Rings]
embeddings = [k.get("embedding") for k in The_Lord_of_the_Rings]
hybrid_entities = [
# Milvus doesn't support string type yet, so we cannot insert "title".
{"name": "duration", "values": durations, "type": DataType.INT32},
{"name": "release_year", "values": release_years, "type": DataType.INT32},
{"name": "embedding", "values": embeddings, "type": DataType.FLOAT_VECTOR},
]
# ------
# Basic insert entities:
# We insert the `hybrid_entities` into our collection, into partition `American`, with ids we provide.
# If succeed, ids we provide will be returned.
# ------
for _ in range(2000):
ids = milvus.insert(collection_name, hybrid_entities, ids, partition_tag="American")
print("\n----------insert----------")
print("Films are inserted and the ids are: {}".format(ids))
# ------
# Basic insert entities:
# After insert entities into collection, we need to flush collection to make sure its on disk,
# so that we are able to retrieve it.
# ------
before_flush_counts = milvus.count_entities(collection_name)
milvus.flush([collection_name])
after_flush_counts = milvus.count_entities(collection_name)
print("\n----------flush----------")
print("There are {} films in collection `{}` before flush".format(before_flush_counts, collection_name))
print("There are {} films in collection `{}` after flush".format(after_flush_counts, collection_name))
# ------
# Basic insert entities:
# We can get the detail of collection statistics info by `get_collection_stats`
# ------
info = milvus.get_collection_stats(collection_name)
print("\n----------get collection stats----------")
print(info)
# ------
# Basic search entities:
# Now that we have 3 films inserted into our collection, it's time to obtain them.
# We can get films by ids, if milvus can't find entity for a given id, `None` will be returned.
# In the case we provide below, we will only get 1 film with id=1 and the other is `None`
# ------
films = milvus.get_entity_by_id(collection_name, ids=[1, 200])
print("\n----------get entity by id = 1, id = 200----------")
for film in films:
if film is not None:
print(" > id: {},\n > duration: {}m,\n > release_years: {},\n > embedding: {}"
.format(film.id, film.duration, film.release_year, film.embedding))
# ------
# Basic hybrid search entities:
# Getting films by id is not enough, we are going to get films based on vector similarities.
# Let's say we have a film with its `embedding` and we want to find `top3` films that are most similar
# with it by L2 distance.
# Other than vector similarities, we also want to obtain films that:
# `released year` term in 2002 or 2003,
# `duration` larger than 250 minutes.
#
# Milvus provides Query DSL(Domain Specific Language) to support structured data filtering in queries.
# For now milvus supports TermQuery and RangeQuery, they are structured as below.
# For more information about the meaning and other options about "must" and "bool",
# please refer to DSL chapter of our pymilvus documentation
# (https://pymilvus.readthedocs.io/en/latest/).
# ------
query_embedding = [random.random() for _ in range(8)]
query_hybrid = {
"bool": {
"must": [
{
"term": {"release_year": [2002, 2003]}
},
{
# "GT" for greater than
"range": {"duration": {"GT": 250}}
},
{
"vector": {
"embedding": {"topk": 3, "query": [query_embedding], "metric_type": "L2"}
}
}
]
}
}
# ------
# Basic hybrid search entities:
# And we want to get all the fields back in results, so fields = ["duration", "release_year", "embedding"].
# If searching successfully, results will be returned.
# `results` have `nq`(number of queries) separate results, since we only query for 1 film, The length of
# `results` is 1.
# We ask for top 3 in-return, but our condition is too strict while the database is too small, so we can
# only get 1 film, which means length of `entities` in below is also 1.
#
# Now we've gotten the results, and known it's a 1 x 1 structure, how can we get ids, distances and fields?
# It's very simple, for every `topk_film`, it has three properties: `id, distance and entity`.
# All fields are stored in `entity`, so you can finally obtain these data as below:
# And the result should be film with id = 3.
# ------
results = milvus.search(collection_name, query_hybrid, fields=["duration", "release_year", "embedding"])
print("\n----------search----------")
for entities in results:
for topk_film in entities:
current_entity = topk_film.entity
print("- id: {}".format(topk_film.id))
print("- distance: {}".format(topk_film.distance))
print("- release_year: {}".format(current_entity.release_year))
print("- duration: {}".format(current_entity.duration))
print("- embedding: {}".format(current_entity.embedding))
# ------
# Basic delete:
# Now let's see how to delete things in Milvus.
# You can simply delete entities by their ids.
# ------
# milvus.delete_entity_by_id(collection_name, ids=[1, 2])
# milvus.flush() # flush is important
# result = milvus.get_entity_by_id(collection_name, ids=[1, 2])
#
# counts_delete = sum([1 for entity in result if entity is not None])
# counts_in_collection = milvus.count_entities(collection_name)
# print("\n----------delete id = 1, id = 2----------")
# print("Get {} entities by id 1, 2".format(counts_delete))
# print("There are {} entities after delete films with 1, 2".format(counts_in_collection))
#
# # ------
# # Basic delete:
# # You can drop partitions we create, and drop the collection we create.
# # ------
# milvus.drop_partition(collection_name, partition_tag='American')
# if collection_name in milvus.list_collections():
# milvus.drop_collection(collection_name)
# ------
# Summary:
# Now we've went through all basic communications pymilvus can do with Milvus server, hope it's helpful!
# ------
#https://github.com/milvus-io/pymilvus/tree/0.3.0#insert-entities-in-a-collection
建索引
上面隻是插入庫裡。真正的搜尋還是要建索引的。
ivf_param = {"index_type": "IVF_FLAT", "metric_type": "L2", "params": {"nlist": 4096}}
# the demo name.
collection_name = 'example_collection_'
partition_tag = 'demo_tag_'
segment_name= ''
_HOST = '192.168.xx.xx'
_PORT = 19530
# Connect to Milvus Server
client = Milvus(_HOST, _PORT)
client.create_index(collection_name, "embedding", ivf_param)
建了索引之後搜尋就非常快了。
以圖搜尋
https://tutorials.milvus.io/how-to-do-reverse-image-search-with-milvus/index.html可以是一個粗略的參考。本質上就是圖的向量化,然後milvus建索引搜尋。
搜尋之後,找到id和圖的關系展示。
邏輯非常簡單。
利用pic-search-webserver來圖檔向量化
docker run \
-v /Users/xx/milvus/data/VOCdevkit/VOC2012/JPEGImages:/tmp/pic1 \
-p 35000:5000 -e "DATA_PATH=/tmp/images-data" \
-e "MILVUS_HOST=192.168.xx.xx" milvusbootcamp/pic-search-webserver:0.7.0
這個指令是啟動一個服務,來完成圖檔的向量化。後續我們專門來一個章節來分析這部分。
前提是搞一些圖檔放在JPEGImages檔案夾裡。 當然提前裝好鏡像, docker pull milvusbootcamp/pic-search-webserver:0.7.0
利用pic-search-webclient來頁面互動
>> docker pull milvusbootcamp/pic-search-webclient:0.1.0
>> docker run --name zilliz_search_images_demo_web --rm -p 8001:80 \
-e API_URL=http://0.0.0.0:35000 \
milvusbootcamp/pic-search-webclient:0.1.0
裝好之後就可以在界面觀察。
總結
- 安裝的一些實操;向量的一些基礎操作;
- 圖檔的向量化
- 向量的索引以及搜尋docker的部署
參考文獻
- https://tutorials.milvus.io/how-to-do-reverse-image-search-with-milvus/index.html
- https://github.com/milvus-io/pymilvus/tree/0.3.0#insert-entities-in-a-collection
- https://zilliz.blog.csdn.net/article/details/103884272?utm_medium=distribute.pc_relevant_t0.none-task-blog-OPENSEARCH-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-OPENSEARCH-1.channel_param