Recommend the system Embedding vector recall in immediate engineering practice

author：Flash Gene 2024-04-30 10:11:00

The Dynamic Square is a place to discover fun circles and interesting friends. How to make every user can see the content they are interested in is the goal that the immediate recommendation team has been working towards.

Last year, we shared a technical article on how to use Spark MLlib in online services, which solves the online serving problem of the ordering model. In the past year, the Instant Recommender System has completed the technical upgrade from Spark+XGBoost to TF+DNN. After the sequencing model was deepened, we began to explore the application of deep learning on the recall layer.

Recommend the system Embedding vector recall in immediate engineering practice

The goal of the recall layer of the recommendation system is to roughly screen out hundreds of items that are most likely to be of interest to current users from the million-level recommendation pool based on user portraits and past consumption history, and hand over the results to the ranking model for further sorting. Unlike sorting, the magnitude of candidate sets for recall level pairs is usually huge. As a result, it is not possible to compare each item in the candidate set against the current user one by one at the time of recall, and pick out the one that the user is most likely to like. Therefore, we generally index the content in the candidate pool in a certain way according to the needs of different businesses, so as to pick out a batch of content that users may like according to the characteristics of a certain aspect. Different indexing methods correspond to different recall strategies. The multi-channel recall strategy works in parallel, and in general, it is to collect the content that users may be interested in from different angles and complete a recall. Different business scenarios have unique recall strategies, and different content types have different indexing methods.

Among these recall strategies, there is a special class known as vector-based embedding-based recalls. As a recall strategy that takes into account both accuracy and coverage, it is widely used in various business scenarios. The indexing method of vector recall is done based on vectors. After training an embedding model and getting the vectors about the item and the user, you can use the vector of the item as the key of the index and the vector of the user as the query of the query, and quickly find a batch of the most matching items from the massive candidate pool through some fuzzy nearest neighbor (ANN) algorithms. Due to the universality of embedding, vector recall can be attempted for almost any business scenario and content type.

In the immediate future, we have developed some recall strategies for the scenarios of the dynamic square recommendation stream, which to a certain extent meet the needs of helping instant friends discover interesting content. However, with the enrichment of the community's circle of interest, the preferences of users have become more diverse. In order to make the recommendation system better meet the needs of personalization, we decided to start experimenting with the recall strategy based on embedding vectors.

## Model structure

There are many mature embedding models in the industry: from the original Matrix Factorization, to item2vec and node2vec inspired by word2vec, from YouTubeDNN based on user click sequences, to GCN and GraphSAGE based on graphs. Different models have different complexities and application scenarios.

Considering the real-time requirements of immediate dynamic stream recommendation, we were inspired by DSSM to select a simple DNN two-tower model and train it in a supervised manner with the optimization goal of click-through rate.

The two towers of the two-tower model receive features on the user side and the content side respectively, and output user vectors and content vectors. For the structure inside a single tower, we chose the simplest multi-layer fully connected structure: the first layer is the feature embedding layer, which obtains a vector representation of multiple features from the original input through embedding, and concats all the feature vectors, and the last layer is the output embedding layer, which obtains the vector representation of the user side and the content side. After obtaining the two-sided vector representation, a simple distance function is used to calculate the distance between the two vectors, and the final output is a scalar between [0,1], and the loss calculation is done with the feedback of the user's click by cross-entropy. Considering the real-time requirements of online recall, the embedding dimension of the final output is set to 64.

## Model Deployment

After getting a trained model, we need to deploy the model as a real-time service for the online recommender system. In practice, the vectors on the content side and the vectors on the user side are calculated separately. Therefore, when deploying services, it is necessary to split the two towers of the two-tower model into two small models for deployment: the content model only receives the feature input of the content and outputs a content side vector, while the user model receives the features of the user side and outputs the user vector.

With TensorFlow Serving, you can easily deploy model files directly as online services.

## Index Recall Schema

Once the model is deployed online, an embedding-based recall strategy needs to be integrated into the existing recommender system architecture.

Our current indexing and recall process is based on Elasticsearch, and integrating the new vector recall channel into Elasticsearch is the most time-saving and convenient way.

The entire index recall process is divided into two parts. In the dynamic indexing phase, you can consume the operation logs of MongoDB to implement dynamic near-real-time indexing and feature updates. When the index service receives an index request, it obtains or calculates dynamic near-real-time features through the feature service, and then estimates that the service calls TF-serving to calculate the dynamic embedding vector, and finally imports it into Elasticsearch together with other fields. In the recommendation recall phase, the real-time features of the user are calculated by calling the feature service, and then the embedding vector of the user is calculated by calling TF-serving, and then the user is retrieved from ES to retrieve a batch of dynamics that the user is most likely to like to complete an embedding recall.

Elasticsearch 7.X natively supports dense vector indexes and queries, but since the current Elasticsearch version is 6.7 and does not have the native vector indexing function, we use the vector indexing plug-in provided by Alibaba Cloud to complete vector indexing and fuzzy queries. The P95 time taken by vector retrieval basically meets the latency requirements of the recommendation service.

## Effect and Iteration Direction

After the first version of the recall strategy based on embedding vectors was launched, the overall interaction rate of Instant Dynamic Square increased by 33.75%, achieving the maximum effect improvement in a single launch. The embedding-based recall strategy not only has the highest interaction rate among all recall strategies, but also has the largest distribution volume.

The first version of the model is just a small attempt to apply deep learning to the recall layer, and there are still many problems in this version of the model, whether it is the model structure or the recommendation architecture, there is a lot of room for optimization. In terms of architecture, an urgent need to solve is how to automatically complete the vector version synchronization problem caused by embedding model updates. We know that one of the fundamental problems of the embedding model is that the semantics of each dimension are unknown and volatile, and the vector space obtained from each training is different, which means that different versions of vectors cannot be compared with each other. This has caused a lot of obstacles to the iterative optimization of the model, and how to continuously complete the model update and integration is a big problem. In terms of model structure, the two-tower model has its advantages, but it also has many disadvantages, such as not considering the sequence of behaviors. In the future, we plan to try more possibilities in terms of model structure and training methods.

In the immediate future, we continue to research and apply sophisticated, cutting-edge machine learning algorithms to real-world recommender systems, and we are also focused on how to establish a flexible and agile deployment process to allow for rapid iteration of models. All of this is to help you discover more interesting circles and meet more interesting friends. The Instant Referral Team welcomes excellent engineers to join us and build the town together.

----

Reference Links:

Author:

Chengzu Ou, Instant Algorithm Engineer

Source-WeChat public account: Instant technical team

Source: https://mp.weixin.qq.com/s/X9lJhI7v2M5P_FLu14Ilmw

Recommend the system Embedding vector recall in immediate engineering practice

Read on