1. Background
Multi-scenario modeling is an algorithm optimization work that is closely integrated with business, and its core lies in thinking about how to deal with the differences and commonalities between multiple business scenarios.
1. Introduction to cloud music recommendation scenarios
The core scenarios of cloud music include daily recommendations, which are updated with 30 songs every day, presented in the form of a list; There are also streaming recommendations, real-time updates, and daily recommendations, which are directly recommended songs for users; There are also playlist recommendations, including Home playlists and MGC playlists.
2. Differentiation of the main recommendation scenario
These scenarios have different characteristics and are designed to meet the different personalized recommendation needs of users. It can be seen that in the past period, these scenarios have been continuously optimized by special personnel. The advantage of this approach is that it can make the model more suitable for the business scenario and characteristics, give full play to the effect of the model, and improve the experience of loyal users in the scene.
There are drawbacks to this approach. First, long-term dedicated optimization may lead to large differences in the technology stack; Second, the pace of technology sharing and co-construction will be slowed down.
3. The more new scenes are added
The challenge we face is not only to undertake more and more new scenarios to meet the personalized music demands of more users and bring increments, but also to think about how to better undertake and continuously optimize these new scenarios.
4. Modeling Objectives
There are two main goals of multi-scenario work: one is to use one model to serve all recommended scenarios to achieve better results, jointly model the data generated by users after consumption in any scene, and accurately model the underlying interest representations that users are really interested in; The second is to use one model to serve all scenarios, effectively reduce machine costs and labor costs, improve R&D efficiency, and promote technology co-construction to reach a higher level.
5. Modeling difficulties
The first difficulty is the double seesaw problem, which consists of a multi-task seesaw and a multi-scene seesaw, which is more difficult to balance when the two are coupled. Another difficulty, which can be described in the Western story "How Gria defeated David", is how to defeat the small model specific to each scene with a common large model, which will be elaborated in the subsequent introduction.
Second, the overall framework
1. System framework
Although we mainly talked about multi-scenario modeling at the algorithm level, more importantly, in addition to the algorithm level, we have built a unified system framework from the data layer to the scene layer to the top-level task, abandoning the original scattered, unified, and non-standard technologies and corresponding data, and building a unified model architecture on this basis.
2. Model architecture
This architecture is not much different from the existing multi-scene modeling architecture in the industry, but it incorporates the unique business characteristics of music scenes and our thinking. For example, in view of the strong business characteristics of the continuous consumption of old songs in music recommendations, we have done a lot of long-term multi-interest representations, and intersected with immediacy and dynamically updated. At the same time, we hope to pass on the music and apartment knowledge precipitated behind the business to a higher level and serve multiple business scenarios above the water.
3. Overall Overview
The overall architecture can be summarized into three key words: bottom-up, seeking common ground while reserving differences, and removing false and retaining truth. Seeking common ground while reserving differences is the most important point in this sharing, because multi-scenario work is more about how to solidify and precipitate the truly valuable common parts at the lowest cost, and at the same time retain the different parts in a fast and agile way, and the organic combination of the two can complete the construction of a unified multi-scenario large model.
3. Key modules
1. Unified model modeling
After referring to many existing multi-scenario modeling work in the industry, we completed the design of the overall architecture. Important areas are marked with color separation blocks for easy understanding. For example, there are blue and yellow color blocks on the ground floor, and the overall design structure follows the separation of public and private domains. In the public domain part, the features and expressions shared by multiple scenes are extracted, and the yellow part is more unique to the scene. You can see that there are multiple purple towers in parallel in the diagram, and each purple tower represents knowledge unique to a scene. On top of it is the common MMOE architecture, which is used for multi-task learning assistance for different objectives in multiple scenarios.
2. Public area network design
The public domain expresses more common characteristics of business scenarios, public interests of users, or constant parts of their long-term and short-term interests. So how does convergence work? This is more of a test of how algorithm engineers can extract the common divisor when faced with scattered business characteristics and different logic. Here are four key points to share: first, a common input structure; the second is the greatest common divisor of characteristics; the third is sharing and co-building; Fourth, it is lightweight and efficient. Sharing and co-construction and lightweight efficiency may be based on team culture, and it is mandatory to do better to serve the big model. At the algorithmic level, more emphasis is placed on retaining the common core features. Here are some core points, such as finding and retaining the necessary core features through ablation experiments, and cutting what can be cut, so as to reduce the burden before building a large model as much as possible.
3. Multi-scene effect analysis
Based on this thinking, we carried out several iterations and timely tested and analyzed based on AB to verify the effectiveness of the public domain architecture design. Users are divided into 11 grades from 0 to 10 according to their activity level, with level 0 users being the least active and level 10 being the most active. As can be seen from the figure, the horizontal axis is the corresponding increase for each group of people, and the increase in levels 0-3 is significantly larger. The purpose of this data is to demonstrate that through a good design of the public domain structure, user characteristics can be effectively expressed and precipitated, so that low-activity users can benefit more.
4. Private domain network design
The public domain is relatively basic, while the private domain is more complex. The core point of the private domain is to retain the most special and valuable features of each scene, emphasizing parameter isolation and gradient isolation, and each tower does not interfere with each other, and the input features are completely different. These features are derived from feature mining that is unique to each scenario, such as the cover feature of a business scenario or the signal generated by a streaming scenario based on real-time user feedback.
5. Scenario: VPC SEN
In order to solve the problem of large differences in private scenarios, the general logic of the SEN scenario VPC is designed to improve the reuse rate and overall iteration efficiency when accessing new scenarios and merging old scenarios. The convergence in seeking common ground while reserving differences is mainly aimed at the design of public domain networks, while the preservation of differences is aimed at the design of private domain networks, which is mainly reflected in: first, the private features of private scenarios are not aligned; Second, some private features are important but easy to overfit; Third, there is a problem of distribution drift. We took some of the designs of the Transformer class and combined them to solve these problems.
6. Cross-domain multitasking model
This is followed by the multitasking seesaw in the double seesaw problem, and the following diagram lists several core scenarios and the tasks they face.
Based on the scenario-specific tasks, we designed the task master logic. The gradient is retained for the task-oriented scenario, and the gradient is stopped for the non-task-free scenario by subgradient to avoid affecting the learning of the corresponding SEN network. This ensures that there is as much isolation between multiple scenarios and gradients as possible, as well as between scenes, tasks and tasks, and tasks specific to scenes.
7. Lightweight design of the model
In the music recommendation scenario, the user behavior sequence, especially the long-term behavior sequence, is very important, and the introduction of LSTM plus session segmentation to extract the long-term interest features of users has brought significant improvement, but this feature and the corresponding network structure consume a lot of the model. When iterating on the multi-scenario and multi-task unified large model, we found a relatively lighter way, that is, a hierarchical attention network.
From the perspective of data comparison, although the hierarchical attention has a slight negative trend in a certain core indicator, it is a trade-off from the overall point of view, sacrificing local small benefits for greater global benefits in the future, and improving the overall iterative efficiency.
Fourth, the application effect
After the model was launched, the red heart rate of the core recommended scenarios increased by more than 10%, and the core indicators of many small scenarios increased by more than 15%, which directly led to a relative increase of 1% in secondary retention. The model was also applied to NetEase Group's other businesses, with the absolute value of new customers increasing by 0.2% and the absolute value of secondary retention increasing by 0.2%. At the same time, after the model is launched, the original fragmented and inconsistent technology stack is replaced, and the overall efficiency is also improved, and resource consumption is saved.
5. Prospects
On the basis of the unification of the existing overall model, we hope to further complicate the model and serve more business lines of NetEase Cloud Music, not only limited to music recommendation, but also podcasts, live broadcasts, etc., so as to maximize the effect of the model in various forms of cooperation.
6. Q&A
Q1: Can I add new domains and tasks to the new model?
A1: Yes, in the private domain network SEN scenario, one network corresponds to one scenario, and the new scenario only needs to add the corresponding tower to the private network, which is more flexible.
Q2: Does 5 iterations mean 5 iterations a week?
A2: No, it means that the offline complete training model tries a new direction, and the offline training efficiency improves the iteration speed.
Q3: If the unified model is large, will there be conflicts or reduced efficiency if multiple people iterate on the model?
A3: The current practice is to distinguish the iteration direction in advance, and different students are responsible for the direction with the lowest overlap, and the overlap is further reduced because the students are responsible for different businesses. Although there are fewer distractions, there are also problems, such as the personal benefits of adding new points due to the promotion of others' work, which may not be as much as when AB alone. Emphasis is placed on incremental AB at this point, with a large number of DIFF comparisons. This is the current practice, and we are still thinking about ways to improve the efficiency of cooperation.
Q4: What do you think about the necessity of splitting the private domain features?
A4: The degree of reuse of private domain features is low, and most of them are unique and important to the scene, so it is recommended not to mix them with the public domain side, as the mixing effect is not ideal in practice.
Q5: Is it possible to reuse samples in multiple scenarios?
A5: Yes, all scenario samples are trained together, and if the offline training efficiency is not optimized, the training time will increase significantly.
Q6:层次 attention 是将长短序融合再和短序融合吗?
A6:是的,先对长序做 attention,再和短序做 attention。
Q7: Why does the public domain feature improve inactive users?
A7: There are many music scenes, but users usually only use one or two functions, multi-scene modeling can cover the whole recommendation domain sample, change the user's underlying expression, and it is better for similar users to start cold, no longer only adapt to the characteristics of a single scene.
Q8: If the new scenario is private domain incremental, after the private domain is trained, all the scenes in the public domain are integrated, if the new scenario comes up, is it direct full training or incremental training?
A8: Direct incremental training may be unstable due to differences in new samples.
Q9: Is the full sample update full cold start or hot start?
A9: Full updates.
Q10: If you have a private domain tower, the upper model of the public domain tower has to be fine-tuned, how will it affect the evaluation of other tasks?
A10: The new private domain tower usually does AB directly, and the current recommended domain is evaluated by an experiment, and the negative impact of the new private domain tower on other scenarios is not observed, because the new ones are mostly small scenes, and the core scenes are usually not brought by new increments.
Q11: What do you think about the bad effect of mixing private domain features?
A11: Because of the big differences, it is like pig feed for cattle to get sick.
Q12: Are there any differences in the definitions of positive and negative samples in different scenarios?
A12: There are differences, the sample distribution is very different, and it was observed that the sample size of some scenes will have a negative impact on other scenes.
Q13: Is there a big difference in magnitude between different scenarios?
A13: If it is larger, there will be information loss when sampling, but the part that can be modeled together and has benefits for the whole will be retained.
Q14:Loss 上有什么考虑?
A14: The multi-task layer is less affected by the task master design, and the multi-scene layer ensures sample isolation through the independent sub-tower design of each scene.
Q15: How do you design this framework for loss design of different samples such as list, wise, and pointwise?
A15: At present, it is mainly a pointwise framework, and listwise logic is more inclined to do secondary partial order correction on the basis of pointwise, which is used for partial order expression of multi-task layers.
Q16: After cross-scenario modeling, will the recommendations of each scenario converge?
A16: At present, this phenomenon does not occur, there are differences in the underlying recall, and the parameter distribution of the sub-towers activated in different scenarios is very different.
Q17: Will the sample feedback delay be different for different scenarios?
A17: Yes, the real-time scenario is faster, the daily scene is slower, and the batch processing at the same time is uniform.
Q18: Is it necessary to design the upper deck and the output tower in different scenarios?
A18: Depending on the business, if a certain task in a certain scenario is very important in the business and the impact on other scenarios is controllable, the output tower can be designed separately, and there is no unified methodology.
Q19: If you have songs and videos, can you do long-term modeling?
A19: At present, multi-scene modeling is mainly done in the song domain, and the cross-domain modeling scheme is more complex due to the large difference between songs and videos.
Q20:长层次 attention 只是处理短序列吗?
A20: No, long and short sequences are treated together, short sequences can be treated as target, and long sequences can be treated as sequences.
Q21: When it comes to the unified experiment, if two private domains of the same size go up and down, how do you evaluate it?
A21: First look at the total amount, and then look at the rise and fall of the private domain scenario, the total amount is the most important.
Q22: How does Task mask solve the coupling between scenes and tasks?
A22: Scene A has task A, scene B does not have task A, mix samples from scene A and scene B when building samples, and mask the gradient of task A when training scene B.
Q23: How long are long sequences and short sequences?
A23: There are tens of thousands of long-term sequences and hundreds of medium- and long-term interest sequences.
Q24:层次 attention 可以再讲解一下吗?
A24:短序列做 target attention 的 target,长序列做序列。
Q25: If the magnitude of the scene is different, will the information be lost in the sampling scene? How is it balanced?
A25: Some information will be lost, depending on the impact of the lost part on the whole, and the part that can be modeled together and beneficial will be retained.
Q26: Ideas for scenario-based design of upper-layer networks
A26: Influenced by the work of the predecessor star, the design concept is the same, divided into public domain and private domain experts, the private domain is the scene private network SEN, each scene has a unique network, and the public domain design is based on the expression of common interests at the user level, rather than the scene level.
Q27: Regarding the fact that the conversion rate distribution of tasks is different in different scenarios, using the same output tower may lead to the problem, whether it is necessary to separate the output for each task in each scenario.
A27: The information is biased through the independent tower of private domain scenarios to ensure the stability of COPC in small scenarios, and gradient isolation can ensure the accuracy of COPC, and the gradient backhaul in the public domain is based on the public interest of users and is not affected by the conversion rate of the scenario.
For more information, please click Full-Scenario Live Streaming Solution - Aerospace Cloud Network Solution