laitimes

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

Fish and sheep from Ofei Temple

Qubits | Official account QbitAI

Little Red Book has changed.

You think it's still "make-up" and "dressing", but now on social media, many of the statements about the Little Red Book are a bit surprising.

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

It seems to have a little bit of a "search engine" flavor.

What happened?

After picking up the data, we found that last year, the digital content of Xiaohongshu Technology increased by 500% year-on-year, the sports events increased by 1140% year-on-year, and the food consumption DAU even exceeded the beauty at one time.

On the home page of Xiaohongshu, there are more than 30 category labels in the drop-down menu. Cooking tutorials, home guides, outdoor camping, travel tips, examinations and even entrepreneurship, the breadth of content has long exceeded the beauty track that settled down in the past.

A more interesting data is that Xiaohongshu previously disclosed that up to 30% of users will directly search after entering Xiaohongshu.

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

That is to say, the continuous generalization of UGC content is constantly impacting the community content map that breaks through Xiaohongshu, and the ensuing user behavior is completely different from the inherent imagination of Xiaohongshu.

From the outside world, the changes in the Little Red Book are not insignificant. From an in-house technical point of view, the challenges are also multiplying.

Content generalization and high-frequency search, coupled with the mixture of different modes of content such as pictures, text, and videos, put forward higher requirements for search and recommendation optimization.

In addition, Internet users' requirements for content quality are increasing, and the demand for platforms and machines to further grasp the psychology of users has always been growing.

Therefore, behind the more complex search and recommendation mechanism, how should we deal with it?

Multimodal challenges for content communities

As one of the few content communities with a large number of graphics and text + short videos, the key word given by Xiaohongshu is: multi-modal learning.

The so-called multimodal refers to different forms of information expression such as text, images, and sounds.

Multimodal learning, on the other hand, is to build a unified model that combines different types of information.

In simple terms, once AI can integrate different forms of information, such as images and words, it can go further in "understanding" the matter.

This can also achieve the following effect:

Let the AI draw according to the prompts of "Angels in the Air, Unreal Engine Effects", and the AI will give the following answer.

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

If AI only makes people feel "unconscious" when they look at literature and painting, what is the more practical significance of multimodal technology for Internet products?

Just recently, a public AI class held by the Xiaohongshu technical team shared their exploration of multimodal algorithms. This gives you a glimpse of the chemistry of the "multi-modal learning" + content community with massive UGC content that is currently in full swing in academia.

Multimodal search

Let's start with search.

When opening the Little Red Book search results page, the app will also recommend more relevant search terms to users:

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

In the past, these query terms were in plain text form.

After applying multimodal techniques, these query terms now have an additional layer of more beautiful and relevant "basemaps.". That is to say, the AI will automatically filter out the patterns that match the query term and display them to the user in the search results interface.

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

Don't look at just such a simple change, Tang Shen, head of the multi-mode algorithm group of Xiaohongshu, revealed that after adding the function, UVCTR (Unique Visitor Click-through Rate) and PVCTR (Page View Click-through Rate) increased by 2-3 times.

In addition, another key embodiment of multimodal technology in search is to search for pictures by map.

Image searches for specific items such as goods, plants and flowers are not uncommon. But what if the user wants to search for a certain sense of atmosphere, a certain overall style?

This actually presents a new challenge for AI: object detection and recognition in complex environments.

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

△ Search for emojis

To solve this problem, the Little Red Book technical team implemented the ability to build offline and index online with three core modules:

Front module

Large-scale search of features

Sort modules

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

In the pre-module, the technical team developed a variety of multi-modal labels, covering many dimensions such as object detection, subject recognition, commodity attributes, and human body attributes.

In the Features module, the technical team solved the problem of inconsistent recall results categories through multi-task learning based on Norm Classifier.

In the sorting module, the technical team used NLP-related information such as OCR and brand words extracted from the title to integrate multimodal information, which significantly improved the retrieval accuracy.

Content quality evaluation system

If the change of search is easier to see, the application of multimodal technology in content quality evaluation affects the overall "painting style" of Xiaohongshu at a deeper level.

Since July and August last year, on the basis of marking various notes with class targets and building a pure classification multi-modal system, the Xiaohongshu technical team began to pay more attention to the establishment of a note content quality evaluation system.

In other words, let AI learn to judge what kind of notes are more "useful" and more aesthetically valuable.

To this end, the Xiaohongshu technical team listed two basic atomic capabilities that are relatively core:

Cover picture quality aesthetic model

Multimodal note quality sub-model

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

The aforementioned search recommendation word shading pictures are actually based on such basic capabilities. In addition, relying on this set of content quality evaluation system, it can also realize the structure of different types of notes such as graphics and videos, and the deduplication of search result pages.

Having said all this, to summarize simply, the application of multimodal technology in business scenarios has the greatest impact on Xiaohongshu: to make high-quality content easier to be seen by people who need it, and to improve the overall painting style and content aesthetics presented in front of users.

In this way, for a UGC-based community, the positive cycle between users and content producers is easier to achieve, which is undoubtedly beneficial to the overall community atmosphere.

This is also the key to the increasingly diverse content of its notes and the increasingly diverse composition of users.

Why did the Little Red Book change?

As mentioned above, the optimization of the "painting style" of Xiaohongshu is not unrelated to the new trend of technology in the entire Internet industry.

Now, graphic content and short video content have become mainstream on social media, and the traditional single mode is obviously difficult to fully describe the information where these texts, images, and sounds intersect.

The fusion of feature information of multiple modes has gradually become a new challenge in various practical application scenarios, especially in fields with high requirements for content understanding, such as search and recommendation.

The Little Red Book itself already has key conditions and urgent needs in terms of scenes and business.

First of all, from the perspective of the scene, the content released on the Little Red Book is mainly graphics and videos, and naturally has massive multimodal data.

Moreover, behind these multimodal data, there is also a wealth of user feedback data.

Secondly, the Little Red Book in the rapid development of the business will face various corner cases. For example, the content posted by users not only covers many different categories such as food, beauty, home, technology products, etc., but also may appear to be notes without text only pictures, notes on pictures + music, short videos without titles, and so on.

These new challenges and unique multimodal application scenarios also provide sufficient space for the landing of multimodal technologies.

From internally meeting business needs to external output

In fact, in order to cope with the changes in user needs, the accumulation of Xiaohongshu's internal technology began earlier. And now it has developed to a new stage from internally meeting business needs to realizing technology output externally.

For example, this year, the Xiaohongshu technical team won 2 CVPR papers, involving video retrieval and video content understanding.

In the past two days, Xiaohongshu has also opened an "AI open class", in which doctoral supervisors from Shanghai Jiaotong University, Beihang University, and Shanghai University of Science and Technology have participated, which has attracted a lot of attention from the academic community.

The theme of this live online broadcast called "REDtech is Coming" is to focus on the latest trends in multimodality in academia and industry.

In the first half of the event held on April 20, Liu Wei, professor and doctoral supervisor of Beihang University, Gao Shenghua, associate professor and doctoral supervisor of the School of Information science and technology of ShanghaiTech University, Xie Weidi, associate professor and doctoral supervisor of the School of Electronic Information and Electrical Engineering of Shanghai Jiao Tong University, and Tang Shen, head of the multimodal algorithm group of Xiaohongshu, shared technology on multimodal content understanding.

In addition to the details of the practice of Xiaohongshu multimodal technology mentioned above, there are many dry goods sharing such as "AI + music", "cross-modal image content understanding and video generation", and "technology and application of self-supervised learning in multimodal content understanding".

In view of the current situation of multi-modal research, big coffee also shared a lot of wonderful views.

Teacher Xie Weidi said:

"Each mode contains different invariances and coexistences. For example, in text, when we refer to "guitar," it may correspond to thousands of different kinds of guitars in the visual. When we hear a dog barking, we most likely see the dog visually.

Therefore, the rational use of the characteristics of different modal data for collaborative training can achieve more efficient characterization learning and generalization to downstream inference tasks. ”

"Weakly correlated data sets are correlation problems, and there is no weak correlation problems, if you do machine learning, it must be from input to output, and the middle is to learn some functions."

"The misalignment between modalities must not be a weak correlation, there must be a strong correlation, otherwise, the network cannot learn." Of course, we want to try to do causality now, and most of what we think of as causality is determined by correlation. ”

Of course, in addition to content understanding, with the boom in multimodal learning research, there is also AI content creation, that is, multimodal human-computer interaction including digital human technology.

For example, recently, there is an AI reading and drawing tool called "Dream by WOMBO", which has been ranked first in the Apple Store graphics and design area for many consecutive days.

Some people turn to the little red book to plant grass, and some people turn to the latest AI technology trends

And this is the other mostly modal technical direction that Little Red Book is exploring.

Read on