作者:tensorchen

This article introduces the exploration and practice of Tencent Docs AI intelligent assistant from the perspective of technical application architecture and AI large model empowerment. As a multi-functional AI product, Tencent Docs has deeply integrated all categories with AI to comprehensively improve the efficiency of users' lives and offices. With Tencent Docs AI, the ideas that come to your mind can be quickly transformed into detailed content and circulated in various types of documents. At the same time, in the face of complex information, Tencent Docs AI can also analyze and process it, helping you extract valuable content from massive amounts of information, so as to transform them into your own cognition.

Chapter 1: The Challenges of Large Models for Efficiency Tools

With the release and popularity of ChatGPT, the world's attention seems to be focused on large language models, and the emergence of key capabilities such as strong language comprehension and generation capabilities, contextual memory, learning error correction, and chain of thought reasoning all mark that the development of "AIGC" has reached a technological inflection point. Developers all over the world hold such a powerful hammer of Thor, eager to hammer all the nails again, so at the beginning of the big model boom, there was such a statement: "All the apps in the world can be redone in combination with the big model". Although the discourse does not stand up to scrutiny, large models can indeed improve amazing efficiency in some fields, especially in the field of efficiency tools.

● Technically: The technology in the field of text generation is relatively mature

The initial application of the large model is for text generation, and at the same time, the development of text generation technology is also the fastest and most mature.

Tencent Docs AI assistant technology practice

● Users: The attention is extremely hot

User attention is an analytical angle that is easy to ignore, when new technologies and new concepts come out, no matter how powerful the new technology is, it will eventually become a product service user. A technology/product that is not popular is not really a good technology/product.

From Baidu's keyword search index, it can be seen that since the advent of ChatGPT, it has covered a wide range of users, user acceptance and interest in it, and the peak search index has reached 85W, which can be regarded as this year's "Internet Spring Festival Gala". By querying the comparison of historical data, you can understand the popularity of this search more concretely:

In the last popular metaverse concept, the search peak index was only 10W, less than 1/8 of ChatGPT this time.

On Chinese New Year's Eve 22, the keyword search index of the Spring Festival Gala was 150W, and the peak of ChatGPT's attention this time has reached half of the Spring Festival Gala.

● The law of development: tools are always the first to change

History does not repeat, but it rhymes. From the past generations of new technology changes to the present, each tool will take the lead in the change, and each generation has a generation of productivity tools.

Also judging from the performance of global/national competitive product data, this matter is completely confirmed: users have a high acceptance of the combination of document tools and AI, and the demand is strong, which is the key area of the landing after the outbreak of this large model.

Among the top 100 AI products in the world, a total of 12 competing documents tools are on the list, and among the top 100 AI products in China, a total of 26 competing documents tools are on the list.

It is a new opportunity but also a new challenge, and the innovation of traditional efficiency is not achieved overnight, whether it is user awareness education, product capacity building, differentiated competition, and commercialization model are new challenges. This article will focus on the implementation of AI technology in the product landing link and model part, and the rest will not be expanded in this long article, and the subsequent update of the buried pit~

Chapter 2: Document AI Technical Thinking and Architecture

From a technical perspective, this chapter introduces the specific practice architecture of the entire Tencent Document AI technology project, as well as its own technical thinking on the implementation of AI applications.

2.1 AI application technology thinking

The practical application of the mindset can be summarized as:

1. What is difficult for people is also difficult for AI

2. Don't let AI do what you can do with your program

Here's an example that might not fit exactly:

People fish: People think about making decisions to use fishing nets (tools) to catch fish. Ordinary people don't know how to actually make fishing nets, and ordinary people need someone to teach relevant skills to make fishing nets, which is time-consuming and laborious, with low results and slow results.

AI plays the role of teaching people, and tools play the role of fishing

In the actual case of the document, AI assists users in beautifying PPT: AI understands that users need to beautify PPT, and AI decides to use PPT beautification tools for beautification. AI will not actually beautify PPT, AI beautification PPT requires someone to teach AI-related skills (massive high-quality PPT beautification related data for model training), the process is time-consuming and laborious, the effect is low, and the effect is slow.

"Adjust the font of the entire PPT to Song" task

AI: Used to solve the problem of understanding the user's intent to adjust the font and the specific font category in the conversation

Tools: Document PPT Font Adjustment Tool for actual execution

"Create a PowerPoint about the history of the Ming Dynasty" task

AI: Used to solve the conversation to understand the user's intention and topic in the creation of PPT Ming Dynasty history

AI: Generate an outline and detailed text content based on the topic of Ming Dynasty history

Tools (image search tools): Conduct image search based on the outline, and implement PPT with pictures

Tool (PPT template): Generate a complete PPT based on outline, text, picture + template

Tencent Docs has many categories such as Word, Excel, PPT, PDF, Form, Mind Map, Flowchart, SmartSheet, SmartCanvas and the ongoing whiteboard category.

Different categories are output-oriented product forms, and the content and form are superimposed together, (Word needs to adjust the format, and PPT everyone needs to learn to beautify). The core lies in the expression of content information.

Therefore, when implementing the AI application of Tencent Docs, from a technical point of view, AI is usually applied to solve content-related problems, and engineering is applied to solve form or style problems.

2.2 Document AI technical architecture

● AICopilot: Provides AI sidebar dialog entry service, which is mainly responsible for the distribution of intent recognition tools, intent persistence, flexible processing, caching logic, session archiving and other capabilities of conversations.

● AIServer: Provides unique floating assistant capabilities for each category.

● AIAgent: Positioned as an AI agent, it mainly provides a collection of capabilities and tools for various categories of documents, and the interface that is actually driven by the intent recognition of the upper-layer service.

● AIEngine: The AI engine service for documents, which involves the abstraction and encapsulation of AI-related capabilities, maintains a unified abstract definition (mainly including the abstraction of AI capabilities such as Wensheng text, Wensheng diagram, TTS, ASR, OCR, Embedding, etc.), shields the differences between different AI capabilities, and lays the foundation for documents to be seamlessly switched between different AI capabilities.

● AIOperation: Grayscale policies, privacy authorization (flexible), and operational operations related to document AI.

● AIExtension: The AI extension service mainly includes and plans other supporting capabilities in the process of AI application implementation, such as text search, image search, and Python execution engine.

2.3 Document AI middle platform architecture

The concept of document AI middle platform was originally based on the fact that there are 10 categories in Tencent Docs, and it is expected to empower different categories in the form of middle office solutions, and it is also practiced and implemented in this way. This is not only in the Tencent Docs product itself, but also in the overall product matrix within the department, and it is also necessary to use the basic capabilities of Docs xAI as the middle platform to deliver and empower different products.

The document AI middle platform is decoupled from specific models and product applications to form a document xAI solution that can empower different products, provide an overall solution for the document AI field, and empower different AI application products.

2.4 Zhongshuge AI application framework

In the process of document AI application and middle platform landing, AI technology and peripheral capability ecology are also abstracted into an AI application framework, and its positioning is the construction of an application framework for AI application landing. Vision: AI For Everyone, lower the technical threshold for AI applications, and improve the efficiency of AI application research and development.

Philosophy:

1. Standardization: It mainly undertakes the first two contents of Oteam, AI application standards and AI application specifications, which will be finally exported to business developers through the standardization construction of AI application frameworks.

2. Visualization: In the application of large language models, it is often encountered that it interacts with large language models and calls external tools many times, and the visualization of the process will help R&D debugging, problem locating, and operation analysis.

The framework will provide a UI platform and a visual interface for the LLM application process (including time-consuming analysis, token consumption, etc.).

The framework will also provide LLM observability, and provide data reporting such as monitoring, distributed tracing, and logs based on the OpenTelemetry standard.

3. Multilingual framework: Multilingual implementation will be provided to meet the needs of different business application scenarios and business technology stacks.

Friendly to non-AI professionals, the framework abstracts modules and capabilities from the perspective of users, provides AI application development in multiple languages, and focuses on the implementation and effect optimization of AI product capabilities.

Chapter 3: Technical Practice of Document AI Application Side

3.1 Q&A scenario application

One of the core capabilities of document products is information communication, and AI Q&A on information in massive information is one of the key AI landing scenarios, which involves questions about Word, PPT, Sheet, mind map, collection table, knowledge base and other scene content in the document.

The key to the implementation of AI application engineering for documents is to build a basic solution for document Q&A. The key to solving this problem is how to get the large model to understand the domain knowledge (the content information in a particular document).

There are generally two solutions:

● Solution 1: The domain knowledge is entered into the weight file of the model through FT or dynamically superimposed into the weight file of the model through LoRA.

● Solution 2: Instantly transfer domain knowledge into the model through Context.

The user's document information is a collection of the user's own information, which mainly serves the user. It is impossible for us to train the model specifically for each user, and it is impossible to change the user data frequently based on timeliness, and it is impossible to retrain the model every time it is changed, and secondly, based on the consideration of user privacy, it is impossible for us to use user data for training. Clearly, option 1 is not viable.

Then the actual implementation in the document is also the second solution: through the context of the way to transfer the domain knowledge into the model in real time.

This technique is called RAG (Retrieval-Augmented Generation) search augmented generation technology, which is a set of technical solutions for retrieval recall and large model model generation based on a specific knowledge base, which is used to deal with various complex knowledge-intensive tasks in large models, such as knowledge question answering.

The overall solution is completed in series by the following modules:

● Document loading: Define a unified Document data model, which will implement the default typical data source loading implementation, and the business side can also customize the document data source it needs according to the interface.

Document sharding: Large language models have certain limits on the context size, and a large amount of data needs to be segmented.
Document Embedding: The Embedding process vectorizes the corresponding text to provide better semantic expression.
Document vector storage: Use a vector database to store document vector information.
Document Recall: Recalls the most relevant document information based on the problem entered by the user.
Q&A: Based on the recall documentation + user input questions, it is provided to the large language model for knowledge Q&A.

In order to solve the following two scenarios, further upgrades are planned on the original architecture.

1. Solve metadata Q&A, summary, and non-summary questions

2. Solve Q&A involving multimodal documents

3.2 Intent Recognition Applications

In order to generate application benefits, it is necessary to translate user intentions into specific behaviors

Challenge 1: Hundreds of command scenarios

Challenge 2: Intent and task flow are not mutually exclusive, and multiple tools are involved in the connection

The following is an example of actual user use:

Combined with different user input application scenarios, the key to implementing AI functions lies in intent recognition and task orchestration

● Unique task indexes are indexed by PromptID

● Instrumentalize the standardization of competencies

● Orchestrate tasks in the form of As Code (refer to gitlab, using yml to orchestrate hundreds of task scenarios)

The bigger challenge of user intent is multi-intent recognition, the user may adjust the font and font size at the same time, for the above solution, it is impossible to use a single function call to solve the problem, the function call parameters are limited, and it is impossible to predict all user behavior.

Then there are about two possible ones:

方案一：多轮Function Call

Solution 2: Generate code

In the end, we plan to use the solution of generating code, which cannot solve the task order problem in the main multi-round function call implementation, and it is feasible to use the generated code.

3.3 Application Scenarios of Tables

The biggest challenge in the tabular scenario is the size of the tabular content, which can only support a limited number of cells according to the context capacity of the current large model. The core strategy of the ultra-large table solution is to upgrade the original AI return results to the methods (i.e., code) of the AI return results.

Chapter 4: Documentation AI Model-Side Technology Practices

4.1 Create a scene model

Use data augmentation methods to strengthen weak capabilities

For creative ability, self instruct, evol instruct and other methods are used to construct similar seed instructions, and through complex evolution and generalization, data augmentation is carried out. There can be a more standardized process:

● Collect seed instructions: collect new requirements and manually write simple seed instructions;

● Instruction diversification: Refer to the practice of self instruct and evol instruct width transformation, and carry out diversified transformation of seed instructions to cover more fields, themes, forms, etc.;

● Instruction Complexity: Refer to the EVOL Instruct depth transformation operation (e.g., add constraints, add reference examples, add materialization operations, etc.), add constraints to the seed instruction, make the instruction complex, and add 3-10 constraints to each instruction;

● Instruction generalization: Agree to rewrite the evolved instruction, further enrich the expressions and forms, and rewrite 3-5 forms for each instruction.

● Result grabbing: annotate and capture the above evolved instructions;

● Result cleaning: Using self-refine, manual inspection, etc., the accuracy rate of grabbing results is close to 100%.

Compare learning styles to improve the stability of comprehension

For tasks that are difficult to distinguish between small differences, such as constraint leakage, negative constraints, numerical requirements, etc., we can construct comparison samples and add SFTs or reinforcement learning. This kind of sample can be added to the SFT stage learning, pair data can be constructed, and preference learning can be added.

● Local comparison: In the case of a large number of constraints, it is difficult for the model to take into account all constraints. Prone to leakage problems. By removing the constraints from the instructions one by one and leaving the other parts unchanged, the sample of local comparison is added, so that each constraint corresponds to the response that appears and does not appear in the instruction, the model has seen.

● Negative contrast: For the negative constraints, the comparison sample is constructed by removing the negative condition and negating the negative condition

Write an email about making an appointment with one of our beauty and skincare therapists in advance to enjoy professional facial treatment and personalized skincare recommendations. The email must contain basic parts such as the subject, recipient, sender, and body of the email. In the email, state that the recipient needs to complete the appointment confirmation and schedule the skincare therapist within 48 hours of the appointment, and remind the recipient to reply to the appointment information by phone or email. Don't have "Best wishes"

● Digit conversion comparison: Transform the numbers required by the numbers in the instruction to construct a comparison sample

Write a short essay on future urban planning, highlighting the importance of sustainable development and green mobility. At the same time, we will explore how to effectively use existing resources to reduce the impact on the environment. Be sure to include at least three innovative planning strategies and include examples or examples in the text.

Write a short essay on future urban planning, highlighting the importance of sustainable development and green mobility. At the same time, we will explore how to effectively use existing resources to reduce the impact on the environment. Be sure to include at least six innovative planning strategies and provide examples or examples in the text.

4.2 Tabular Scene Model

Official Generation

In addition to identifying the requirements of the basic formula ("finding the sum of column A"), the formula generation also supports the understanding of professional terms in popular fields, for example: the user asks the product with the largest working capital turnover rate, based on the ability of mixed element knowledge, mixed element knows [working capital turnover ratio = sales / average working capital], and then calculates the working capital turnover ratio of each product.

In addition, the method of chain of thought (COT) + code generation (POT) is used in the technical solution to solve the problem of unstable effect caused by formula nesting.

Chain of Thought (CoT) is considered one of the most pioneering and influential prompt engineering techniques that can enhance the performance of large language models in the decision-making process.

CoT forces the model to divide the inference process into intermediate steps. This approach is similar to a human cognitive process, breaking down complex challenges into smaller, more manageable pieces.

The Thinking Program (PoT) is a unique LLM inference method. It's not just about generating natural language answers, it's about creating an executable program that can be run on a program interpreter like Python that produces real results.

PoT provides a clearer, more expressive, and fundamental answer derivation model, improving accuracy and comprehension.

Chart generation

The core part of graph generation includes 6 modules, of which the three modules of rejection, step-by-step rewriting, and code generation are inference modules based on large models, and the models behind them have been fine-tuned by the model.

Specifically:

The rejection model identifies the relevance of the user's question to the table and rejects the question that is not related to the table or the non-drawing question
The step-by-step rewriting model breaks down the drawing steps into multiple executable steps for different tables and different problems
The code generation model generates Python table visualization code based on the drawing steps.

Chapter 5: Summary

Combined with the AI implementation process of Tencent Docs, here are some experiences in the development of AI assistants:

What is difficult for people is also difficult for AI
If you can make a program do it, don't let AI do it
Apply AI to solve content-related problems, and apply engineering to solve form or style problems