laitimes

Original | How multimodal large models can help enterprises in digital transformation

author:Tsinghua Management Review
Original | How multimodal large models can help enterprises in digital transformation

Lead

One of the important manifestations of the multi-modal large model is that it can bridge the cognitive gap between business and technology, realize the social matching of business personnel and technical personnel, and reduce the difficulty of digital transformation. This article focuses on how multimodal large models can help enterprises achieve digital transformation.

Text / Li Gaoyong, Liu Lu

Artificial intelligence (AI) large models refer to AI models that use deep learning technology to train large-scale data and obtain the mapping relationship between input and output. If the training data is text data, such a large AI model is a large language model (LLM). ChatGPT, released by Open AI in 2022, is a large language model that can accurately understand natural language and produce fluent and coherent natural language texts. With the rapid development of computing power and the intensification of competition among large model manufacturers, the training materials of AI large models have also expanded to other types of data such as images and videos (also known as modals), and realized collaborative reasoning of heterogeneous modals, and large models have also developed from single-modal to multimodal. In March 2023, the ChatGPT4.0 version released by Open AI was upgraded to a multimodal large model. Gemini released by Google also belongs to the multi-modal large model, which can recognize five types of data: text, image, audio, video, and code.

The advent of large language models has raised expectations for the use of AI for tasks such as conversations and questions, and its practical applications have indeed shown great potential to transform people's lives and work. For example, a large language model-based dialogue system can interact with the user in natural language, understand the user's intent and generate meaningful responses. This type of dialogue system has a wide range of applications in customer service, intelligent assistants, and other fields. Large language models are used in search engines to improve the accuracy and relevance of search results, making it easier for users to find the information they need. For example, Microsoft has integrated ChatGPT into its search engine, blending traditional link-centric search patterns with new AI models. The new search engine will better answer more complex and open-ended questions that traditional search engines are not good at, and present the consolidated search results in a way that is easier to understand. For example, when users search for travel destinations, a search engine that integrates a large language model directly gives easy-to-understand travel tips, greatly improving the search experience.

Compared with single-modal large models such as large language models, multi-modal large models can process different types of data, especially atypical modal data, such as 3D vision data, depth sensor data, and LiDAR (LiDAR) data in autonomous driving, etc., to provide more complete input for AI, help it better and more generally understand the external environment, and collaborate with multimodal data to reason to adapt and respond to the environment, bringing a more realistic and smooth human-computer interaction experience. Therefore, multimodal large models also give AI richer and deeper applications.

At present, the application of multimodal large models is mainly concentrated in the fields of medical diagnosis and behavior recognition. In terms of medical diagnosis, multi-modal large models can combine imaging data (including CT and X-ray images), clinical data (physiological indicators obtained by various medical instruments) and biochemical data to collaboratively infer the physiological state of patients and assist doctors in making diagnoses. In terms of human behavior recognition, multimodal large models can more clearly understand human intentions and infer the purpose of human behaviors more accurately by recognizing voice and body movements. In the field of security inspection, the multi-modal large model developed by a domestic company can recognize people's gestures and faces at the same time, so as to realize an intelligent electronic police bayonet; It is also possible to identify emotions by visual modality (facial expressions) and audio modality (tone and pitch), that is, by "observing words and colors".

One of the important manifestations of the multi-modal large model is that it can bridge the cognitive gap between business and technology, realize the social alignment of business personnel and technical personnel, and reduce the difficulty of digital transformation. Multi-modal large models combined with no-code and low-code technologies can realize the popularization of technology, eliminate the cognitive gap between business and technology, facilitate business personnel-led digital transformation, and even cooperate with AI agents to realize the automation of digital transformation. This article focuses on how multimodal large models can help enterprises achieve digital transformation.

Why digital transformation is failing

Implementing digital transformation, that is, using digital technology to completely transform the original business model, operating model and production/service model to achieve a comprehensive upgrade of capabilities, is the primary strategic choice for enterprises to adapt to the turbulent environment and gain new competitive advantages. However, digital transformation is fundamentally different from other types of organizational change, and it is characterized by both disruptive and exogenous characteristics. Disruptive refers to the fact that digital transformation is a complete change in the enterprise, which has been well known, so most enterprises will adopt the "first-in-command responsibility system" and leadership participation in the transformation to reduce the negative impact of disruptive when implementing digital transformation.

Digital transformation also has exogenous characteristics, that is, for enterprises, the driving force of digital transformation is not business changes, but the rapid rise of digital technologies that are very different from their own business but are very professional and have a high learning threshold. The essence of digital transformation is to integrate business and digital technologies to create a new production and operation model, but there are huge cognitive differences between business (people) and technology (people) due to differences in job responsibilities, personal experiences and backgrounds. Business people can't understand technical knowledge, and technical people have no way to understand business knowledge, which hinders the convergence of business and digital technologies. In a November 2022 survey of more than 2,100 leaders, 54% of Nash Squared said the perception gap between business and technology has become a significant barrier to change.

The cognitive gap between current business and technology is bridged primarily through training and learning. The time and energy of business personnel are occupied by daily work, and there is also cognitive inertia; For technologists, as digital transformation deepens, the more personalized business needs become more and more difficult to understand, and simple training and learning and traditional requirements analysis methods are not as effective. Therefore, how to bridge the cognitive gap between technology and business has become a problem that must be solved in the process of digital transformation. The characteristics and advantages of multimodal large models provide new ideas for solving this problem.

How to bridge the cognitive gap in digital transformation

Being able to understand information and reason to a reasonable conclusion from it is the basic ability of the large model, so the large model can act as a "translation" between different departments and become a bridge for knowledge transfer and sharing. In the context of digital transformation, large models can "translate" business knowledge into technical knowledge, or "translate" technical knowledge into business knowledge, thereby narrowing, spanning, or even eliminating the cognitive differences between technology and business, and accelerating the integration of business and digital technologies.

First of all, the multimodal large model can develop personalized learning solutions for personnel in different departments, enhance the effect of learning and training, and reduce the cognitive gap between business and technology. For example, Thomson Reuters launched the GenAI education program, and one of the important training components is to use multimodal large models to develop internal learning materials based on the individual characteristics and job responsibilities of the trainees. In this educational program, the large model will develop different learning content for the same problem, such as embedding vector data information related content in the learning materials for system developers, and how to adjust the layout in the learning materials for development engineers, so that technicians can have a deeper understanding of business needs. This personalized approach to learning promotes the social fit between technology and business, becoming the "lubricant" of transformation.

The multi-modal large model provides a better human-computer interaction interface and provides tools for scenario-based learning and training. The multi-modal large model can convert text knowledge into pictures and videos that are more in line with people's cognitive habits, and can also disassemble complex scenes to reduce the difficulty of learning. For example, Ensono, an IT services company, needs to analyze and understand the workflow of the company it serves when providing transformation services, which is time-consuming, laborious and prone to misunderstanding in the traditional context. The company introduces a multi-modal large model service to disassemble and visualize the business processes of the service company to help technicians understand the business.

Second, multimodal large models help businesses and technologies directly bridge cognitive differences and promote the implementation of digital transformation. Compared with single-modal large models such as large language models, multi-modal large models can carry out in-depth learning and dig deep knowledge, thereby promoting the digital transformation of enterprises to expand to a deeper level. For example, multimodal large models can process sensor data, which reflects the industrial mechanism and process principles that are difficult to obtain through traditional demand analysis.

Traditional discriminative AI classifies and makes decisions by analyzing the relationship between input data and corresponding output labels, and lacks modeling of the data generation process, which leads to problems such as opaque prediction process and uninterpretable prediction results, which reduces the credibility of artificial intelligence and hinders the application of artificial intelligence in organizational decision-making. Large models are generative AI and can ask questions about their output, although the analysis process is still invisible. Combined with long-term and short-term memory, large models can reproduce the decision-making process and make reasonable explanations for the output, thereby improving credibility and helping to realize the digital transformation of non-programmatic decision-making.

Thirdly, the multi-modal model completely eliminates the cognitive differences between technology and business, and realizes digital transformation that is completely led by the business side. Multimodal large models are beginning to move towards standardization and modularization, and Model-as-a-Service (MaaS) is realized. With AI agents, digital transformation can be automated without the intervention of technicians. For example, Microsoft has integrated AI assistant Copilot into the desktop system Windows 11 and the office software Microsoft 365 to help users who are not proficient in system operation achieve professional-level human-computer interaction and system operation without the help of technical personnel.

Large models based on generative AI are able to generate new data samples and learn from them, enabling self-learning and self-adaptation. This means that large models are able to learn from their own experiences and improve their performance based on those learnings. Traditional IT technology has a rigid architecture and strong specialization, and once a rigid system is implemented, the user's business process will be locked. When the user's business changes, it is difficult for the system to achieve synchronous adjustment, which forms the phenomenon of "IT lock-in". The ability of self-evolution of large models combined with low-code and no-code technologies allows large models to independently learn and collaboratively deal with logistics, capital flow, information flow, responsibility flow, risk flow, etc. in the new business process when the business process of the enterprise changes, and deduce a more reasonable new process. Business people adapt no-code and low-code platforms to accommodate new processes. Large models can even directly adjust the system, enabling digital transformation without the intervention of technical personnel.

How to contribute to digital transformation

Multimodal large models are still in the early stages of development, but they have already shown great potential. Starting from the perspective of large models to eliminate the cognitive differences between business and technology, combined with the current advanced applications, the following will analyze how multimodal large models can help digital transformation from several aspects such as R&D, production control, customer service, and product innovation.

R&D link

Enabled by digital technology, R&D has evolved from experimental verification to simulation and simulation, that is, processing large data sets accumulated in production and operation through simulation technology, batch modeling based on specific rules, and then using "digital twins" to conduct trial production of 3D models to verify feasibility. The current large-scale datasets need to be cleaned and customized according to the needs of simulation software, which seriously limits the depth and application scope of digital R&D. The multimodal large model directly incorporates consumer-side data (such as online reviews of similar and similar products) into the R&D process, and collaboratively processes consumer-side and production-based R&D data to develop products that can better meet market demand.

The development of a beverage in which the author of this article participated demonstrates the potential of multimodal large models in research and development. At present, the most mainstream beverage research and development is ingredientomics, that is, after analyzing the flavor substances in the raw materials, by changing their ratio to obtain the product that best meets the taste preferences of the market. However, there are two problems in this process: first, the product process needs to be fermented, and the fermentation process is uncontrollable, resulting in the fact that the flavor substances cannot be directly and accurately controlled, and need to be indirectly measured and controlled by other means; Second, the final taste of the product is judged by professional tasters based on subjective feelings, which cannot be directly quantified. In order to solve these two problems, the R&D team tried to use a multi-modal large model to analyze the production process data, and tried to establish a connection between the flavor substances and the production environment. At the same time, the natural language understanding ability of the large model is used to quantify the tasting results and consumers' evaluation of the taste of the product, and finally obtain the best formula and the best production process. Although the progress of research and development is not fast, the project shows the infinite possibilities of multimodal large models in research and development.

Production control

The field of production control is the most complex part of the manufacturing industry, and it is also the core of the digital transformation of the manufacturing industry. However, in this field, the data is deeply buried, the data types are diversified, the data correlation is wide, and the data discontinuities are many, resulting in the digital transformation service providers, and even the operators and business personnel of manufacturing enterprises, unable to discover the needs in the transformation.

Technologies such as multimodal large models and the Internet of Things can realize real-time and ubiquitous connections within manufacturing enterprises and between upstream and downstream industries, bridge the breakpoints in the process of enterprise data flow, and help data flow efficiently. At the same time, the multi-modal large model can perform collaborative reasoning on data, adjust unreasonable business processes, improve the collaborative efficiency of manufacturing enterprises, and help the manufacturing industry move towards intelligent collaborative production. For example, in April 2023, Siemens and Microsoft announced a collaboration to develop a code generation tool for PLC (programmable logic controllers) to automate code generation based on business scenarios, automate operations and control, and facilitate the transformation of automation technology based on generative AI.

customer service

Customer service is the most mature scenario for the commercialization of large models. Customer service is in fact programmatic decision-making, meaning that most customer requests have programmatic solutions. However, traditional AI is difficult to understand customer intent when processing natural language, and cannot give smooth and natural answers, which limits the automation and intelligence of customer service.

The ability of large language models to understand and output natural language solves the problems existing in the past automated customer service and advances customer service to the intelligent era. Large language models can accurately understand the customer's natural language and determine the user's intent, and then select the solution according to the set rules, presenting the user with clear and understandable answers in natural language. Based on the ability of long short-term memory, the large model can also maintain the user's multi-round conversation context, track the conversation status, and generate dialogue strategies, etc., to solve customer demands and meet customer needs in a way that is more in line with human behavior patterns.

Product innovation

At present, the impact of digital transformation on product innovation is to derive physical products as services. For example, products such as wearables and smart homes are provided at the individual level, and remote maintenance and data services for industrial equipment are provided at the enterprise level. The human-computer interaction capability of the large model further helps the intelligence of terminal products.

The terminal product integrating multi-modal large model can better reason the intention of users or operators, thereby improving the interaction between products and users, and providing a new direction for the servitization of products. For example, the integration of multimodal large models into humanoid robots, through training, to generate control strategies according to the actual situation, and to manipulate the robot to achieve "using human tools to realize human capabilities", has become the key to opening the era of "embodied intelligence".

It should be admitted that AI large models, especially multimodal large models, are still in the early stage of development, which has caused a lot of controversy. It is a long process from hypothesis to verification, from theory to practice, from pilot to popularization, and people from all walks of life should work together to promote the revolutionary subversion of the economy and society including enterprise management.

About the Author | Li Gaoyong: Associate Professor, Shandong University of Finance and Economics;

Liu Lu: Ph.D. candidate at Chinese People's University.

Editor-in-Charge | Liu Yongxuan ([email protected])

Original | How multimodal large models can help enterprises in digital transformation
Original | How multimodal large models can help enterprises in digital transformation

Tsinghua Management Review Contact Information

Subscriptions, Advertising & Business Partnerships

010-62788163

[email protected]

Original | How multimodal large models can help enterprises in digital transformation

Read on