laitimes

Development status and prospect of big data knowledge engineering丨Engineering Science in China

author:Proceedings of the Chinese Academy of Engineering

This article is selected from the journal of the Chinese Academy of Engineering, China Engineering Science, Issue 2, 2023

Authors: Zheng Qinghua, Liu Huan, Gong Tieliang, Zhang Lingling, Liu Jun

Source:Development status and prospect of big data knowledge engineering[J].Engineering Science,2023,25(2):208-220.)

Editor's note

Knowledge engineering aims to study machine representation and computational problems of human knowledge and is an important branch in the field of artificial intelligence. Big data knowledge engineering is the "infrastructure" of artificial intelligence, which can mine fragmented knowledge from multi-source big data, integrate it into a knowledge base / knowledge graph that can be understood by humans, characterized by machines and reasoned, and solve practical engineering problems with common needs from informatization to intelligence for many industries and fields.

The second issue of 2023, a journal of the Chinese Academy of Engineering, published the article "Development Status and Prospect of Big Data Knowledge Engineering" by Professor Zheng Qinghua's research team of Xi'an Jiaotong University. This paper expounds the background and conceptual connotation of big data knowledge engineering, and puts forward a research framework of "data knowledge, knowledge systematization, and knowledge reasonability". The key technologies of big data knowledge engineering such as knowledge acquisition and integration, knowledge representation, and knowledge reasoning and the engineering applications in typical scenarios such as smart education, tax risk management and control, and smart medical care were sorted out. The challenges faced by big data knowledge engineering are summarized, and the future research directions of big data knowledge engineering are judged, including complex big data knowledge acquisition, knowledge + data hybrid learning, brain-inspired knowledge coding memory, etc. The paper suggests guiding the cross-integration of multiple disciplines, setting up major and key R&D projects, and promoting the basic theory and technical research of big data knowledge engineering; Strengthen exchanges and cooperation between enterprises and research institutions, promote cutting-edge research results and form application demonstrations, and establish an industry standard system for big data knowledge engineering; Guided by the application of major needs, explore the school-enterprise collaborative education model, and accelerate the application of big data knowledge engineering technology in important industries.

Development status and prospect of big data knowledge engineering丨Engineering Science in China

I. Preface

After more than 40 years of development, mainland informatization construction has accumulated massive data in education, government affairs, finance, medical care and other fields, and how to further transform these data into relevant knowledge, feed back the development of the field, and solve practical engineering problems has gradually become a common demand in various fields. Knowledge engineering aims to study the machine representation and calculation of human knowledge, and is an important branch of the field of artificial intelligence, which aims to input human or expert knowledge into computers and establish reasoning mechanisms, so that machines can also have knowledge and can calculate and reason, and solve practical problems accordingly. At present, the development of knowledge engineering in mainland China has gone through stages such as traditional knowledge engineering represented by expert systems and modern knowledge engineering represented by mainstream deep learning technology, which has significantly promoted the development of various fields, but there are still certain limitations in solving practical engineering problems in various fields. For example, after the rapid development of traditional knowledge engineering in the seventies and eighties of the 20th century, it entered the "cold winter period" in the 90s, mainly because the "knowledge acquisition" mainly came from experts in the field, facing high labor costs, limited expert experience, and inability to dynamically solve complex engineering problems; In the modern knowledge engineering stage, deep learning models (especially large-scale pre-training models) have made significant progress in natural language processing and computer vision, but such data-driven models have challenges such as strong data dependence and excessive computing power/energy consumption, which are difficult to cope with high-order, multi-hop reasoning tasks in practical engineering problems, and difficult to meet the interpretability needs of key fields such as medical and information security.

Big data knowledge engineering can mine fragmented knowledge from multi-source big data and integrate it into a knowledge base/knowledge graph that can be understood by humans, characterized by machines and reasoned, which can significantly alleviate the limitations of the above technologies and provide support for solving practical engineering problems. Different from traditional knowledge engineering, the knowledge acquisition process of big data knowledge engineering is machine-based and supplemented by labor, which effectively alleviates the bottleneck problem of "knowledge acquisition" in traditional knowledge. At the same time, the symbolic knowledge generated by big data knowledge engineering helps to make up for the limitations of existing deep learning, and the integration of the two is expected to realize the "symbol + nerve" reasoning method, which can simultaneously cope with the reasoning tasks of intuitive system (System 1) and logical analysis system (System 2) commonly found in practical engineering problems.

In order to promote the further development of big data knowledge engineering, this paper will sort out the development status of big data knowledge engineering, summarize the challenges and future research directions in this field, put forward countermeasures and suggestions for the high-quality development of big data knowledge engineering technology and industry in the mainland, help the application of big data knowledge engineering, and serve the economic and social development of the mainland.

Second, the development status of big data knowledge engineering

(1) Overview of big data knowledge engineering

Data \u2012 Information \u2012 Knowledge \u2012 The DIKW model depicts the hierarchical relationship and value-added process from data, information, knowledge to wisdom from bottom to bottom, and is widely used in the field of knowledge management. Accordingly, this paper proposes a research framework for big data knowledge engineering (see Figure 1), which includes three stages: data knowledge, knowledge systematization, and knowledge reasoning.

Development status and prospect of big data knowledge engineering丨Engineering Science in China

Figure 1 Research framework of big data knowledge engineering

Data knowledge aims to add value to data. Firstly, from the massive big data from multiple sources, the fragmented knowledge that can be used for problem solving is mined, including text fragments, images, logical rules, etc. Secondly, through redundant disambiguation, the quantitative and qualitative transformation of fragmented knowledge is realized. Finally, the representation learning method is used to represent the fragmented knowledge of different modes into a low-dimensional dense public space, which provides support for the cross-modal interoperability of subsequent inference calculations. Compared with the input data, fragmented knowledge not only reduces the scale, but also realizes the transformation from low-quality to trusted, unstructured to structured, thereby improving the value density of data.

Knowledge systematization is the process of fusing cross-domain fragmented knowledge into a knowledge system based on actual engineering problems to achieve knowledge value-added. Firstly, the semantic relationships such as causality and preorder between fragmented knowledge are excavated. For example, in the computer field, the fragmentation knowledge related to "linear tables" and the "stack" have a pre-order relationship, and the former must be learned before the latter. Secondly, through the nonlinear fusion of fragmented knowledge and semantic relationships, new knowledge different from existing fragmented knowledge is generated, and "the whole is greater than the sum of parts" is realized.

Knowledge reasonable is the process of finding the reasoning path required to solve an engineering problem based on the knowledge system generated by knowledge fusion. Traditional symbol systems are good at deterministic reasoning, easy to portray explicit knowledge, and have the characteristics of composability and interpretation, but there are also combinatorial explosion problems, and there are limitations in uncertainty reasoning and tacit knowledge characterization. Machine inference models based on deep learning have strong representation and learning capabilities, but the generalization ability is limited, and most of them belong to black-box models, which has interpretability problems. Therefore, it is difficult to meet the complex inference needs in practice by relying only on traditional symbology or deep learning models, and it is necessary to fuse symbolic inference and deep learning. In addition, the reasoning process involves many optimization goals, including accuracy, timeliness, and interpretability, which can be decomposed into multiple sub-goals, so machine reasoning in practical engineering problems is a multi-step, multi-objective combination optimization problem.

(2) Key technologies of big data knowledge engineering

Based on the research framework of big data knowledge engineering, this paper gives the technical system of big data knowledge engineering (see Figure 2). This technical system includes core technologies such as knowledge acquisition and fusion, knowledge representation, and knowledge reasoning. Specifically, knowledge acquisition and integration include knowledge graph construction, logical formula extraction, knowledge fusion based on knowledge forest, etc. Knowledge representation includes techniques such as symbolic representation and distributed representation; Knowledge reasoning includes techniques such as knowledge retrieval reasoning, automatic question answering reasoning, memorized reasoning, and explainable reasoning. Among them, knowledge acquisition and fusion technology and knowledge representation technology can solve the problem of data knowledge and knowledge systematization, and knowledge reasoning technology can solve the problem of knowledge reasoning.

Development status and prospect of big data knowledge engineering丨Engineering Science in China

Figure 2 Technical system of big data knowledge engineering

1. Knowledge acquisition and integration

Knowledge acquisition is the process of extracting knowledge from single or multiple data sources and forming a knowledge base, which is the premise and foundation of subsequent knowledge representation and knowledge reasoning. Knowledge graph and logical formula are the two mainstream forms of knowledge base organization.

(1) Knowledge graph was originally proposed by Google Inc. as a technology used to optimize search engines to describe concepts in the real world and their interrelationships. The Knowledge Graph represents knowledge using Resource Description Framework (RDF) triples and attribute graphs to identify, discover, and infer complex relationships between things and concepts from data. The construction of knowledge graph involves entity extraction, relationship extraction, event extraction, etc. Among them, entity extraction is to detect named entities from text and classify them into predefined categories, such as people, organizations, places, times, etc.; Relationship extraction refers to identifying the relationship between two or more entities according to the text context, such as "born", "capital", "husband and wife" relationship, etc.; Event extraction refers to identifying information about an event in text and presenting it in a structured form. In addition, a high-quality knowledge graph also requires steps such as entity fusion and relational reasoning, which can be used in multiple fields such as knowledge question answering, language understanding, and decision analysis.

(2) Logical formula is a formal language that describes the logical relationship of objective things through predicates, quantifiers, operators, parameters, etc. Logical formulas include propositional logic formulas, first-order logic formulas, and higher-order logic formulas. Logical formula extraction aims to summarize a large amount of knowledge by modifying and expanding logical expressions. Compared with the knowledge graph, logical formulas are higher-level summary and induction of knowledge, and have better interpretability. Extracting more generalized first-order logical formulas on the basis of knowledge graph is a research hotspot. For example, from a "statistic" perspective, a candidate set of first-order logical formulas is first generated, and then first-order logical formulas that meet the requirements are filtered according to specific evaluation functions. In addition, the confidence level and structural information of first-order logical formulas can be learned simultaneously by constructing a differentiable network model, so that it has good generalization and generalization ability.

On the basis of knowledge acquisition, the fundamental problem of knowledge fusion lies in how to combine data from different sources, remove redundant knowledge, and achieve optimal integration of knowledge. To solve this problem, the research team of Xi'an Jiaotong University proposed an innovative knowledge fusion model, namely knowledge forest. As shown in Figure 3, the knowledge forest adopts the strategy of combining "faceted aggregation" and "navigation learning" to form a knowledge hierarchy that combines the topic facet tree (tree structure on the right) and learning dependencies (paths in the forest on the left). The construction of knowledge forest includes three steps: topic facet tree generation, fragmented knowledge assembly, and cognitive relationship mining. Among them, facet tree generation aims to mine knowledge topics with rich content information in the domain and their more fine-grained facet structure, which can first generate topics and facet sets through the topic facet joint learning algorithm, and then mine the faceted hierarchical structure of each topic based on the motif structure. Fragmented knowledge assembly aims to learn the mapping relationship of fragmented knowledge such as text and images and topic facet trees, form an instantiated topic facet tree, and provide learners with more comprehensive topic content in the form of knowledge expression with pictures and texts. The above knowledge forest construction process can be realized by using natural language processing, computer vision, cross-media mining and other technologies. Learning dependency is manifested as the need to master the prerequisite knowledge of a knowledge topic before learning it, and the mining of such relationships can be achieved by analyzing the distribution characteristics and semantic characteristics of knowledge topics, and the locality and asymmetry of cognitive relationships. Knowledge forest is an innovative form of knowledge base, which can provide knowledge support for reasoning tasks such as knowledge retrieval, intelligent question answering, and question generation, and has application prospects in education, taxation, medical and other fields.

Development status and prospect of big data knowledge engineering丨Engineering Science in China

Figure 3 Knowledge forest

On the whole, the current knowledge acquisition and fusion methods have achieved remarkable results, but most of these methods are based on closed domain methods, pre-set specific knowledge type collections, which is difficult to meet the needs of continuous derivation and updating of new knowledge in practical applications. Therefore, how to achieve the acquisition and integration of open domain knowledge is still a challenge for future research.

2. Knowledge representation

Traditional knowledge representation methods based on symbolic logic, including generative rules, Horn logic, script theory, etc., can characterize explicit and discrete knowledge. The computational and reasoning capabilities of such methods are weak, and it is difficult to mine the semantic relationships between complex knowledge entities. In contrast, distributed knowledge representation transforms knowledge into vector forms that are convenient for computer storage and calculation, which is more conducive to subsequent complex reasoning, which is the key to achieving efficient artificial intelligence systems.

Knowledge distributed representation has gone through the process from shallow representation to deep representation. At the beginning of the 20th century, researchers mainly focused on shallow knowledge representation methods, including principal component analysis, linear discriminant analysis, manifold learning, multilayer perceptrons, etc. At the beginning of the 21st century, greedy hierarchical pre-training and parameter fine-tuning methods for neural networks set off a boom in deep knowledge representation. Compared with shallow representation, the number of network hidden layers and the number of parameters of the deep representation method are significantly increased, which can more accurately learn the hidden laws of big data, and then accurately depict the characteristics of knowledge in terms of semantics and structure. In recent years, the improvement of computer hardware resources has further promoted the development of knowledge representation methods based on deep networks.

Knowledge distributed representation is mainly divided into two categories: knowledge graph representation and logical rule representation.

(1) The representation learning of knowledge graph aims to embed the entities and relationships in the knowledge graph into a continuous low-dimensional vector space, which is mainly divided into two categories: direct push learning and inductive learning. Direct learning aims to mine the feature information of entities and relationships in the knowledge graph, and at the same time use the feature information to complete hidden links in the knowledge graph. Represented by TransE and RESCAL, the direct push method represents the known entities, relationships or entire triples in the knowledge graph, and designs a reasonable scoring function to measure the rationality of the triplet feature embedding. Inductive learning is mainly used to extract potential representations of entities and relationships outside the current knowledge graph, which requires the model to have higher generalization ability. Taking the GraIL method as an example, the entity independence of the knowledge graph is used to mine the topological information of the local subgraph of the triplet for potential feature extraction, which is the main method of knowledge inductive learning.

(2) The logical rule representation aims to map discrete symbolic logical formulas to low-dimensional continuous space, and is one of the links connecting symbolism and connectionism. Although there may be some information loss in the process of logical formula representation, because the input sample generally contains noise, embedding it in the low-dimensional space can filter out part of the noise, improve the generalization ability of the model, and effectively reduce the storage and calculation cost of logical formula. Logical formula representation learning first converts logical formulas into corresponding syntactic structures, and then embeds them using neural network models. According to the use of syntactic structures and representation networks, logical rule representation learning research can be divided into sequence-based, tree-based and graph-based logical formula representation methods. Among them, the sequence-based method treats the logical rules as a simple symbolic sequence form, which is then embedded through the neural network; The tree-structure-based method converts logical rules into a tree structure and embeds them through syntactic parsing tools. The graph structure-based method often uses graph convolutional neural networks to enhance the information interaction between nodes in logical rules in order to capture deeper structural information.

In recent years, deep learning technology has made major breakthroughs in knowledge deep representation learning, but there are still challenges such as high training cost, weak interpretability, and difficult dynamic evolution, and more in-depth research is needed in the future.

3. Intellectual reasoning

Knowledge reasoning is the process of inferring new knowledge or identifying wrong knowledge based on existing knowledge. In big data knowledge engineering, knowledge reasoning takes the results of knowledge representation learning as input, and uses computer vision, natural language processing, cross-modal learning and other technologies as means to output the reasoning results. Typical knowledge reasoning techniques include knowledge retrieval reasoning, automatic question answering reasoning, memory reasoning and explainable reasoning.

Knowledge retrieval reasoning is the process of retrieving knowledge from a knowledge base on the basis of knowledge organization. Given a set of queries, knowledge retrieval technology needs to parse and understand the problem, and then complete logical operations such as query, reasoning, and comparison in the knowledge base. The original knowledge retrieval method was developed from information retrieval and went through the development process of information retrieval \u2012 specific knowledge base retrieval\u2012 knowledge graph retrieval. With the continuous increase of the scale of knowledge base, future knowledge retrieval will face problems such as high complexity of knowledge graph mode, high complexity of retrieval algorithm and weak generalization.

Automatic question answering is based on the user's natural language questions to query and reason on existing resources, and finally return accurate answers to the user. According to the different resource organization forms in the reasoning space, automatic question answering can be divided into natural language question answering, cross-modal question answering, and visual question answering. For example, textbook Q&A is an intelligent Q&A for smart education, which is a cross-modal Q&A reasoning task in the field of education. As a dual question of automatic question answering, question generation can provide necessary or additional data for the automatic question answering system, and can be organically combined with the question answering system to promote each other.

The key prerequisite for a model to have inference ability is that the model has the ability to remember. Compared with other inference models, memory inference models can save more information and can be used in subsequent inference tasks. The development of memory reasoning models has gone through the stages of long short-term memory (LSTM) network, neural Turing machine, memory network and differentiable neural computer (DNC). Among them, DNC uses an external storage matrix as the "memory" of the neural network, and a variant of LSTM as the "controller", which has powerful memory management capabilities, selectively writes and reads memories, allowing repeated modifications to the memory content. As a result, DNC is somewhat closer to the capabilities of the human brain.

The high complexity and black-box nature of deep learning models make it impossible for the model to interpret the results of inference. According to the method of explanation generation, inference models are generally divided into pre-explanation and post-exposition models. Recently, in order to achieve controllable and interventionable reasoning processes, researchers have proposed symbolic hierarchical interpretable reasoning models (SHiL). This model belongs to the fusion of pre- and post-exposition reasoning models. The core idea of SHiL is "hierarchical stepwise controllable + symbolic knowledge driven", that is, based on the theory of mesoscience, the complex data system with multi-level and multi-scale dynamic spatiotemporal association is divided into several mesoregions to form a hierarchical hierarchical structure. At the same time, according to the functional and state characteristics of each mesoregion, a symbolic control mechanism (such as common sense, rules, etc.) embedded with physical or sociological knowledge is constructed. The SHiL model is understandable, programmable, and interventional, and realizes knowledge-driven data computation and reasoning.

Recently, the development of knowledge reasoning has entered the stage of integrating symbolism and connectionism, that is, using the logical reasoning ability of the former's rules and the self-learning ability of the latter's deep learning to build a more powerful knowledge reasoning model.

(3) The current status of engineering application of big data knowledge engineering

1. Smart education

Smart education aims to use modern information technology to change the traditional education model and promote educational reform and development. Educational big data refers to the collection of data generated in the entire process of educational activities, collected according to educational needs, used for educational development and can create potential value. The teaching paradigm driven by big data has the advantages of high efficiency, wisdom and industrialization. In terms of educational resources, knowledge graph and other technologies can be used to aggregate high-quality resources from multiple regions and forms, and characterize and deeply analyze these resources to provide resource support for teachers' teaching and students' independent learning. In terms of "teaching", the use of education big data can generate teaching plans, simulate teachers to make decisions, etc., greatly reduce teachers' workload, and realize rapid and large-scale "replication" of high-quality teachers. In terms of "learning", by analyzing students' interests, abilities, learning status and knowledge mastery ability, and accurately planning students' learning paths and learning resources, teaching according to aptitude is realized.

In recent years, the research team of Xi'an Jiaotong University has successfully applied the knowledge forest theory to online education, developed the knowledge forest navigation learning system, solved the problem of structured and systematic description of scattered, cluttered and fragmented knowledge, optimized the organization of massive online teaching resources, and improved the efficiency of online learning and the quality of lesson preparation. Here, we will briefly introduce the acquisition, learning and preparation of "gravity" knowledge as an example.

(1) Transform the previous method of using search engines to search for learning materials on the Internet to search for learning resources under the guidance of the knowledge forest navigation learning system. When finding the knowledge point of "gravity", the system will give the knowledge system related to "gravity" to achieve "seeing both the trees and the forest", which can not only easily obtain the knowledge of a specific knowledge point, but also obtain the knowledge point related to it from the macro level.

(2) Knowledge Forest provides personalized guidance path recommendations. When conducting online education, the Knowledge Forest is used to provide students with a series of guided functions. For example, it can generate a learning path for students that meets the learning goals of "gravity" and cognitive ability, and avoids chaotic learning without goals and clues, that is, solving the so-called "learning trek" problem; Be able to answer students' questions related to course knowledge and help students answer questions.

The knowledge forest navigation learning system has been applied in the fields of higher continuing education and international education and training, which verifies the application value of big data knowledge engineering in the field of education. In the field of higher continuing education, the "MOOC China" learning platform based on the research and development of knowledge forest construction technology and navigation learning technology has promoted the expansion and strengthening of the mainland MOOC platform and seized the commanding heights of global MOOC intelligent guidance technology. In the field of international education and training, based on the knowledge forest construction technology and navigation learning technology, the International Engineering Science and Technology Knowledge Center (IKCEST) has established a special training system for the development of Silk Road engineering science and technology, serving Russia, Thailand, Kyrgyzstan, Uzbekistan and other "Belt and Road" countries, and training more than 40,000 international students from more than 100 countries and foreign-related enterprise personnel in China.

2. Tax risk control

Smart taxation aims to promote the deep integration of new achievements of modern information technology and tax work, promote the further convenience and inclusiveness of tax payment services, further improve the quality and efficiency of tax collection and management, and further standardize and transparent tax law enforcement, with the ultimate goal of improving tax service capabilities, supervision capabilities and governance capabilities in an all-round way. In fact, tax scenarios include policies and regulations, statements, invoices, budgets, settlements and other related data, and how to effectively use such massive, low-quality, and disorderly fragmented information and realize automated decision-making is an important challenge for smart tax governance. The use of big data knowledge engineering methods, on the one hand, can automatically obtain the regulatory, economic, industry and other knowledge contained in massive tax data, on the other hand, it can reason and apply the refined knowledge to solve key problems faced by the tax field such as intelligent decision-making support and explainable tax supervision.

From the perspective of tax services, the use of big data knowledge engineering can effectively achieve two-way accurate matching between tax policies and taxpayers, so as to cope with the challenges brought by real-time changes in tax policy texts and taxpayers' operating conditions. Firstly, multiple types of rules and conditions (including industry attributes, taxpayer attributes, tax information, tax-related constraints, etc.) are obtained from the tax policy text, and the knowledge fusion technology is used to repeatedly merge and invalidate the rules in the knowledge base to construct the rule knowledge base. Subsequently, the relevant knowledge is coded in rules and the decision table is constructed; Finally, according to the actual business needs, the obtained taxpayer data can be automatically calculated by the rule calculation engine to automatically calculate the tax amount, automatically fill in the declaration, etc., so as to minimize the time and psychological costs of taxpayers and ensure the full enjoyment of various tax policies.

From the perspective of tax supervision, the big data knowledge engineering method can extract fragmented knowledge from enterprise capital flow, invoice flow, contract flow and logistics, and combine the knowledge of the characteristics of the finance and taxation industry to build a financial and tax knowledge base for the tax department. Subsequently, through the use of knowledge representation and symbolic knowledge reasoning technology, the risk clues are dynamically integrated according to the relationship of time sequence, dependence, causation, etc., and the reasoning path and evidence chain are generated, so as to improve the interpretability of the audit results of tax-related violations, so as to actively discover potential tax-related illegal enterprises, help tax authorities effectively control the criminal risk of enterprises, reduce the financial losses caused by tax evasion, promote precise supervision and accurate law enforcement, and avoid disturbing honest taxpayers. In addition, for tax-related enterprises, not only can the identification results be obtained, but also the relevant evidence chain can be given to ensure credibility, credibility and enforcement.

3. Smart healthcare

Smart medical care is a comprehensive service model that takes residents' health and medical data as the core and integrates emerging technologies such as the Internet of Things, cloud computing, and artificial intelligence. Since the "Thirteenth Five-Year Plan", with the rapid development of medical informatization, including the construction of clinical systems with electronic medical records as the core, the construction of medical insurance cost control systems for the purpose of cost control, the improvement of "Internet +" medical information systems, and the construction of regional health informatization with medical alliances as the carrier, massive medical data has been accumulated. How to extract information from these data for effective management, analysis and application is the basis for medical knowledge retrieval, clinical diagnosis, medical quality management and intelligent analysis and processing of electronic health records. Building a medical knowledge graph is a key means to achieve these goals.

Chinese Medical Knowledge Graph CMeKG is based on large-scale medical text data and is developed in a human-machine combination. The construction of this knowledge graph refers to authoritative international medical standards such as the International Classification of Diseases (ICD), Anatomical Therapeutics and Chemical Classification System (ATC), Medical System Glossary (SNOMED), Medical Subject Thesaurus (MeSH), as well as large-scale clinical guidelines, industry diagnosis and treatment standards, and medical encyclopedia knowledge. CMeKG 1.0 (January 2019) includes more than 6,000 diseases, more than 10,000 drugs (Western medicine, Chinese proprietary medicine, Chinese herbal medicine), more than 1,200 diagnosis and treatment techniques and equipment structured knowledge description, covering more than 30 common relationship types such as clinical symptoms, site of onset, drug treatment, surgical treatment, differential diagnosis, imaging examination, drug composition, indications, dosage, expiration date, contraindications, etc. CMeKG 1.0 has more than 1 million conceptual relationship examples and attribute triples describing medical knowledge. CMeKG 2.0 (September 2019) integrates knowledge for multi-source heterogeneous medical resources, adds symptomatic knowledge, and describes pediatric diseases in detail. The expanded CMeKG 2.0 now contains structured knowledge descriptions of more than 10,000 diseases, 20,000 drugs, 10,000 symptoms, and 3,000 diagnostic and therapeutic techniques, with 1.56 million medical knowledge triads.

Medical information retrieval based on medical knowledge graph can improve the retrieval accuracy and overcome the shortcomings of traditional medical search response speed and large storage consumption. For example, the language system of Chinese medicine combined with "knowledge card" embedding and "knowledge map" display can visualize the conceptual knowledge in the field of Chinese medicine, which is convenient for users to query and search for specific concepts. Well-known foreign dedicated medical information search engines include WebMed, Healthline and Google Health, among which Google Health can provide data on more than 400 health conditions in the face of search requests for specific diseases and symptoms, and can give corresponding symptom descriptions.

Based on the medical knowledge graph, combined with the patient's symptom performance and laboratory information, the clinical decision support system (CDSS) can automatically generate diagnostic reports and treatment plans, and can check and fill in the gaps in the diagnosis and treatment plans given by doctors, reducing or even avoiding the occurrence of misdiagnosis. Representative CDSS developers in mainland China include China Medical, Mindray Medical, etc., and DiaDiagnosisOne, DXplain, Micromedex and so on. At present, the application of knowledge graph to CDSS has become a research hotspot, but it still faces challenges such as incomplete knowledge graph of general medicine, low confidence in medical decision-making, and lack of interpretability of prediction results obtained based on artificial intelligence methods.

Third, the challenges and future research directions of big data knowledge engineering technology

With the rapid development of artificial intelligence, Internet of Things, cloud computing and blockchain technologies, massive data recording human production and life behavior has been generated in various fields. Based on these massive amounts of data, how to mine the knowledge of patterns and laws and realize the transformation from data to knowledge and from knowledge to decision-making is the core problem to be solved by the fourth paradigm scientific research. Recently, inspired by the task of AlphaFold, researchers proposed the idea of "the prototype of the fifth normal form of scientific research", pointing out the need to integrate domain knowledge (including human priori/expert knowledge) into the design of algorithms and models to better solve domain problems. Accordingly, this paper analyzes the challenges faced by big data knowledge engineering in knowledge acquisition, knowledge representation and knowledge reasoning, and discusses potential future research directions to solve these challenges.

(1) Knowledge acquisition

Traditional knowledge acquisition techniques focus more on mining potential knowledge from massive text data, which has great limitations in modal diversity and knowledge types. In the future, how to obtain more informative visual knowledge and highly concealed common sense knowledge will be the development direction of knowledge acquisition technology. The following introduces these two types of knowledge and analyzes their potential research directions.

1. Visual knowledge acquisition

Visual knowledge is a new framework that is expected to improve cross-media knowledge expression capabilities and further promote the development of artificial intelligence. Cognitive psychology theories show that visual memory is a special existence different from language memory, and humans can fold, rotate, scan, and compare visual memory in the brain as needed. This type of memory is called "mental image" by cognitive psychologists and visual knowledge in the field of artificial intelligence. Visual knowledge has the following characteristics:

(1) Be able to express the spatial shape, size, spatial relationship, color and texture of the object;

(2) Be able to express the movement, speed and time relationship of the object;

(3) Able to perform space-time transformation, operation and reasoning of objects, including shape transformation, action transformation, speed transformation, scene transformation, various space-time analogies, associations, and prediction based on space-time reasoning results. How to effectively process and rationally use visual knowledge has become the most important way for people to communicate with information and information machines.

Visual knowledge has a variety of forms of expression, which can be divided into static visual knowledge and dynamic visual knowledge according to the continuous and discrete expression of knowledge. Static visual knowledge, also known as visual common sense, refers to the static visual facts that can be collected from real-world scenes and the predictable information or inferences made by social agents based on the facts. The study of computer knowledge of visual common sense is extremely difficult. On the one hand, the breadth of visual common sense knowledge is huge, and computers lack prior knowledge similar to human accumulation of common sense knowledge. On the other hand, in addition to the low-level recognition tasks on visual elements, computers need to have a deeper understanding of the context information implied in images. Dynamic visual narrative refers to the expression of knowledge composed of a continuous set of static visual knowledge, with temporal or spatial relationships as a sequence. The spatial relationship is expressed as a scene structure, describing the orientation relationship between each object, such as upper and lower, left and right, front and back, as well as the distance relationship, the relationship between inside and outside, and the relationship between size; The temporal relationship is expressed as a dynamic structure, which expresses the growth, displacement, action, change, competition, collaboration, etc. of the object.

In addition, in recent years, academia has begun to pay attention to the advanced static visual knowledge of schematic diagrams. A schematic diagram is a visual representation presented in graphical elements, often used to express the internal rules/logical information of a specific knowledge topic or knowledge concept in some professional field. Schematics are widely distributed in MOOC websites, open knowledge bases, technical forums and other knowledge sources. The analysis and understanding of such special images is the foundation of knowledge-intensive tasks such as knowledge base building and intelligent Q&A, and is also an important part of cross-media intelligence. In terms of underlying visual features, the color, texture, background and other information of the schematic diagram are far less abundant than those of natural images, and the sparse characteristics of this visual feature lead to problems such as overfitting and difficult convergence in the model training stage. In terms of high-level semantic expression, schematic diagrams have the phenomenon of "homographs and different shapes" that are different from natural images. Taking Figure 4 as an example, the schematic diagram of the "solar system" and the "atom" are similar in shape but completely different in meaning. This phenomenon of schematic diagrams exposes understanding to a more serious semantic gap.

Development status and prospect of big data knowledge engineering丨Engineering Science in China

Figure 4 Schematic example of "homography"

Visual knowledge theory can not only promote the research of cross-media expression, but also support and enhance the research and application of a wider range of artificial intelligence fields such as intelligent creation and logical reasoning. At present, many studies have not formally introduced the concept of visual knowledge, and visual knowledge has certain limitations in structured representation, operation and reasoning, reconstruction and generation.

2. Acquisition of common sense knowledge

Common sense knowledge refers to the effective consensus reached by people about the connections between different things in the real world, covers a large amount of human experience, and is widely accepted without explanation and argumentation. Common sense knowledge can make computers as familiar with all facts and information as possible like a human and make reasoning decisions, which plays a huge role in machine question answering, conversational emotion recognition, story end generation, and so on.

Common sense knowledge has the following three characteristics.

(1) Conceptual, the vast majority of common sense knowledge is conceptual knowledge, representing the common characteristics of a certain type of things, rather than the unique characteristics of an entity.

(2) General, the concept embodied in common sense knowledge is widely accepted and has a general nature. For example, "human breathing needs oxygen" is common sense knowledge, while "the composition of cell membranes requires cholesterol" is only known by experts in specific fields, more professional, so it is not common sense knowledge.

(3) Implicit, common sense knowledge is universally shared and is often omitted from people's oral or written communication. The types of common sense knowledge manifest themselves very diversely. For example, ConceptNet and ATOMIC are typical common sense knowledge graphs, which represent common sense knowledge as relational triples and organize these relationship triples into a network structure. The vocabulary database represented by WordNet and Roget is a knowledge source manually compiled and constructed by knowledge experts according to certain rules and requirements, and it is also common sense knowledge. Pre-trained language models such as BERT are also considered an expression of common sense knowledge. These models are usually trained on large corpus and can effectively capture syntactic features, semantic information, and factual knowledge. In the research of natural language processing, the above common sense knowledge can be used as background semantics to significantly enhance the context semantic information. In computer vision-related research, common sense knowledge can improve the performance of various downstream tasks such as navigation, manipulation, and recognition, so as to achieve true artificial intelligence.

The insufficient level of awareness of common sense knowledge is still an important bottleneck in the development of artificial intelligence. Common sense knowledge is diverse, including but not limited to intuition, psychology, vision, emotion and other forms, as well as text, images, speech and other modalities. Therefore, how to link and integrate an event, concept and relationship elements in cross-language and cross-modal multi-source data to obtain rich common sense knowledge and representation will be an important research direction. In addition, although the current large-scale common sense library contains some emotional states, implicit semantics and possible behaviors of human beings, it rarely emphasizes the social interaction patterns widely adopted by humans in daily life, such as how to respond to others in an empathetic way. Therefore, how to use the rich dynamic dialogue resources on the network to build a social common sense knowledge knowledge base to be more conducive to the construction of various machine dialogues, question answers, chats and other downstream tasks is an important research direction.

(2) Knowledge representation

Driven by massive labeled data and super computing power, the performance of existing knowledge engineering technologies in many fields and tasks has fully approached or even surpassed humans. However, knowledge representation techniques still have practical challenges such as high model complexity and poor interpretability. The specific manifestations are: first, the deep representation and inference model has a complex structure, a large number of parameters, and extremely difficult training. For example, the text representation model GPT-3 contains more than 170 billion internal parameters and uses 45 TB of data to train. Secondly, most deep representation models belong to black-box models, which are difficult to understand the internal mechanism and results of the model, and their corresponding optimization schemes cannot be clear.

In contrast, humans are born with the ability to encode and remember knowledge, which relies on the complex structure and mechanisms of the human brain. The human brain can autonomously represent knowledge, summarize learning, reason about knowledge, and perform multiple unrelated tasks in parallel; In addition, compared to the huge computational cost required by knowledge engineering techniques, the human brain can maintain low energy consumption while maintaining relatively high efficiency. Therefore, the human brain is still the only truly intelligent system at present, and it is very promising to learn the complex mechanisms of the brain and build more powerful and versatile models of knowledge representation. Next, the latest progress of the brain in knowledge representation and sequence memory processing is introduced, which provides reference for the next development direction of big data knowledge representation technology.

How knowledge is represented in the brain has always been at the forefront of scientific research. Cognitive neuroscientists have shown that both spatial location information and abstract knowledge information are stored in the hippocampus in the form of cognitive maps in the brain. In order to explore the knowledge coding mechanism of the brain in complex activities, such as tasks involving both spatial position changes and abstract cognitive variables, the neural activity space of the dorsal hippocampus 1 region of mice when performing cognitive decision-making tasks was constructed. The experimental results show that the coding of spatial location information and abstract cognitive variables by neurons is simultaneous and interdependent. In addition, through the neural manifold space, the population neuronal activity of mice in the virtual scene during the motor state was reduced, and it was found that the population activity of neuronal activity in the hippocampus showed strong geometric structure characteristics for the representation of spatial position information and abstract cognitive variables. At the same time, the geometry of these representational knowledge exists specific to the specific task. Finally, the study also found that neurons rich in abstract cognitive information enable organisms to make predictions and judgments. This study reveals that the representation of complex knowledge in the brain has distinct geometric features. Therefore, when designing a new knowledge representation model, the manifold learning method can be used to judge and evaluate the structure of the knowledge represented in the low-dimensional space, so as to improve the knowledge representation ability of the model.

The human brain processes sequence information all the time, whether it is language communication, action implementation or episodic memory, which essentially involves the representation of temporal information, so sequence memory is a basic cognitive function of the brain. In order to explore the encoding of temporal memory, in the latest study, the researchers used in vivo two-photon calcium imaging to record the activity of thousands of neurons on the lateral prefrontal cortex (the area responsible for working memory) in macaques. The experimental results show that the information of each order can find a corresponding two-dimensional subspace for it in the high-dimensional calcium imaging data. In each subspace, the position of each point corresponds to the real hexagonal structure seen by the macaque; Moreover, the subspaces corresponding to the information of the 3 different orders are close to orthogonal to each other, that is, each information in the sequence has an independent storage space in the brain. In addition, the researchers also found that the radius of the hexagonal ring structure in the subspace of the backward order information is smaller than the upper order space, which also corresponds to the behavioral performance of sequential memory, that is, the more content to be remembered in life, the more subsequent information is more likely to be wrong. This sequential working memory study reveals the coding mechanism of brain neurons storing sequence memory, which corresponds to a representation method that embeds structural information in different order subspaces into high-dimensional vector space, which will provide important reference for brain-inspired knowledge encoding and memory.

(3) Knowledge reasoning

With the development of deep learning, the design of knowledge reasoning models has become more and more complex and has been widely used in many fields. Practice shows that these complex models have surpassed the human level in inference speed, accuracy and stability, but still face certain challenges. Specifically, it is difficult for users to intuitively understand the parameters, structure and characteristics in the model, and cannot accurately grasp the basis of the model in reasoning and decision-making. This has prompted academia and industry to explore new frameworks for knowledge reasoning. In recent years, counterfactual reasoning and explainable reasoning models have gradually attracted the attention of researchers, and have become the next development direction of big data knowledge reasoning technology. The following introduces these two inference models and analyzes their future research directions.

1. Counterfactual reasoning

Counterfactual reasoning, also known as counterfactual thinking, refers to the thinking activity of negating and re-characterizing facts that have occurred in the past to construct a possibility hypothesis. The ability of counterfactual logical reasoning is one of the important manifestations of human intelligence, and in the current research boom of artificial intelligence, researchers realize that the ability to have causal inference and counterfactual reasoning like humans is a symbol of moving from weak artificial intelligence to strong artificial intelligence. Causality has three levels from bottom to top, namely association, intervention, and counterfactuals. Counterfactual is at the very top of the "ladder of causation," as shown in Figure 5.

Development status and prospect of big data knowledge engineering丨Engineering Science in China

Figure 5 The ladder of causation

Counterfactual reasoning needs to be performed on observational data, and researchers have designed a variety of counterfactual reasoning frameworks, the most famous of which are the Latent Outcome Framework (POF) and the Structural Causal Model (SCM). POF draws on the concepts of randomized controlled trials and potential outcomes in statistics to construct an analytical framework based on causal inference, the core idea of which is "no hypothesis without causation", that is, if the reality does not meet the basic hypothesis, the conclusion of the potential outcome does not hold. The three basic hypotheses commonly used in POF are: the stability hypothesis of the intervention value of the study subject, the negligibility hypothesis, and the positive hypothesis. Based on these three hypotheses, the researchers designed corresponding causal reasoning methods such as matching method, inverse probability weighting method and hierarchical method. SCM explores counterfactual causality by constructing causal diagrams and structural equations. Under this system, the inference of causality relies on three basic path structures of directed acyclic graphs: chain structure, fork structure and collision structure. Each of the three structures has a different way of transmitting information, and all causal diagrams can be disassembled into a combination of these three structures. SCM parameterizes causal relationships between variables and uses structural equation models for inference.

Under the background of the intertwining and integration of causal inference and big data knowledge engineering, counterfactual reasoning has also developed rapidly in the field of big data knowledge engineering, and has achieved success in task fields such as visual question answering (using counterfactual reasoning to eliminate language bias) and duplicate problem recognition (using counterfactual reasoning to replace traditional statistical analysis). Nevertheless, a general theoretical system based on counterfactual reasoning has not yet been established, and how to effectively integrate actual data, clarify evaluation indicators and objectives, and how to design scalable reasoning models based on multimodal data need to be solved urgently.

2. Explainable reasoning

In recent years, interpretable reasoning has become a research hotspot in academia and industry. However, there is no consensus on the definition of interpretability, and a more recognized definition in the industry is: interpretability is the ability to provide explanations to human beings in the way of human knowledge and understanding. In some low-risk situations (such as movie recommendations), people can not pay attention to why the model made this judgment, but in high-risk situations (such as autonomous driving, drug recommendation, etc.), in addition to obtaining high-accuracy predictions, the model must explain how to make the current prediction. This requirement for high model reliability further increases the need for interpretability studies.

According to the method of explanatory generation, inference models can be roughly divided into two categories: prior explanation and post-ex-hoc explanation. The former mainly refers to the use of the explanation that comes with the model architecture without the help of additional explanation methods; The latter mainly refers to the use of explanatory methods that do not rely on the model itself to explain the inference results. If one method can explain the black box model, then it can:

(1) The process of approximating model inference using transparent models (such as decision trees, rule lists, linear models, etc.);

(2) be able to predict and interpret the model based on specific examples;

(3) Ability to understand specific properties within the model (such as the role of neurons in a decision in a deep neural network). It is worth noting that the ex post interpretation method can also be used in the ex ante interpretation method.

Although the current explainable reasoning model shows good potential in the fields of people's livelihood such as medical (such as clinical decision support system), finance (such as tax evasion / tax evasion / tax fraud detection), transportation (such as automatic perception / control / decision-making), the overall research is still in its infancy and still faces many challenges. For example, inference model performance is inadequate; Some well-performing inference models are strongly correlated with the domain and have poor scalability; How to judge the advantages and disadvantages of different interpretability methods in the same task/scenario. Breakthroughs in these questions will drive the rapid development of interpretable reasoning.

Fourth, the development of big data knowledge engineering in the mainland

(1) Multidisciplinary cross-integration to promote theoretical and technical research in big data knowledge engineering

Multidisciplinary cross-integration is an important source of scientific and technological innovation and theoretical creation, which can promote the high-quality development of big data knowledge engineering technology in mainland China. First of all, build a special zone for cross-research on the frontier of big data knowledge engineering, and set up major/key research and development projects for big data knowledge engineering. Take the construction of joint laboratories related to big data knowledge engineering as the starting point to promote the deep cross-integration of computer science, artificial intelligence and other disciplines. Secondly, it provides a strong institutional mechanism guarantee for the interdisciplinary integration. Do a good job in the top-level planning of interdisciplinarity, rationalize the mechanism and system of interdisciplinary degree awarding, establish an interdisciplinary service platform, and explore evaluation methods for emerging interdisciplinary disciplines.

(2) Establish an industry standard system for big data knowledge engineering

The establishment of standards such as big data knowledge engineering related terms and applicable guidelines is an important symbol to measure the level of technological development of the industry, and is the leading and driving force for innovation and development. First of all, by strengthening communication and deepening cooperation, integrating and making full use of the advantageous resources of enterprises and research institutions related to big data knowledge engineering at home and abroad, focusing on breakthroughs in knowledge acquisition, integration, characterization, and reasoning technologies. Secondly, promote relevant cutting-edge research results, form application demonstration effect, create industry application benchmarks, and select market-recognized general standards and norms, so as to promote the continuous development and improvement of the industry technical standard system.

(3) Take demand as the traction to promote the engineering application of big data knowledge engineering in various industries

Taking big data knowledge engineering theory and technical research and the formulation of industry standards as an opportunity, we will create a "industry-university-research" collaborative development mechanism based on the path of "basic research \u2012 technological innovation \u2012 industrialization" in the face of market demand. First of all, at the level of universities and scientific research institutions, give full play to the characteristics of running schools, gather the advantageous disciplines of institutions, and explore a school-enterprise collaborative education model that meets the needs of the times and the market. At the same time, we will invest relevant resources in the direction of big data knowledge engineering and its application technology, formulate and improve corresponding talent training programs, enhance the cultivation of application-oriented talents in the process of technology promotion, and pay attention to cultivating students' innovation potential. Secondly, at the enterprise level, closely follow the market demand, deepen market research and actively layout, aim at the international leading development goals, adhere to application-led research and development, prospectively demonstrate the key direction of innovative research in the interdisciplinary field of big data knowledge engineering, and drive the deepening and expansion of the entire industrial chain through demonstration effect.

Note: The content of this article has been slightly adjusted, if necessary, you can view the original text.

Development status and prospect of big data knowledge engineering丨Engineering Science in China

Note: The paper reflects the progress of research results and does not represent the views of China Engineering Science magazine.

Read on