At present, it is the general trend for archives at all levels to open archives to the public for inquiry and use, to meet the needs of public archives and maximize the value of archives. The "14th Five-Year Plan for the Development of National Archives" proposes to "strive to promote archives work towards governance according to law, openness and modernization". The newly revised Archives Law in 2020 requires: "The archives of archives at or above the county level shall be open to the public for 25 years from the date of formation." ”
Article 7 of Decree No. 19 of the State Archives Administration, "Measures for the Opening of the Archives of the State Archives", states: "The archives of the State Archives that have completed 25 years from the date of formation shall be opened to the public in a timely manner if they do not need to be restricted in use after being opened and reviewed. Archives such as economics, education, science and technology, and culture may be opened to the public in advance after being opened for review. The above regulations and policies have pointed out the direction for the open use of archives. However, due to factors such as the inconsistency and non-specificity of the open review standards and procedures, and the shortage of open review personnel, the progress of the open review of archives has been slow.
Part 1
AI technology helps to open archives for review
With the rapid development of AI technology, it has become a reality to promote the rapid development of file open review through AI technology. In the past, the application of open archive review mainly relied on sensitive word filtering, natural language processing (NLP) and other technologies, and there were many technical problems such as weak transfer learning ability, narrow adaptability, and weak semantic analysis ability, which could not greatly reduce the risks of compliance compliance and privacy protection.
In the field of file open review, the advantages of customized training models over natural language processing technology are as follows:
- Contextual understanding: Custom trained models are able to better understand text contextual relationships, rather than just predicting the next state based on the current state. Ability to understand complex contexts, especially long or large volumes.
- Semantic understanding: Customized training models can capture the semantic information of words more accurately, and can understand and analyze texts at a deep level.
- Generalization ability: The customized training model has strong generalization ability and can adapt to various document types and domains.
- End-to-end learning: Customized training models support end-to-end learning without much preprocessing or manual feature extraction, which can better adapt to complex tasks and simplify the process.
- Transfer learning: Customized training models have strong transfer learning capabilities, and can quickly deploy and demonstrate capabilities for specific tasks.
Part 2
Archival open review model architecture
In addition to the customized training model, it is also necessary to integrate intelligent OCR recognition, official seal detection, official seal OCR recognition, image recognition, image comparison, semantic recognition, and natural language processing technology to build an AI file open review system, which can customize audit rules, intelligently carry out file open review work, and visually display the review process and results, as shown in the following figure:
Part 3
The path to open archives for review
Step 1: Document preprocessing
· SM file screening
Since SM files are not open, SM screening should be performed on the pre-open documents first, and the detected SM files should be removed from the open review queue.
The SM file intelligent screening subsystem built based on AI technologies such as neural networks, natural language processing, and deep learning can automatically analyze unstructured electronic documents, identify the secret level in the text, and efficiently screen SM files.
· AI-OCR recognition
Using AI-OCR recognition technology, the text recognition accuracy of all scanned digital copies can reach 99%, and the accuracy of horizontal handwriting recognition can reach 95%. After OCR recognition, text information is generated, and then processed by natural language processing and large language model technology to form data-based information, laying a data foundation for file open review.
· Official seal detection
Using deep learning and computer vision processing technology, the automatic detection of official seals is realized through document/image preprocessing, object detection model, generation of candidate regions, official seal classification and post-processing.
· OCR recognition of official seals
On the basis of the official seal detection, OCR recognition and image pre-training model technology are used to detect and recognize the internal text information of the official seal. The OCR recognition of official seals is different from ordinary OCR recognition, which is specially designed for the detection and recognition of the internal characters of the official seal (generally the name of the unit or the name of the person), and the OCR recognition model of the official seal needs to be specially trained.
· Image recognition and comparison
Computer vision processing and image pre-training model technology are used to detect and identify digital copies of archives, and help improve the accuracy of AI-OCR recognition, official seal detection and OCR recognition.
· AI document classification and identification
The AI document classification and identification subsystem can automatically classify documents, realize the classification and recognition of text and image content, and assist the open review system to quickly determine the document type and match the review rules, further improving the efficiency of open review.
Step 2: Build a rulebase
In accordance with relevant regulations and policies, the known audit rules are briefly summarized as follows:
Step 3: Determine the technical implementation method according to the rule base
Technical implementation methods can be divided into the following six categories:
Step 4: Configure rules and implement authentication
Based on the above three steps, the process of open review is shown in the following figure:
Step 5: Optimize the open review model
The maturity of the AI file open review system depends on the maturity of the open review model, and the model needs to be continuously optimized according to the feedback results of manual review in the open review process, especially for the "false negative" (the model is predicted to be open, the manual review is controlled, and comes from the concept of confusion matrix) samples to continuously improve the accuracy of open review.
Due to the particularity of the open file review business scenario, it is difficult to obtain open or controlled binary classification large-capacity samples/corpora in the process of software development, and the following methods are generally adopted to improve the accuracy index:
The above are some technical means to improve the accuracy of the open review model through specific methods in the case of limited samples.
Step 6: Implement the deployment
· Deploy offline
You can purchase or rent an AI file open review all-in-one machine to carry out open review work offline. If the project is completed by lease, the storage carrier (hard disk) in the all-in-one will be reserved for the rental unit.
· System integration and deployment
The AI file open review system provides an interface to integrate with the file system or other systems, and carries out open review through online interface calls.
Part 4
Proven in practice
Verified by a project: 200,000 sample files, the initial identification accuracy of the AI file open review system reached 100%, the accuracy rate reached 99%, and the second identification accuracy reached 100% through targeted training of "false negative" samples.
Note: Precision=TP/(TP+FP), Accuracy=(TP+TN)/(TP+TN+FP+FN)
Of course, the AI file open review model is only an industry-specific training model, after all, it is not a general model, and its accuracy is affected by the training sample data, which has a problem of scenario applicability. That is to say, if a certain project of a certain unit has achieved good results, it may not be able to meet the needs of another project of another unit. Therefore, software developers should go as deep into the application line as possible, and go through multi-project or multi-scenario experience through joint development/cooperation with archival institutions to continuously improve the accuracy of the model.