laitimes

Google Launches Cutting-Edge Security Framework: Assessing the Serious Risks of AI Models, with 4 Areas Having the Most Impact

author:Smart stuff
Google Launches Cutting-Edge Security Framework: Assessing the Serious Risks of AI Models, with 4 Areas Having the Most Impact

Compile | ZeR0

Edit | Desert Shadow

Zhidong reported on May 18 that Google DeepMind launched the AI cutting-edge security framework last night and released a technical report.

Google Launches Cutting-Edge Security Framework: Assessing the Serious Risks of AI Models, with 4 Areas Having the Most Impact

The Frontier Security Framework is a set of protocols that emphasizes the importance of identifying and mitigating potential risks during the development of AI models, with the aim of proactively identifying AI capabilities that could cause serious harm in the future and establishing mechanisms to detect and mitigate them.

The plan is to fully implement this initial framework by early 2025. The framework complements Google's conformance studies by focusing on the serious risks posed by strong capabilities at the model level, such as special institutions or complex network capabilities.

In the technical report, it is worth noting that the main risk mitigation measure in terms of security protection is the protection of model weights, where security seems to be more related to trade secrets.

1. Three key components: identify hazard thresholds, regularly assess detection, and apply mitigation measures

The first version of the framework, released today, builds on Google's research on critical capability assessments in cutting-edge models and follows the emerging approach of responsible capability scaling.

The framework has 3 key components:

Google Launches Cutting-Edge Security Framework: Assessing the Serious Risks of AI Models, with 4 Areas Having the Most Impact

1. The threshold of the ability to identify the serious harm that the model may have. Google DeepMind looked at the pathways in which the model could cause serious harm in high-risk areas, and then determined the minimum level of capability that the model must play a role in causing that harm, known as Critical Capability Thresholds (CCLs), which guide Google's DeepMind assessment and mitigation approach.

2. Regularly evaluate frontier models to detect when they reach these critical capability thresholds. Google DeepMind will develop a model evaluation kit, called "Early Warning Assessment", which will alert and run frequently when a model is close to the CCL so that researchers notice before they reach a threshold.

3. When the model reaches the early warning assessment, apply the mitigation plan. This should take into account the overall balance of benefits and risks, as well as the intended deployment environment. These mitigations will focus primarily on security (preventing model leakage) and deployment (preventing abuse of critical capabilities).

Second, the two types of mitigation measures manage key capabilities, and the four areas are most likely to pose serious risks

The Frontier Security Framework proposes two types of mitigations: one to prevent model weights from being leaked, and the other is to manage access to critical capabilities in the deployment and restrict their expression.

For each type of mitigation, Google DeepMind has developed several levels that allow it to adjust the robustness of the measures based on the risks they pose.

The following table describes the level of security mitigation that can be applied to model weights to prevent their leaks.

Google Launches Cutting-Edge Security Framework: Assessing the Serious Risks of AI Models, with 4 Areas Having the Most Impact

According to the technical report, the disclosure of the model's weights may allow the removal of any safeguards that are trained into or deployed with the model and, as a result, access (including bad actors) of any critical capabilities.

A higher level of security mitigation can better prevent the leakage of model weights and more tightly manage critical capabilities. But these measures can also slow down innovation and reduce the widespread accessibility of capabilities.

The following table describes the deployment mitigation levels to manage access to critical capabilities in the deployment and restrict their expression.

Google Launches Cutting-Edge Security Framework: Assessing the Serious Risks of AI Models, with 4 Areas Having the Most Impact

Abuse of critical capabilities can be more or less difficult to distinguish from beneficial use, and the overall risk of abuse can vary depending on the context of the deployment. Therefore, the mitigation options listed here are prescriptive and need to be adjusted for different use cases and risks.

Initial research by Google's DeepMind suggests that the capabilities of future foundational models are most likely to pose serious risks in these four areas: autonomy, biosafety, cybersecurity, and machine learning R&D.

In terms of autonomy, cybersecurity, and biosecurity, its main goal is to assess the extent to which threat actors use models with advanced capabilities to engage in harmful activities with severe consequences.

For machine learning R&D, the focus is on whether a model with such capabilities can propagate a model with other key capabilities, or whether it can enable AI capabilities to be scaled up quickly and unmanageably.

Its technical report details an initial set of CCLs identified through a preliminary analysis of the risk areas of autonomy, biosafety, cybersecurity, and machine learning R&D.

Google Launches Cutting-Edge Security Framework: Assessing the Serious Risks of AI Models, with 4 Areas Having the Most Impact

With further research, Google DeepMind expects these CCLs to evolve and add CCLs at higher levels or other risk areas.

Conclusion: Adhere to AI principles and regularly review and improve the framework

The research behind the framework is still in its infancy, and it is progressing rapidly. Google DeepMind has invested heavily in the Frontier Security team, coordinating the cross-functional work behind the framework, with the responsibility of advancing cutting-edge risk assessment science and refining its framework based on improved knowledge.

The team developed an evaluation kit to assess the risks of key capabilities, with a particular emphasis on autonomous large language model agents, and conducted real-world tests on Google's state-of-the-art models.

In their recent paper describing these assessments, they also explored mechanisms that could shape future "early warning systems". The system describes a technical approach to assessing how far the model is from success in a task that cannot be done at the moment, and also includes a forecast of future capabilities by an expert forecasting team.

Following Google AI principles, Google DeepMind will regularly review and improve the frontier security framework, progressively deepen its understanding of risk scopes, CCLs, and deployment contexts, and will continue to calibrate specific mitigation measures for CCLs.

Google DeepMind hopes to work with industry, academia, and government to develop and refine the framework, and agree on standards and best practices for evaluating the safety of future generations of AI models.

Source: Google DeepMind

Read on