laitimes

Mainstream large-language model threat detection capability evaluation: GPT4 is inferior to open source models

author:GoUpSec
Mainstream large-language model threat detection capability evaluation: GPT4 is inferior to open source models

How well do the most mainstream big language model generative AI technologies on the market, such as ChatGPT, Google Bard, and LLAMA-based open source big language models, perform in the field of threat detection/classification and prediction?

Cybersecurity is the largest market for AI applications, and threat detection/classification is one of the hottest AI technology security use cases, and generative AI big language models can identify and analyze potential security threats faster and at greater scale than human security analysts. So, how do the mainstream big language models on the market perform in terms of threat detection/classification?

Recently, when developing an artificial intelligence threat analysis framework that can greatly improve the accuracy of threat identification and improve itself, which uses voting systems and confidence to comprehensively analyze the threat information detection results from multiple large models to form a more robust and reliable security analysis ensemble learning framework than a single large language model), the threat detection and prediction capabilities of mainstream large language models have been evaluated and studied.

The threat detection accuracy GPT4 is no match for the open source model

Skyhawk's large language model benchmarking methodology is as follows:

Models: ChatGPT, GoogleBARD, LLAMA2-based open source model

Test content:

  • Task: Determine the level of malice of aggregated sequences extracted from cloud logs, such as AWS logs.
  • Evaluation metrics: Use precision, recall, and F1 scores to measure performance.
  • Thresholds: Introduce optimal thresholds to determine classification outcomes.
  • Leaderboards: Assessment results and rankings are available on the Skyhawk website.

The test results (below) show that GPT4 has the highest F1 score, but surprisingly, GPT4 is less accurate than the open source Llama-2-70B-LoRA-assemble-v2. The ambitious GoogleBard ranks second to last in accuracy, and even lags far behind GPT3.5-turbo:

Mainstream large-language model threat detection capability evaluation: GPT4 is inferior to open source models

The review is based on a representative sample of 200 human markers Source: Skyhawk

The significance of the big language model threat detection capability ranking

Benchmarking threat detection and classification of mainstream large-language models is the foundation of AI-enhanced network security. Because when developing a "federated learning" or "ensemble learning" threat analysis framework composed of multiple large language models, researchers need to quantitatively analyze and rank the threat detection "talent" and potential of each mainstream big language model to determine the weight of different models in the threat classification and scoring process, so as to optimize the threat classification process and improve accuracy and efficiency.

Benefits of an integrated multi-model learning framework

The digital landscape is constantly evolving, and the complexity of cloud security is increasing. In this dynamic environment, pinpointing and assessing the risks associated with cloud events is becoming increasingly challenging, especially when incident information crosses the line between malicious and benign judgments, and traditional manual threat detection and machine learning methods often have problems providing the detection accuracy and insight needed.

To address this challenge, many cybersecurity vendors choose to use big-language models to act as effective security analysts to score the malice of each set of event information, but this approach requires "ensemble learning" (as opposed to federated learning) of multiple large-language models to work together, using voting systems and confidence determined by differences in results to create a more powerful multi-model security analysis ensemble learning framework (figure below):

Mainstream large-language model threat detection capability evaluation: GPT4 is inferior to open source models

Compared to existing MLEnsemble frameworks such as Bagging and Boosting, this new framework has several advantages, including:

  • Improved generalization: Ability to learn and adapt from the findings and errors of "primary" models, resulting in better prediction accuracy, especially in complex or noisy datasets.
  • Model interpretability: Provides a more precise and understandable representation of the decision-making process.
  • Robustness: Enhance resilience to outliers and adversarial attacks, minimize overfitting and enhance data quality management.
  • Efficiency: May provide computational benefits when working with large datasets or resource-constrained environments. The fact that we don't need to run each model individually (as opposed to stacking and bagging) is an advantage.
  • Flexibility: Ability to effectively integrate various "primary" models as well as human-driven insights to meet different problem types.
  • Incremental learning: Promote continuous adaptation and refinement based on the distribution of data over time.
  • Reduce bias: Take a multifaceted approach to reduce prediction bias and ensure more balanced and fair results.

The researchers point out that the ensemble learning multi-model framework not only outperforms the single large language model, but also significantly improves the simple ensemble framework based on mean and variance evaluation.

Read on