Mainstream large-language model threat detection capability evaluation: GPT4 is inferior to open source models

author：GoUpSec 2023-09-21 17:56:00

How well do the most mainstream big language model generative AI technologies on the market, such as ChatGPT, Google Bard, and LLAMA-based open source big language models, perform in the field of threat detection/classification and prediction?

Cybersecurity is the largest market for AI applications, and threat detection/classification is one of the hottest AI technology security use cases, and generative AI big language models can identify and analyze potential security threats faster and at greater scale than human security analysts. So, how do the mainstream big language models on the market perform in terms of threat detection/classification?

Recently, when developing an artificial intelligence threat analysis framework that can greatly improve the accuracy of threat identification and improve itself, which uses voting systems and confidence to comprehensively analyze the threat information detection results from multiple large models to form a more robust and reliable security analysis ensemble learning framework than a single large language model), the threat detection and prediction capabilities of mainstream large language models have been evaluated and studied.

The threat detection accuracy GPT4 is no match for the open source model

Skyhawk's large language model benchmarking methodology is as follows:

Models: ChatGPT, GoogleBARD, LLAMA2-based open source model

Test content:

Task: Determine the level of malice of aggregated sequences extracted from cloud logs, such as AWS logs.
Evaluation metrics: Use precision, recall, and F1 scores to measure performance.
Thresholds: Introduce optimal thresholds to determine classification outcomes.
Leaderboards: Assessment results and rankings are available on the Skyhawk website.

The test results (below) show that GPT4 has the highest F1 score, but surprisingly, GPT4 is less accurate than the open source Llama-2-70B-LoRA-assemble-v2. The ambitious GoogleBard ranks second to last in accuracy, and even lags far behind GPT3.5-turbo:

The review is based on a representative sample of 200 human markers Source: Skyhawk

The significance of the big language model threat detection capability ranking

Benchmarking threat detection and classification of mainstream large-language models is the foundation of AI-enhanced network security. Because when developing a "federated learning" or "ensemble learning" threat analysis framework composed of multiple large language models, researchers need to quantitatively analyze and rank the threat detection "talent" and potential of each mainstream big language model to determine the weight of different models in the threat classification and scoring process, so as to optimize the threat classification process and improve accuracy and efficiency.

Benefits of an integrated multi-model learning framework

The digital landscape is constantly evolving, and the complexity of cloud security is increasing. In this dynamic environment, pinpointing and assessing the risks associated with cloud events is becoming increasingly challenging, especially when incident information crosses the line between malicious and benign judgments, and traditional manual threat detection and machine learning methods often have problems providing the detection accuracy and insight needed.

To address this challenge, many cybersecurity vendors choose to use big-language models to act as effective security analysts to score the malice of each set of event information, but this approach requires "ensemble learning" (as opposed to federated learning) of multiple large-language models to work together, using voting systems and confidence determined by differences in results to create a more powerful multi-model security analysis ensemble learning framework (figure below):

Compared to existing MLEnsemble frameworks such as Bagging and Boosting, this new framework has several advantages, including:

Improved generalization: Ability to learn and adapt from the findings and errors of "primary" models, resulting in better prediction accuracy, especially in complex or noisy datasets.
Model interpretability: Provides a more precise and understandable representation of the decision-making process.
Robustness: Enhance resilience to outliers and adversarial attacks, minimize overfitting and enhance data quality management.
Efficiency: May provide computational benefits when working with large datasets or resource-constrained environments. The fact that we don't need to run each model individually (as opposed to stacking and bagging) is an advantage.
Flexibility: Ability to effectively integrate various "primary" models as well as human-driven insights to meet different problem types.
Incremental learning: Promote continuous adaptation and refinement based on the distribution of data over time.
Reduce bias: Take a multifaceted approach to reduce prediction bias and ensure more balanced and fair results.

The researchers point out that the ensemble learning multi-model framework not only outperforms the single large language model, but also significantly improves the simple ensemble framework based on mean and variance evaluation.

Mainstream large-language model threat detection capability evaluation: GPT4 is inferior to open source models

Read on

Global AI Agent inventory, big language model entrepreneurship must refer to 60 AI agents

Reversing the Curse: The Powerlessness of Big Language Models

CNCC | Prospective problems and challenges of large language models in mathematics: theory, methods and applications

Recently, the desktop operating system, the three camps have very large version updates. First of all, domestic DeepinOS accesses AI large language models. Immediately after the 26th, Microsoft Wind

The implementation practice of large language model in data warehouse data governance

The breakthrough of the big language model is to equip AI with five senses and five senses

How to use big language models to build a private knowledge base?

🚀Langchain-Chatchat: The New Choice for Local Knowledge Base Q&A! 🌟 Project Highlights: Based on the Big Language Model: Combining Langchain and Ch

Microsoft launched the AutoGen framework to help developers create complex applications based on large language models

Live Review | Potential and resistance, explore the application of big language models in the field of financial risk control

Under the wave of ChatGPT, look at the development of China's large language model industry #Dongshroom Business School#

The Big Language Model of Federal Law

The bookstore picked it up casually and took a look, and stood for three hours to read it, the fastest reading speed 😂 ever#Large Language Model#OpenAI

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

How to Become LLM Word Master! "The Underlying Mental Method of Big Language Model"

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

The tuyere belonging to major technology companies is here again! This large language model leads to the "new industrial revolution."

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

探索大语言模型：理解Self Attention| 京东物流技术团队

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills