laitimes

Hugging Face 发布医疗任务评估基准Open Medical-LLM

author:The Webmaster's House

Highlights:

⭐️ Hugging Face has released a new medical task assessment benchmark designed to test the performance of generative AI models on health-related tasks.

⭐️ The Open Medical-LLM benchmark is a stitching of existing test sets that cover multiple areas of medicine, such as anatomy, pharmacology, genetics, and clinical practice.

⭐️ Some medical experts have warned against Open Medical-LLM, emphasizing that there is a large gap between actual clinical practice and answering medical questions, emphasizing that benchmark results are not a substitute for real-world testing.

Webmaster's Home (ChinaZ.com) April 19 News: Recently, Hugging Face released a new benchmark called Open Medical-LLM, which aims to evaluate the performance of generative AI models on health-related tasks.

Hugging Face 发布医疗任务评估基准Open Medical-LLM

The benchmark was created by Hugging Face in collaboration with researchers from the nonprofit Open Life Science AI and the University of Edinburgh's Natural Language Processing Group. The goal of Open Medical-LLM is to standardize and evaluate the performance of generative AI models on a range of medical-related tasks.

Hugging Face 发布医疗任务评估基准Open Medical-LLM

Rather than a benchmark from scratch, Open Medical-LLM is a patchwork of existing test sets (e.g., MedQA, PubMedQA, MedMCQA, etc.) covering multiple medical fields such as anatomy, pharmacology, genetics, and clinical practice. The benchmark test contains multiple-choice and open-ended questions that require medical reasoning and understanding, and covers the content of medical licensing exams in the United States and India, as well as college biology test question banks.

While Hugging Face sees the benchmark as a "sound assessment" of generative AI models in the medical community, some medical experts have warned about Open Medical-LLM on social media, pointing to a large gap between actual clinical practice and answers to medical questions. They emphasize that benchmark results are not a substitute for careful testing under real-world conditions.

Hugging Face 发布医疗任务评估基准Open Medical-LLM

In response, Clémentine Fourrier, a research scientist at Hugging Face, said on social media that these rankings can only be used as a first approximation to explore specific use cases, but that a more in-depth testing phase is actually needed to check the limitations and relevance of the model under real-world conditions. She points out that medical models must never be used by patients alone, but should be trained as a support tool for doctors.

While benchmarks such as Open Medical-LLM are instructive, the results leaderboard also reflects the model's poor performance in answering basic health questions. However, Open Medical-LLM and any other benchmarks are no substitute for well-thought-out real-world testing. For example, Google has tried to introduce an AI tool for diabetic retinopathy screening into Thailand's healthcare system, but despite its high theoretical accuracy, the tool has performed poorly in real-world tests, resulting in frustration among patients and nurses about the inconsistency of their results and a lack of coordination with actual clinical practice.

To date, none of the 139 AI-related medical devices approved by the U.S. Food and Drug Administration use generative AI. Testing how the performance of generative AI tools in the lab translates to real-world conditions in hospitals and outpatient clinics, and how those results may trend over time, can be challenging.

Official blog: https://huggingface.co/blog/leaderboard-medicalllm

Read on