Can Large Language Models Understand Real Medical Records? Harvard Medical School Unveils BRIDGE Benchmark

Introduction

In the ever-evolving landscape of artificial intelligence, large language models (LLMs) have demonstrated remarkable capabilities across various domains. One of the most promising and challenging areas of application is healthcare. The potential of LLMs to assist in medical diagnosis, patient management, and clinical documentation has spurred significant interest and investment. However, as these models transition from theoretical promise to real-world application, a critical question emerges: Can large language models truly understand real medical records?

To address this, a team from Harvard Medical School and Brigham and Women’s Hospital (BWH) YLab, in collaboration with multiple institutions including the University of Illinois at Urbana-Champaign (UIUC), MIT, Stanford, and the Mayo Clinic, have introduced BRIDGE—the first large-scale, multilingual benchmark designed specifically for evaluating large language models on real clinical texts. This benchmark represents a significant step forward in assessing the practical utility of LLMs in healthcare settings.

The BRIDGE Benchmark: A New Frontier in Medical LLM Evaluation

What is BRIDGE?

BRIDGE, or the Benchmark for Real-world Integrated Diagnostic Generative Evaluation, is a comprehensive evaluation framework comprising 87 real-world electronic health record (EHR) tasks across nine languages. It assesses 65 of the most advanced large language models, making it one of the most extensive evaluations of LLM performance in medical applications to date.

Why BRIDGE Matters

The significance of BRIDGE lies in its focus on real clinical texts, which differ substantially from the structured, standardized questions found in medical licensing exams. Real-world clinical texts are riddled with abbreviations, clinical jargon, patient colloquialisms, misspellings, and a mix of templated and free-form inputs. These characteristics create a high-noise, low-structure environment that challenges even the most advanced LLMs.

The Need for Real-World Evaluation

While LLMs like GPT-4 and Med-PaLM-1/2 have achieved expert-level scores on the United States Medical Licensing Examination (USMLE), these successes do not directly translate to clinical proficiency. The structured nature of exam questions contrasts sharply with the unpredictable and varied nature of real medical records. This disparity highlights the necessity for benchmarks like BRIDGE that evaluate models in the context of actual clinical tasks.

Constructing BRIDGE: A Multifaceted Approach

Task Design

The creation of BRIDGE involved meticulous design and curation of tasks that reflect the complexities of real-world medical records. The 87 tasks encompass a wide range of clinical activities, including:

Diagnostic reasoning
Treatment planning
Summarization of patient histories
Identification of clinical abnormalities
Translation of medical jargon into patient-friendly language

Multilingual Diversity

One of the unique features of BRIDGE is its inclusion of nine languages: English, Spanish, Chinese, Hindi, Arabic, Russian, Portuguese, French, and German. This multilingual approach ensures that the benchmark is applicable across diverse healthcare settings, reflecting the global nature of medical practice.

Model Assessment

BRIDGE evaluates 65 state-of-the-art large language models, including:

GPT-4
Med-PaLM-1/2
BERT-based models
RoBERTa
XLNet
T5

The evaluation criteria include:

Accuracy
Precision
Recall
F1 Score
Language Understanding
Clinical Relevance

Key Findings and Implications

Performance Disparities

The results from BRIDGE reveal significant performance disparities among different models. While some models exhibit strong performance in specific tasks or languages, none demonstrate comprehensive proficiency across all tasks and languages. This variability underscores the need for continued model refinement and specialization for clinical applications.

Language-Specific Challenges

BRIDGE highlights unique challenges posed by different languages. For instance, models trained primarily on English data struggle with clinical texts in languages like Hindi and Arabic, which have distinct linguistic features and medical terminologies. This finding emphasizes the importance of multilingual training data and model adaptability.

Real-World Complexity

The benchmark underscores the complexity of real-world clinical texts. Models that perform well on structured exam questions often falter when confronted with the noise and variability of EHRs. This gap highlights the necessity

>>> Read more <<<

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Can Large Language Models Understand Real Medical Records? Harvard Medical School Unveils BRIDGE Benchmark

作者智能小编

Introduction