Optimizing Large Language Models for Scalable Clinical Operations
A recent study published in NPJ Digital Medicine explores how large language models (LLMs) can optimize clinical workflows, shedding new light on the economic and computational challenges associated with scaling these technologies in health systems. While LLMs have demonstrated the ability to extract and summarize data from electronic health records (EHRs), this study investigates how these models handle increasing task complexity, emphasizing their performance under high data loads and strategies for reducing costs at the enterprise level.
The researchers evaluated 10 LLMs, including high-capacity models such as GPT-4-turbo-128k and Llama-3–70B, by testing their ability to respond to more than 300 000 queries derived from real-world patient data. These tasks involved extracting factual, numerical, and temporal information from EHR notes while managing a growing number of simultaneous queries. Findings revealed that performance deteriorated as prompt sizes and question complexity increased. However, models with larger context windows demonstrated greater resilience. GPT-4-turbo-128k and Llama-3–70B maintained high accuracy rates across most tasks, effectively handling up to 50 concurrent questions. Smaller models struggled with both accuracy and formatting errors, particularly under heavy computational burdens.
In addition to performance metrics, the study examined economic considerations. Using a "concatenation" strategy—where multiple questions and notes are bundled into a single prompt—the researchers demonstrated significant cost savings. This approach reduced application programming interface (API) costs by up to 17-fold compared with querying notes and questions individually. With models such as GPT-4-turbo-128k, these savings become particularly relevant for large health systems that process millions of clinical documents annually. The findings suggest that, for population-scale tasks such as generating hospital resource reports or summarizing patient cases for shift handoffs, asynchronous query strategies could balance accuracy with affordability.
The researchers emphasize that further exploration of LLM integration is needed. While smaller models may provide a cost-effective option for limited tasks, large models are better equipped to handle the complexities of clinical data extraction and analysis. Strategies such as prompt optimization, improved attention mechanisms, and pre-processing techniques could further enhance model performance and scalability. The study advocates for real-world trials to evaluate how LLMs perform in live clinical environments, where factors like computational time and operational efficiency are critical.
“In conclusion, our study demonstrates the value of this concatenation strategy for LLMs in real-world clinical settings and provides evidence for effective usage under different model complexities and burdens,” the researchers stated. “Future research should further explore different combinations of stressors and question types. As LLMs are continually refined, expanded, and newly developed, more model types should be included in this assessment.”
Reference
Klang E, Apakama D, Abbott EE, et al. A strategy for cost-effective large language model use at health system-scale. NPJ Digit Med. 2024;7(1):320. doi:10.1038/s41746-024-01315-1