The University of Pittsburgh has developed QUEST, a software tool designed to facilitate the human evaluation of large language models (LLMs) in healthcare. As the integration of generative AI into healthcare advances, it is crucial to ensure the safety and efficacy of these models through rigorous human assessments. QUEST provides a comprehensive evaluation framework organized into three main phases: Planning, Implementation and Adjudication, and Scoring and Review. This tool supports the evaluation process by guiding users through goal setting, task identification, stakeholder consideration, and success criteria definition, ultimately enhancing the reliability and effectiveness of LLMs in healthcare.
Description
QUEST is a software tool designed to facilitate the human evaluation of large language models (LLMs) in healthcare. It provides a structured evaluation framework organized into three phases: Planning, Implementation and Adjudication, and Scoring and Review. QUEST helps define evaluation goals, tasks, stakeholders, and success criteria, and offers tools for data collection, statistical analysis, and report generation. It supports both open-source and proprietary LLMs, ensuring consistent and standardized evaluations through dedicated training modules and expert panel reviews.
Applications
• Evaluation of large language models in healthcare
• AI safety and efficacy assessments
• Human evaluation frameworks
• Healthcare AI research
Advantages
The QUEST tool represents a significant innovation in the evaluation of LLMs in healthcare by offering a comprehensive, structured framework tailored specifically for this purpose. Its standout features include healthcare-specific evaluation dimensions, a cyclical adjudication process, flexible LLM integration, evaluator training and standardization, advanced statistical tools, and automated, insightful reporting. These features ensure consistent and standardized evaluations, enhance the reliability of evaluation outcomes, and provide detailed and accessible reports from complex evaluation data.
Invention Readiness
The QUEST software is currently at the prototype stage. It has been developed to support the human evaluation of LLMs in healthcare and is ready for further development and commercialization.
IP Status
Copyright