Loading…
Using Large Language Models to Generate Educational Materials on Childhood Glaucoma
To evaluate the quality, readability, and accuracy of large language model (LLM)–generated patient education materials (PEMs) on childhood glaucoma, and their ability to improve existing the readability of online information. Cross-sectional comparative study. We evaluated responses of ChatGPT-3.5,...
Saved in:
Published in: | American journal of ophthalmology 2024-09, Vol.265, p.28-38 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | To evaluate the quality, readability, and accuracy of large language model (LLM)–generated patient education materials (PEMs) on childhood glaucoma, and their ability to improve existing the readability of online information.
Cross-sectional comparative study.
We evaluated responses of ChatGPT-3.5, ChatGPT-4, and Bard to 3 separate prompts requesting that they write PEMs on “childhood glaucoma.” Prompt A required PEMs be “easily understandable by the average American.” Prompt B required that PEMs be written “at a 6th-grade level using Simple Measure of Gobbledygook (SMOG) readability formula.” We then compared responses’ quality (DISCERN questionnaire, Patient Education Materials Assessment Tool [PEMAT]), readability (SMOG, Flesch–Kincaid Grade Level [FKGL]), and accuracy (Likert Misinformation scale). To assess the improvement of readability for existing online information, Prompt C requested that LLM rewrite 20 resources from a Google search of keyword “childhood glaucoma” to the American Medical Association–recommended “6th-grade level.” Rewrites were compared on key metrics such as readability, complex words (≥3 syllables), and sentence count.
All 3 LLMs generated PEMs that were of high quality, understandability, and accuracy (DISCERN ≥4, ≥70% PEMAT understandability, Misinformation score = 1). Prompt B responses were more readable than Prompt A responses for all 3 LLM (P ≤ .001). ChatGPT-4 generated the most readable PEMs compared to ChatGPT-3.5 and Bard (P ≤ .001). Although Prompt C responses showed consistent reduction of mean SMOG and FKGL scores, only ChatGPT-4 achieved the specified 6th-grade reading level (4.8 ± 0.8 and 3.7 ± 1.9, respectively).
LLMs can serve as strong supplemental tools in generating high-quality, accurate, and novel PEMs, and improving the readability of existing PEMs on childhood glaucoma. |
---|---|
ISSN: | 0002-9394 1879-1891 1879-1891 |
DOI: | 10.1016/j.ajo.2024.04.004 |