
Our study showcases a head-to-head comparison of LLM-chatbot performance in addressing CVD prevention questions using 2 predominant languages – English and Chinese. Notably, LLM-chatbots exhibited significant disparities in performance across languages, performing generally better with English prompts than with Chinese prompts. ChatGPT-4.0 outperformed ChatGPT-3.5 and Bard for English prompts while ChatGPT-3.5 outperformed ChatGPT-4.0 and ERNIE for Chinese prompts. When evaluating for temporal improvement and self-checking capabilities, ChatGPT-4.0 and ERNIE exhibited substantial improvements in rectifying initial suboptimal responses with their updated iterations for English and Chinese prompts, respectively. Our study findings highlight the promising capabilities of LLM-Chatbots in addressing inquiries related to CVD prevention and its potential for future advancements in this field.
This study has important implications for CVD prevention. Individuals with health concerns have become increasingly engaged consumers of publicly available health information24 – beginning with the advent of the digital age and, then, even more so during the recent growth in telehealth care programs and augmentation of internet search functions25. The traditional model of patients seeking and gaining information from their primary care providers has been shown historically to enhance knowledge and understanding of cardiovascular risk factors, healthy behaviors, and preventive measures related to cardiovascular health26. However, the sole reliance on primary care practitioners to improve population’s cardiovascular literacy poses inherent limitations such as geographical and resource disparities and time constraints27,28,29, especially when considering the poor access to health care in underserved population30. LLM-Chatbots offer promising potential in delivering accurate knowledge and information to bridge these gaps. In this regard, our study provides valuable evidence regarding the utility of appropriate LLM-Chatbots in promoting health literacy in terms of CVD prevention.
Moreover, in the context of responding to CVD prevention queries, this study represents the first investigation comparing the performance of chatbots when prompted with Chinese17,18,19, a language used by ~20% of the global population. Baidu’s ERNIE, tailored for Chinese linguistic nuances31, displayed inherent strengthswhen compared to ChatGPT 3.5 and ChatGPT 4 in distinct CVD prevention areas. Notably, ChatGPT often misidentified specific drug brand names in Chinese, mistakenly linking ‘诺欣妥’ (nuoxintuo, the Chinese moniker for sacubitril-valsartan) to unrelated drugs like Norspan and Norinyl. This suggests a possible over-reliance on transliteration techniques for Chinese drug names. Although ERNIE excelled at drug name recognition, its overall competency across domains still fell short in comparison to ChatGPTs for Chinese queries. While ERNIE was developed with the goal of improving access to health information32, especially in Chinese speaking regions, our findings however indicate that it did not surpass ChatGPT in performance. These findings may suggest that current language specific LLM may not be as well and broadly trained as generic LLM such as ChatGPT. The performance disparities observed likely stem from the quality and availability of training datasets. This distinction is particularly evident when juxtaposing English and Chinese LLM capabilities, given the varying quality of guideline-based CVD prevention resources across the languages. Despite the initial postulation, our findings are noteworthy in revealing that Chinese-specific LLM performed inferiorly compared to generic LLMs like ChatGPT-4.0. This disparity may suggest that despite being tailored to the Chinese language, the current Chinese-specific LLMs may not have been trained as broadly as the generic, English dominant LLMs33. In addition, our findings demonstrated variability in response lengths across different LLMs. While longer responses may suggest a more comprehensive understanding of the query topic, this increased verbosity did not consistently lead to higher accuracy rates. For instance, among Chinese responses, a mean response length of 299 words was associated with an accuracy of 84%, while a length of 405 words corresponded to a just slightly higher accuracy of 85.3%. Nevertheless, the impact of response length on perceived accuracy warrants further evaluation.
Our assessment covered both the chatbots’ initial factual accuracy and their adeptness at refining suboptimal responses over time. Recent updates to ChatGPT 3.5, ChatGPT 4, and BARD have shown marked improvements, transitioning their responses from “inappropriate” or “borderline” to “appropriate” ones. Our findings were consistent with Johnson et. al. paper34, which reported a significant improvement in accuracy scores over a 2-week period between evaluations. Collectively, these exemplify the rapidly advancing nature of LLMs and its boundless potential moving forward. Additionally, we further examined the chatbots’ self-awareness of correctness by instructing them to review their own responses. Interestingly, ChatGPT 3.5, even in its updated form, identified the correctness of only 1 out of 6 of its own responses. This indicates that, even when explicitly prompted, LLM-Chatbots might continue to relay inaccurate information. Moreover, the gaps in ability to improve over time are related not only to availability and quality of training data but in the availability and quality of the continued interaction and feedback data. Thus, it is likely that the LLM chatbots that will demonstrate substantial improvements in performance over time are those that garner the most attention to ongoing technical improvement but also the most attention in terms of user feedback. Such that we will likely see not only improvement in LLM chatbot performance over time but also increasing gap between high performers and low performers. This will obligate ongoing continuation of comparison studies of this type to understand the magnitude, nature, and temporal trends in these gaps. Regarding Chinese prompts, ERNIE displayed significant improvements in refining suboptimal responses, effectively addressing 11 out of 12 cases. Furthermore, ERNIE demonstrated a significant capability to self-aware correctness, accurately assessing the accuracy of its responses in 11 out of 12 cases. Considering the notable evolutions of LLMs, it should be noted that ChatGPT has undergone a series of more than ten updates35, whereas Baidu ERNIE has also undergone substantial and pivotal updates36. In light of this, the observed disparities of LLMs’ temporal improvements should be plausibly attributed to divergent magnitudes and velocities characterizing the updates received by each model. Capitalizing on the promising ability of chatbots to self-check accuracy may entail user adjustments in interaction patterns or enhancements in the chatbot’s built-in algorithm checks37,38, especially concerning medical queries.
In the process of conducting our study, several noteworthy strengths emerged that we believe contribute significantly to the value and reliability of our findings. First, where many studies focused primarily on ChatGPT3.5, we expanded our scope to include ChatGPT-3.5, ChatGPT-4.0, Google Bard, and Baidu ERNIE (Chinese). This comprehensive approach provides a broader understanding of chatbot capabilities in CVD-related patient interactions. Second, our study involved systematic masking, randomization, and a wash-out period between grading sets. Each assessment was meticulously conducted by three seasoned cardiologists, with a consensus approach guiding the establishment of the ground truth. These measures ensured our study’s robustness. Third, with our team’s multilingual expertise, we could compare chatbot performance in both English and Chinese, offering a unique angle on AI-driven medical communications across major languages. Additionally, beyond assessing a chatbot’s factual accuracy, we scrutinized its response evolution and introduced a procedure to prompt self-assessment, highlighting potential avenues for improving AI responses in medical contexts. There are also limitations that may merit further consideration. First, although we generated the questions with a guideline-based approach, they only represent a small part of questions in terms of CVD prevention. Though we compared the responses of five LLMs under consistent conditions to ensure the impact of stochasticity was consistent across the selected LLMs, the impact of stochastic responses may not be eliminated completely. Thus, the generalizability of our findings to the entire spectrum of CVD prevention questions may be limited. Second, though we tested models’ temporal improvements, most responses were generated using chatbots between 24th April and 9th May 2023. As the LLM-Chatbots evolve at a unprecedent speed, more continuous research is needed to accommodate updated LLM iterations and other emerging LLMs such as Meta’s LLaMA and Anthropic’s Claude. Third, to reduce the bias from language proficiency, English part and Chinese part were assessed by independent panel of cardiologists, leading to varying guideline interpretations in respective regions. For example, Entresto was approved for treatment of hypertension in China and Japan but not in United States and Singapore. Thus, the any direct comparisons between performances of the chatbots in response to English and Chinese prompts should be interpreted with caution. Fourth, our findings indicate comparable performances between the chatbots, suggesting a smaller effect size than anticipated. This implies that our initial effect size estimation during the study design phase, set at 0.05, might have been optimistic, resulting in a potential underestimation of the ideal sample size.
In conclusion, ChatGPT-4.0 excels in responding to English-language queries related to CVD prevention, with a high accuracy rate of 97.3%. In contrast, all LLM Chatbots demonstrated moderate performance for Chinese-language queries, with accuracy rates ranging from 84% to 88%. Considering the increasing accessibility of LLM Chatbots, they offer promising avenues for enhancing health literacy, particularly among underserved communities. Continuous comparative evaluations assessments are crucial to delve deeper into the quality and limitations of the medical information disseminated by these chatbots across common languages.
link