A study presented at the European Respiratory Society (ERS) Congress in Vienna revealed that the chatbot ChatGPT outperformed trainee doctors in assessing complex respiratory diseases, such as cystic fibrosis, asthma, and chest infections. The research also showed that Google’s chatbot Bard performed better than trainees in certain areas, while Microsoft’s Bing chatbot matched the trainees in overall performance.

The findings suggest that large language models (LLMs) like these could help trainee doctors, nurses, and general practitioners in triaging patients more efficiently, potentially easing the strain on healthcare systems. The study was led by Dr. Manjith Narayanan, a consultant in pediatric pulmonology at the Royal Hospital for Children and Young People, Edinburgh, and a senior lecturer at the University of Edinburgh.

Dr. Narayanan and his team designed clinical scenarios based on real-life pediatric respiratory cases, which included common but challenging conditions like cystic fibrosis, asthma, and sleep-disordered breathing. These cases lacked a clear diagnosis or widely accepted treatment guidelines, adding complexity to the task.

Ten trainee doctors, each with less than four months of clinical pediatric experience, were given an hour to solve each case using only the internet—without any chatbot assistance. Their 200-400 word responses were then compared with those generated by ChatGPT (version 3.5), Bard, and Bing.

Six pediatric respiratory experts evaluated the responses on five criteria: correctness, comprehensiveness, usefulness, plausibility, and coherence. They also judged whether the responses appeared human-generated and gave an overall score out of nine.

  • ChatGPT: Achieved an average score of 7 out of 9, surpassing the trainee doctors. Experts found its responses to be more human-like than those from other chatbots.
  • Bard: Scored 6 out of 9, with particularly strong marks for coherence, although it performed similarly to the trainees in other aspects.
  • Bing: Matched the trainee doctors with a score of 4 out of 9, with experts consistently identifying its responses as non-human.

Dr. Narayanan emphasized that this study moves beyond testing LLMs’ memory capabilities, focusing instead on how they can be applied in real-world clinical situations. While not tested in direct patient care roles, AI tools like ChatGPT could assist triage nurses, trainee doctors, and primary care physicians, who are often the first to assess patients.

Notably, there were no significant instances of “hallucinations”—a term used to describe when LLMs generate false or irrelevant information—across the three chatbots. However, Dr. Narayanan stressed the need for caution, particularly in developing safeguards to prevent such issues in clinical applications.

The researchers plan to expand their study by testing LLMs against more experienced doctors and exploring newer, more advanced versions of these models.

Professor Hilary Pinnock, Chair of the ERS Education Council and a specialist in primary care respiratory medicine at the University of Edinburgh, described the study as both encouraging and slightly concerning. She pointed out the importance of ensuring that AI tools like ChatGPT do not produce errors due to inaccurate or biased data. While AI holds promise for improving healthcare efficiency and outcomes, Pinnock emphasized the need for extensive testing before integrating it into routine clinical practice.

This groundbreaking study signals a potential shift toward AI-supported healthcare, highlighting both the opportunities and challenges of incorporating these technologies into everyday medical practice.

By Impact Lab