Hypothesis / aims of study
This study aims to evaluate and compare the accuracy and completeness of responses generated by three AI models—ChatGPT, Gemini, and DeepSeek—when prompted with patient-oriented questions regarding female urinary tract infections. The findings will be measured against evidence-based clinical guidelines and publications.
Study design, materials and methods
A cross-sectional design was employed. Researchers developed five standardized, patient-focused questions on UTI management based on recent evidence and authoritative guidelines. Each question was individually submitted to ChatGPT, Gemini, and DeepSeek in a private browser session. Two medical professionals independently evaluated each AI-generated response for accuracy (1–3 scale: Correct, Partially Correct, or Incorrect) and completeness (1–2 scale: Incomplete or Complete). Both raters compared the AI response with the AUA guidelines. Inter-rater agreement was used to assess the consistency of ratings between evaluators.
Interpretation of results
Inter-rater agreement was high across all models. Overall agreement for accuracy was 86.7%, while completeness ratings had 100% agreement. DeepSeek demonstrated the highest consistency, with 100% agreement between evaluators on both accuracy and completeness. ChatGPT and Gemini each showed 80% agreement for accuracy but maintained full agreement for completeness.