Can AI Guide Us? Comparing ChatGPT-4 and Deepseek in the Management of Postprostatectomy Incontinence

Pinto V1, Gaspar C1, Nascimento L1, Ataides R1, Alves P1, Pereira M2, Macedo Filho M3, Bessa Junior J4, Gomes C1

Research Type

Pure and Applied Science / Translational

Abstract Category

Urotechnology

Best in Category Prize: Urotechnology
Abstract 47
Urology 2 - Male Stress Urinary Incontinence
Scientific Podium Short Oral Session 4
Thursday 18th September 2025
12:15 - 12:22
Parallel Hall 2
Stress Urinary Incontinence Male Outcomes Research Methods
1. University of Sao Paulo School of Medicine, 2. Hospital do Servidor Publico Estadual, 3. UNDB University Center, 4. State University of Feira de Santana
Presenter
Links

Abstract

Hypothesis / aims of study
The recent advances in Artificial Intelligence (AI) have led to the emergence of increasingly sophisticated language models, such as ChatGPT-4 and Deepseek, which are gaining popularity among healthcare professionals as potential tools for clinical decision support. We aimed to compare the accuracy and clinical relevance of recommendations provided by ChatGPT-4 and Deepseek regarding the assessment and management of postprostatectomy urinary incontinence (PPUI).
Study design, materials and methods
A total of 20 questions were prepared by urologists with expertise in PPUI. The questions had uncontroversial answers based on the Incontinence after Prostate Treatment: AUA/SUFU Guideline. Ten were conceptual questions and ten were based on clinical cases, designed to evaluate the models’ ability to apply knowledge and critical thinking. All questions were submitted in English, anonymously (without IP identification), separately, to ChatGPT 4o and Deepseek. The engine was prompted to be specific and limit the answers to 200 words for greater objectivity and was not prompted to incorporate any specific guideline. Each question was entered as a separate, independent prompt using the “New Chat” function. AI generated answers were independently analyzed by the experts who provided the questions. The accuracy of each response was graded as (A) Correct (1 point); (B) partially correct (0.5 point); or (C)Incorrect (0 point).
Results
ChatGPT had a global accuracy of 95% (19 out of 20 questions), with 90% accuracy in conceptual questions (9 correct answers) and 100% in clinical cases. Deepseek reached a global accuracy of 72.5%, with 80% accuracy in conceptual questions (8 correct answers) and 65% in clinical cases (6.5 correct answers). Deepseek showed more partial answers and incorrect interpretations in questions addressing treatment options, complications, and special clinical situations. The Table shows examples of performance differences between the two AI models across various domains.
Interpretation of results
ChatGPT had a global accuracy of 95% (19 out of 20 questions), with 90% accuracy in conceptual questions (9 correct answers) and 100% in clinical cases. Deepseek reached a global accuracy of 72.5%, with 80% accuracy in conceptual questions (8 correct answers) and 65% in clinical cases (6.5 correct answers). Deepseek showed more partial answers and incorrect interpretations in questions addressing treatment options, complications, and special clinical situations. The Table shows examples of performance differences between the two AI models across various domains.
Concluding message
Both AI tools demonstrated potential to support clinical reasoning in the context of PPUI. However, ChatGPT outperformed Deepseek in both accuracy and consistency, especially in complex clinical scenarios. Despite promising results, careful human validation remains essential before incorporating AI-generated recommendations into clinical practice.
Figure 1
Figure 2
Disclosures
Funding none Clinical Trial No Subjects None
05/07/2025 10:54:44