Improving Uroflowmetry Interpretation: Effects of Standardization Sessions on Interobserver Agreement and AI Model Consistency

Plata M1, Azuero G1, Daza F1, Gutierrez A1, Garcia V1, Rojas-Rivillas M1, Gomez S2, Rodriguez T2, Cañar S2, Avila P2, Gonzalez S2, Florez S2, Giraldo L2

Research Type

Clinical

Abstract Category

Urodynamics

Abstract 551
Open Discussion ePosters
Scientific Open Discussion Session 105
Friday 19th September 2025
12:40 - 12:45 (ePoster Station 3)
Exhibition
Urodynamics Equipment Urodynamics Techniques Voiding Dysfunction
1. Hospital Universitario Fundacion Santa Fe de Bogota, 2. Universidad de los Andes
Presenter
Links

Abstract

Hypothesis / aims of study
Urodynamic studies (UDS) are essential for evaluating lower urinary tract dysfunction and guiding clinical decision-making (1). Despite efforts to promote best practices, standardization remains limited, and interobserver variability persists, even among experienced urologists (2). In a previous study, we assessed concordance levels in UDS interpretation among specialists, identifying key parameters with significant disagreement. This study aims to evaluate the impact of standardization sessions on interobserver agreement in uroflowmetry interpretation and to determine how improved concordance among specialists enhances both the predictive performance and consistency of AI models.
Study design, materials and methods
A series of weekly standardization sessions were conducted where key urodynamic parameters were reviewed and discussed until a consensus was reached. For uroflowmetry, normality was determined using the Liverpool nomogram, with values above the 25th percentile for men and above the 10th percentile for women considered normal. Curve morphology was classified according to the 2024 ICS standards. Voided volumes above 500 mL or below 150 mL the results were considered equivocal.  

To evaluate the impact of these sessions, we focused on uroflowmetry interpretation, which previously demonstrated moderate agreement (κ=0.5) for both normality and curve morphology. The same set of 50 UDS previously analyzed was re-evaluated, with three independent urologists reinterpreting only the uroflowmetry component. Interobserver agreement was measured using Fleiss’ kappa index (κ). Additionally, intraobserver variability was assessed by comparing each urologist’s initial and post-standardization. Statistical significance was set at a p-value of < 0.05. Decision-tree-based AI models were trained for each urologist using pre- and post-standardization data, with F1-scores and feature importance calculated.
Results
All 50 uroflowmetry tracings were successfully interpreted by the three urologists.  Uroflowmetry interpretation across examiners demonstrated a moderate to substantial agreement, with a kappa for normality of 0,772 (0.668 - 0.872) and for curve morphology of 0,513 (0.372 - 0.648). Intraobserver variability range greatly amongst examiners going from 0,172 (0,015 - 0,328) to 0,671 (0,499 - 0,843), see table 1. Notably, standardization enhanced AI model performance and consistency: mean F1-score increased from 0.62 to 0.67, and models trained by individual urologists became more similar, with key predictive features aligning across specialists' post-standardization, reflecting the influence of improved interobserver agreement. Figure 1 illustrates the trained AI decision-tree model before and after standardization.
Interpretation of results
The results of this study suggest that the implementation of standardization sessions significantly improved interobserver agreement in the interpretation of uroflowmetry, as indicated by the substantial increase in kappa values. Specifically, the kappa value for defining normality rose to 0.772. This is a notable improvement over previous studies where agreement was fair. Additionally, the kappa for curve morphology, while still moderate (0.513), demonstrated an improvement from the baseline. 

Intraobserver variability varied widely, suggesting a shift in individual urologists’ perspectives on uroflowmetry interpretation. This change may reflect the impact of the standardization sessions, which helped align definitions and diagnostic criteria, leading to significant change in interpretation. Importantly, enhanced interobserver agreement directly improved AI model performance and consistency, with models showing greater uniformity and reliability across urologists. This highlights that standardization harmonizes human interpretation while boosting the predictive accuracy and alignment of AI models.
Concluding message
These findings highlight the critical role of structured standardization sessions in reducing interobserver variability and enhancing both the performance and consistency of AI models. This synergistic effect not only improves diagnostic agreement among urologists but also aligns AI models across specialists, boosting their predictive accuracy and uniformity. This dual benefit underscores the potential of integrating human expertise with AI to advance the reliability and standardization of urodynamic study interpretation.
Figure 1 Intraobserver variability
Figure 2 Decision trees generated before and after the standarization process, based on training data from one of the specialists.
References
  1. Bodmer NS, Wirth C, Birkhäuser V, et al. Randomised controlled trials assessing the clinical value of urodynamic studies: A systematic review and meta-analysis. Eur Urol Open Sci. 2022;44:131-141.
  2. Dudley AG, Taylor AS, Tanaka ST. Reliability and reproducibility of pediatric urodynamic studies. Curr Bladder Dysfunct Rep. 2017 Sep;12(3):233-240.
  3. Oelke M, Heesakkers J, Doumouchtsis SK, et al. ICS Standards 2024. Continence. 2024. Available from: https://www.ics.org/Publications/ICS%20Standards%202024.pdf
Disclosures
Funding None Clinical Trial No Subjects Human Ethics Committee Corporate Research Ethics Committee, Hospital Universitario Fundacion Santa Fe de Bogota Helsinki Yes Informed Consent Yes
16/07/2025 05:11:03