Content validity of AI-generated stuttering assessment and intervention programs based on expert review: A comparative analysis across age groups and language versions


Koçak A. N., Arslan M. B.

Journal of Fluency Disorders, vol.87, 2026 (SCI-Expanded, SSCI, Scopus) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 87
  • Publication Date: 2026
  • Doi Number: 10.1016/j.jfludis.2025.106186
  • Journal Name: Journal of Fluency Disorders
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, BIOSIS, CINAHL, Communication Abstracts, Educational research abstracts (ERA), EMBASE, Linguistic Bibliography, MLA - Modern Language Association Database, Psycinfo
  • Keywords: Artificial intelligence, Content validity, Stuttering
  • Anadolu University Affiliated: No

Abstract

Purpose This study aimed to evaluate the content validity and inter-rater reliability of stuttering assessment and intervention programs generated by artificial intelligence (GPT-4) in both Turkish and English for preschool, school-age, and adult populations. It also examined whether linguistic or cultural differences affected expert evaluations. Methods Twelve AI-generated programs (six in Turkish, six in English) were reviewed by twelve certified speech-language pathologists specializing in fluency disorders. Each item was rated using a 5-point Likert scale. Descriptive statistics, Cronbach’s Alpha, and Intraclass Correlation Coefficients (ICC) were calculated to assess consistency and reliability. Results The majority of items were rated as appropriate or highly appropriate (M = 4.6–4.9). The overall reliability among raters was poor (ICC = 0.45), while single-rater reliability was higher (ICC = 0.65). Only a small number of items were flagged for revision, typically involving emotional or contextual components. Experts noted that English versions tended to be more detailed and literature-consistent, whereas certain Turkish terms required clearer cultural adaptation. Conclusion GPT-4 can produce clinically relevant and linguistically accurate stuttering materials when paired with expert review. However, human validation remains essential to refine affective and culture-specific elements. These findings support the integration of AI-assisted tools in multilingual clinical content development.