Content validity of AI-generated stuttering assessment and intervention programs based on expert review: A comparative analysis across age groups and language versions

Koçak, Ayşe; Arslan, Melis

doi:10.1016/j.jfludis.2025.106186

Content validity of AI-generated stuttering assessment and intervention programs based on expert review: A comparative analysis across age groups and language versions

Koçak A. N., Arslan M. B.

Journal of Fluency Disorders, cilt.87, 2026 (SCI-Expanded, SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 87
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.jfludis.2025.106186
Dergi Adı: Journal of Fluency Disorders
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, BIOSIS, CINAHL, Communication Abstracts, Educational research abstracts (ERA), EMBASE, Linguistic Bibliography, MLA - Modern Language Association Database, Psycinfo
Anahtar Kelimeler: Artificial intelligence, Content validity, Stuttering
Anadolu Üniversitesi Adresli: Hayır

Özet

Purpose This study aimed to evaluate the content validity and inter-rater reliability of stuttering assessment and intervention programs generated by artificial intelligence (GPT-4) in both Turkish and English for preschool, school-age, and adult populations. It also examined whether linguistic or cultural differences affected expert evaluations. Methods Twelve AI-generated programs (six in Turkish, six in English) were reviewed by twelve certified speech-language pathologists specializing in fluency disorders. Each item was rated using a 5-point Likert scale. Descriptive statistics, Cronbach’s Alpha, and Intraclass Correlation Coefficients (ICC) were calculated to assess consistency and reliability. Results The majority of items were rated as appropriate or highly appropriate (M = 4.6–4.9). The overall reliability among raters was poor (ICC = 0.45), while single-rater reliability was higher (ICC = 0.65). Only a small number of items were flagged for revision, typically involving emotional or contextual components. Experts noted that English versions tended to be more detailed and literature-consistent, whereas certain Turkish terms required clearer cultural adaptation. Conclusion GPT-4 can produce clinically relevant and linguistically accurate stuttering materials when paired with expert review. However, human validation remains essential to refine affective and culture-specific elements. These findings support the integration of AI-assisted tools in multilingual clinical content development.