Statistical structure of printed Turkish, English, German, French, Russian and Spanish

Shamilov A., Yolacan S.

WSEAS Transactions on Mathematics, vol.5, no.6, pp.756-762, 2006 (Scopus) identifier

  • Publication Type: Article / Article
  • Volume: 5 Issue: 6
  • Publication Date: 2006
  • Journal Name: WSEAS Transactions on Mathematics
  • Journal Indexes: Scopus
  • Page Numbers: pp.756-762
  • Keywords: ANOVA, Coding theory, Entropy of language, Optimal language, Regression analysis, Semantic content, Shannon's measure
  • Anadolu University Affiliated: Yes


Interests in the statistical properties of language, the basic tool for communication, has been frequently used for the development of computer sciences such as the construction of efficient binary codes. The language itself may be also regarded as a code for certain conceptual entities. From this point of view, in this study, statistical structures of printed Turkish, English, German, French, Russian and Spanish are examined on the basis of the probability distribution of letters for the same semantic content. Consequently, the optimal language in the sense of coding theory is determined by using Shannon's measure for entropy. During the analysis of the study, we encountered by some known difficulties about the evaluation of Shannon's measure. In order to get over these difficulties, we have established that the regression analysis is a convenient method. So, a regression equation is given for generalization of entropy estimates and related interpretations are given. The main important result of the paper is that the slope of the simple linear regression model gives the approximated value for the entropy of the languages.