Invited Lecture 1 (9th, September)Kumiko Tanaka (The University of Tokyo)
Long Memory in Natural Language
In this talk, I present a frontier of quantification of the long memory underlying natural language. Long memory is a common quality of complex systems that also exists in natural language. In language, long memory appears as clustering of the word occurrences in a word sequence and is caused mainly by context shifts. Thus far, there have been two different quantification techniques for long memory: long-range correlation and fluctuation analyses. Both produce power laws that suggest the self-similar nature of the clustering phenomena to underlie natural language sequences. After explaining the state-of-the-art quantification methods, their outputs, and the understandings gained, I argue the further signification of long memory with respect to machine learning, as studied in the field of computational linguistics.
Invited Lecture 2 (10th, September)Yuichiro Kobayashi (Nihon University)
Automated essay/speech scoring: A stylometric approach to language assessment
Automated scoring, in which computer technology evaluates and scores written or spoken content (Shermis and Burstein, 2003), aims to sort a large body of data, which it assigns to a small number of discrete proficiency levels. Objectively measurable features are used as exploratory variables to predict scores defined as criterion variables. It is a form of language assessment and an application of stylometric methods, such as authorship attribution or chronological stylistic analysis. This talk will outline the basic concepts and technologies for automated essay and speech scoring. In particular, some useful feature sets for automated grading will be discussed in the context of statistical prediction of learners' proficiency levels. Furthermore, the results of automated English speech grading will be reported in order to show the possibilities and limits of statistical language evaluation in educational settings. For the speech grading, I used two corpora of Japanese English language learners' spoken utterances, the NICT JLE Corpus (Izumi, Uchimoto, and Isahara, 2004) and the Longitudinal Corpus of Spoken English (Abe and Kondo, 2019), which are coded into nine oral proficiency levels. The nine levels, which were manually assessed by professional raters and pertained to such aspects of examinees’ speech as vocabulary, grammar, pronunciation, and fluency, were used as criterion variables, and 67 linguistic features analyzed in Biber (1988) were considered as explanatory variables. The random forest algorithm (Breiman, 2001) was employed to predict oral proficiency.
Abe, M., & Kondo, Y. (2019). Constructing a longitudinal learner corpus to track L2 spoken English. Journal of Modern Languages, 29, 23-44.
Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Izumi, E., Uchimoto, K., & Isahara, H. (2004). A Speaking Corpus of 1,200 Japanese Learners of English. ALC Press.
Shermis, M. D., & Burstein, J. C. (Ed.) (2003). Automated essay scoring: A cross-disciplinary perspective. Routledge.