Article Preview
TopIntroduction
Pronunciation teaching is an important stage in language learning activities. Advances in automatic speech recognition (ASR) technology -in terms of models, algorithms, portability, etc.- have positively impacted the research in several fields (for more details, see (Baker et al., 2009)). Particularly, these advances have promoted the development of computer-assisted pronunciation learning systems (Lee et al., 2017; Sefara et al., 2017) and have enabled the automatic pronunciation assessment (Chen & Li, 2016). A few languages like English have benefited from this technological aspect but others such as Arabic remain poorly resourced. This work aims to alert teachers on pupils with specific difficulties to correct them early, it also aims to give pupils additional time and more comfortable environment for practicing their pronunciation.
Overall, research in Computer-Assisted Pronunciation Teaching (CAPT) systems follows two directions: the first one is error detection (Hu et al., 2015; Lee et al., 2016) while the second is pronunciation assessment and scoring (Cheng et al., 2014; Neumeyer et al., 2000). In both directions, pronunciation teaching finds applications in many fields such as language learning (Landini et al., 2017; Reeder et al., 2015) or speech therapy (Necibi et al., 2013).
A typical ASR-based CAPT system involves several stages depicted in figure 1. When the learner pronounces a given word (or another part of speech), acoustic observations are extracted from the incoming signal and are represented as a collection of acoustic vectors. After that, the system force-aligns this representation with the model of the correct pronunciation (e.g. native-like). Hidden Markov Models (HMMs) are the most used models to represent speech.
A Hidden Markov Model is a collection of states connected by transitions (transitions are labeled with probabilities). An HMM begins in a designated initial state. In each discrete time step, a transition is taken in a new state where an output symbol is generated (Rabiner, 1989). “When an HMM is applied to speech recognition, the states are interpreted as acoustic models, indicating what sounds are likely to be heard during their corresponding segment of speech” (Bahi & Benati, 2009).
In the force-alignment stage, the speech recognizer computes the probability p(W|O), where O is the observation represented by the extracted features from the incoming signal and W the model of the word to pronounce. In CAPT context, the acoustical representation of the incoming speech is force-aligned to HMMs (one or more) that models the expected pronunciation. The system outputs a score representing how close is the incoming speech to the correct pronunciation.
Figure 1. Several stages in ASR-based CAPT
O’Brien et al. (2019) pointed that: “Early studies primarily dealt with pronunciation assessment and showed a relatively strong correlation between human pronunciation ratings and machine scores,” and experiments in (Cucchiarini et al., 2000) have concluded that “raters who did not receive any instructions on the use of the rating scales may differ from each other in the absolute values of the scores assigned.” Moreover, Luo et al. (2016) observed that “Many teachers remain skeptical about the fairness of automatic scores given by machines even with the most advanced scoring methods.”
As fuzzy logic is known to promote soft frontiers between classes, its use would make more consensus between experts while still alerting teachers on possible difficulties of the learners. To overcome limitations related to the experts’ rating disparities and to make teachers more comfortable with automatic scores, we developed a Fuzzy logic-based System for Pronunciation Assessment (FuSPA). The research question is:
Can automatic scores provided by FuSPA resemble those provided by human experts?