Perceptual Category Mining in Human Second Language Speech Perception Yizhou Lan Department of Chinese, Translation and Linguistics City University of Hong Kong Kowloon, Hong Kong ylylan2-c@my.cityu.edu.hk
Will X. Y. Li Department of Electronic Engineering City University of Hong Kong Kowloon, Hong Kong xiangyuli4-c@my.cityu.edu.hk
Abstract—this paper intends to clarify the process of the Perceptual Assimilation Model (PAM) predicting the patterns of human categorical speech perception in audio perception of second language (L2) speech signals. The original stage of categorizing acoustic stimuli in the L2 signal often involves assimilation from L1 categories. Whether L1 sounds will assimilate to L2 ones can be assimilated to the L2 category is decided by difference of category distance. This study, with evidence from Cantonese learners’ perception of the English trsignal, proposes that a careful mining process to find out the intended assimilating L1 category is necessary. The chosen L1 category should be tested on perceptual similarity to the L2 category in fine-grained phonetic environments, rather than just possible L1 phonological mappings from observed production errors. Two experiments with different L1 assimilating candidates are done using the AX identification and ABX discrimination paradigms. Target trVC syllables are both aligned with twVC, a representative of phonological closeness through contrastive analysis, and with chVC, a representative of perceptually similar candidate which is less mentioned in literature. Results from both identification and discrimination show that accuracy rate for twVC is almost ceiling whereas that of trVC is significantly lower. Results suggest that perceptual distance is better represented by perceptual similarity and such screening process should be applied as a pre-examination procedure instead of choosing L1 assimilator based on phonological similarity.
not be directly functioning because L1 and L2 categories may differ. Despite the unavailability of readily usable categories, L2 learners will “borrow” L1 categories, a process called “equivalence classification”, to increase perceptual efficiency [3] and such “laziness” of equaling L1 and L2 categories is referred to as perceptual assimilation [5]. Even in cases where L1 and L2 categories are labeled as the same phoneme (the smallest meaningful unit in speech) in their languages, L1 and L2 speakers’ perception may still surface subtle mismatches. The Perceptual Assimilation Model (PAM, [5] [6]) is an approach describing how L2 speech sounds, perceived as categories of discrete constellations of articulatory gestures, can be assimilated to L1 ones. Assimilation process could either facilitate or hinder communication. According to different perceptual distances between L1 and L2 categories, a candidate L1 sound may be perceptually identical, similar, or distinct to the target L2 sound to be perceived or learned. If the L2 sound is perceived identical phonologically to the candidate L1 sound, it will map onto the L1 category and make the L2 sound indistinguishable. For two given L2 sounds and their different perceptual distance to a given L1 sound, the PAM model proposes six possible assimilation types and predicts L2 perceiver’s ability to discriminate the two L2 sounds in these situations. The types and predictions are as follows:
Index Terms—human speech perception, category assimilation, perceptual distance, L2 speech processing.
TABLE I. ASSIMILATION TYPES AND THEIR PREDICTIONS IN PAM
I. INTRODUCTION The context of second language (L2) speech processing is a precious condition to study the capability of human speech perception per se. Speech, part of the human cognitive mechanism, is usually perceived by recognizing higher-level knowledge of categories like in the process of recognizing shapes or colors [1]. One does not have to focus on all the acoustic details to comprehend a speech sound [2]. Instead, previous studies has shown that economical ways of extracting specific acoustic cues [2] [3] or articulatory cues [4] [5] in topdown processing helps humans to recognize fast and variant speech sound tokens efficiently and accurately. However, in L2 speech, though the incoming speech signals are linguistic sounds, the higher-level knowledge of categories, which helps us processing our native language (L1) automatically, might
978-1-4799-1027-4/13/$31.00 ©2013 IEEE
Assimilation types Two-category Uncategorizable–categorizable Both uncategorizable Category-goodness Single-category Non-assimilatable
Predicted discrimination rate Excellent Very good Fair to good Moderate to very good Very poor Very good to excellent
Since a major purpose of researching L2 speech perception is to find out the difficult phonemes in learning a specific L2, the “perceptual distance” became especially important for researchers to effectively testify whether a non-discriminable assimilation has taken place [7]. The SC type (bolded in Table 1) has been extensively discussed to this very end. When two L2 sounds were assimilated to the same L1 category, L2 learners, who fails to discriminate subtle acoustic differences and faces two possible categories to access to, will randomly pick one candidate category in a force-choice discrimination task. Taking
According to previous observations, Cantonese learners of English often delete C-[r] initial clusters or substitute the /r/ as [w], as in producing [pɪnt] for print or [tʃaɪ] for try [10] [11] [12]. Specifically, the alveolar cluster, tr-, is more problematic because in most studies, it is said to be deleted or substituted by [w] just like previous studies. Moreover, it has been found that Cantonese production of /tr/ is phonetically [tʃ], which is less mentioned in literature because /tʃ/ is phonologically more distant from /r/. Phonologically, /tr/ is a clustering sequence of consonants composed by a plosive and an approximant (and so does /tw/), but /tʃ/ is a single affricate with no approximant feature at all. Therefore, one possible way of choosing the SC assimilating candidate for /r/ is by examining the phonological closeness of Cantonese and English. /w/ is considered as a freevariation for /r/ in various studies [10] [13] phonologically. Accordingly, many studies have pinned down /w/ as the phoneme to be assimilating category (e.g., [11] [12]). However, /tʃ/, also produced in Cantonese (as a variant for /ts/) might be another candidate assimilated to /tr/ by Cantonese. Despite its phonological distance from /tr/, participants have expressed their perceptual confusion of /tr/ and /tʃ/ in our pilot study. Therefore, it is doubtful whether researchers can rely on phonological information or production to prescribe the assimilating candidate. Rather, also as claimed by PAM, perceptual distance is what determines categorical perception, not contrasts of L1 and L2 phonology. This suggests that the assimilating candidate has to be tested. The current study directly compares the assimilation patterns of tr- to tw- and trto ch- to see which candidate may be more suitable to be categorized as the SC type.
respectively tested the tw- / tr- and ch- / tr- contrast, each with two tasks: AX identification and ABX discrimination. Participants were three adults working as administrative staff at City University of Hong Kong (2 females, 1 male, mean age=27.5). They are all local-raised native Cantonese speakers using English as their working language. None of them had prior exposure to other foreign languages except English. All participants were right-handed with no reported hearing or motor-control defects. They did not have prior exposure to musical training. For controlling, two native monolingual English speakers (1 female and 1 male, mean age=26.5) from California, U.S. also participated in the study and went through the same procedure. The perception tests were carried out in the Phonetics Lab, City University of Hong Kong. Stimuli for both tests were pseudo-words in isolation, recorded by another American English speaker from California. Stimuli for the first experiment were designed as minimal pairs of trVC and twVC (e.g., trook-chook) and that of second experiment as trVC and chVC (e.g. trep-twep). Stimuli differ in five vowel contexts, /i/, /ɛ,æ/, /u/, /ʌ/, and /ɔ/. Each word was repeated three times by the native English speaker and then the most clearly pronounced utterance was selected as a stimulus. Test tokens were added with the equal numbers of fillers. For the perception test, each word in the stimuli list was repeated 10 times and randomized. In total, 1000 tokens were tested (5 participants × 2 experiments × 2 tasks × 5 vowels × 10 repetitions). To find evidence for perception, production samples of real word stimuli are also recorded in the lab for both Cantonese and native English participants. Both experiments involved identification and discrimination of the sounds in the minimal pairs. In the identification task, two words from a minimal pair (e.g., cheet/treet or twook/trook) were played in sequence, and the participants were asked to choose and circle either “same” or “different” on an answer sheet. In the discrimination task, three consecutive words (e.g., treek/tweek/tweek) were played, where the third word was identical to either the first or the second one. The participants were asked to circle the correct word on the answer sheet. The inter-stimulus intervals (ISI) were set at 250 milliseconds for both tasks. The perceptual reliability of the stimuli may be influenced by variation of vowel quality across stimuli tokens and result in a possible extra variable to perception. To resolve this, the original vowel parts of the stimuli were replaced with the identical vowel which was sectioned from one token so that vowel quality remained consistent for the tasks. For instance, the [i] in one clear production of “treek” was used for all tokens with /i/ vowels in both experiments.
III. EXPERIMENTAL PROTOCOL
IV. RESULTS
Two experiments were done to examine the perceptual assimilation patterns of two candidates of L1 assimilating categories, namely tw- and ch-, to see which candidate was more prone to become perceptually unintelligible by experienced Cantonese learners of English. Two experiments
A. Experiment 1: tw- and tr- contrast In both identification and discrimination tasks, the consonant distinctions between tw- and tr- in the clusters were easily distinguished with 100% accuracy by native English
Mandarin learners’ acquisition of English /r/ as an example, the sound /r/ is often categorized as /z/, and /z/ in English is also categorized as /z/ in Mandarin, forming a SC scenario where /r/ cannot be distinguished from /w/ in English [8]. However, others claimed that /r/ is categorized as /ᶎ/, when examining the Mandarin /ᶎ/ phoneme [9]. The assimilation results these studies predict are confounding according to the candidate chosen for testing. Therefore, mining the L1 assimilating category for testing directly affects the result and efficiency of experiments. The present study contributes a heightened attention to a robust and procedural way of mining the candidate category. Attention should be brought to mining the category through pilot tests, which measures category distance for each learner rather than prescribes it with phonological knowledge in general. II. TWO CANDIDATE METHODS
speakers (N=200, sd=0) whereas 94.3% of the contrast was correctly perceived by Cantonese speakers (N=300, sd=.232). Results of two tasks did not differ much with each other. Cantonese speakers scored 94.67% and 94% in identification and discrimination respectively. T-test results showed that such a difference is statistically insignificant [t=.248, df=298, p=.619] Vowel context was not a significant indicator for the difference of accuracy rate in either identification or discrimination. In five vowel contexts, Cantonese speakers’ perception rate was /i/, 98%; /ɛ,æ/, 92%; /ʌ/, 95%; /ɔ/, 95%; and /u/, 92%, respectively. Through ANOVA tests, it was found that the difference of accuracy rates by vowel contexts was statistically insignificant [F(4, 295)=.869, p=.483]. Individual difference among participants was insignificant [F(2, 297)=1.182, p=.308] too. B. Experiment 2: ch- and tr- contrast Native English speakers also attained 100% accuracy in the second experiment whereas 73% of the contrast was correctly perceived by Cantonese speakers (N=300, sd=0.446). Since task variation is a key point because of statistical significance [t=3.150, df=298, p<0.05], experiment 2 is reported in the order of tasks. In the identification task, the level of accuracy was substantially related to the vowel contexts. In the context of /i/, /ɛ,æ/, /ʌ/ and /ɔ/, the consonants were distinguished with relatively high accuracy rate (100%, 86.67%, 83.33%, and 80%, respectively); but in the context of /u/, the identification accuracy was significantly low (56.67%). Accuracy rate was moderately higher in the identification task (81.3%) than in the discrimination task (66%). ANOVA test results of the identification task show that the vowel effect on the accuracy rate is significant [F(4, 145)=5.387, p<0.001]. Tukey’s posthoc tests review that for /i/ and /ɛ,æ/, only /u/ is significantly different [for /i/, m=-0.433, sd=0.097, p<0.001; for /ɛ,æ/, m=0.300, sd=0.097, p<.05]. In the discrimination task, the level of accuracy in regards to vowel contexts patterned similar to the identification task that the stimuli with the all vowels except /u/ were more likely to be correctly discriminated (/i/: 86.67%; /ɛ,æ/: 73.33%; /ʌ/: 66.67%; /ɔ/: 63.33%, respectively), whereas the accuracy rate for words with /u/ fell massively to 40% (see Figure 1). Statistics also show a significant effect of vowels [F(4, 145)=4.396, p<0.05]. Post-hoc tests results were similar to those of the identification. Only the accuracy of /u/ is significantly different from /i/ and /ɛ,æ/ [for /i/, m=-0.467, sd=0.097, p<0.001; for /ɛ,æ/, m=-0.333, sd=0.097, p<0.05]. Individual difference of the accuracy rates among three participants was not significant [F (2, 297)=1.833, p=0.162]. A detailed layout of the two experiments is shown below in Fig.1. Overall, the difference of Cantonese speakers’ performance in the two experiments (tw- and ch-), which is 94.3% and 73%, is significant [t=-7.462, df=598, p<0.0001].
Fig. 1. Perceptual results for Cantonese participant 2’s accuracy rates by vowel categories in each task in each experiment
V. DISCUSSIONS From the results, especially the significant difference of accuracy rate between experiment 1 and 2, it is clear to see that /tr/ is more likely to be assimilated to the candidate /tʃ/ rather than /tw/ because both accuracy rate in identification and discrimination are significantly lower in /tʃ/-/tr/ contrast in experiment two, indicating that Cantonese speakers will confuse /tʃ/ with /tr/ in processing the L2 speech signal. A. Mining the candidate category through perceptual distance PAM posits that only perceptual distance can help discover the correct candidate of categorical assimilation [5]. This means that we should directly measure the perceptual distance by identification and/or discrimination paradigms in categorymining stages of studies rather than rely on the proximity in L1 and L2 phonology because human cognition of sound categories is far more complex and context-dependent than the entries in contrastive phonology. In some extreme cases, category assimilation may function across the boundary of phoneme: for the current study, a sequence of sound may be assimilated to a single sound or vice versa as in Japanese (English /r/ is assimilated to [ɰɾ], cf. [14]). And yet in empirical studies, the important step of finding out the right category for testing the SC type of assimilation is very likely to be missed out. It is further emphasized that mining of the candidate assimilating category should be tested first rather than prescribed. Although /w/ is the predicted category assimilated to /r/ category, applying it to tr- cluster might not function ideally because in /tr/ environment, the category of /r/ might have been shifted. According to the results /r/ in tr- has been to elsewhere in the human perceptual landscape of speech, where /r/ in the /tr/ environment is no longer confused with /w/ but with /tr/ together with /tʃ/. B. Effect of vowel From the results we noticed that even within the same category, phonetic environments of L2 may alter the perceptual distance of L1 and L2. In this study, the /u/ environment has shortened the /tr/ and /tʃ/ category into a non-discriminable situation as shown by the accuracy rate below chance.
However, in /i/, the distance is much larger accuracy rate. This further confirms that phonetic environment is crucial to mining out the confusing category. C. Evidence from production data We also conducted a production test for real words in trVC, chVC and twVC structure pronounced by Cantonese speakers. They were sent to Praat for spectrogram analysis. In productions of truth (see Fig. 2), a F3 slope, a distinct acoustic feature of /r/ [15], was completely unseen in the Cantonese speaker’s production but clear in that of native English at the beginning of voicing (picture not shown due to page limit), showing that tr- may be poorly distinguished by Cantonese English learners. In production of choose, we see that the F3 is similar to that in truth, indicating that the two sounds may be very close in the participants’ perception the /u/ context.
The production results serve as a side-proof, yet an important one, to how Cantonese L2 speakers of English realize the distance of perceptual categories in clarity of production. Here, in truth, the consonant part is similar to that of choose but not Twain. The perceptual distance from /tw/ to /tr/ is apparently farther because of the clarity in differentiating the two categories in production. D. Limitations This study serves as a first step to raise awareness for researchers to notice category mining and has a number of limitations. Firstly, participants are confined to advanced learners. There is possibility that naïve English listeners of Cantonese may still fail to detect the difference of tr- and twand undergo categorical assimilation. Secondly, the proposed method of category mining is principally exhausting possible confounding sounds in L1 that could be perceptually similar to the L2 target sound. It would be excessively inefficient to exhaust all possibilities. Future studies regarding how to optimize the process of mining is eminently desirable. ACKNOWLEDGMENT The writers thank Bin Li for her suggestions for an earlier draft of the manuscript. The writers are also deeply indebted to Sunyoung Oh for her guidance in decoding speech perception models. REFERENCES
Fig. 2. Spectrograms for Cantonese participant 2’s production of “truth” (above) and “choose” (below). No F3 change is visible for the Cantonese participant in both cases.
However, in Twain, Cantonese speaker can produce a full clear /w/ with low F2 nearing F1, which is distinct for /w/ [16], different from the consonant part in either truth or choose (see Fig.3). This shows that /tw/ is an independent category free from influence of /tr/.
Fig. 3. Spectrogram for Cantonese participant 3’s production of “Twain”. A clear [w]quality is seen by the nearing of F1 and F2 and a glide to the following diphthong.
[1] B. Berlin, & P. Kay, Basic color terms: their universality and evolution, CA: University of California Press, 1991. [2] J. B. Pierrehumbert, Phonetic Diversity, Statistical Learning, and Acquisition of Phonology, Language and Speech, 46(2-3), 2003, pp. 115–154, [3] P. K. Kuhl, “A new view of language acquisition”, Colloquium of PNAS, 2000, pp. 11850–11857. [4] C. P. Browman, & L. Goldstein, “Tiers in articulatory phonology, with some implications for casual speech”, Haskins Laboratories Status report on speech research, 1987, pp.1-30. [5] C. T. Best, “A Direct-realist view of cross-language speech perception,” In Strange, W. Ed. Speech perception and linguistic experience: Issues in cross-language research, 1995, pp. 171– 204. [6] C. T. Best, G. W. McRoberts, & E. Goodall, “Discrimination of non-native consonant contrasts varying in perceptual assimilation to the listener’s native phonological system”. Journal of Acoustics Society of America. 109(2), 2001, pp. 775– 794. [7] T. M. Derwing, & M. J. Munro. "Second Language Accent and Pronunciation Teaching: A Research‐Based Approach." TESOL Quarterly 39 (3), 2005, pp. 379-397. [8] H. Wang, “English as a lingua franca: Mutual intelligibility of Chinese, Dutch and American speakers of English”. LOT dissertation series 147. LOT, Utrecht, 2007. [9] S. Lee, A Comparison of Cluster Realizations in First and Second Language. The Journal of Studies in Language, 19, (2), 2003, pp. 341-357. [10] T. N. Hung, “Language in contact: Hong Kong English phonology and the influence of Cantonese”. In Kirkpatrick, A.
Ed, Englishes in Asia: Communication, Identity, Power and Education. Melbourne: Language Australia, 2002, pp. 191-200. [11] Y. Chan,Strategies used by Cantonese speakers in pronouncing English initial consonant clusters: Insights into the interlanguage phonology of Cantonese ESL learners in Hong Kong, IRAL proceedings, 44, 2006a, pp. 331-355. [12] Y. Chan, Cantonese ESL Learners’ Pronunciation of English Final Consonants. Language, Culture and Curriculum, 19 (3), 2006b, pp. 296-312. [13] H. Ling, Production of English /r/ and /w/ by Cantonese L1 speakers in Hong Kong, Unpublished M.A. Thesis, 2010.
[14] K. Aoyama, J. E. Flege, & S.G. Guion, “Perceived phonetic dissimilarity and L2 speech learning: The case of Japanese/r/and English/l/and/r/”, Journal of Phonetics, 32, 2004, pp. 233–250. [15] C. T. Best, & W. Strange, “Effects of phonological and phonetic factors on cross language perception of approximants”, Haskins Laboratories Status Report on Speech Research, SR-109/110, 1992. [16] R. D. Kent, & C. Read, “Acoustic analysis of speech”, Albany: Delmar Learning, 2002.