FN Thomson Reuters Web of Science™ VR 1.0 PT J AU Vogel, AP Fletcher, J Maruff, P AF Vogel, Adam P. Fletcher, Janet Maruff, Paul TI The impact of task automaticity on speech in noise SO SPEECH COMMUNICATION LA English DT Article DE Lombard effect; Automaticity; Task selection; Speech; Cognitive load; Timing ID FUNDAMENTAL-FREQUENCY; SENSORIMOTOR CONTROL; ACOUSTIC ANALYSIS; INTERFERENCE; UNCERTAINTY; VARIABILITY; RELIABILITY; IMPAIRMENT; PRINCIPLES; DISEASE AB In the control of skeleto-motor movement, it is well established that the less complex, or more automatic a motor task is, the less variability and uncertainty there is in its performance. It was hypothesized that a similar relationship exists for integrated cognitive-motor tasks such as speech where the uncertainty with which actions are initiated may increase when the feedback loop is interrupted or dampened. To investigate this, the Lombard effect was exploited to explore the acoustic impact of background noise on speech during tasks increasing in automaticity. Fifteen healthy adults produced five speech tasks bearing different levels of automaticity (e.g., counting, reading, unprepared monologue) during habitual and altered auditory feedback conditions (Lombard effect). Data suggest that speech tasks relatively free of meaning or phonetic complexity are influenced to a lesser degree by a compromised auditory feedback than more complex paradigms (e.g., contemporaneous speech) on measures of timing. These findings inform understanding of the relative contribution speech task selection plays in measures of speech. Data also aid in understanding the relationship between task automaticity and altered speech production in neurological conditions where dual impairments of movement and cognition are observed (e.g., Huntington's disease, progressive aphasia). (C) 2014 Elsevier B.V. All rights reserved. C1 [Vogel, Adam P.] Univ Melbourne, Speech Neurosci Unit, Melbourne, Vic, Australia. [Fletcher, Janet] Univ Melbourne, Sch Languages & Linguist, Melbourne, Vic, Australia. [Maruff, Paul] Univ Melbourne, Howard Florey Inst Neurosci & Mental Hlth, Melbourne, Vic, Australia. RP Vogel, AP (reprint author), 550 Swanston St, Melbourne, Vic 3010, Australia. EM vogela@unimelb.edu.au FU National Health and Medical Research Council - Australia [1012302] FX APV was supported by a National Health and Medical Research Council - Australia, Early Career Fellowship (#1012302). CR Bays PM, 2007, J PHYSIOL-LONDON, V578, P387, DOI 10.1113/jphysiol.2006.120121 Bays PM, 2005, CURR BIOL, V15, P1125, DOI 10.1016/j.cub.2005.05.023 Blais C., 2010, Q J EXP PSYCHOL, V65, P268, DOI [10.1080/17470211003775234, DOI 10.1080/17470211003775234] Boersma P., 2001, GLOT INT, V5, P341 Boril H., 2008, THESIS CZECH TECHNIC BROWN W S JR, 1972, Journal of Auditory Research, V12, P157 Castellanos A, 1996, SPEECH COMMUN, V20, P23, DOI 10.1016/S0167-6393(96)00042-8 DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780 Dunlap WP, 1996, PSYCHOL METHODS, V1, P170, DOI 10.1037//1082-989X.1.2.170 Fredrickson A, 2008, HUM PSYCHOPHARM CLIN, V23, P425, DOI 10.1002/hup.942 Gadesmann M, 2008, INT J LANG COMM DIS, V43, P41, DOI 10.1080/13682820701234444 Gilabert R., 2006, MULTILINGUAL MATTERS, P44 Hanley TD, 1949, J SPEECH HEAR DISORD, V14, P363 Hayhoe M, 2005, TRENDS COGN SCI, V9, P188, DOI 10.1016/j.tics.2005.02.009 Houde JF, 1998, SCIENCE, V279, P1213, DOI 10.1126/science.279.5354.1213 Junqua JC, 1996, SPEECH COMMUN, V20, P13, DOI 10.1016/S0167-6393(96)00041-6 Laures JS, 2003, J COMMUN DISORD, V36, P449, DOI 10.1016/S0021-9924(03)00032-7 Lee GS, 2007, EAR HEARING, V28, P343, DOI 10.1097/AUD.0b013e318047936f LETOWSKI T, 1993, EAR HEARING, V14, P332 Lombard E., 1911, MALADIES OREILLE LAR, V27, P101 Lu YY, 2009, J ACOUST SOC AM, V126, P1495, DOI 10.1121/1.3179668 MACLEOD CM, 1988, J EXP PSYCHOL LEARN, V14, P126, DOI 10.1037/0278-7393.14.1.126 Mazzoni D., 2012, AUDACITY Mendoza E, 1998, J VOICE, V12, P263, DOI 10.1016/S0892-1997(98)80017-9 Patel R, 2008, J SPEECH LANG HEAR R, V51, P209, DOI 10.1044/1092-4388(2008/016) Pittman AL, 2001, J SPEECH LANG HEAR R, V44, P487, DOI 10.1044/1092-4388(2001/038) RIVERS C, 1985, Journal of Auditory Research, V25, P37 Robinson P, 2001, APPL LINGUIST, V22, P27, DOI 10.1093/applin/22.1.27 Robinson P., 2005, IRAL-INT REV APPL LI, V43, P1, DOI DOI 10.1515/IRA1.2005.43.1.1 Scherer K.R., 2002, ICSLP 2002 DENV CO U, P2017 SIEGEL GM, 1992, J SPEECH HEAR RES, V35, P1358 Stout JC, 2011, NEUROPSYCHOLOGY, V25, P1, DOI 10.1037/a0020937 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 TARTTER VC, 1993, J ACOUST SOC AM, V94, P2437, DOI 10.1121/1.408234 Tremblay S, 2003, NATURE, V423, P866, DOI 10.1038/nature01710 Van Riper C, 1963, SPEECH CORRECTION, V4 van Beers RJ, 2002, PHILOS T R SOC B, V357, P1137, DOI 10.1098/rstb.2002.1101 Vogel AP, 2010, MOVEMENT DISORD, V25, P1753, DOI 10.1002/mds.23103 Vogel AP, 2012, NEUROPSYCHOLOGIA, V50, P3273, DOI 10.1016/j.neuropsychologia.2012.09.011 Vogel AP, 2009, BEHAV RES METHODS, V41, P318, DOI 10.3758/BRM.41.2.318 Vogel AP, 2011, J VOICE, V25, P137, DOI 10.1016/j.jvoice.2009.09.003 Wassink AB, 2007, J PHONETICS, V35, P363, DOI 10.1016/j.wocn.2006.07.002 Watson PJ, 2006, J SPEECH LANG HEAR R, V49, P636, DOI 10.1044/1092-4388(2006/046) Wolpert DM, 2011, NAT REV NEUROSCI, V12, P739, DOI 10.1038/nrn3112 YATES AJ, 1963, PSYCHOL BULL, V60, P213, DOI 10.1037/h0044155 NR 45 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 1 EP 8 DI 10.1016/j.specom.2014.05.002 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700001 ER PT J AU Schwerin, B Paliwal, K AF Schwerin, Belinda Paliwal, Kuldip TI An improved speech transmission index for intelligibility prediction SO SPEECH COMMUNICATION LA English DT Article DE Speech transmission index; Modulation transfer function; Speech enhancement; Objective evaluation; Speech intelligibility; Short-time modulation spectrum ID MODEL AB The speech transmission index (STI) is a well known measure of intelligibility, most suited to the evaluation of speech intelligibility in rooms, with stimuli subjected to additive noise and reverberance. However, STI and its many variations do not effectively represent the intelligibility of stimuli containing non-linear distortions such as those resulting from processing by enhancement algorithms. In this paper, we revisit the STI approach and propose a variation which processes the modulation envelope in short-time segments, requiring only an assumption of quasi-stationarity (rather than the stationarity assumption of STI) of the modulation signal. Results presented in this work show that the proposed approach improves the measures correlation to subjective intelligibility scores compared to traditional STI for a range of noise types and subjected to different enhancement approaches. The approach is also shown to have higher correlation than other coherence, correlation and distance measures tested, but is unsuited to the evaluation of stimuli heavily distorted with (for example) masking based processing, where an alternative approach such as STOI is recommended. (C) 2014 Elsevier B.V. All rights reserved. C1 [Schwerin, Belinda; Paliwal, Kuldip] Griffith Univ, Griffith Sch Engn, Signol Proc Lab, Nathan, Qld 4111, Australia. RP Schwerin, B (reprint author), Griffith Univ, Griffith Sch Engn, Signol Proc Lab, Nathan, Qld 4111, Australia. EM belsch71@gmail.com CR [Anonymous], 2001, ITU T REC, P862 ANSI, 1997, S351997 ANSI Balakrishnan N., 1992, HDB LOGISTIC DISTRIB Boldt J., 2009, P EUSIPCO, P1849 CARTER GC, 1983, IEEE T AUDIO ELECTRO, V21, P337, DOI DOI 10.1109/TAU.1973.1162496 Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erkelens JS, 2007, IEEE T AUDIO SPEECH, V15, P1741, DOI 10.1109/TASL.2007.899233 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778 Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493 Ma JF, 2011, SPEECH COMMUN, V53, P340, DOI 10.1016/j.specom.2010.10.005 Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004 Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216 Pearce D., 2000, P ICSLP, V4, P29 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Taal C., 2010, P ITG FACHT SPRACHK Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881 Tribolet J. M., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing NR 27 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 9 EP 19 DI 10.1016/j.specom.2014.05.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700002 ER PT J AU Moniz, H Batista, F Mata, AI Trancoso, I AF Moniz, Helena Batista, Fernando Mata, Ana Isabel Trancoso, Isabel TI Speaking style effects in the production of disfluencies SO SPEECH COMMUNICATION LA English DT Article DE Prosody; Disfluencies; Lectures; Dialogues; Speaking styles ID SPONTANEOUS SPEECH; REPAIR; UM AB This work explores speaking style effects in the production of disfluencies. University lectures and map-task dialogues are analyzed in order to evaluate if the prosodic strategies used when uttering disfluencies vary across speaking styles. Our results show that the distribution of disfluency types is not arbitrary across lectures and dialogues. Moreover, although there is a statistically significant cross-style strategy of prosodic contrast marking (pitch and energy increases) between the region to repair and the repair of fluency, this strategy is displayed differently depending on the specific speech task. The overall patterns observed in the lectures, with regularities ascribed for speaker and disfluency types, do not hold with the same strength for the dialogues, due to underlying specificities of the communicative purposes. The tempo patterns found for both speech tasks also confirm their distinct behaviour, evidencing the more dynamic tempo characteristics of dialogues. In university lectures, prosodic cues are given to the listener both for the units inside disfluent regions and between these and the adjacent contexts. This suggests a stronger prosodic contrast marking of disfluency fluency repair when compared to dialogues, as if teachers were monitoring the different regions the introduction to a disfluency, the disfluency itself and the beginning of the repair demarcating them in very contrastive ways. (C) 2014 Elsevier B.V. All rights reserved. C1 [Moniz, Helena; Batista, Fernando; Trancoso, Isabel] L2F INESCID, Lisbon, Portugal. [Moniz, Helena; Mata, Ana Isabel] Univ Lisbon, FLUL CLUL, P-1699 Lisbon, Portugal. [Batista, Fernando] Inst Univ Lisboa, ISCTE IUL, Lisbon, Portugal. [Trancoso, Isabel] Univ Lisbon, Inst Super Tecn, P-1699 Lisbon, Portugal. RP Moniz, H (reprint author), L2F INESCID, Lisbon, Portugal. EM helenam@12f.inesc-id.pt; fmmb@12f.inesc-id.pt; aim@fl.ul.pt; isabel.trancoso@12f.inesc-id.pt RI Batista, Fernando/C-8355-2009 OI Batista, Fernando/0000-0002-1075-0177 FU FCT - Fundacao para a Ciencia e Tecnologia [FCT/SFRH/BD/44671/2008, SFRH/BPD/95849/2013, PEst-OE/EEI/LA0021/2013, PTDC/CLE-LIN/120017/2010]; European Project EU-IST FP7 project SpeDial [611396]; ISCTE-IUL, Instituto Universitario de Lisboa FX This work was supported by national funds through FCT - Fundacao para a Ciencia e Tecnologia, under Ph.D Grant FCT/SFRH/BD/44671/2008 and Post-doc fellow researcher Grant SFRH/BPD/95849/2013, projects PEst-OE/EEI/LA0021/2013 and PTDC/CLE-LIN/120017/2010, by European Project EU-IST FP7 project SpeDial under Contract 611396, and by ISCTE-IUL, Instituto Universitario de Lisboa. CR Allwood J., 1990, NORD J LINGUIST, P3 Amaral R., 2008, INT 2008 BRISB AUSTR Arnold JE, 2003, J PSYCHOLINGUIST RES, V32, P25, DOI 10.1023/A:1021980931292 Barry W, 1995, ICPHS 1995 STOCKH SW Batista F, 2011, THESIS I SUPERIOR TE Batista F., 2012, J SPEECH SCI, P115 Batista F, 2012, IEEE T AUDIO SPEECH, V20, P474, DOI 10.1109/TASL.2011.2159594 Benus S., 2012, 3 IEEE C COGN INF KO Biber D., 1988, VARIATION SPEECH WRI Blaauw E., 1995, PERCEPTUAL CLASSIFIC Brennan SE, 2001, J MEM LANG, V44, P274, DOI 10.1006/jmla.2000.2753 Caseiro D., 2002, PMLA ISCA TUT RES WO Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3 Cole J., 2005, DISS 2005 AIX EN PRO Conrad S., 2009, REGISTER GENRE STYLE Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894 Eklund R., 2004, THESIS U LINKOPINK Erard M., 2007, SLIPS STUMBLES VERBA Eskenazi M., 1993, EUR 1993 BERL GERM Fox-Tree J.E., 1995, J MEM LANG, P709 Gravano A., 2011, INT 2011 FLOR IT Grojean F., 1980, CROSS LINGUISTIC ASS, P144 Heike A., 1981, LANG SPEECH, P147 Hindle D., 1983, ACL, P123 Hirschberg J., 2000, THEORY EXPT STUDIES, P335 Koehn P., 2005, 10 MACH TRANSL SUMM Levelt W. J. M., 1989, SPEAKING Levelt W. J. M., 1983, J SEMANT, V2, P205 LEVELT WJM, 1983, COGNITION, V14, P41, DOI 10.1016/0010-0277(83)90026-4 Liu Y, 2006, COMPUT SPEECH LANG, V20, P468, DOI 10.1016/j.csl.2005.06.002 Mata A.I., 2010, SPEECH PROSODY Mata A.I., 1999, THESIS U LISBON Moniz H, 2006, THESIS U LISBON NAKATANI CH, 1994, J ACOUST SOC AM, V95, P1603, DOI 10.1121/1.408547 Neto J, 2008, INT CONF ACOUST SPEE, P1561, DOI 10.1109/ICASSP.2008.4517921 O'Connell DC, 2008, COGN LANG A SER PSYC, P3, DOI 10.1007/978-0-387-77632-3_1 Pellegrini T., 2012, GSCP 2012 BRAZ Plauche M., 1999, ICPHS 1999 S FRANC U Ranganath R, 2013, COMPUT SPEECH LANG, V27, P89, DOI 10.1016/j.csl.2012.01.005 Ribeiro R, 2011, J ARTIF INTELL RES, V42, P275 Rose R., 1998, THESIS U BIRMINGHAM Savova G., 2003, DISS 2003 GOT SWED Savova G., 2003, INT 2003 GEN SWITZ Schuller B, 2013, COMPUT SPEECH LANG, V27, P4, DOI 10.1016/j.csl.2012.02.005 Shriberg E., 2001, J INT PHON ASSOC, V31, P153 Shriberg E, 1999, INT C PHON SCI SAN F, P612 Shriberg E. E., 1994, THESIS U CALIFORNIA Sjolander K., 1998, ICSLP 1998 SYDN AUST, P3217 Swerts M, 1998, J PRAGMATICS, V30, P485, DOI 10.1016/S0378-2166(98)00014-9 Trancoso I., 2008, LREC 2008 LANG RES E Trancoso I., 1998, PROPOR98 PORT AL BRA Vaissiere J, 2005, BLACKW HBK LINGUIST, P236, DOI 10.1002/9780470757024.ch10 Viana M.C., 1998, WORKSH LING COMP LIS NR 53 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 20 EP 35 DI 10.1016/j.specom.2014.05.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700003 ER PT J AU Rigoulot, S Pell, MD AF Rigoulot, Simon Pell, Marc D. TI Emotion in the voice influences the way we scan emotional faces SO SPEECH COMMUNICATION LA English DT Article DE Speech; Prosody; Face; Eye-tracking; Emotion; Cross-modal ID SPOKEN-WORD RECOGNITION; FACIAL EXPRESSIONS; EYE-MOVEMENTS; CULTURAL-DIFFERENCES; NEURAL EVIDENCE; SPEECH PROSODY; TIME-COURSE; PERCEPTION; INFORMATION; ATTENTION AB Previous eye-tracking studies have found that listening to emotionally-inflected utterances guides visual behavior towards an emotionally congruent face (e.g., Rigoulot and Pell, 2012). Here, we investigated in more detail whether emotional speech prosody influences how participants scan and fixate specific features of an emotional face that is congruent or incongruent with the prosody. Twenty-one participants viewed individual faces expressing fear, sadness, disgust, or happiness while listening to an emotionally-inflected pseudoutterance spoken in a congruent or incongruent prosody. Participants judged whether the emotional meaning of the face and voice were the same or different (match/mismatch). Results confirm that there were significant effects of prosody congruency on eye movements when participants scanned a face, although these varied by emotion type; a matching prosody promoted more frequent looks to the upper part of fear and sad facial expressions, whereas visual attention to upper and lower regions of happy (and to some extent disgust) faces was more evenly distributed. These data suggest ways that vocal emotion cues guide how humans process facial expressions in a way that could facilitate recognition of salient visual cues, to arrive at a holistic impression of intended meanings during interpersonal events. (C) 2014 Elsevier B.V. All rights reserved. C1 [Rigoulot, Simon; Pell, Marc D.] McGill Univ, Fac Med, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada. McGill Ctr Res Brain Language & Mus, Montreal, PQ, Canada. RP Rigoulot, S (reprint author), McGill Univ, Fac Med, Sch Commun Sci & Disorders, 1266 Ave Pins Ouest, Montreal, PQ H3G 1A8, Canada. EM simon.rigoulot@mail.mcgill.ca FU Natural Sciences and Engineering Research Council of Canada FX We are grateful to Catherine Knowles and Hope Valeriote for running the experiment. This research was funded by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (to MDP). CR Adolphs R, 2005, NATURE, V433, P68, DOI 10.1038/nature03086 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BASSILI JN, 1979, J PERS SOC PSYCHOL, V37, P2049, DOI 10.1037//0022-3514.37.11.2049 Bate S, 2009, NEUROPSYCHOLOGY, V23, P658, DOI 10.1037/a0014518 Bayle DJ, 2009, PLOS ONE, V4, DOI 10.1371/journal.pone.0008207 Beaudry O, 2014, COGNITION EMOTION, V28, P416, DOI 10.1080/02699931.2013.833500 Becker MW, 2009, Q J EXP PSYCHOL, V62, P1257, DOI 10.1080/17470210902725753 Blais C, 2012, NEUROPSYCHOLOGIA, V50, P2830, DOI 10.1016/j.neuropsychologia.2012.08.010 Brosch T, 2009, J COGNITIVE NEUROSCI, V21, P1670, DOI 10.1162/jocn.2009.21110 BRUCE V, 1992, PHILOS T ROY SOC B, V335, P121, DOI 10.1098/rstb.1992.0015 Calder AJ, 2005, NAT REV NEUROSCI, V6, P641, DOI 10.1038/nrn1724 Calder AJ, 2000, J EXP PSYCHOL HUMAN, V26, P527, DOI 10.1037/0096-1523.26.2.527 Calvo MG, 2009, COGNITION EMOTION, V23, P782, DOI 10.1080/02699930802151654 Calvo MG, 2005, J EXP PSYCHOL HUMAN, V31, P502, DOI 10.1037/0096-1523.31.3.502 Calvo MG, 2013, MOTIV EMOTION, V37, P202, DOI 10.1007/s11031-012-9298-1 Calvo MG, 2011, VISION RES, V51, P1751, DOI 10.1016/j.visres.2011.06.001 Calvo MG, 2008, J EXP PSYCHOL GEN, V137, P471, DOI 10.1037/a0012771 Campanella S, 2007, TRENDS COGN SCI, V11, P535, DOI 10.1016/j.tics.2007.10.001 Charash M, 2002, J ANXIETY DISORD, V16, P529, DOI 10.1016/S0887-6185(02)00171-8 Cisler JM, 2009, COGNITION EMOTION, V23, P675, DOI 10.1080/02699930802051599 Collignon O, 2008, BRAIN RES, V1242, P126, DOI 10.1016/j.brainres.2008.04.023 COOPER RM, 1974, COGNITIVE PSYCHOL, V6, P84, DOI 10.1016/0010-0285(74)90005-X Cvejic E, 2010, SPEECH COMMUN, V52, P555, DOI 10.1016/j.specom.2010.02.006 Dahan D, 2001, COGNITIVE PSYCHOL, V42, P317, DOI 10.1006/cogp.2001.0750 Darwin Charles, 1998, EXPRESSION EMOTIONS, V3rd de Gelder B, 2000, COGNITION EMOTION, V14, P289 Dolan RJ, 2001, P NATL ACAD SCI USA, V98, P10006, DOI 10.1073/pnas.171288598 Eisenbarth H, 2011, EMOTION, V11, P860, DOI 10.1037/a0022758 Ekman P., 2002, FACIAL ACTION CODING Ekman P., 1990, J PERS SOC PSYCHOL, V58, P343, DOI [DOI 10.1037/0022-3514.58.2.342, 10.1037/0022-3514.58.2.342] Ekman P., 1976, PICTURES FACIAL AFFE Gordon MS, 2011, Q J EXP PSYCHOL, V64, P730, DOI 10.1080/17470218.2010.516835 Gosselin F, 2001, VISION RES, V41, P2261, DOI 10.1016/S0042-6989(01)00097-9 Green MJ, 2003, COGNITION EMOTION, V17, P779, DOI 10.1080/02699930302282 Hall JK, 2010, COGNITION EMOTION, V24, P629, DOI 10.1080/02699930902906882 Huettig F, 2005, COGNITION, V96, pB23, DOI 10.1016/j.cognition.2004.10.003 Hunnius S, 2011, COGNITION EMOTION, V25, P193, DOI 10.1080/15298861003771189 Jack RE, 2009, CURR BIOL, V19, P1543, DOI 10.1016/j.cub.2009.07.051 Jaywant A, 2012, SPEECH COMMUN, V54, P1, DOI 10.1016/j.specom.2011.05.011 Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770 Malcolm GL, 2008, J VISION, V8, DOI 10.1167/8.8.2 MATSUMOTO D, 1989, J NONVERBAL BEHAV, V13, P171, DOI 10.1007/BF00987048 Messinger DS, 2012, EMOTION, V12, P430, DOI 10.1037/a0026498 Neath KN, 2014, COGNITION EMOTION, V28, P115, DOI 10.1080/02699931.2013.812557 Niedenthal PM, 2007, SCIENCE, V316, P1002, DOI 10.1126/science.1136930 Palermo R, 2007, NEUROPSYCHOLOGIA, V45, P75, DOI 10.1016/j.neuropsychologia.2006.04.025 Paulmann S, 2011, MOTIV EMOTION, V35, P192, DOI 10.1007/s11031-011-9206-0 Paulmann S, 2012, SPEECH COMMUN, V54, P92, DOI 10.1016/j.specom.2011.07.004 Paulmann S, 2010, COGN AFFECT BEHAV NE, V10, P230, DOI 10.3758/CABN.10.2.230 Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 Pell MD, 2005, J NONVERBAL BEHAV, V29, P193, DOI 10.1007/s10919-005-7720-z Pell M.D., 2011, PLOS ONE, V6 Pell MD, 2011, COGNITION EMOTION, V25, P834, DOI 10.1080/02699931.2010.516915 Pell MD, 1999, BRAIN LANG, V69, P161, DOI 10.1006/brln.1999.2065 Pell MD, 1999, CORTEX, V35, P455, DOI 10.1016/S0010-9452(08)70813-X Pell MD, 1997, BRAIN LANG, V57, P195, DOI 10.1006/brln.1997.1736 Pourtois G, 2005, CORTEX, V41, P49, DOI 10.1016/S0010-9452(08)70177-1 Rigoulot S., BRAIN RES IN PRESS Rigoulot S, 2012, PLOS ONE, V7, DOI 10.1371/journal.pone.0030740 Rigoulot S, 2012, NEUROPSYCHOLOGIA, V50, P2887, DOI 10.1016/j.neuropsychologia.2012.08.015 Rigoulot S, 2011, NEUROPSYCHOLOGIA, V49, P2013, DOI 10.1016/j.neuropsychologia.2011.03.031 Rousselet GA, 2005, VIS COGN, V12, P852, DOI 10.1080/13506280444000553 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Schyns PG, 2002, PSYCHOL SCI, V13, P402, DOI 10.1111/1467-9280.00472 Stins JF, 2011, EXP BRAIN RES, V212, P603, DOI 10.1007/s00221-011-2767-z Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 Tanaka A, 2010, PSYCHOL SCI, V21, P1259, DOI 10.1177/0956797610380698 TANENHAUS MK, 1995, SCIENCE, V268, P1632, DOI 10.1126/science.7777863 Thompson LA, 2009, BRAIN COGNITION, V69, P108, DOI 10.1016/j.bandc.2008.06.002 Tottenham N, 2009, PSYCHIAT RES, V168, P242, DOI 10.1016/j.psychres.2008.05.006 VASEY MW, 1987, PSYCHOPHYSIOLOGY, V24, P479, DOI 10.1111/j.1469-8986.1987.tb00324.x Vassallo S, 2009, J VISION, V9, DOI 10.1167/9.3.11 Wong B, 2005, NEUROPSYCHOLOGY, V19, P739, DOI 10.1037/0894-4105.19.6.739 Yarbus A. L., 1967, EYE MOVEMENTS VISION Yee E, 2009, PSYCHON B REV, V16, P869, DOI 10.3758/PBR.16.5.869 Yee E, 2006, J EXP PSYCHOL LEARN, V32, P1, DOI 10.1037/0278-7393.32.1.1 Yuki M, 2007, J EXP SOC PSYCHOL, V43, P303, DOI 10.1016/j.jesp.2006.02.004 NR 77 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 36 EP 49 DI 10.1016/j.specom.2014.05.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700004 ER PT J AU Skantze, G Hjalmarsson, A Oertel, C AF Skantze, Gabriel Hjalmarsson, Anna Oertel, Catharine TI Turn-taking, feedback and joint attention in situated human-robot interaction SO SPEECH COMMUNICATION LA English DT Article DE Turn-taking; Feedback; Joint attention; Prosody; Gaze; Uncertainty ID GAZE; CONVERSATIONS; BACKCHANNELS; ADDRESSEES; FEATURES; SPEAKING; DIALOG; TASK AB In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user's and the robot's gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot's speech. By analysing the participants' subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot's gaze when talking about landmarks, and that the robot's verbal and gaze behaviour has a strong effect on the users' turn-taking behaviour. We also present an analysis of the users' gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user's level of uncertainty. (C) 2014 Elsevier B.V. All rights reserved. C1 [Skantze, Gabriel; Hjalmarsson, Anna; Oertel, Catharine] KTH Royal Inst Technol, Dept Speech Mus & Hearing, Stockholm, Sweden. RP Skantze, G (reprint author), KTH Royal Inst Technol, Dept Speech Mus & Hearing, Stockholm, Sweden. EM gabriel@speech.kth.se FU Swedish research council (VR) [2011-6237, 2011-6152]; GetHomeSafe (EU 7th Framework STREP) [288667] FX Gabriel Skantze is supported by the Swedish research council (VR) project Incremental processing in multimodal conversational systems (2011-6237). Anna Hjalmarsson is supported by the Swedish Research Council (VR) project Classifying and deploying pauses for flow control in conversational systems (2011-6152). Catharine Oertel is supported by GetHomeSafe (EU 7th Framework STREP 288667). CR Al Moubayed S., 2013, INT J HUMANOID ROB, V10 Allen J.F., 1997, DRAFT DAMSL DI UNPUB Allopenna PD, 1998, J MEM LANG, V38, P419, DOI 10.1006/jmla.1997.2558 Allwood J., 1992, Journal of Semantics, V9, DOI 10.1093/jos/9.1.1 Anderson A., LANG SPEECH, V34, P351 Baron-Cohen S., 1995, JOINT ATTENTION ITS, P41 Bavelas JB, 2002, J COMMUN, V52, P566, DOI 10.1093/joc/52.3.566 Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 Boersma P., 2001, GLOT INT, V5, P341 Bohus D., 2010, P ICMI10 BEIJ CHIN Boucher J.D., 2012, FRONT NEUROROBOTICS, V6 Boye J, 2007, P 8 SIGDIAL WORKSH D Boye J., 2012, IWSDS2012 INT WORKSH BOYLE EA, 1994, LANG SPEECH, V37, P1 Buschmeier H, 2011, P 11 INT C INT VIRT, P169, DOI 10.1007/978-3-642-23974-8_19 Buschmeier H., 2012, P 13 ANN M SPEC INT, P295 Cathcart N., 2003, 10 C EUR CHAPT ASS C Clark H. H., 1981, ELEMENTS DISCOURSE U, P10 Clark H. H., 1996, USING LANGUAGE Clark HH, 2004, J MEM LANG, V50, P62, DOI 10.1016/j.jml.2003.08.004 DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031 Edlund J, 2009, LANG SPEECH, V52, P351, DOI 10.1177/0023830909103179 Forbes-Riley K, 2011, SPEECH COMMUN, V53, P1115, DOI 10.1016/j.specom.2011.02.006 Gravano A, 2011, COMPUT SPEECH LANG, V25, P601, DOI 10.1016/j.csl.2010.10.003 Grosz B. J., 1986, Computational Linguistics, V12 Hall M., 2009, SIGKDD EXPLORATIONS, V11, P1, DOI DOI 10.1145/1656274.1656278 Heldner M, 2010, J PHONETICS, V38, P555, DOI 10.1016/j.wocn.2010.08.002 Hjalmarsson A, 2011, SPEECH COMMUN, V53, P23, DOI 10.1016/j.specom.2010.08.003 Hjalmarsson A., 2012, P IVA 2012 WORKSH RE Huang L., 2011, INTELLIGENT VIRTUAL, P68 Iwase T., 1998, P ICSLP SYDN AUSTR, P1203 Johansson M., 2013, INT C SOC ROB ICSR 2 Katzenmaier M., 2004, P INT C MULT INT ICM KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4 Kennington C., 2013, P SIGDIAL 2013 C MET, P173 Koiso H, 1998, LANG SPEECH, V41, P295 Lai C., 2010, P INT MAK JAP LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310 Liscombe J., 2006, P INT 2006 PITTSB PA Meena R., 2013, 14 ANN M SPEC INT GR, P375 Morency LP, 2010, AUTON AGENT MULTI-AG, V20, P70, DOI 10.1007/s10458-009-9092-y Mutlu B, 2006, P 6 IEEE RAS INT C H, P518 Nakano Yukiko I., 2003, P 41 ANN M ASS COMP, V1, P553, DOI 10.3115/1075096.1075166 Neiberg D., 2012, INT WORKSH FEEDB BEH Oertel C., 2012, P INT 2012 PORTL OR Okumura Y, 2013, J EXP CHILD PSYCHOL, V116, P86, DOI 10.1016/j.jecp.2013.02.007 Pon-Barry H, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P74 Randolph J. J., 2005, JOENS U LEARN INSTR Reidsma D, 2011, J MULTIMODAL USER IN, V4, P97, DOI 10.1007/s12193-011-0060-x SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 Schegloff Emanuel A., 1982, ANAL DISCOURSE TEXT, P71 Schlangen D, 2011, DIALOGUE DISCOURSE, V2, P83 SCHOBER MF, 1989, COGNITIVE PSYCHOL, V21, P211, DOI 10.1016/0010-0285(89)90008-X Skantze G., 2013, P INT Skantze G., 2009, P SIGDIAL 2009 LOND Skantze G., 2012, P ICMI SANT MON CA Skantze G., 2012, P INT WORKSH FEEDB B Skantze G., 2013, 14 ANN M SPEC INT GR Skantze G., 2009, P 12 C EUR CHAPT ASS Skantze G, 2013, COMPUT SPEECH LANG, V27, P243, DOI 10.1016/j.csl.2012.05.004 Staudte M, 2011, COGNITION, V120, P268, DOI 10.1016/j.cognition.2011.05.005 Stocksmeier T., 2007, P INT 2007 Velichkovsky B. M., 1995, PRAGMAT COGN, V3, P199, DOI 10.1075/pc.3.2.02vel Vertegaal R., 2001, P ACM C HUM FACT COM Wallers A, 2006, LECT NOTES ARTIF INT, V4021, P183 Ward N., 2004, P INT C ISCA SPEC IN, P325 Ward N, 2003, INT J HUM-COMPUT ST, V59, P603, DOI 10.1016/S1071-5819(03)00085-5 Yngve Victor, 1970, 6 REG M CHIC LING SO, P567 NR 68 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 50 EP 66 DI 10.1016/j.specom.2014.05.005 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700005 ER PT J AU Yuan, JH Liberman, M AF Yuan, Jiahong Liberman, Mark TI F-0 declination in English and Mandarin Broadcast News Speech SO SPEECH COMMUNICATION LA English DT Article DE Declination; F-0; Regression; Convex-hull ID FUNDAMENTAL-FREQUENCY; PERCEIVED PROMINENCE; SENTENCE INTONATION; ACCENTED SYLLABLES; MAXIMUM SPEED; PITCH; PERCEPTION; DOWNTREND; PATTERNS; CONTOURS AB This study investigates F-0 declination in broadcast news speech in English and Mandarin Chinese. The results demonstrate a strong relationship between utterance length and declination slope. Shorter utterances have steeper declination, even after excluding the initial rising and final lowering effects. Initial F-0 tends to be higher when the utterance is longer, whereas the low bound of final F-0 is independent of the utterance length. Both top line and baseline show declination. The top line and baseline have different patterns in Mandarin Chinese, whereas in English their patterns are similar. Mandarin Chinese has more and steeper declination than English, as well as wider pitch range and more F-0 fluctuations. Our results suggest that F-0 declination is linguistically controlled, not just a by-product of the physics and physiology of talking. (C) 2014 Elsevier B.V. All rights reserved. C1 [Yuan, Jiahong; Liberman, Mark] Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA. RP Yuan, JH (reprint author), Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA. EM jiahong.yuan@gmail.com FU NSF [0964556] FX An earlier version of this paper was presented at Inter-speech 2010. This work was supported in part by NSF Grant 0964556. CR 't Hart J., 1979, IPO ANN PROGR REPORT, V14, P61 ATKINSON JE, 1978, J ACOUST SOC AM, V63, P211, DOI 10.1121/1.381716 BAER T, 1979, J ACOUST SOC AM, V65, P1271, DOI 10.1121/1.382795 Breckenridge J., 1977, DECLINATION PHONOLOG COHEN A, 1982, PHONETICA, V39, P254 COLLIER R, 1975, J ACOUST SOC AM, V58, P249, DOI 10.1121/1.380654 Cooper W. E., 1981, FUNDAMENTAL FREQUENC EADY SJ, 1982, LANG SPEECH, V25, P29 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 GARDING E, 1979, PHONETICA, V36, P207 Gelfer C.E., 1983, VOCAL FOLD PHYSL BIO, P113 Han S., 2011, PLOS ONE, V6, P1 Heuven van V.J., 2004, SPEECH LANGUAGES STU, P83 Hirose H., 2010, HDB PHONETIC SCI, P130, DOI DOI 10.1002/9781444317251.CH4 Hollien H., 1983, VOCAL FOLD PHYSL, P361 HOLMES VM, 1984, LANG SPEECH, V27, P115 Honda K, 1999, LANG SPEECH, V42, P401 Keating P, 2012, J ACOUST SOC AM, V132, P1050, DOI 10.1121/1.4730893 Laan GPM, 1997, SPEECH COMMUN, V22, P43, DOI 10.1016/S0167-6393(97)00012-5 Ladd D. R., 1984, PHONOLOGY YB, V1, P53, DOI DOI 10.1017/S0952675700000294 LADD DR, 1993, LANG SPEECH, V36, P435 Ladefoged P., 1967, 3 AREAS EXPT PHONETI Ladefoged P., 2009, PRELIMINARY STUDIES Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 Lieberman P, 1966, THESIS LIEBERMAN P, 1985, J ACOUST SOC AM, V77, P649, DOI 10.1121/1.391883 Maeda S, 1976, THESIS MIT MERMELSTEIN P, 1975, J ACOUST SOC AM, V58, P880, DOI 10.1121/1.380738 Ni JF, 2006, SPEECH COMMUN, V48, P989, DOI 10.1016/j.specom.2006.01.002 Nooteboom S.G., 1975, STRUCTURE PROCESS SP, P124 Nooteboom S.G, 1995, PRODUCING SPEECH CON, P3 O'Shaughnessy, 1976, THESIS MIT OHALA JJ, 1990, NATO ADV SCI I D-BEH, V55, P23 OHALA Johni, 1978, TONE LINGUISTIC SURV, P5 PIERREHUMBERT J, 1979, J ACOUST SOC AM, V66, P363, DOI 10.1121/1.383670 Pierrehumbert J, 1980, THESIS MIT Prieto P., 2006, P SPEECH PROS 2006, V2006, P803 Prieto P, 1996, J PHONETICS, V24, P445, DOI 10.1006/jpho.1996.0024 Rialland A., 2001, P S CROSS LING STUD, P301 Shih CL, 2000, TEXT SPEECH LANG TEC, V15, P243 Sorensen J.M., 1980, PERCEPTION PRODUCTIO, P399 Sternberg S., 1980, PERCEPTION PRODUCTIO, P507 Stevens K.N., 2000, ACOUSTIC PHONETICS, P55 STRIK H, 1995, J PHONETICS, V23, P203, DOI 10.1016/S0095-4470(95)80043-3 SUNDBERG J, 1979, J PHONETICS, V7, P71 Swerts M, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1501 Talkin D., 1996, GET F0 ONLINE DOCUME TERKEN J, 1994, J ACOUST SOC AM, V95, P3662, DOI 10.1121/1.409936 TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019 THORSEN NG, 1980, J ACOUST SOC AM, V67, P1014, DOI 10.1121/1.384069 TITZE IR, 1988, J ACOUST SOC AM, V83, P1536, DOI 10.1121/1.395910 Tondering J., 2011, P ICPHS, VXVII, P2010 UMEDA N, 1982, J PHONETICS, V10, P279 Vaissiere J, 2005, BLACKW HBK LINGUIST, P236, DOI 10.1002/9780470757024.ch10 Vaissiere Jacqueline, 1983, PROSODY MODELS MEASU, P53 Whalen DH, 1997, PHONETICA, V54, P138 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789 Yuan J., 2002, P SPEECH PROS 2002, V2002, P711 Yuan J., 2004, THESIS CORNELL U NR 60 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 67 EP 74 DI 10.1016/j.specom.2014.06.001 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700006 ER PT J AU Kates, JM Arehart, KH AF Kates, James M. Arehart, Kathryn H. TI The Hearing-Aid Speech Perception Index (HASPI) SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Intelligibility index; Auditory model; Hearing loss; Hearing aids ID AUDITORY-NERVE RESPONSES; SPECTRAL-SHAPE-FEATURES; INTELLIGIBILITY PREDICTION; IMPAIRED LISTENERS; FINE-STRUCTURE; SINUSOIDAL REPRESENTATION; ARTICULATION INDEX; TRANSMISSION INDEX; VOCODED SPEECH; WORKING-MEMORY AB This paper presents a new index for predicting speech intelligibility for normal-hearing and hearing-impaired listeners. The Hearing-Aid Speech Perception Index (HASPI) is based on a model of the auditory periphery that incorporates changes due to hearing loss. The index compares the envelope and temporal fine structure outputs of the auditory model for a reference signal to the outputs of the model for the signal under test. The auditory model for the reference signal is set for normal hearing, while the model for the test signal incorporates the peripheral hearing loss. The new index is compared to indices based on measuring the coherence between the reference and test signals and based on measuring the envelope correlation between the two signals. HASPI is found to give accurate intelligibility predictions for a wide range of signal degradations including speech degraded by noise and nonlinear distortion, speech processed using frequency compression, noisy speech processed through a noise-suppression algorithm, and speech where the high frequencies are replaced by the output of a noise vocoder. The coherence and envelope metrics used for comparison give poor performance for at least one of these test conditions. (C) 2014 Elsevier B.V. All rights reserved. C1 [Kates, James M.; Arehart, Kathryn H.] Univ Colorado, Dept Speech Language & Hearing Sci, Boulder, CO 80309 USA. RP Kates, JM (reprint author), Univ Colorado, Dept Speech Language & Hearing Sci, Boulder, CO 80309 USA. EM James.Kates@colorado.edu; Kathryn.Arehart@colorado.edu FU GN ReSound; NIH [R01 DC60014] FX The authors thank Dr. Rosalinda Baca for providing the statistical analysis used in this paper. Author JMK was supported by a grant from GN ReSound. Author KHA was supported by a NIH Grant (R01 DC60014) and by the grant from GN ReSound. CR Aguilera Munoz C.M., 1999, ELECT CIRCUITS SYSTE, V2, P741 Anderson M.C., 2010, THESIS U COLORADO ANSI, S351997 ANSI Arehart K.H., 2013, P M AC POMA JUN 2 7, V19 Arehart KH, 2013, EAR HEARING, V34, P251, DOI 10.1097/AUD.0b013e318271aa5e Bruce IC, 2003, J ACOUST SOC AM, V113, P369, DOI 10.1121/1.1519544 BYRNE D, 1986, EAR HEARING, V7, P257 CARTER GC, 1983, IEEE T AUDIO ELECTRO, V21, P337, DOI DOI 10.1109/TAU.1973.1162496 Chen F., 2013, P 35 ANN INT C IEEE, P4199 Chen F, 2011, EAR HEARING, V32, P331, DOI 10.1097/AUD.0b013e3181ff3515 Ching TYC, 1998, J ACOUST SOC AM, V103, P1128, DOI 10.1121/1.421224 Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004 Cooke M., 1991, THESIS U SHEFFIELD Cooper NP, 1997, J NEUROPHYSIOL, V78, P261 Cosentino S., 2012, P 11 INT C INF SCI S, P666 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020 Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6 Fogerty D, 2011, J ACOUST SOC AM, V129, P977, DOI 10.1121/1.3531954 Glista D, 2009, INT J AUDIOL, V48, P632, DOI 10.1080/14992020902971349 Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 Gomez AM, 2012, SPEECH COMMUN, V54, P503, DOI 10.1016/j.specom.2011.11.001 GORGA MP, 1981, J ACOUST SOC AM, V70, P1310, DOI 10.1121/1.387145 Greenberg S, 2004, IEICE T INF SYST, VE87D, P1059 HARRIS DM, 1979, J NEUROPHYSIOL, V42, P1083 Hicks ML, 1999, J ACOUST SOC AM, V105, P326, DOI 10.1121/1.424526 Hines A, 2010, SPEECH COMMUN, V52, P736, DOI 10.1016/j.specom.2010.04.006 HOHMANN V, 1995, J ACOUST SOC AM, V97, P1191, DOI 10.1121/1.413092 Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354 Hopkins K, 2011, J ACOUST SOC AM, V130, P334, DOI 10.1121/1.3585848 Hopkins K, 2008, J ACOUST SOC AM, V123, P1140, DOI 10.1121/1.2824018 HOUTGAST T, 1971, ACUSTICA, V25, P355 HUMES LE, 1986, J SPEECH HEAR RES, V29, P447 Imai S., 1983, P ICASSP, V8, P93 Immerseel LV, 2003, ACOUST RES LETT ONL, V4, P59 KATES JM, 1991, IEEE T SIGNAL PROCES, V39, P2573, DOI 10.1109/78.107409 KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657 Kates JM, 2010, J AUDIO ENG SOC, V58, P363 Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575 Kates J.M., 2013, P M AC POMA JUN 2 7, V19 Kates J.M., 2008, DIGITAL HEARING AIDS, P1 Kiessling J., 1993, J SPEECH LANG PAT S1, V1, P39 Kjems U, 2009, J ACOUST SOC AM, V126, P1415, DOI 10.1121/1.3179673 Li N, 2008, J ACOUST SOC AM, V123, P1673, DOI 10.1121/1.2832617 Ludvigsen C, 1990, Acta Otolaryngol Suppl, V469, P190 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 McDermott HJ, 2011, PLOS ONE, V6, DOI 10.1371/journal.pone.0022358 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Moore BCJ, 2004, HEARING RES, V188, P70, DOI 10.1016/S0378-5955(03)00347-2 Moore BCJ, 1999, J ACOUST SOC AM, V106, P2761, DOI 10.1121/1.428133 Ng EHN, 2013, INT J AUDIOL, V52, P433, DOI 10.3109/14992027.2013.776181 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 NOSSAIR ZB, 1991, J ACOUST SOC AM, V89, P2978, DOI 10.1121/1.400735 PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456 PAVLOVIC CV, 1986, J ACOUST SOC AM, V80, P50, DOI 10.1121/1.394082 Payton K., 2008, P AC 2008 PAR, P634 PAYTON KL, 1994, J ACOUST SOC AM, V95, P1581, DOI 10.1121/1.408545 Plack CJ, 2000, J ACOUST SOC AM, V107, P501, DOI 10.1121/1.428318 QUATIERI TF, 1986, IEEE T ACOUST SPEECH, V34, P1449, DOI 10.1109/TASSP.1986.1164985 Rosenthal S., 1969, IEEE T AUDIO ELECTRO, V17, P227 SACHS MB, 1974, J ACOUST SOC AM, V56, P1835, DOI 10.1121/1.1903521 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 SHAW JC, 1981, J MED ENG TECHNOL, V5, P279, DOI 10.3109/03091908109009362 Simpson A, 2005, INT J AUDIOL, V44, P281, DOI 10.1080/14992020500060636 Slaney M., 1993, 35 APPL COMP LIB Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a Souza P, 2009, J ACOUST SOC AM, V126, P792, DOI 10.1121/1.3158835 Souza PE, 2013, J SPEECH LANG HEAR R, V56, P1349, DOI 10.1044/1092-4388(2013/12-0151) STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 STEIGER JH, 1980, PSYCHOL BULL, V87, P245, DOI 10.1037//0033-2909.87.2.245 Stone MA, 2008, J ACOUST SOC AM, V124, P2272, DOI 10.1121/1.2968678 Suzuki Y, 2004, J ACOUST SOC AM, V116, P918, DOI 10.1121/1.1763601 Taal CH, 2011, J ACOUST SOC AM, V130, P3013, DOI 10.1121/1.3641373 Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881 Wang DL, 2008, J ACOUST SOC AM, V124, P2303, DOI 10.1121/1.2967865 WILLIAMS EJ, 1959, J ROY STAT SOC B, V21, P396 Wojtczak M, 2012, J ACOUST SOC AM, V131, P363, DOI 10.1121/1.3665995 YATES GK, 1990, HEARING RES, V45, P203, DOI 10.1016/0378-5955(90)90121-5 ZAHORIAN SA, 1993, J ACOUST SOC AM, V94, P1966, DOI 10.1121/1.407520 ZAHORIAN SA, 1981, J ACOUST SOC AM, V69, P832, DOI 10.1121/1.385539 Zhang XD, 2001, J ACOUST SOC AM, V109, P648, DOI 10.1121/1.1336503 Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512 NR 82 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 75 EP 93 DI 10.1016/j.specom.2014.06.002 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700007 ER PT J AU Salah-Eddine, C Merouane, B AF Salah-Eddine, Cheraitia Merouane, Bouzid TI Robust coding of wideband speech immittance spectral frequencies SO SPEECH COMMUNICATION LA English DT Article DE SSVQ quantization; Source-channel coding; Robust speech coding; ISF parameters; Wideband speech coder ID CODED VECTOR QUANTIZATION; LSF-PARAMETERS; NOISY CHANNELS; LPC PARAMETERS; QUANTIZERS AB In this paper, we propose a reduced complexity stochastic joint source-channel coding system developed for efficient and robust coding of wideband speech ISF (Immittance Spectral Frequency) parameters. Initially, the aim of this encoding system was to achieve a transparent quantization of ISF parameters for ideal transmissions over a noiseless channels. It was designed based on the switched split vector quantization (SSVQ) technique and called "ISF-SSVQ coder". After that, our interest was drawn to the improvement of the ISF-SSVQ robustness for transmissions over a noisy channel. To protect implicitly our ISF coders, we developed a stochastic joint source-channel coding system based on a reduced complexity version of the channel optimized SSVQ technique. Simulation results will show that our new encoding system, called ISF-SCOSSVQ coder, can provide a good implicit protection to ISF parameters. The ISF-SCOSSVQ coder was further used to encode the ISF parameters of the Adaptive Multi-rate Wideband (AMR-WB, ITU-T G.722.2) speech coder operating over a noisy channel. We will show that the proposed ISF-SCOSSVQ coder contribute significantly in the improvement of the AMR-WB performance by ensuring a good coding robustness of its ISF parameters. (C) 2014 Elsevier B.V. All rights reserved. C1 [Salah-Eddine, Cheraitia; Merouane, Bouzid] USTHB, Elect Fac, Speech Commun & Signal Proc Lab, Algiers 16111, Algeria. RP Salah-Eddine, C (reprint author), USTHB, Elect Fac, Speech Commun & Signal Proc Lab, POB 32, Algiers 16111, Algeria. EM cher.salah@yahoo.fr; mbouzid@usthb.dz CR Azami S.B.Z., 1996, P CNES WORKSH DAT CO Bessette B, 2002, IEEE T SPEECH AUDI P, V10, P620, DOI 10.1109/TSA.2002.804299 BISTRITZ Y, 1989, IEEE T INFORM THEORY, V35, P675, DOI 10.1109/18.30994 Bistritz Y., 1993, P IEEE INT C AC SPEE, V2, P9 Biundo G., 2002, P 3 COST 276 WORKSH, P114 Bouzid M, 2005, SIGNAL PROCESS, V85, P1675, DOI 10.1016/j.sigpro.2005.03.009 Bouzid M., 2012, P 11 ED INT C INF SC, P1045 Bouzid M, 2007, ANN TELECOMMUN, V62, P426 Chen JH, 1996, INT CONF ACOUST SPEE, P275 Chiang D.M., 1997, IEEE T CIRCUITS SYST, V7, P604 Cordoba J.L.P., 2005, P INTERSPEECH 2005 L, P2745 CUPERMAN V, 1985, IEEE T COMMUN, V33, P685, DOI 10.1109/TCOM.1985.1096372 DARPA TIMIT, 1993, AC PHON CONT SPEECH Duhamel P., 1997, P C GRETSI GREN FRAN, P699 FARVARDIN N, 1990, IEEE T INFORM THEORY, V36, P799, DOI 10.1109/18.53739 FARVARDIN N, 1991, IEEE T INFORM THEORY, V37, P155, DOI 10.1109/18.61130 Gersho A., 1992, VECTOR QUANTIZATION Guibe G, 2001, EUR T TELECOMMUN, V12, P535, DOI 10.1002/ett.4460120609 HAMMING RW, 1950, AT&T TECH J, V29, P147 Hussain Y., 1992, P IEEE INT C AC SPEE, V2, P133 Itakura F., 1975, J ACOUST SOC AM, V57, P535 Katsavounidis I, 1994, IEEE SIGNAL PROC LET, V1, P144, DOI 10.1109/97.329844 Kleijn W. B., 1995, SPEECH CODING SYNTHE Knagenhjelm P., 1993, THESIS CHALMERS U TE Kovesi B., 1997, P GRETSI 97 GREN FRA, P1065 Krishnan V, 2004, IEEE T SPEECH AUDI P, V12, P1, DOI 10.1109/TSA.2003.819945 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 McLoughlin IV, 2008, SIGNAL PROCESS, V88, P448, DOI 10.1016/j.sigpro.2007.09.003 MILLER D, 1994, IEEE T COMMUN, V42, P347, DOI 10.1109/TCOMM.1994.577056 Moreira Jorge C., 2006, ESSENTIALS ERROR CON Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 POTTER LC, 1995, IEEE T COMMUN, V43, P804, DOI 10.1109/26.380112 Rabiner L.R., 1978, DIGITAL PROCESSING S So S., 2004, P INT C SPOK LANG PR So S, 2007, DIGIT SIGNAL PROCESS, V17, P138, DOI 10.1016/j.dsp.2005.08.005 Xiang Y., 2010, IEEE T INFORM THEORY, V56, P5769 ZEGER K, 1990, IEEE T COMMUN, V38, P2147, DOI 10.1109/26.64657 NR 37 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 94 EP 108 DI 10.1016/j.specom.2014.07.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700008 ER PT J AU Patel, R Kember, H Natale, S AF Patel, Rupal Kember, Heather Natale, Sara TI Feasibility of augmenting text with visual prosodic cues to enhance oral reading SO SPEECH COMMUNICATION LA English DT Article DE Oral reading; Prosody; Reading software; Fluency; Children ID VOCAL FUNDAMENTAL-FREQUENCY; COMPREHENSION; CHILDREN; FLUENCY; INTONATION; SPEECH; READERS; CRIES; LANGUAGE; STRESS AB Reading fluency has traditionally focused on speed and accuracy yet recent reports suggest that expressive oral reading is an important component that has been largely overlooked. The current study assessed the impact of augmenting text with visual prosodic cues to improve expressive reading in beginning readers. Customized reading software was developed to present text augmented with prosodic cues to convey changes in pitch, duration and/or intensity. Prosodic modulation was derived from the recordings of a fluent adult model and rendered as a set of visual cues that could be presented in isolation or in combination. To establish baseline measures, eight children aged 7-8 first read a five-chapter story in standard text format. In the subsequent three sessions, participants were trained to use each augmented text cue with the guidance of an auditory model. They also had the opportunity to practice reading aloud in each cue condition. At the post-training session, participants re-recorded the baseline story with each chapter read in one of the different cue conditions (standard, pitch, duration, intensity and combination). Post-training and baseline recordings were acoustically analyzed to assess changes in reading expressivity. Despite large individual differences in how each participant implemented the prosodic cues, as a group, there were notable improvements in marking pitch accents and elongating word duration to convey linguistic contrasts. In fact, even after only three training sessions, participants appeared to have generalized implementation of pitch and word duration cues when reading standard text at post-training. In contrast, while participants manipulated pause duration when provided with explicit visual cues, they did not transfer these cues to standard text at post-training. These findings suggest that beginning readers could benefit from explicit visual prosodic cues and that even limited exposure may be sufficient to learn and generalize skills. Further discussion focuses on the implications of this work on struggling readers and second language learners. (C) 2014 Elsevier B.V. All rights reserved. C1 [Patel, Rupal; Kember, Heather; Natale, Sara] Northeastern Univ, Dept Speech Language Pathol & Audiol, Boston, MA 02115 USA. [Patel, Rupal] Northeastern Univ, Coll Comp & Informat Sci, Boston, MA 02115 USA. RP Patel, R (reprint author), Northeastern Univ, Dept Speech Language Pathol & Audiol, 360 Huntington Ave,Room 204 FR, Boston, MA 02115 USA. EM r.patel@neu.edu FU National Science Foundation [HCC-0915527] FX There are a number of individuals who have made significant contributions this work. We are indebted to Isabel Meirelles for her collaboration on designing the visual renderings used here, to Sheelah Sweeny for her guidance and work on developing the stories and comprehension questions, and to William Furr for his dedication to implementing a robust and user-friendly software system. We also thank our participants and their families for their time and commitment to this multi-week study. Last but not least, this material is based upon work supported by the National Science Foundation under Grant No. HCC-0915527. CR Adams M. J., 1990, BEGINNING READ THINK ALLINGTON RL, 1983, READ TEACH, V36, P556 ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x Beaver J. M., 2006, DEV READING ASSESSME Blevins W., 2001, BUILDING FLUENCY LES Boersma P., 2014, PRAAT DOING PHONETIC Bolinger D., 1989, INTONATION ITS USES Carlson K, 2009, LANGUAGE LINGUISTICS, V3, P1188, DOI DOI 10.1111/J.1749-818X.2009.00150.X) Christensen R., 2002, PLANE ANSWERS COMPLE Cooper W. E., 1981, FUNDAMENTAL FREQUENC COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372 Cromer W, 1970, J Educ Psychol, V61, P471, DOI 10.1037/h0030288 CRUTTENDEN A, 1985, J CHILD LANG, V12, P643 Crystal D, 1978, COMMUN COGNITION, P257 Crystal D, 1986, LANGUAGE ACQUISITION Cutler A, 1997, LANG SPEECH, V40, P141 Daane M. C., 2005, 2006469 NCES US DEP Dowhower S. L., 1991, THEOR PRACT, V30, P165, DOI DOI 10.1080/00405849109543497 EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091 Eason SH, 2013, SCI STUD READ, V17, P199, DOI 10.1080/10888438.2011.652722 FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022 Gilbert HR, 1996, INT J PEDIATR OTORHI, V34, P237, DOI 10.1016/0165-5876(95)01273-7 Grigos MI, 2007, J SPEECH LANG HEAR R, V50, P119, DOI 10.1044/1092-4388(2007/010) Jenkins JR, 2003, J EDUC PSYCHOL, V95, P719, DOI 10.1037/0022-0663.95.4.719 Lehiste I., 1970, SUPRASEGMENTALS LeVasseur VM, 2006, APPL PSYCHOLINGUIST, V27, P423, DOI 10.1017/S0142716406060346 Lind K, 2002, INT J PEDIATR OTORHI, V64, P97, DOI 10.1016/S0165-5876(02)00024-1 Local J., 1980, SOCIOLINGUISTIC VARI Miller J, 2006, J EDUC PSYCHOL, V98, P839, DOI 10.1037/0022-0663.98.4.839 Morgan J.L., 1995, SIGNAL SYNTAX BOOTST Neddenriep CE, 2011, PSYCHOL SCHOOLS, V48, P14, DOI 10.1002/pits.20542 OSHEA LJ, 1983, READ RES QUART, V18, P458, DOI 10.2307/747380 Patel R., 2011, ACM CHI C HUM FACT C, P3203 Patel R, 2006, SPEECH COMMUN, V48, P1308, DOI 10.1016/j.specom.2006.06.007 Patel R, 2011, SPEECH COMMUN, V53, P431, DOI 10.1016/j.specom.2010.11.007 Protopapas A, 1997, J ACOUST SOC AM, V102, P3723, DOI 10.1121/1.420403 Schreiber P. A., 1987, COMPREHENDING ORAL W, P243 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 SNOW D, 1994, J SPEECH HEAR RES, V37, P831 Snow D, 1998, J SPEECH LANG HEAR R, V41, P1158 Stathopoulos ET, 1997, J SPEECH LANG HEAR R, V40, P595 Therrien WJ, 2004, REM SPEC EDUC, V25, P252, DOI 10.1177/07419325040250040801 TINGLEY BM, 1975, CHILD DEV, V46, P186 Walker L.L., 2005, BEHAV, V14, P21 Wells B, 2004, J CHILD LANG, V31, P749, DOI 10.1017/S030500090400652X Wermke K, 2002, MED ENG PHYS, V24, P501, DOI 10.1016/S1350-4533(02)00061-9 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y., 2011, J SPEECH SCI, V1, P85 NR 48 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2014 VL 65 BP 109 EP 118 DI 10.1016/j.specom.2014.07.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AP2KS UT WOS:000341901700009 ER PT J AU Schoenenberg, K Raake, A Egger, S Schatz, R AF Schoenenberg, Katrin Raake, Alexander Egger, Sebastian Schatz, Raimund TI On interaction behaviour in telephone conversations under transmission delay SO SPEECH COMMUNICATION LA English DT Article DE VoIP; Delay; Interactivity; Conversation analysis; Interaction rhythm; Conversational quality ID ON-OFF PATTERNS; SPEECH QUALITY AB This work analyses the interaction behaviour of two interlocutors communicating over telephone connections affected by echo-free delay, for conversation tasks yielding different speed and structure. Based on a series of conversation tests, it is shown that transmission delay in a telephone circuit does not only result in a longer time until information is exchanged between the interlocutors, but also alters various characteristics of the conversational course. It was observed that with increasing transmission delay, the realities perceived by the interlocutors increasingly diverge. As a measure of utterance pace, a new conversation surface structure metric, the so-called utterance rhythm (URY), is introduced. Using surface-structure analysis of conversations from different conversation tests, it is shown that peoples' utterance rhythm stays rather constant in close-to-natural conversations, but is considerably affected for scenarios requiring fast interaction and a clear answering structure. At the same time, the quality of the connection is perceived less critically in close-to-natural than in tasks requiring fast interaction, that is, interactive tasks leading to a delay-dependant utterance rhythm. Hence, the conclusion can be drawn that the degree of necessary adaption of the utterance rhythm to a certain delay condition co-determines the extent to which transmission delay impacts the perceived integral quality of a call. (C) 2014 Elsevier B.V. All rights reserved. C1 [Schoenenberg, Katrin; Raake, Alexander] Tech Univ Berlin, T Labs, D-10587 Berlin, Germany. [Egger, Sebastian; Schatz, Raimund] Telecommun Res Ctr Vienna FTW, A-1220 Vienna, Austria. RP Schoenenberg, K (reprint author), Tech Univ Berlin, T Labs, Ernst Reuter Pl 7, D-10587 Berlin, Germany. EM katrin.schoenenberg@telekom.de; alexander.raake@telekom.de; egger@ftw.at; raimund.schatz@ftw.at CR BRADY PT, 1971, AT&T TECH J, V50, P115 BRADY PT, 1968, AT&T TECH J, V47, P73 BRADY PT, 1965, AT&T TECH J, V44, P1 Egger S., 2012, P INT C COMM ICC, P1320 Egger S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1321 Glass L, 2001, NATURE, V410, P277, DOI 10.1038/35065745 Gueguin M, 2008, EURASIP J ADV SIG PR, DOI 10.1155/2008/185248 Hammer F., 2004, P INT C SPOK LANG PR, P1741 Hoeldtke K., 2011, P INT C COMM ICC, P1 KITAWAKI N, 1991, IEEE J SEL AREA COMM, V9, P586, DOI 10.1109/49.81952 KRAUSS RM, 1967, J ACOUST SOC AM, V41, P286, DOI 10.1121/1.1910338 Lakaniemi A., 2001, INT C COMM ICC, V3, P748, DOI 10.1109/ICC.2001.937339 Luengo I., 2010, P LREC C VALL MALT M, P1539 Moller S, 2011, IEEE SIGNAL PROC MAG, V28, P18, DOI 10.1109/MSP.2011.942469 Moller S., 2000, ASSESSMENT PREDICTIO Raake A., 2006, SPEECH QUALITY VOIP Richards D.L., 1962, P INT C SAT COMM, P955 RIESZ RR, 1963, AT&T TECH J, V42, P2919 Sat B., 2007, P IEEE INT S MULT TA, P3 Schoenenberg K, 2014, INT J HUM-COMPUT ST, V72, P477, DOI 10.1016/j.ijhcs.2014.02.004 Thomsen G, 2000, IEEE SPECTRUM, V37, P52, DOI 10.1109/6.842135 Trevarthen C., 1993, PERCEIVED SELF ECOLO, P121 Wah Benjamin W, 2009, Journal of Multimedia, V4 NR 23 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2014 VL 63-64 BP 1 EP 14 DI 10.1016/j.specom.2014.04.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ7NY UT WOS:000337884600001 ER PT J AU De Armas, W Mamun, KA Chau, T AF De Armas, Winston Mamun, Khondaker A. Chau, Tom TI Vocal frequency estimation and voicing state prediction with surface EMG pattern recognition SO SPEECH COMMUNICATION LA English DT Article DE Fundamental frequency; EMG; Electrolarynx; Pitch modulation; Hands free; Voicing state ID ELECTROMYOGRAPHIC ACTIVITY; SPEECH; REHABILITATION; CLASSIFICATION; LARYNGECTOMY; MACHINE; SIGNAL; STRAP; HAND AB The majority of laryngectomees use the electrolarynx as their primary mode of verbal communication after total laryngectomy surgery. However, the archetypal electrolarynx suffers from a monotonous tone and the inconvenience of requiring manual control. This paper presents the potential of pattern recognition to support electrolarynx use by predicting fundamental frequency (F0) and voicing state (VS) from surface EMG of the infrahyoid and suprahyoid muscles, as well as from a respiratory trace. In this study, surface EMG signals from the infrahyoid and suprahyoid muscle groups and respiratory trace were collected from 10 able-bodied, adult males (18- 60 years old). Participants performed three kinds of vocal tasks tones, legatos and phrases. Signal features were extracted from the EMG and respiratory trace, and a Support Vector Machine (SVM) classifier with radial basis function kernels was employed to predict F0 and voicing state. An average root mean squared error of 2.81 +/- 0.6 semitones was achieved for the estimation of vocal frequency in the range of 90-360 Hz. An average cross-validation (CV) accuracy of 78.05 +/- 6.3% was achieved for the prediction of voicing state from EMG and 65.24 +/- 7.8% from the respiratory trace. The proposed method has the advantage of being non-invasive compared with studies that relied on intramuscular electrodes (invasive), while still maintaining an accuracy above chance. Pattern classification of neck-muscle surface EMG has merit in the prediction of fundamental frequency and voicing state during vocalization, encouraging further study of automatic pitch modulation for electrolarynges and silent speech interfaces. (C) 2014 Elsevier B.V. All rights reserved. C1 [De Armas, Winston; Mamun, Khondaker A.; Chau, Tom] Univ Toronto, Inst Biomat & Biomed Engn, Toronto, ON M5S 3G9, Canada. [Mamun, Khondaker A.; Chau, Tom] Holland Bloorview Kids Rehabil Hosp, Bloorview Res Inst, Toronto, ON M4G 1R8, Canada. RP Chau, T (reprint author), Univ Toronto, Inst Biomat & Biomed Engn, Rosebrugh Bldg,164 Coll St,Room 407, Toronto, ON M5S 3G9, Canada. EM winston.dearmas@mail.utoronto.ca; k.mamun@utoronto.ca; tom.chau@utoronto.ca CR ATKINSON JE, 1978, J ACOUST SOC AM, V63, P211, DOI 10.1121/1.381716 Bishop C.M., 2006, PATTERN RECOGN, P325 Castellini C, 2009, BIOL CYBERN, V100, P35, DOI 10.1007/s00422-008-0278-1 Chen JJ, 2005, SAR QSAR ENVIRON RES, V16, P517, DOI 10.1080/10659360500468468 Daubechies I., 1992, 10 LECT WAVELETS Goldstein EA, 2004, IEEE T BIO-MED ENG, V51, P325, DOI 10.1109/TBME.2003.820373 GRAY S, 1976, ARCH PHYS MED REHAB, V57, P140 HART J, 1981, Journal of the Acoustical Society of America, V69, P811 Hastie T, 2004, J MACH LEARN RES, V5, P1391 Hillman R E, 1998, Ann Otol Rhinol Laryngol Suppl, V172, P1 Honda K, 1999, LANG SPEECH, V42, P401 Hsu C.W., 2010, PRACTICAL GUIDE SUPP Huang H.-P., 1999, 1999 IEEE INT C ROB, V3, P2392, DOI [10.1109/ROBOT.1999.770463, DOI 10.1109/ROBOT.1999.770463] HUDGINS B, 1993, IEEE T BIO-MED ENG, V40, P82, DOI 10.1109/10.204774 Johner C., 2012, ADV AFFECTIVE PLEASU Khokhar ZO, 2010, BIOMED ENG ONLINE, V9, DOI 10.1186/1475-925X-9-41 Kubert HL, 2009, J COMMUN DISORD, V42, P211, DOI 10.1016/j.jcomdis.2008.12.002 Lee KS, 2008, IEEE T BIO-MED ENG, V55, P930, DOI 10.1109/TBME.2008.915658 Ma K., 1999, IMPROVEMENT ELECTROL Meltzner G., 2005, CONT CONSIDERATIONS Meltzner GS, 2005, J SPEECH LANG HEAR R, V48, P766, DOI 10.1044/1092-4388(2005/053) Mendenhall WM, 2002, J CLIN ONCOL, V20, P2500, DOI 10.1200/JCO.2002.07.047 Merletti R, 2001, P ANN INT IEEE EMBS, V23, P1119 Muller-putz G.R., 2008, INT J BIOELECTROMAGN Nakamura K, 2011, INT CONF ACOUST SPEE, P573 Nolan F., 2003, P P 15 INT C PHON SC Ohala J., 1969, AUT 1969 M AC SOC JA, P359 Park W, 2010, IEEE INT CONF ROBOT, P205 Reaz MBI, 2006, BIOL PROCED ONLINE, V8, P11, DOI 10.1251/bpo115 Roubeau B, 1997, ACTA OTO-LARYNGOL, V117, P459, DOI 10.3109/00016489709113421 Saikachi Y, 2009, J SPEECH LANG HEAR R, V52, P1360, DOI 10.1044/1092-4388(2009/08-0167) Scott R.N, 1984, INTRO MYOELECTRIC PR SHIPP T, 1979, J ACOUST SOC AM, V66, P678, DOI 10.1121/1.383694 Slim Y, 2010, IRBM, V31, P209, DOI 10.1016/j.irbm.2010.05.002 St-Amant Y., 1996, BIOENG C 1996 P 1996, P93 Stepp CE, 2009, IEEE T NEUR SYS REH, V17, P146, DOI 10.1109/TNSRE.2009.2017805 Sundberg J., 1973, STL QPSR, V14, P39 Talkin D., 1995, SPEECH CODING SYNTHE Uemi N., 1994, 3 IEEE INT WORKSH RO, P198 Vapnik V, 1998, NONLINEAR MODELING, P55 von Tscharner V, 2011, J ELECTROMYOGR KINES, V21, P683, DOI 10.1016/j.jelekin.2011.03.004 Watson PJ, 2009, AM J SPEECH-LANG PAT, V18, P162, DOI 10.1044/1058-0360(2008/08-0025) Yang DP, 2009, 2009 IEEE-RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, P516, DOI 10.1109/IROS.2009.5354544 Zhao JD, 2005, IEEE INT CONF ROBOT, P4482 NR 44 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2014 VL 63-64 BP 15 EP 26 DI 10.1016/j.specom.2014.04.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ7NY UT WOS:000337884600002 ER PT J AU Xia, XJ Ling, ZH Jiang, Y Dai, LR AF Xia, Xian-Jun Ling, Zhen-Hua Jiang, Yuan Dai, Li-Rong TI HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; Unit selection; Hidden Markov model; Log likelihood ratio; Perceptual data ID HIDDEN MARKOV-MODELS; SYNTHESIS SYSTEM AB This paper presents a hidden Markov model (HMM) based unit selection speech synthesis method using log likelihood ratios (LLR) derived from perceptual data. The perceptual data is collected by judging the naturalness of each synthetic prosodic word manually. Two acoustic models which represent the natural speech and the unnatural synthetic speech are trained respectively. At synthesis time, the LLRs are derived from the estimated acoustic models and integrated into the unit selection criterion as target cost functions. The experimental results show that our proposed method can synthesize more natural speech than the conventional method using likelihood functions. Due to the inadequacy of the acoustic model estimated for the unnatural synthetic speech, utilizing the LLR-based target cost functions to rescore the pre-selection results or the N-best sequences can achieve better performance than substituting them for the original target cost functions directly. (C) 2014 Elsevier B.V. All rights reserved. C1 [Xia, Xian-Jun; Ling, Zhen-Hua; Dai, Li-Rong] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230027, Anhui, Peoples R China. [Jiang, Yuan] iFLYTEK Res, Hefei 230088, Anhui, Peoples R China. RP Ling, ZH (reprint author), Univ Sci & Technol China, 96 JinZhai Rd, Hefei, Anhui, Peoples R China. EM xxjpjj@mail.ustc.edu.cn; zhling@ustc.edu.cn; yuanjiang@iflytek.com; lrdai@ustc.edu.cn FU Fundamental Research Funds for the Central Universities [WK2100060005]; National Nature Science Foundation of China [61273032] FX This work is partially funded by the Fundamental Research Funds for the Central Universities (Grant No. WK2100060005) and the National Nature Science Foundation of China (Grant No. 61273032). We also gratefully acknowledge the research division of iFLYTEK Co. Ltd., Hefei, China for providing the speech database and collecting the perceptual data. CR Hirai T., 2007, 6 ISCA SPEECH SYNTH, P81 Hirai T., 2004, 5 ISCA SPEECH SYNTH, P37 Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Ling Z. H., 2007, BLIZZ CHALL WORKSH Ling Z.-H., 2008, BLIZZ CHALL WORKSH Ling Z.-H., 2010, ISCSLP, P144 Ling ZH, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2034 Ling ZH, 2007, INT CONF ACOUST SPEE, P1245 Ling Z.-H., 2008, J PR NI, V21, P280 Ling ZH, 2008, INT CONF ACOUST SPEE, P3949 Lu H, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P162 Lu H, 2011, INT CONF ACOUST SPEE, P5352 Oura K., 2009, INTERSPEECH, P1759 Qian Y, 2013, IEEE T AUDIO SPEECH, V21, P280, DOI 10.1109/TASL.2012.2221460 Qian Y., 2002, SPEECH PROSODY, P591 Sagisaka Y., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196677 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Song Yang, 2013, Journal of Tsinghua University (Science and Technology), V53 Strom V, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P150 Syrdal A., 2005, INTERSPEECH 2005 LIS, P2813 Toda T., 2004, ICASSP, P657 Tokuda K, 2013, P IEEE, V101, P1234, DOI 10.1109/JPROC.2013.2251852 TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229 Wang RH, 2009, CHINESE SCI BULL, V54, P1963, DOI 10.1007/s11434-009-0267-3 Wei S, 2009, SPEECH COMMUN, V51, P896, DOI 10.1016/j.specom.2009.03.004 Xia XJ, 2012, 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, P160 Yoshida A, 2008, INT CONF ACOUST SPEE, P4617 Yoshimura T., 1999, EUROSPEECH, P2347 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 NR 30 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2014 VL 63-64 BP 27 EP 37 DI 10.1016/j.specom.2014.04.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ7NY UT WOS:000337884600003 ER PT J AU White, L AF White, Laurence TI Communicative function and prosodic form in speech timing SO SPEECH COMMUNICATION LA English DT Article DE Speech timing; Prosody; Rhythm; Prosodic structure; Speech perception ID WORD SEGMENTATION; AMERICAN ENGLISH; FUNDAMENTAL-FREQUENCY; VOWEL DURATION; TURN-TAKING; PERCEPTION; STRESS; PHRASE; BOUNDARY; DOMAIN AB Listeners can use variation in speech segment duration to interpret the structure of spoken utterances, but there is no systematic description of how speakers manipulate timing for communicative ends. Here I propose a functional approach to prosodic speech timing, with particular reference to English. The disparate findings regarding the production of timing effects are evaluated against the functional requirement that communicative durational variation should be perceivable and interpretable by the listener. In the resulting framework, prosodic structure is held to influence speech timing directly only at the heads and edges of prosodic domains, through large, consistent lengthening effects. As each such effect has a characteristic locus within its domain, speech timing cues are potentially disambiguated for the listener, even in the absence of other information. Diffuse timing effects in particular, quasi-rhythmical compensatory processes implying a relationship between structure and timing throughout the utterance are found to be weak and inconsistently observed. Furthermore, it is argued that articulatory and perceptual constraints make shortening processes less useful as structural cues, and they must be regarded as peripheral, at best, in a parsimonious and functionally-informed account. (C) 2014 Elsevier B.V. All rights reserved. C1 Univ Plymouth, Sch Psychol, Plymouth PL4 8AA, Devon, England. RP White, L (reprint author), Univ Plymouth, Sch Psychol, Plymouth PL4 8AA, Devon, England. EM laurence.white@plymouth.ac.uk CR Abercrombie D, 1967, ELEMENTS GEN PHONETI Albin DD, 1996, INFANT BEHAV DEV, V19, P401, DOI 10.1016/S0163-6383(96)90002-8 Arvaniti A., 2013, LAB PHONOLOGY, V4, P7, DOI 10.1515/lp-2013-0002 Arvaniti A, 2009, PHONETICA, V66, P46, DOI 10.1159/000208930 BEACH CM, 1991, J MEM LANG, V30, P644, DOI 10.1016/0749-596X(91)90030-N Beckman M. E., 1990, PAPERS LABORATORY PH, P152 Beckman M. E., 1992, SPEECH PERCEPTION PR, P457 Beckman M.E., 1992, SPEECH PERCEPTION PR, P356 BERKOVITS R, 1994, LANG SPEECH, V37, P237 Bolinger Dwight L., 1965, FORMS ENGLISH ACCENT Brown W, 1911, PSYCHOL REV, V18, P336, DOI 10.1037/h0074259 Bye Patrik, 1997, ESTONIAN PROSODY PAP, P36 Byrd D, 2003, J PHONETICS, V31, P149, DOI 10.1016/S0095-4470(02)00085-2 Byrd D, 2005, J ACOUST SOC AM, V118, P3860, DOI 10.1121/1.2130950 Byrd D, 2008, J INT PHON ASSOC, V38, P187, DOI 10.1017/S0025100308003460 Cambier-Langeveld T., 2000, THESIS U AMSTERDAM CAMPBELL WN, 1991, J PHONETICS, V19, P37 Cho TH, 2001, J PHONETICS, V29, P155, DOI 10.1006/jpho.2001.0131 Cho TH, 2007, J PHONETICS, V35, P210, DOI 10.1016/j.wocn.2006.03.003 Chomsky N., 1968, SOUND PATTERN ENGLIS Christophe A, 1996, LINGUIST REV, V13, P383, DOI 10.1515/tlir.1996.13.3-4.383 Classe A, 1939, RHYTHM ENGLISH PROSE Couper-Kuhlen E., 1993, ENGLISH SPEECH RHYTH Couper-Kuhlen E., 1986, INTRO ENGLISH PROSOD Cumming R, 2011, J PHONETICS, V39, P375, DOI 10.1016/j.wocn.2011.01.004 Cummins F, 1999, J ACOUST SOC AM, V105, P476, DOI 10.1121/1.424576 Cummins F, 2009, J PHONETICS, V37, P16, DOI 10.1016/j.wocn.2008.08.003 Cummins F, 2012, FRONT PSYCHOL, V3, DOI 10.3389/fpsyg.2012.00364 Cummins F, 2011, FRONT HUM NEUROSCI, V5, DOI 10.3389/fnhum.2011.00170 Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070 Cutler A., 1990, PAPERS LAB PHONOLOGY, P208 DAUER RM, 1983, J PHONETICS, V11, P51 Davis MH, 2002, J EXP PSYCHOL HUMAN, V28, P218, DOI 10.1037//0096-1523.28.1.218 Dilley LC, 2010, PSYCHOL SCI, V21, P1664, DOI 10.1177/0956797610384743 Dimitrova S, 2012, J PHONETICS, V40, P403, DOI 10.1016/j.wocn.2012.02.008 EDWARDS J, 1991, J ACOUST SOC AM, V89, P369, DOI 10.1121/1.400674 Fletcher J., 2010, HDB PHONETIC SCI, P523 Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 FOURAKIS M, 1988, LANG SPEECH, V31, P283 FOWLER CA, 1981, PHONETICA, V38, P35 Fowler C.A., 1990, PAPERS LAB PHONOLOGY, P201 Frota S., 2007, SEGMENTAL PROSODIC I, P131 FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022 Gaitenby J.H., 1965, SR2 HASK LAB GOW DW, 1995, J EXP PSYCHOL HUMAN, V21, P344, DOI 10.1037//0096-1523.21.2.344 Gussenhoven C., 2002, P SPEECH PROS AIX EN JONES MR, 1989, PSYCHOL REV, V96, P459, DOI 10.1037//0033-295X.96.3.459 Keating P., 2003, PAPERS LAB PHONOLOGY, VVI, P145 Kim E, 2013, ATTEN PERCEPT PSYCHO, V75, P1547, DOI 10.3758/s13414-013-0490-5 Kim H., 2005, INT 2005 P 9 EUR C S, P2365 KLATT D, 1974, J SPEECH HEAR RES, V17, P51 KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 Klatt D.H., 1975, J PHONETICS, V3, P129 Klatt D.H., 1975, STRUCTURE PROCESS SP, P69 Knight S., 2013, THESIS U CAMBRIDGE Kochanski G, 2005, J ACOUST SOC AM, V118, P1038, DOI 10.1121/1.1923349 Kohler K.J., 2003, P 15 INT C PHON SCI, P7 Ladd D. R., 1996, INTONATIONAL PHONOLO Lee H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1193 LEHISTE I, 1973, J ACOUST SOC AM, V54, P1228, DOI 10.1121/1.1914379 Lehiste I., 1975, PHONOLOGICA 1972, P115 Lehiste I., 1977, J PHONETICS, V5, P253 LEHISTE I, 1972, J ACOUST SOC AM, V51, P2018, DOI 10.1121/1.1913062 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 MARSLENWILSON WD, 1992, Q J EXP PSYCHOL-A, V45, P73 Mattys SL, 2000, PERCEPT PSYCHOPHYS, V62, P253, DOI 10.3758/BF03205547 Mattys SL, 2005, J EXP PSYCHOL GEN, V134, P477, DOI 10.1037/0096-3445.134.4.477 MORTON J, 1976, PSYCHOL REV, V83, P405, DOI 10.1037//0033-295X.83.5.405 NAKATANI LH, 1981, PHONETICA, V38, P84 Nespor M., 1986, PROSODIC PHONOLOGY Nolan F., P ROY SOC B Nolan F, 2009, PHONETICA, V66, P64, DOI 10.1159/000208931 O'Dell M.L., 1999, P 14 ICPHS SAN FRANC, P1075 O'Dell M.L., 2009, NORD PROS P 10 C HEL, P179 Oller D. K., 1973, J ACOUST SOC AM, V54, P1235 Ortega-Llebaria M., 2007, SEGMENTAL PROSODIC I, P155 FANT G, 1991, J PHONETICS, V19, P351 PISONI DB, 1976, J ACOUST SOC AM, V59, pS39, DOI 10.1121/1.2002669 POINTON GE, 1980, J PHONETICS, V8, P293 Port RF, 2003, J PHONETICS, V31, P599, DOI 10.1016/j.wocn.2003.08.001 PORT RF, 1981, J ACOUST SOC AM, V69, P262, DOI 10.1121/1.385347 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 Prieto P, 2012, SPEECH COMMUN, V54, P681, DOI 10.1016/j.specom.2011.12.001 QUENE H, 1992, J PHONETICS, V20, P331 RAKERD B, 1987, PHONETICA, V44, P147 RAPHAEL LJ, 1972, J ACOUST SOC AM, V51, P1296, DOI 10.1121/1.1912974 Reinisch E, 2011, J EXP PSYCHOL HUMAN, V37, P978, DOI 10.1037/a0021923 Remijsen B, 2008, J PHONETICS, V36, P318, DOI 10.1016/j.wocn.2007.09.002 Roach P., 1982, LINGUISTIC CONTROVER, P73 Saffran JR, 1996, J MEM LANG, V35, P606, DOI 10.1006/jmla.1996.0032 Salverda AP, 2003, COGNITION, V90, P51, DOI 10.1016/S0010-0277(03)00139-2 SCOTT DR, 1982, J ACOUST SOC AM, V71, P996, DOI 10.1121/1.387581 SCOTT DR, 1985, J PHONETICS, V13, P155 Scott SK, 2009, NAT REV NEUROSCI, V10, P295, DOI 10.1038/nrn2603 Selkirk E, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P187 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 Shen Y., 1962, STUDIES LINGUISTICS Sluijter A.M.C, 1995, THESIS U LEIDEN SNOW D, 1994, J SPEECH HEAR RES, V37, P831 Stivers T, 2009, P NATL ACAD SCI USA, V106, P10587, DOI 10.1073/pnas.0903616106 Suomi K, 2007, J PHONETICS, V35, P40, DOI 10.1016/j.wocn.2005.12.001 Suomi K, 2009, J PHONETICS, V37, P397, DOI 10.1016/j.wocn.2009.07.003 Suomi K, 2013, J PHONETICS, V41, P1, DOI 10.1016/j.wocn.2012.09.001 Tabain M, 2003, J ACOUST SOC AM, V113, P2834, DOI 10.1121/1.1564013 Tagliapietra L, 2010, J MEM LANG, V63, P306, DOI 10.1016/j.jml.2010.05.001 Turk AE, 1997, J PHONETICS, V25, P25, DOI 10.1006/jpho.1996.0032 Turk AE, 2007, J PHONETICS, V35, P445, DOI 10.1016/j.wocn.2006.12.001 Turk AE, 2000, J PHONETICS, V28, P397, DOI 10.1006/jpho.2000.0123 Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093 van Santen J.P.H, 1997, COMPUTING PROSODY CO, P225 VANLANCKER D, 1988, J PHONETICS, V16, P339 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 White L., 2014, P SPEECH PROS DUBL White L., 2007, CURRENT ISSUES LINGU, P237 White L, 2012, J MEM LANG, V66, P665, DOI 10.1016/j.jml.2011.12.010 White L, 2010, J PHONETICS, V38, P459, DOI 10.1016/j.wocn.2010.05.002 White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003 White L., LANG LEARN IN PRESS White L., 2009, PHONETICS PHONOLOGY, P137 White Laurence, 2002, THESIS U EDINBURGH WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 Wilson M, 2005, PSYCHON B REV, V12, P957, DOI 10.3758/BF03206432 Xu Y, 2010, J PHONETICS, V38, P329, DOI 10.1016/j.wocn.2010.04.003 Xu Y, 2005, J PHONETICS, V33, P159, DOI 10.1016/j.wocn.2004.11.001 Xu Y., 2006, P SPEECH PROS DRESD NR 126 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2014 VL 63-64 BP 38 EP 54 DI 10.1016/j.specom.2014.04.003 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ7NY UT WOS:000337884600004 ER PT J AU Ning, LH Shih, C Loucks, TM AF Ning, Li-Hsin Shih, Chilin Loucks, Torrey M. TI Mandarin tone learning in L2 adults: A test of perceptual and sensorimotor contributions SO SPEECH COMMUNICATION LA English DT Article DE Pitch shift; Internal model; Tone discrimination; L2 learning aptitude; Language experience classification ID PITCH FEEDBACK PERTURBATIONS; SHIFTED AUDITORY-FEEDBACK; VOICE F-0 RESPONSES; SPEECH PRODUCTION; LANGUAGE EXPERIENCE; VOCAL RESPONSES; SUSTAINED VOCALIZATION; INTERNAL-MODELS; NATIVE-LANGUAGE; LEXICAL TONE AB Adult second language learners (L2) of Mandarin have to acquire both new perceptual categories for discriminating and identifying lexical pitch variation and new sensorimotor skills to produce rapid tone changes. Perceptual learning was investigated using two perceptual tasks, musical tone discrimination and linguistic tone discrimination, which were administered to 10 naive adults (native speakers of English with no tonal language exposure), 10 L2 adults, and 9 Mandarin-speaking adults. Changes in sensorimotor skills were examined with a pitch-shift paradigm that examines rapid responses to unexpected pitch perturbations in auditory feedback. Discrimination of musical tones was correlated significantly with discrimination of Mandarin tones, with the clearest advantage (better performance) among Mandarin speakers and some advantage among L2 learners. Group differences were found in the fundamental frequency (F0) contours of responses to pitch-shift stimuli. The F0 contours of Mandarin speakers were least affected quantitatively by the amplitude and direction of pitch perturbations, suggesting more stable internal tone models, while the F0 contours of naive speakers and L2 learners were significantly altered by the perturbations. Discriminant analysis suggests that pitch-shift responses and tone discrimination predict class membership for the three groups. Discrimination of variations in tone appears to change early in L2 learning, possibly reflecting a process whereby new pitch representations are internalized. These findings indicate that tone discrimination and internal models for audio vocal control are sensitive to language experience. (C) 2014 Elsevier B.V. All rights reserved. C1 [Ning, Li-Hsin; Shih, Chilin] Univ Illinois, Dept Linguist, Urbana, IL 61801 USA. [Shih, Chilin] Univ Illinois, Dept East Asian Languages & Cultures, Urbana, IL 61801 USA. [Loucks, Torrey M.] Univ Illinois, Dept Speech & Hearing Sci, Champaign, IL 61820 USA. RP Ning, LH (reprint author), Univ Illinois, Dept Linguist, 4080 Foreign Language Bldg,707 S Mathews Ave, Urbana, IL 61801 USA. EM uiucning@gmail.com; cls@illinois.edu; tloucks@illinois.edu CR Bauer JJ, 2003, J ACOUST SOC AM, V114, P1048, DOI 10.1121/1.1592161 Behroozmand R., 2011, BMC NEUROSCI, V12 Burnett TA, 2002, J ACOUST SOC AM, V112, P1058, DOI 10.1121/1.1487844 Burnett TA, 1998, J ACOUST SOC AM, V103, P3153, DOI 10.1121/1.423073 Callan DE, 2004, NEUROIMAGE, V22, P1182, DOI 10.1016/j.neuroimage.2004.03.006 Chandrasekaran B, 2007, BRAIN RES, V1128, P148, DOI 10.1016/j.brainres.2006.10.064 Chang EF, 2013, P NATL ACAD SCI USA, V110, P2653, DOI [10.1073/pnas.1216827110, 10.1073/pnas.1216827110/-/DCSupplemental] Chen SH, 2007, J ACOUST SOC AM, V121, P1157, DOI 10.1121/1.2404624 Chen ZC, 2010, J ACOUST SOC AM, V128, pEL355, DOI 10.1121/1.3509124 Chen ZC, 2012, BRAIN LANG, V121, P25, DOI 10.1016/j.bandl.2012.02.004 Cooper A, 2012, J ACOUST SOC AM, V131, P4756, DOI 10.1121/1.4714355 Deutsch D., 2000, J ACOUST SOC AM, V108, P2591 Donath TM, 2002, J ACOUST SOC AM, V111, P357, DOI 10.1121/1.1424870 Eliades SJ, 2008, NATURE, V453, P1102, DOI 10.1038/nature06910 Francis AL, 2008, J PHONETICS, V36, P268, DOI 10.1016/j.wocn.2007.06.005 GANDOUR JT, 1978, LANG SPEECH, V21, P1 GANDOUR J, 1983, J PHONETICS, V11, P149 Guenther FH, 2006, J COMMUN DISORD, V39, P350, DOI 10.1016/j.jcomdis.2006.06.013 Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001 Guenther FH, 1998, PSYCHOL REV, V105, P611 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 Hain TC, 2000, EXP BRAIN RES, V130, P133, DOI 10.1007/s002219900237 Hain TC, 2001, J ACOUST SOC AM, V109, P2146, DOI 10.1121/1.1366319 Halle PA, 2004, J PHONETICS, V32, P395, DOI 10.1016/S0095-4470(03)00016-0 Heinks-Maldonado TH, 2005, PSYCHOPHYSIOLOGY, V42, P180, DOI 10.1111/j.1469-8986.2005.00272.x Henthorn T, 2007, AM J MED GENET A, V143A, P102, DOI 10.1002/ajmg.a.31596 Hickok G, 2011, NEURON, V69, P407, DOI 10.1016/j.neuron.2011.01.019 Houde JF, 1998, SCIENCE, V279, P1213, DOI 10.1126/science.279.5354.1213 Houde JF, 2002, J SPEECH LANG HEAR R, V45, P295, DOI 10.1044/1092-4388(2002/023) Jones JA, 2002, J PHONETICS, V30, P303, DOI 10.1006/jpho.2001.0160 Jones JA, 2000, J ACOUST SOC AM, V108, P1246, DOI 10.1121/1.1288414 JORDAN MI, 1992, COGNITIVE SCI, V16, P307, DOI 10.1207/s15516709cog1603_1 Kawato M, 1999, CURR OPIN NEUROBIOL, V9, P718, DOI 10.1016/S0959-4388(99)00028-8 Kosling K, 2013, LANG SPEECH, V56, P529, DOI 10.1177/0023830913478914 Krishnan A, 2005, COGNITIVE BRAIN RES, V25, P161, DOI 10.1016/j.cogbrainres.2005.05.004 Lalazar H, 2008, CURR OPIN NEUROBIOL, V18, P573, DOI 10.1016/j.conb.2008.11.003 LANE H, 1971, J SPEECH HEAR RES, V14, P677 Larson CR, 2001, J ACOUST SOC AM, V110, P2845, DOI 10.1121/1.1417527 Larson CR, 2000, J ACOUST SOC AM, V107, P559, DOI 10.1121/1.428323 Liu H., 2009, J ACOUST SOC AM, V127 Liu HJ, 2007, J ACOUST SOC AM, V122, P3671, DOI 10.1121/1.2800254 Liu HJ, 2010, J ACOUST SOC AM, V128, P3739, DOI 10.1121/1.3500675 Liu HJ, 2009, J ACOUST SOC AM, V125, P2299, DOI 10.1121/1.3081523 Mandell J., 2009, ADAPTIVE PITCH TEST Mattock K, 2006, INFANCY, V10, P241, DOI 10.1207/s15327078in1003_3 Mattock K, 2008, COGNITION, V106, P1367, DOI 10.1016/j.cognition.2007.07.002 Mitsuya T, 2013, J ACOUST SOC AM, V133, P2993, DOI 10.1121/1.4795786 Sakai S., 2004, P INT C AC SPEECH SI, V1, P277 Van Lancker D, 1973, J PHONETICS, V6, P19 Wang Y, 2001, BRAIN LANG, V78, P332, DOI 10.1006/brln.2001.2474 Wang Y, 2003, J COGNITIVE NEUROSCI, V15, P1019, DOI 10.1162/089892903770007407 Wong PCM, 2007, APPL PSYCHOLINGUIST, V28, P565, DOI 10.1017/S0142716407070312 Wood SN, 2011, J R STAT SOC B, V73, P3, DOI 10.1111/j.1467-9868.2010.00749.x Wood SN, 2006, GEN ADDITIVE MODELS Xu Y, 2004, J ACOUST SOC AM, V116, P1168, DOI 10.1121/1.1763952 Xu YS, 2006, J ACOUST SOC AM, V120, P1063, DOI 10.1121/1.2213572 NR 56 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2014 VL 63-64 BP 55 EP 69 DI 10.1016/j.specom.2014.05.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ7NY UT WOS:000337884600005 ER PT J AU Medabalimi, AJX Seshadri, G Bayya, Y AF Medabalimi, Anand Joseph Xavier Seshadri, Guruprasad Bayya, Yegnanarayana TI Extraction of formant bandwidths using properties of group delay functions SO SPEECH COMMUNICATION LA English DT Article DE Formant frequency; Bandwidth; Group delay function; Short segments; Closed phase; Open phase ID PREDICTION PHASE SPECTRA; LINEAR-PREDICTION AB Formant frequencies represent resonances of vocal tract system during the production of speech signals. Bandwidths associated with the formant frequencies are important parameters in analysis and synthesis of speech signals. In this paper, a method is proposed to extract the bandwidths associated with formant frequencies, by analysing short segments (2-3 ms) of speech signal. The method is based on two important properties of group delay function (GDF): (a) The GDF exhibits prominent peaks at resonant frequencies and (b) the influence of one resonant frequency on other resonances is negligible in GDF. The accuracy of the method is demonstrated for synthetic signals generated using all-pole filters. The method is evaluated by extracting bandwidths of synthetic signals in closed phase and open phase regions within a pitch period. The accuracy of the proposed method is also compared with that of two other methods, one based on linear prediction analysis of speech signals, and another based on filterbank arrays for obtaining amplitude envelopes and instantaneous frequency signals. Results indicate that the method based on the properties of GDF is suitable for accurate extraction of formant bandwidths, even from short segments of speech signal within a pitch period. (C) 2014 Elsevier B.V. All rights reserved. C1 [Medabalimi, Anand Joseph Xavier; Bayya, Yegnanarayana] Int Inst Informat Technol, Hyderabad 500032, Andhra Pradesh, India. [Seshadri, Guruprasad] TATA Consultancy Serv, Innovat Labs, Bangalore 560066, Karnataka, India. RP Medabalimi, AJX (reprint author), Int Inst Informat Technol, Hyderabad 500032, Andhra Pradesh, India. EM anandjm@research.iiit.ac.in; guruprasad.seshadri@gmail.com; yegna@iiit.ac.in CR Cohen L., 1992, P IEEE 6 SP WORKSH S, P13 Deng L, 2006, INT CONF ACOUST SPEE, P369 Fant G., 1960, ACOUSTIC THEORY SPEE GAUFFIN J, 1989, J SPEECH HEAR RES, V32, P556 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Murty KSR, 2008, IEEE T AUDIO SPEECH, V16, P1602, DOI 10.1109/TASL.2008.2004526 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Oppenheim A.V., 1999, DISCRETE TIME SIGNAL Oppenheim A.V., 1975, DIGIT SIGNAL PROCESS POTAMIANOS A, 1995, INT CONF ACOUST SPEE, P784, DOI 10.1109/ICASSP.1995.479811 REDDY NS, 1984, IEEE T ACOUST SPEECH, V32, P1136, DOI 10.1109/TASSP.1984.1164456 Tsiakoulis P, 2013, INT CONF ACOUST SPEE, P8032, DOI 10.1109/ICASSP.2013.6639229 Xavier M.A.J., 2006, P INTERSPEECH PITTSB, P1009 Yasojima O., 2006, P IEEE INT S SIGN PR, P589 YEGNANARAYANA B, 1978, J ACOUST SOC AM, V63, P1638, DOI 10.1121/1.381864 Zheng YL, 2003, PROCEEDINGS OF THE 2003 IEEE WORKSHOP ON STATISTICAL SIGNAL PROCESSING, P601 NR 16 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2014 VL 63-64 BP 70 EP 83 DI 10.1016/j.specom.2014.04.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ7NY UT WOS:000337884600006 ER PT J AU Uzun, E Sencar, HT AF Uzun, Erkam Sencar, Husrev T. TI A preliminary examination technique for audio evidence to distinguish speech from non-speech using objective speech quality measures SO SPEECH COMMUNICATION LA English DT Article DE Preliminary analysis of audio evidence; Speech and non-speech discrimination; Objective speech quality assessment; Audio encoding; Audio effects; Surveillance ID CLASSIFICATION; DISCRIMINATION; NETWORKS; LIBRARY; CODECS AB Forensic practitioners are faced more and more with large volumes of data. Therefore, there is a growing need for computational techniques to aid in evidence collection and analysis. With this study, we introduce a technique for preliminary analysis of audio evidence to discriminate between speech and non-speech. The novelty of our approach lies in the use of well-established speech quality measures for characterizing speech signals. These measures rely on models of human perception of speech to provide objective and reliable measurements of changes in characteristics that influence speech quality. We utilize this capability to compute quality scores between an audio and its noise-suppressed version and to model variations of these scores in speech as compared to those in non-speech audio. Tests performed on 11 datasets with widely varying characteristics show that the technique has a high discrimination capability, achieving an identification accuracy of 96 to 99% in most test cases, and offers good generalization properties across different datasets. Results also reveal that the technique is robust against encoding at low bit-rates, application of audio effects and degradations due to varying degrees of background noise. Performance comparisons made with existing studies show that the proposed method improves the state-of-the-art in audio content identification. (C) 2014 Elsevier B.V. All rights reserved. C1 [Sencar, Husrev T.] TOBB Univ Econ & Technol, Ankara, Turkey. New York Univ Abu Dhabi, Abu Dhabi, U Arab Emirates. RP Sencar, HT (reprint author), TOBB Univ Econ & Technol, Ankara, Turkey. EM euzun@etu.edu.tr; htsencar@etu.edu.tr CR Alexandre-Cortizo E., 2005, P EUROCON, V2, P1666 Alvarez L., 2011, P GAVTASC, P97 Barbedo J. G. A., 2006, Journal of the Audio Engineering Society, V54 BARNWELL TP, 1979, J ACOUST SOC AM, V66, P1658, DOI 10.1121/1.383664 Barthet M., 2011, P EXPL MUS CONT Beigi H., 2011, AUDIO SOURCE CLASSIF Campbell D, 2009, SIGNAL PROCESS, V89, P1489, DOI 10.1016/j.sigpro.2009.02.015 CAREY MJ, 1999, ACOUST SPEECH SIG PR, P149 Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199 Chow D, 2004, LECT NOTES ARTIF INT, V3157, P901 DIMOLITSAS S, 1989, P IEE P I, V136, P317 Dong Y., 2005, P 14 INT C ACM WORLD, P1072 DONOHO DL, 1994, CR ACAD SCI I-MATH, V319, P1317 Dubey Rajesh Kumar, 2013, International Journal of Speech Technology, V16, DOI 10.1007/s10772-012-9162-4 Emiya V, 2011, IEEE T AUDIO SPEECH, V19, P2046, DOI 10.1109/TASL.2011.2109381 Falk T.H., 2008, P IWAENC Fenton S., 2011, P AES CONV Fry Dennis B., 1979, PHYS SPEECH Ghosal A, 2011, Proceedings of the Second International Conference on Emerging Applications of Information Technology (EAIT 2011), DOI 10.1109/EAIT.2011.19 Gonzalez R., 2012, P ICME, P556 Grancharov V, 2006, IEEE T AUDIO SPEECH, V14, P1948, DOI 10.1109/TASL.2006.883250 GRAY AH, 1976, IEEE T ACOUST SPEECH, V24, P380, DOI 10.1109/TASSP.1976.1162849 Haque MA, 2013, MULTIMED TOOLS APPL, V63, P63, DOI 10.1007/s11042-012-1023-2 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 Hu Y., 2006, P INTERSPEECH Huijbiegts M, 2011, SPEECH COMMUN, V53, P143, DOI 10.1016/j.specom.2010.08.008 Itakura F., 1968, P 6 INT C AC, V17, pC17 Jain A, 1997, IEEE T PATTERN ANAL, V19, P153, DOI 10.1109/34.574797 Kim HC, 2002, INT C PATT RECOG, P160 Kitawaki N., 1992, ADV SPEECH SIGNAL PR, P357 Klatt D., 1982, P IEEE INT C AC SPEE, V7, P1278 Kondo K., 2012, SUBJECTIVE QUALITY M Kubichek R., 1991, 3 GLOBECOM, P1765 Lavner Y., 2009, EURASIP J AUDIO SPEE, V2 Lim C., 2011, ETRI J, V33 Loizou PC, 2011, STUD COMPUT INTELL, V346, P623 MacLean Ken, 2014, VOXFORGE OPEN SOURCE Manning CD, 2008, INTRO INFORM RETRIEV, V1 McKinney M., 2003, P ISMIR, V3, P151 Meier Paul, 2014, IDEA INT DIALECTS EN Munoz-Exposito JE, 2007, ENG APPL ARTIF INTEL, V20, P783, DOI 10.1016/j.engappai.2006.10.007 NHK Technology, 2009, OBJ PERC AUD QUAL ME NOCERINO N, 1985, SPEECH COMMUN, V4, P317, DOI 10.1016/0167-6393(85)90057-3 Ozer H, 2003, P SOC PHOTO-OPT INS, V5020, P55, DOI 10.1117/12.477313 Pikrakis A, 2008, IEEE T MULTIMEDIA, V10, P846, DOI 10.1109/TMM.2008.922870 Quackenbush S. R., 1988, OBJECTIVE MEASURES S RIX AW, 2001, ACOUST SPEECH SIG PR, P749 Rohdenburg T., 2005, P 9 INT WORKSH AC EC, P169 Sadjadi S. O., 2007, P 6 INT C INF COMM S, P1 Scheirer E, 1997, INT CONF ACOUST SPEE, P1331, DOI 10.1109/ICASSP.1997.596192 Song JH, 2008, IEEE SIGNAL PROC LET, V15, P103, DOI 10.1109/LSP.2007.911184 Tzanetakis G., 2001, P AMTA Verfaille V, 2006, IEEE T AUDIO SPEECH, V14, P1817, DOI 10.1109/TSA.2005.858531 Voran S, 1999, IEEE T SPEECH AUDI P, V7, P371, DOI 10.1109/89.771259 Wang J, 2008, INT CONF ACOUST SPEE, P2033 WANG SH, 1992, IEEE J SEL AREA COMM, V10, P819, DOI 10.1109/49.138987 Xie L, 2011, MULTIMEDIA SYST, V17, P101, DOI 10.1007/s00530-010-0205-x Yang W., 1997, P SPEECH COD WORKSH, V10 Yang W., 1999, THESIS TEMPLE U Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 61 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN-JUL PY 2014 VL 61-62 BP 1 EP 16 DI 10.1016/j.specom.2014.03.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ6EO UT WOS:000337782700001 ER PT J AU Vieira, MN Sansao, JPH Yehia, HC AF Vieira, Maurilio N. Sansao, Joao Pedro H. Yehia, Hani C. TI Measurement of signal-to-noise ratio in dysphonic voices by image processing of spectrograms SO SPEECH COMMUNICATION LA English DT Article DE Signal-to-noise ratio; Breathiness; Dysphonic voice; 2D speech processing ID BREATHY VOCAL QUALITY; PATHOLOGICAL VOICE; ADDITIVE NOISE; PERTURBATION; SPEECH; HOARSENESS; FILTERS; ENHANCEMENT; COMPUTATION; PERCEPTION AB The measurement of glottal noise was investigated in human and synthesized dysphonic voices by means of two-dimensional (2D) speech processing. A prime objective was the reduction of measurement sensitivities to fundamental frequency (f(o)) tracking errors and phonatory aperiodicities. An available fingerprint image enhancement algorithm was used for signal-to-noise measurement in narrow band spectrographic images. This spectrographic signal-to-noise ratio estimation method ((SNR)-N-2) creates binary masks, mainly based on the orientation field of the partials, to separate energy in regions with strong harmonics from energy in noisy areas. Synthesized vowels with additive noise were used to calibrate the algorithm, validate the calibration, and systematically evaluate its dependence on f(o), shimmer (cycle-to-cycle amplitude perturbation), and jitter (cycle-to-cycle fo perturbation). In synthesized voices with known signal-to-noise ratios in the 5-40 dB range, (SNR)-N-2 estimates were, on average, accurate within +/- 3.2 dB and robust to variations in f(o) (120 Hz or 220 Hz), jitter (0-3%), and shimmer (0-30%). In human /a/ produced by dysphonic speakers, (SNR)-N-2 values and perceptual ratings of breathiness revealed a non-linear but monotonic decay of (SNR)-N-2 with increased breathiness. Comparison between (SNR)-N-2 and related acoustic measurements indicated similar behaviors regarding the relationship with breathiness and immunity to shimmer, but the other methods had marked influence of jitter. Overall, the (SNR)-N-2 method did not rely on accurate fo estimation, was robust to vocal perturbations and largely independent of vowel type, having also potential application in running speech. (C) 2014 Elsevier B.V. All rights reserved. C1 [Vieira, Maurilio N.; Yehia, Hani C.] Univ Fed Minas Gerais, Dept Elect Engn, BR-31270010 Belo Horizonte, MG, Brazil. [Sansao, Joao Pedro H.] Univ Fed Minas Gerais, Programa Posgrad Engn Eletr, BR-31270901 Belo Horizonte, MG, Brazil. [Sansao, Joao Pedro H.] Univ Fed Sao Joao del Rei, Dept Engn Telecomunicacoes & Mecatron, BR-36420000 Ouro Branco, MG, Brazil. RP Vieira, MN (reprint author), Univ Fed Minas Gerais, Dept Elect Engn, Ave Antonio Carlos 6627, BR-31270010 Belo Horizonte, MG, Brazil. EM maurilionunesv@cpdee.ufmg.br; jsansao@gmail.com; hani@cpdee.ufmg.br FU Conselho Nacional de Desenvolvimento Cientifico e TecnolOgico (CNPq); Coordenagao de Aperfeigoamento de Pessoal de Nivel Superior (Capes); Fundagdo de Amparo a Pesquisa do Estado de Minas Gerais (Fapemig) FX This research was supported by Conselho Nacional de Desenvolvimento Cientifico e TecnolOgico (CNPq), Coordenagao de Aperfeigoamento de Pessoal de Nivel Superior (Capes), and Fundagdo de Amparo a Pesquisa do Estado de Minas Gerais (Fapemig). The human dysphonic voices used in the study were perceptually rated by meticulous work of Ana Paula da Penha and Mariana de Sousa Dutra Borges. CR Bazen AM, 2002, IEEE T PATTERN ANAL, V24, P905, DOI 10.1109/TPAMI.2002.1017618 Bielamowicz S, 1996, J SPEECH HEAR RES, V39, P126 Chen G, 2013, J ACOUST SOC AM, V133, P1656, DOI 10.1121/1.4789931 COX NB, 1989, J ACOUST SOC AM, V85, P2165, DOI 10.1121/1.397865 DAUGMAN JG, 1985, J OPT SOC AM A, V2, P1160, DOI 10.1364/JOSAA.2.001160 DEJONCKERE PH, 1994, CLIN LINGUIST PHONET, V8, P161, DOI 10.3109/02699209408985304 DEKROM G, 1993, J SPEECH HEAR RES, V36, P254 Ding HJ, 2009, SPEECH COMMUN, V51, P259, DOI 10.1016/j.specom.2008.09.003 ESKENAZI L, 1990, J SPEECH HEAR RES, V33, P298 Ezzat T., 2007, 8 ANN C INIT SPEECH, P506 FANT G, 1979, STL QPSR, V1, P85 FEIJOO S, 1990, J SPEECH HEAR RES, V33, P324 Heman-Ackah YD, 2002, J VOICE, V16, P20, DOI 10.1016/S0892-1997(02)00067-X HILLENBRAND J, 1994, J SPEECH HEAR RES, V37, P769 HILLENBRAND J, 1987, J SPEECH HEAR RES, V30, P448 Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311 HIRANO M, 1988, ACTA OTO-LARYNGOL, V105, P432, DOI 10.3109/00016488809119497 HIRAOKA N, 1984, J ACOUST SOC AM, V76, P1648, DOI 10.1121/1.391611 Hong L, 1998, IEEE T PATTERN ANAL, V20, P777 HORII Y, 1979, J SPEECH HEAR RES, V22, P5 Horn T, 1998, ACUSTICA, V84, P175 JAIN AK, 1991, PATTERN RECOGN, V24, P1167, DOI 10.1016/0031-3203(91)90143-S Jellyman KA, 2009, LECT NOTES COMPUT SC, V5371, P63 KANE M, 1985, FOLIA PHONIATR, V37, P53 Kasuya H., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4) KASUYA H, 1986, J ACOUST SOC AM, V80, P1329, DOI 10.1121/1.394384 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 Kovesi P.D., 2005, MATLAB OCTAVE FUNCTI LAVER J, 1992, J VOICE, V6, P115, DOI 10.1016/S0892-1997(05)80125-0 LIEBERMAN P, 1963, J ACOUST SOC AM, V35, P344, DOI 10.1121/1.1918465 Markel JD, 1976, LINEAR PREDICTION SP MARTIN D, 1995, J SPEECH HEAR RES, V38, P765 Maryn Y, 2009, J ACOUST SOC AM, V126, P2619, DOI 10.1121/1.3224706 Murphy PJ, 2000, J ACOUST SOC AM, V107, P978, DOI 10.1121/1.428272 Murphy PJ, 1999, J ACOUST SOC AM, V105, P2866, DOI 10.1121/1.426901 Murphy PJ, 2007, J ACOUST SOC AM, V121, P1679, DOI 10.1121/1.2427123 MUTA H, 1988, J ACOUST SOC AM, V84, P1292, DOI 10.1121/1.396628 Nixon M. S., 2002, FEATURE EXTRACTION I Patel S, 2012, J SPEECH LANG HEAR R, V55, P639, DOI 10.1044/1092-4388(2011/10-0337) PROSEK RA, 1987, J COMMUN DISORD, V20, P105, DOI 10.1016/0021-9924(87)90002-5 QI YG, 1992, J ACOUST SOC AM, V92, P2569, DOI 10.1121/1.404429 QI YY, 1995, J ACOUST SOC AM, V97, P2525, DOI 10.1121/1.411972 Schoentgen J., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90020-6 Shue Y.-L., 2010, VOICESAUCE PROGRAM V Soon Y.I., 2003, IEEE T SPEECH AUDIO, V111, P717 TITZE IR, 1993, J SPEECH HEAR RES, V36, P1177 Vieira MN, 2002, J ACOUST SOC AM, V111, P1045, DOI 10.1121/1.1430686 Vieira M.N., 1997, THESIS U EDINBURGH U Wolfe VI, 2000, J SPEECH LANG HEAR R, V43, P697 WOLFE VI, 1987, J SPEECH HEAR RES, V30, P230 YANAGIHA.N, 1967, J SPEECH HEAR RES, V10, P531 YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808 Zhang Y, 2005, J ACOUST SOC AM, V118, P2551, DOI 10.1121/1.2005907 NR 54 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN-JUL PY 2014 VL 61-62 BP 17 EP 32 DI 10.1016/j.specom.2014.04.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AJ6EO UT WOS:000337782700002 ER PT J AU Zhang, WQ Liu, WW Li, ZY Shi, YZ Liu, J AF Zhang, Wei-Qiang Liu, Wei-Wei Li, Zhi-Yi Shi, Yong-Zhe Liu, Jia TI Spoken language recognition based on gap-weighted subsequence kernels SO SPEECH COMMUNICATION LA English DT Article DE Spoken language recognition; Gap-weighted subsequence kernel (GWSK); n-Gram; Phone recognizer (PR); Vector space model (VSM) ID STRING KERNELS; FRONT-END; IDENTIFICATION; CLASSIFICATION AB Phone recognizers followed by vector space models (PR-VSM) is a state-of-the-art phonotactic method for spoken language recognition. This method resorts to a bag-of-n-grams, with each dimension of the super vector based on the counts of n-gram tokens. The n-gram cannot capture the long-context co-occurrence relations due to the restriction of gram order. Moreover, it is vulnerable to the errors induced by the frontend phone recognizer. In this paper, we introduce a gap-weighted subsequence kernel (GWSK) method to overcome the drawbacks of n-gram. GWSK counts the co-occurrence of the tokens in a non-contiguous way and thus is not only error-tolerant but also capable of revealing the long-context relations. Beyond this, we further propose a truncated GWSK with constraints on context length in order to remove the interference from remote tokens and lower the computational cost, and extend the idea to lattices to take the advantage of multiple hypotheses from the phone recognizer. In addition, we investigate the optimal parameter setting and computational complexity of the proposed methods. Experiments on NIST 2009 LRE evaluation corpus with several configurations show that the proposed GWSK is consistently more effective than the PR-VSM approach. (C) 2014 Elsevier B.V. All rights reserved. C1 [Zhang, Wei-Qiang; Liu, Wei-Wei; Li, Zhi-Yi; Shi, Yong-Zhe; Liu, Jia] Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China. RP Zhang, WQ (reprint author), Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China. EM wqzhang@tsinghua.edu.cn RI Zhang, Wei-Qiang/A-7088-2008 OI Zhang, Wei-Qiang/0000-0003-3841-1959 FU National Natural Science Foundation of China [61370034, 61273268, 61005019] FX This work was supported by the National Natural Science Foundation of China under Grant Nos. 61370034, 61273268 and 61005019. CR Campbell W., 2006, P OD SAN JUAN Campbell W. M., 2004, P ICASSP, P1 Campbell WM, 2007, INT CONF ACOUST SPEE, P989 Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307 Fan RE, 2008, J MACH LEARN RES, V9, P1871 Gauvain J. -L., 2004, P ICSLP, P25 Hazen T. J., 1993, P EUR 93 SEPT, V2, P1303 Hofmann T, 2008, ANN STAT, V36, P1171, DOI 10.1214/009053607000000677 Kim S., 2010, BMC BIOINFORMATICS, V11 Kruengkrai C., 2005, P 5 INT S COMM INF T, P896 Lerma M. A., 2008, SEQUENCES STRINGS Li HZ, 2007, IEEE T AUDIO SPEECH, V15, P271, DOI 10.1109/TASL.2006.876860 Lodhi H, 2002, J MACH LEARN RES, V2, P419, DOI 10.1162/153244302760200687 Ma B, 2007, IEEE T AUDIO SPEECH, V15, P2053, DOI 10.1109/TASL.2007.902861 Matejka P, 2005, P INT 2005 LISB PORT, P2237 Muthusamy YK, 1994, IEEE SIGNAL PROC MAG, V11, P33, DOI 10.1109/79.317925 Navratil J, 1997, INT CONF ACOUST SPEE, P1115, DOI 10.1109/ICASSP.1997.596137 Navratil J, 2001, IEEE T SPEECH AUDI P, V9, P678, DOI 10.1109/89.943345 NIST, 2009, 2009 NIST LANG REC E Penagarikano M, 2011, IEEE T AUDIO SPEECH, V19, P2348, DOI 10.1109/TASL.2011.2134088 Rousu J, 2005, J MACH LEARN RES, V6, P1323 Shawe-Taylor J., 2004, KERNEL METHODS PATTE Stolcke A., 2002, P INT C SPOK LANG PR, V2, P901 Tong R, 2009, IEEE T AUDIO SPEECH, V17, P1335, DOI 10.1109/TASL.2009.2016731 Torres-Carrasquillo P., 2008, P INT 08, P719 Torres-Carrasquillo PA, 2002, THESIS MICHIGAN STAT Vapnik V., 1995, NATURE STAT LEARNING Yin CH, 2008, NEUROCOMPUTING, V71, P944, DOI 10.1016/j.neucom.2007.02.005 Zhang W., 2006, P ICSP GUIL, V1 Zhang WQ, 2010, CHINESE J ELECTRON, V19, P124 ZISSMAN MA, 1994, INT CONF ACOUST SPEE, P305 Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6 NR 32 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2014 VL 60 BP 1 EP 12 DI 10.1016/j.specom.2014.01.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AH0JI UT WOS:000335804800001 ER PT J AU Xia, BY Bao, CC AF Xia, Bingyin Bao, Changchun TI Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Weighted Denoising Auto-encoder; SNR estimation; Wiener filter; Noise classification; Gaussian mixture model ID RECOGNITION AB A novel speech enhancement method based on Weighted Denoising Auto-encoder (WDA) and noise classification is proposed in this paper. A weighted reconstruction loss function is introduced into the conventional Denoising Auto-encoder (DA), and the relationship between the power spectra of clean speech and noisy observation is described by WDA model. First, the sub-band power spectrum of clean speech is estimated by WDA model from the noisy observation. Then, the a priori SNR is estimated by the a Posteriori SNR Controlled Recursive Averaging (PCRA) approach. Finally, the clean speech is obtained by Wiener filter in frequency domain. In addition, in order to make the proposed method suitable for various kinds of noise conditions, a Gaussian Mixture Model (GMM) based noise classification method is employed. And the corresponding WDA model is used in the enhancement process. From the test results under ITU-T G.160, it is shown that, in comparison with the reference method which is the Wiener filtering method with decision-directed approach for SNR estimation, the WDA-based speech enhancement methods could achieve better objective speech quality, no matter whether the noise conditions are included in the training set or not. And the similar amount of noise reduction and SNR improvement can be obtained with smaller distortion on speech level. (C) 2014 Elsevier B.V. All rights reserved. C1 [Xia, Bingyin; Bao, Changchun] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. EM baochch@bjut.edu.cn FU Beijing Natural Science Foundation Program; Beijing Municipal Commission of Education [KZ201110005005] FX This work was supported by the Beijing Natural Science Foundation Program and Scientific Research Key Program of Beijing Municipal Commission of Education (No. KZ201110005005). CR 3GPP2, 2010, ENH VAR RAT COD SPEE Bengio Yoshua, 2009, Foundations and Trends in Machine Learning, V2, DOI 10.1561/2200000006 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Brakel P., 2013, ANN C INT SPEECH COM, P2973 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 Dahl GE, 2012, IEEE T AUDIO SPEECH, V20, P30, DOI 10.1109/TASL.2011.2134090 DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Hinton GE, 2006, SCIENCE, V313, P504, DOI 10.1126/science.1127647 ITU-T, 2008, ITU SER G ITU-T, 2001, ITU SER P JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Lecun Y, 1998, P IEEE, V86, P2278, DOI 10.1109/5.726791 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Lu X., 2013, ANN C INT SPEECH COM, P436 Maas AL, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P22 NTT, 1994, MULT SPEECH DAT TEL REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Vincent P, 2010, J MACH LEARN RES, V11, P3371 Vincent P., 2008, ICML, P1096 Xie J., 2012, ADV NEURAL INF PROCE, P341 Xu H. -T., 2005, EUR C SPEECH COMM TE, P977 NR 25 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2014 VL 60 BP 13 EP 29 DI 10.1016/j.specom.2014.02.001 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AH0JI UT WOS:000335804800002 ER PT J AU Lee, KS AF Lee, Ki-Seung TI A unit selection approach for voice transformation SO SPEECH COMMUNICATION LA English DT Article DE Voice conversion; Unit selection; Hidden Markov model ID TO-SPEECH SYNTHESIS; CONVERSION; RECOGNITION; ALGORITHM; FREQUENCY; NETWORKS; DESIGN; MODELS AB A voice transformation (VT) method that can make the utterance of a source speaker mimic that of a target speaker is described. Speaker individuality transformation is achieved by altering four feature parameters, which include the linear prediction coefficients cepstrum (LPCC), ALPCC, LP-residual and pitch period. The main objective of this study involves construction of an optimal sequence of features selected from a target speaker's database, to maximize both the correlation probabilities between the transformed and the source features and the likelihood of the transformed features with respect to the target model. A set of two-pass conversion rules is proposed, where the feature parameters are first selected from a database then the optimal sequence of the feature parameters is then constructed in the second pass. The conversion rules were developed using a statistical approach that employed a maximum likelihood criterion. In constructing an optimal sequence of the features, a hidden Markov model (HMM) with global control variables (GCV) was employed to find the most likely combination of the features with respect to the target speaker's model. The effectiveness of the proposed transformation method was evaluated using objective tests and formal listening tests. We confirmed that the proposed method leads to perceptually more preferred results, compared with the conventional methods. (C) 2014 Elsevier B.V. All rights reserved. C1 Konkuk Univ, Dept Elect Engn, Seoul 143701, South Korea. RP Lee, KS (reprint author), Konkuk Univ, Dept Elect Engn, 1 Hwayang Dong, Seoul 143701, South Korea. EM kseung@konkuk.ac.kr CR Abe M., 1988, P IEEE ICASSP, P565 Arslan LM, 1999, SPEECH COMMUN, V28, P211, DOI 10.1016/S0167-6393(99)00015-1 Beutnagel M., 1999, P JOINT M ASA EAA DA Bi N, 1997, IEEE T SPEECH AUDI P, V5, P97 Cheng YM, 1994, IEEE T SPEECH AUDI P, V2, P544 Childers D. G., 1985, P ICASSP 85, P748 CHILDERS DG, 1994, IEEE T BIO-MED ENG, V41, P663, DOI 10.1109/10.301733 COX SJ, 1989, P IEEE INT C AC SPEE, P294 Dutoit T., 2007, P ICASSP, P15 Erickson M. L., 2003, 31 ANN S CAR PROF VO, P24 Erro D, 2013, IEEE T AUDIO SPEECH, V21, P556, DOI 10.1109/TASL.2012.2227735 Helander E, 2012, IEEE T AUDIO SPEECH, V20, P806, DOI 10.1109/TASL.2011.2165944 Huang Y. C., 2013, P IEEE T AUDIO SPEEC, V21, P51 IWAHASHI N, 1995, SPEECH COMMUN, V16, P139, DOI 10.1016/0167-6393(94)00051-B Jian Z. H., 2007, P INT SIGN PROC COMM, P32 KAIN A, 1998, ACOUST SPEECH SIG PR, P285 Kain A, 2001, INT CONF ACOUST SPEE, P813 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Kominek J., 2004, P 5 ISCA SPEECH SYNT, P223 Lee KS, 2002, IEICE T INF SYST, VE85D, P1297 Lee KS, 2007, IEEE T AUDIO SPEECH, V15, P641, DOI 10.1109/TASL.2006.876760 Lee KS., 1996, P ICSLP, P1401 Lee KS, 2008, IEEE T BIO-MED ENG, V55, P930, DOI 10.1109/TBME.2008.915658 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Ma JC, 2005, PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), P199 Manabe H, 2004, P ANN INT IEEE EMBS, V26, P4389 MIZUNO H, 1995, SPEECH COMMUN, V16, P153, DOI 10.1016/0167-6393(94)00052-C MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z NARENDRANATH M, 1995, SPEECH COMMUN, V16, P207, DOI 10.1016/0167-6393(94)00058-I Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150 Rabiner L, 1993, FUNDAMENTALS SPEECH Rabiner L.R., 1978, DIGITAL PROCESSING S REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Saito D, 2012, IEEE T AUDIO SPEECH, V20, P1784, DOI 10.1109/TASL.2012.2188628 Savic M., 1991, DIGIT SIGNAL PROCESS, V4, P107 Shuang ZW, 2008, INT CONF ACOUST SPEE, P4661 Rao KS, 2010, COMPUT SPEECH LANG, V24, P474, DOI 10.1016/j.csl.2009.03.003 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 Summerfield A. Q., 1992, PHILOS T R SOC LON B, V335, P71 Sundermann D., 2006, P ICASSP, P14 Sundermann D., 2005, P IEEE WORKSH AUT SP, P369 Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344 VALBRET H, 1992, SPEECH COMMUN, V11, P175, DOI 10.1016/0167-6393(92)90012-V WHITE GM, 1976, IEEE T ACOUST SPEECH, V24, P183, DOI 10.1109/TASSP.1976.1162779 Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839 NR 45 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2014 VL 60 BP 30 EP 43 DI 10.1016/j.specom.2014.02.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AH0JI UT WOS:000335804800003 ER PT J AU Sun, CL Zhu, Q Wan, MH AF Sun, Chengli Zhu, Qi Wan, Minghua TI A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Matrix decomposition; Low-rank matrix approximation; Robust principal component analysis ID SPECTRAL AMPLITUDE ESTIMATOR; SUBSPACE APPROACH; NOISE; SUBTRACTION; COMPLETION; ALGORITHM AB In this paper, we present a novel speech enhancement method based on the principle of constrained low-rank and sparse matrix decomposition (CLSMD). According to the proposed method, noise signal can be assumed as a low-rank component because noise spectra within different time frames are usually highly correlated with each other; while the speech signal is regarded as a sparse component since it is relatively sparse in time frequency domain. Based on these assumptions, we develop an alternative projection algorithm to separate the speech and noise magnitude spectra by imposing rank and sparsity constraints, with which the enhanced time-domain speech can be constructed from sparse matrix by inverse discrete Fourier transform and overlap-add-synthesis. The proposed method is significantly different from existing speech enhancement methods. It can estimate enhanced speech in a straightforward manner, and does not need a voice activity detector to find noise-only excerpts for noise estimation. Moreover, it can obtain better performance in low SNR conditions, and does not need to know the exact distribution of noise signal. Experimental results show the new method can perform better than conventional methods in many types of strong noise conditions, in terms of yielding less residual noise and lower speech distortion. (C) 2014 Elsevier B.V. All rights reserved. C1 [Sun, Chengli] Sci & Technol Avion Integrat Lab, Shanghai 200233, Peoples R China. [Sun, Chengli; Wan, Minghua] Nanchang Hang kong Univ, Sch Informat, Nanchang 330063, Peoples R China. [Zhu, Qi] Nanjing Univ Aeronaut & Astronaut, Dept Comp Sci & Engn, Nanjing 210016, Jiangsu, Peoples R China. RP Sun, CL (reprint author), Nanchang Hang kong Univ, Sch Informat, Nanchang 330063, Peoples R China. EM sun_chengli@163.com FU Science and Technology on Avionics Integration Laboratory; China's Aviation Science Fund [20115556007]; National Nature Science Committee of China [61362031, 61203243, 61263040, 61263032] FX We thank the anonymous reviews for their constructive comments and suggestions. This article is partially supported by the Science and Technology on Avionics Integration Laboratory and China's Aviation Science Fund (No. 20115556007), and the funds of National Nature Science Committee of China (Nos. 61362031, 61203243, 61263040, 61263032). CR [Anonymous], 2001, PERC EV SPEECH QUAL BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cai JF, 2010, SIAM J OPTIMIZ, V20, P1956, DOI 10.1137/080738970 Candes EJ, 2011, J ACM, V58, DOI 10.1145/1970392.1970395 Candes EJ, 2010, IEEE T INFORM THEORY, V56, P2053, DOI 10.1109/TIT.2010.2044061 Chang SG, 2000, IEEE T IMAGE PROCESS, V9, P1532, DOI 10.1109/83.862633 Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278] Doclo S, 2002, IEEE T SIGNAL PROCES, V50, P2230, DOI 10.1109/TSP.2002.801937 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Hai-yan W., 2011, J CHIN U POSTS TELEC, V1, P13 Hermus K, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/45821 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 Huang P.-S., 2012, ICASSP Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd Larsen R. M., 1998, LANCZOS BIDIAGONALIZ Lin Z, 2009, UILUENG092215 UIUC Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lu Y, 2008, SPEECH COMMUN, V50, P453, DOI 10.1016/j.specom.2008.01.003 Manohar K, 2006, SPEECH COMMUN, V48, P96, DOI 10.1016/j.specom.2005.08.002 Mardani M, 2013, IEEE T INFORM THEORY, V59, P5186, DOI 10.1109/TIT.2013.2257913 Moor B. D., 1993, IEEE T SIGNAL PROCES, V41, P2826 Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Paliwal K, 2012, SPEECH COMMUN, V54, P282, DOI 10.1016/j.specom.2011.09.003 Peng Y., 2012, IEEE T PATTERN ANAL Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621 Quatieri T. F., 2002, DISCRETE TIME SPEECH Scalart P., 1996, P 21 IEEE INT C AC S Shannon B., 2006, P INT C SPOK LANG PR Soon IY, 2000, IEE P-VIS IMAGE SIGN, V147, P247, DOI 10.1049/ip-vis:20000323 Stark A, 2011, SPEECH COMMUN, V53, P51, DOI 10.1016/j.specom.2010.08.001 Toh KC, 2010, PAC J OPTIM, V6, P615 Vaseghi S. V., 2006, ADV DIGITAL SIGNAL P Wiener N., 1949, EXTRAPOLATION INTERP Wright J., 2009, NIPS Xu H, 2012, IEEE T INFORM THEORY, V58, P3047, DOI 10.1109/TIT.2011.2173156 Zhang Y, 2013, SPEECH COMMUN, V55, P509, DOI 10.1016/j.specom.2012.09.005 Zhou X., 2013, IEEE T PATTERN ANAL, V35 NR 40 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2014 VL 60 BP 44 EP 55 DI 10.1016/j.specom.2014.03.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AH0JI UT WOS:000335804800004 ER PT J AU Larcher, A Lee, KA Ma, B Li, HZ AF Larcher, Anthony Lee, Kong Aik Ma, Bin Li, Haizhou TI Text-dependent speaker verification: Classifiers, databases and RSR2015 SO SPEECH COMMUNICATION LA English DT Article DE Speaker recognition; Text-dependent; Database ID GAUSSIAN MIXTURE-MODELS; DATA FUSION; RECOGNITION; IDENTIFICATION; SPEECH; CORPUS; HMM; NORMALIZATION; FEATURES AB The RSR2015 database, designed to evaluate text-dependent speaker verification systems under different durations and lexical constraints has been collected and released by the Human Language Technology (HLT) department at Institute for Infocomm Research ((IR)-R-2) in Singapore. English speakers were recorded with a balanced diversity of accents commonly found in Singapore. More than 151 h of speech data were recorded using mobile devices. The pool of speakers consists of 300 participants (143 female and 157 male speakers) between 17 and 42 years old making the RSR2015 database one of the largest publicly available database targeted for text-dependent speaker verification. We provide evaluation protocol for each of the three parts of the database, together with the results of two speaker verification system: the HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system. We thus provide a reference evaluation scheme and a reference performance on RSR2015 database to the research community. The HiLAM outperforms the state-of-the-art i-vector system in most of the scenarios. (C) 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/). C1 [Larcher, Anthony; Lee, Kong Aik; Ma, Bin; Li, Haizhou] Human Language Technol Dept 1, Inst Infocomm Res I2R, Singapore 138632, Singapore. RP Larcher, A (reprint author), Human Language Technol Dept 1, Inst Infocomm Res I2R, Fusionopolis Way 21-01,Connexis South Tower, Singapore 138632, Singapore. EM alarcher@i2r.a-star.edu.sg; kalee@i2r.a-star.edu.sg; mabin@i2r.a-star.edu.sg; hli@i2r.a-star.edu.sg CR Amino K, 2009, FORENSIC SCI INT, V185, P21, DOI 10.1016/j.forsciint.2008.11.018 Aronowitz Hagai, 2012, OD SPEAK LANG REC WO Avinash B, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1073 Bailly-Bailliere E, 2003, LECT NOTES COMPUT SC, V2688, P625 BenZeghiba MF, 2006, SPEECH COMMUN, V48, P1200, DOI 10.1016/j.specom.2005.08.008 Boakye K., 2004, OD SPEAK LANG REC WO, P1 Boies D., 2004, OD SPEAK LANG REC WO, P1 Bonastre J.F., 2003, EUR C SPEECH COMM TE, P2013 Bousquet P.M., 2011, ANN C INT SPEECH COM, P485 Bousquet P.M., 2012, OD SPEAK LANG REC WO, P1 Brummer N., 2010, OD SPEAK LANG REC WO, P1 Campbell J., 1994, YOHO SPEAKER VERIFIC Campbell JP, 2009, IEEE SIGNAL PROC MAG, V26, P95, DOI 10.1109/MSP.2008.931100 CAMPBELL JP, 1995, INT CONF ACOUST SPEE, P341, DOI 10.1109/ICASSP.1995.479543 CAMPBELL JP, 1999, ACOUST SPEECH SIG PR, P829 Charlet D, 2000, SPEECH COMMUN, V31, P113, DOI 10.1016/S0167-6393(99)00072-2 Charlet D, 1997, PATTERN RECOGN LETT, V18, P873, DOI 10.1016/S0167-8655(97)00064-0 Chatzis S, 2007, ICSPC: 2007 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, VOLS 1-3, PROCEEDINGS, P804 Che CW, 1996, INT CONF ACOUST SPEE, P673 Chen K, 1996, IEEE T NEURAL NETWOR, V7, P1309 Chen W., 2012, INT C AUD LANG IM PR, P432 Chollet G., 1996, TECHNICAL REPORT Cole R., 1998, P INT C SPOK LANG PR, P3167 Cumani S, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4361 Cumani S, 2013, INT CONF ACOUST SPEE, P7644, DOI 10.1109/ICASSP.2013.6639150 Das A., 2010, IEEE INT C AC SPEECH, P4510 Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307 Dehak N., 2011, ANN C INT SPEECH COM, P857 Dehak N., 2010, OD SPEAK LANG REC WO, P1 Dessimoz D., 2008, FORENSIC SCI INT, V167, P154 Dialogues Spotlight Technology, 2000, TECHNICAL REPORT Doddington G., 2012, OD SPEAK LANG REC WO, P1 Doddington G.R., 1998, WORKSH SPEAK REC ITS, P20 Dong C., 2008, OD SPEAK LANG REC WO, P1 Dumas B., 2005, BIOMETRICS INTERNET, V275, P59 Dutta T, 2008, CISP 2008: FIRST INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOL 2, PROCEEDINGS, P354, DOI 10.1109/CISP.2008.560 Dutta T., 2007, IMAGE VISION COMPUT, P238 ELDA - Evaluations and Language resources Distribution Agency, 2003, S0050 RUSTEN RUSS SW FARRELL KR, 1995, INT CONF ACOUST SPEE, P349, DOI 10.1109/ICASSP.1995.479545 FARRELL KR, 1998, ACOUST SPEECH SIG PR, P1129 Faundez-Zanuy M, 2006, IEEE AERO EL SYS MAG, V21, P29, DOI 10.1109/MAES.2006.1703234 Fauve B., 2009, THESIS SWANSEA U Fierrez J, 2010, PATTERN ANAL APPL, V13, P235, DOI 10.1007/s10044-009-0151-4 Fierrez J, 2007, PATTERN RECOGN, V40, P1389, DOI 10.1016/j.patcog.2006.10.014 Finan RA, 1996, IEEE IJCNN, P1992, DOI 10.1109/ICNN.1996.549207 FORSYTH M, 1995, SPEECH COMMUN, V17, P117, DOI 10.1016/0167-6393(95)00020-O Fox N.A., 2005, INT C AUD VID BAS PE, P777 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P342, DOI 10.1109/TASSP.1981.1163605 Garcia-Romero D., 2011, ANN C INT SPEECH COM, P249 Garcia-Salicetti S., 2003, AUDIO VIDEO BASED BI Garofolo J.S., 1993, TIMIT ACOUSTIC PHONE, P1 Gu Y., 1998, ANN C INT SPEECH COM, P125 Hasan T, 2013, INT CONF ACOUST SPEE, P7663, DOI 10.1109/ICASSP.2013.6639154 Hebert M., 2008, HDB SPEECH PROCESSIN, P743 Hebert M., 2003, EUR C SPEECH COMM TE, P1665 Hebert M, 2005, INT CONF ACOUST SPEE, P729 Heck Larry, 2001, OD SPEAK LANG REC WO, P249 Hennebert J, 2000, SPEECH COMMUN, V31, P265, DOI 10.1016/S0167-6393(99)00082-5 Jiang Y., 2012, ANN C INT SPEECH COM, P1680 Kahn J., 2011, INT C PHON SCI ICPHS, P1002 Kahn J., 2010, OD SPEAK LANG REC WO, P109 Kanagasundaram A., 2011, ANN C INT SPEECH COM, P2341 Karam ZN, 2011, 2011 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), P525, DOI 10.1109/SSP.2011.5967749 Karlsson I, 2000, SPEECH COMMUN, V31, P121, DOI 10.1016/S0167-6393(99)00073-4 Karlsson I., 1999, GOTHENBURG PAPERS TH, P93 KATO T, 2003, ACOUST SPEECH SIG PR, P57 Kekre H., 2010, INT J BIOMETRICS BIO, V4, P100 Kelly F, 2011, LECT NOTES COMPUT SC, V6583, P113, DOI 10.1007/978-3-642-19530-3_11 Kelly F., 2012, INT C BIOM ICB, P478 Kenny P, 2013, INT CONF ACOUST SPEE, P7649, DOI 10.1109/ICASSP.2013.6639151 Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693 Kenny P., 2004, P IEEE INT C AC SPEE, V1, P37 Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Kong Aik Lee, 2011, ANN C INT SPEECH COM, P3317 Larcher A, 2013, DIGIT SIGNAL PROCESS, V23, P1910, DOI 10.1016/j.dsp.2013.07.007 Larcher A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4773 Larcher A., 2013, ANN C INT SPEECH COM, P2768 Larcher A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P371 Larcher A, 2013, INT CONF ACOUST SPEE, P7673, DOI 10.1109/ICASSP.2013.6639156 Larcher Anthony, 2012, ANN C INT SPEECH COM, P1580 Lawson A.D., 2009, ANN C INT SPEECH COM, P2899 Lee K.A., 2013, ANN C INT SPEECH COM, P3651 Lee Kong Aik, 2013, SLTC NEWSLETTER Lei Y., 2009, ANN C INT SPEECH COM, P2371 Li HZ, 2013, P IEEE, V101, P1136, DOI 10.1109/JPROC.2012.2237151 Li Q, 2002, IEEE T SPEECH AUDI P, V10, P146 Luan J., 2006, OD SPEAK LANG REC WO, P1 Mandasari M.I., 2011, ANN C INT SPEECH COM, P21 Marcel S, 2010, LECT NOTES COMPUT SC, V6388, P210, DOI 10.1007/978-3-642-17711-8_22 Martin A.F., 2010, ANN C INT SPEECH COM, P2726 Martin A.F., 2009, ANN C INT SPEECH COM, P2579 Martinez D., 2011, ANN C INT SPEECH COM, P861 Mason J.S., 1996, TECHNICAL REPORT Matsui T., 1993, IEEE INT C AC SPEECH, V2, P391, DOI 10.1109/ICASSP.1993.319321 Meng H., 2006, INT WORKSH MULT US A, P1 Messer K., 1999, 2 INT C AUD VID BAS, V964, P965 MISTRETTA W, 1998, ACOUST SPEECH SIG PR, P113 Nakano S., 2004, International Astronomical Union Circular Nosratighods M, 2010, SPEECH COMMUN, V52, P753, DOI 10.1016/j.specom.2010.04.007 Ortega-Garcia J, 2010, IEEE T PATTERN ANAL, V32, P1097, DOI 10.1109/TPAMI.2009.76 Ortega-Garcia J, 2000, SPEECH COMMUN, V31, P255, DOI 10.1016/S0167-6393(99)00081-3 Pigeon Stephane, 1997, AUDIO VIDEO BASED BI Prazak J., 2011, INT C INT DAT ACQ AD, P347 Prince S., 2007, IEEE 11 INT C COMP V, P1 Przybocki M.A., 2006, OD SPEAK LANG REC 2, P1 Ramasubramanian V., 2006, IEEE VLSI DESIGN, P1 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 ROSENBERG AE, 1991, INT CONF ACOUST SPEE, P381, DOI 10.1109/ICASSP.1991.150356 Rosenberg AE, 2000, SPEECH COMMUN, V31, P131, DOI 10.1016/S0167-6393(99)00074-6 Schmidt M, 1996, INT CONF ACOUST SPEE, P105, DOI 10.1109/ICASSP.1996.540301 Senoussaoui M., 2011, ANN C INT SPEECH COM, P25 Silovsky J, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P2920 Stafylakis T., 2013, ANN C INT SPEECH COM, P3684 Steininger Silke, 2002, LREC WORKSH MULT RES Stolcke A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4397 Sturim DE, 2002, INT CONF ACOUST SPEE, P677 Subramanya A., 2007, IEEE INT C AC SPEECH, P4 Toledano D.T., 2008, LREC Toledo-Ronen O., 2011, ANN C INT SPEECH COM, P9 van Leeuwen D.A., 2013, ANN C INT SPEECH COM, P1619 Vogt R, 2008, COMPUT SPEECH LANG, V22, P17, DOI 10.1016/j.csl.2007.05.003 Vogt R.J., 2009, ANN C INT SPEECH COM, P1563 Vogt R.J., 2008, OD SPEAK LANG REC WO, P1 Wagner M., 2006, OD SPEAK LANG REC WO, P1 Wong YW, 2011, PATTERN RECOGN LETT, V32, P1503, DOI 10.1016/j.patrec.2011.06.011 Woo R.H., 2006, OD SPEAK LANG REC WO Woo S.C., 2000, P IEEE REG 10 INT C Wu D., 2008, SPEECH RECOGNITION T Xu J., 2011, ANN C INT SPEECH COM Yegnanarayana B, 2005, IEEE T SPEECH AUDI P, V13, P575, DOI 10.1109/TSA.2005.848892 Yoma NB, 2002, SPEECH COMMUN, V38, P77, DOI 10.1016/S0167-6393(01)00044-9 You CH, 2010, IEEE T AUDIO SPEECH, V18, P1300, DOI 10.1109/TASL.2009.2032950 Young S.J., 2008, SPRINGER HDB SPEECH Young S. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.225844 Zheng T.F., 2005, ORIENTAL COCOSDA NR 136 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2014 VL 60 BP 56 EP 77 DI 10.1016/j.specom.2014.03.001 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AH0JI UT WOS:000335804800005 ER PT J AU Ablimit, M Kawahara, T Hamdulla, A AF Ablimit, Mijit Kawahara, Tatsuya Hamdulla, Askar TI Lexicon optimization based on discriminative learning for automatic speech recognition of agglutinative language SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Language model; Lexicon; Morpheme; Discriminative learning; Uyghur ID UNITS AB For automatic speech recognition (ASR) of agglutinative languages, selection of a lexical unit is not obvious. The morpheme unit is usually adopted to ensure sufficient coverage, but many morphemes are short, resulting in weak constraints and possible confusion. We propose a discriminative approach for lexicon optimization that directly contributes to ASR error reduction by taking into account not only linguistic constraints but also acoustic phonetic confusability. It is based on an evaluation function for each word defined by a set of features and their weights, which are optimized by the difference in word error rates (WERs) between ASR hypotheses obtained by the morpheme-based model and those by the word-based model. Then, word or sub-word entries with higher evaluation scores are selected to be added to the lexicon. We investigate several discriminative models to realize this approach. Specifically, we implement it with support vector machines (SVM), logistic regression (LR) model as well as the simple perceptron algorithm. This approach was successfully applied to an Uyghur large-vocabulary continuous speech recognition system, resulting in a significant reduction of WER with a modest lexicon size and a small out-of-vocabulary rate. The use of SVM for a sub-word lexicon results in the best performance, outperforming the word-based model as well as conventional statistical concatenation approaches. The proposed learning approach is realized in an unsupervised manner because it does not require correct transcription for training data. (C) 2013 Elsevier B.V. All rights reserved. C1 [Ablimit, Mijit; Kawahara, Tatsuya] Kyoto Univ, Sch Informat, Kyoto, Japan. [Hamdulla, Askar] Xinjiang Univ, Inst Informat Engn, Urumqi, Peoples R China. RP Kawahara, T (reprint author), Sakyo Ku, Kyoto 6068501, Japan. EM kawahara@i.kyoto-u.ac.jp FU JSPS; National Natural Science Foundation of China (NSFC) [61163032] FX This work is supported by JSPS Grant-in-Aid-for Scientific Research (KAKENHI) and National Natural Science Foundation of China (NSFC; grant 61163032). CR Ablimit M., 2010, P ICSP BEIJ Ablimit M., 2012, P IEEE ICASSP Afify M., 2006, P INTERSPEECH Arisoy E, 2009, IEEE T AUDIO SPEECH, V17, P874, DOI 10.1109/TASL.2008.2012313 Arisoy E, 2012, IEEE T AUDIO SPEECH, V20, P540, DOI 10.1109/TASL.2011.2162323 Berton A., 1996, P ICSLP Carki K., 2000, P IEEE ICASSP Collins M., 2002, P EMNLP Collins M., 2005, P ACL, P507, DOI 10.3115/1219840.1219903 Creutz M., 2006, THESIS HELSINKI U TE Creutz M., 2007, ACM T SPEECH LANG PR, V5, P1, DOI 10.1145/1322391.1322394 DELIGNE S, 1995, INT CONF ACOUST SPEE, P169, DOI 10.1109/ICASSP.1995.479391 El-Desoky A., 2009, P INTERSPEECH Fan RE, 2008, J MACH LEARN RES, V9, P1871 Geutner P., 1995, P IEEE ICASSP Goldsmith J., 2001, COMPUT LINGUIST, V2, P78 Hacioglu K., 2003, P EUR Ircing P., 2001, P EUR Jeff Kuo H.-K., 1999, P EUR Jongtaveesataporn M., 2009, SPEECH COMMUN, V2009, P379 Kawahara T., 2000, P INT C SPOK LANG PR, V4, P476 Kiecza D., 1999, P ICSP SEOUL KWON OW, 2000, ACOUST SPEECH SIG PR, P1567 Kwon OW, 2003, SPEECH COMMUN, V39, P287, DOI 10.1016/S0167-6393(02)00031-6 Larson M., 2000, P INTERSPEECH Lee A., 2001, EUROSPEECH, P1691 Masataki H, 1996, INT CONF ACOUST SPEE, P188, DOI 10.1109/ICASSP.1996.540322 Mihajlik P., 2007, P INT 2007, P1497 Nussbaum-Thom M., 2011, P INT Pellegrini T., 2007, P INTERSPEECH Pellegrini T, 2009, IEEE T AUDIO SPEECH, V17, P863, DOI 10.1109/TASL.2009.2022295 Puurula A., 2007, P ACL Roark B, 2007, COMPUT SPEECH LANG, V21, P373, DOI 10.1016/j.csl.2006.06.006 Sak H, 2012, IEEE T AUDIO SPEECH, V20, P2341, DOI 10.1109/TASL.2012.2201477 Saon G, 2001, IEEE T SPEECH AUDI P, V9, P327, DOI 10.1109/89.917678 Sarikaya R, 2008, IEEE T AUDIO SPEECH, V16, P1330, DOI 10.1109/TASL.2008.924591 Shinozaki T., 2002, P ICSLP, P717 Whittaker E., 2003, P ISCSLP BEIJ Xiang B., 2006, IEEE ICASSP NR 39 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2014 VL 60 BP 78 EP 87 DI 10.1016/j.specom.2013.09.011 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AH0JI UT WOS:000335804800006 ER PT J AU Amano-Kusumoto, A Hosom, JP Kain, A Aronoff, JM AF Amano-Kusumoto, Akiko Hosom, John-Paul Kain, Alexander Aronoff, Justin M. TI Determining the relevance of different aspects of formant contours to intelligibility SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Vowel perception; Speech synthesis ID CONVERSATIONAL SPEECH; CLEAR SPEECH; VOWEL INTELLIGIBILITY; NORMAL-HEARING; PERCEPTION; TRANSITION; LISTENERS; CHILDREN; HARD AB Previous studies have shown that "clear" speech, where the speaker intentionally tries to enunciate, has better intelligibility than "conversational" speech, which is produced in regular conversation. However, conversational and clear speech vary along a number of acoustic dimensions and it is unclear what aspects of clear speech lead to better intelligibility. Previously, Kain et al. (2008) showed that a combination of short-term spectra and duration was responsible for the improved intelligibility of one speaker. This study investigates subsets of specific features of short-term spectra including temporal aspects. Similar to Kain's study, hybrid stimuli were synthesized with a combination of features from clear speech and complementary features from conversational speech to determine which acoustic features cause the improved intelligibility of clear speech. Our results indicate that, although steady-state formant values of tense vowels contributed to the intelligibility of clear speech, neither the steady-state portion nor the formant transition was sufficient to yield comparable intelligibility to that of clear speech. In contrast, when the entire formant contour of conversational speech including the phoneme duration was replaced by that of clear speech, intelligibility was comparable to that of clear speech. It indicated that the combination of formant contour and duration information was relevant to the improved intelligibility of clear speech. The study provides a better understanding of the relevance of different aspects of formant contours to the improved intelligibility of clear speech. (C) 2013 Elsevier B.V. All rights reserved. C1 [Amano-Kusumoto, Akiko; Aronoff, Justin M.] House Res Inst, Dept Human Commun Sci Devices, Los Angeles, CA 90057 USA. [Amano-Kusumoto, Akiko; Hosom, John-Paul; Kain, Alexander] Oregon Hlth & Sci Univ, Ctr Spoken Language Understanding CSLU, Beaverton, OR 97006 USA. RP Amano-Kusumoto, A (reprint author), House Res Inst, Dept Human Commun Sci Devices, 2100 West Third St, Los Angeles, CA 90057 USA. EM akiko.amano@gmail.com FU NSF [0826654]; NIH [T32DC009975] FX This work was supported in part by NSF grant 0826654 and NIH grant T32DC009975. CR Amano-Kusumoto A., 2011, 11002 CSLU, P1 Amano-Kusumoto A, 2009, INT CONF ACOUST SPEE, P4677, DOI 10.1109/ICASSP.2009.4960674 [Anonymous], 1996, EL SOUND LEV MET Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 Bradlow AR, 2003, J SPEECH LANG HEAR R, V46, P80, DOI 10.1044/1092-4388(2003/007) Ferguson SH, 2004, J ACOUST SOC AM, V116, P2365, DOI 10.1121/1.1788730 Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078 FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826 Helfer K S, 1998, J Am Acad Audiol, V9, P234 Hillenbrand JM, 1999, J ACOUST SOC AM, V105, P3509, DOI 10.1121/1.424676 Hosom JP, 2009, SPEECH COMMUN, V51, P352, DOI 10.1016/j.specom.2008.11.003 Kain A, 2008, J ACOUST SOC AM, V124, P2308, DOI 10.1121/1.2967844 Kain AB, 2007, SPEECH COMMUN, V49, P743, DOI 10.1016/j.specom.2007.05.001 Krause JC, 2004, J ACOUST SOC AM, V115, P362, DOI 10.1121/1.1635842 Krause JC, 2002, J ACOUST SOC AM, V112, P2165, DOI 10.1121/1.1509432 Kusumoto A., 2007, P INTERSPEECH, P370 Liu S, 2004, J ACOUST SOC AM, V116, P2374, DOI 10.1121/1.1787528 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 Perkell JS, 2002, J ACOUST SOC AM, V112, P1627, DOI 10.1121/1.1506369 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 WINGFIELD A, 1985, J GERONTOL, V40, P579 NR 23 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2014 VL 59 BP 1 EP 9 DI 10.1016/j.specom.2013.12.001 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AD8DI UT WOS:000333496300001 ER PT J AU Kane, J Aylett, M Yanushevskaya, I Gobl, C AF Kane, John Aylett, Matthew Yanushevskaya, Irena Gobl, Christer TI Phonetic feature extraction for context-sensitive glottal source processing SO SPEECH COMMUNICATION LA English DT Article DE Voice quality; Phonation type; Glottal source; Expressive speech; Speech synthesis ID SPEAKER RECOGNITION; SPEECH RECOGNITION; NEURAL-NETWORKS; FLOW AB The effectiveness of glottal source analysis is known to be dependent on the phonetic properties of its concomitant supraglottal features. Phonetic classes like nasals and fricatives are particularly problematic. Their acoustic characteristics, including zeros in the vocal tract spectrum and aperiodic noise, can have a negative effect on glottal inverse filtering, a necessary pre-requisite to glottal source analysis. In this paper, we first describe and evaluate a set of binary feature extractors, for phonetic classes with relevance for glottal source analysis. As voice quality classification is typically achieved using feature data derived by glottal source analysis, we then investigate the effect of removing data from certain detected phonetic regions on the classification accuracy. For the phonetic feature extraction, classification algorithms based on Artificial Neural Networks (ANNs), Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs) are compared. Experiments demonstrate that the discriminative classifiers (i.e. ANNs and SVMs) in general give better results compared with the generative learning algorithm (i.e. GMMs). This accuracy generally decreases according to the sparseness of the feature (e.g., accuracy is lower for nasals compared to syllabic regions). We find best classification of voice quality when just using glottal source parameter data derived within detected syllabic regions. (C) 2013 Elsevier B.V. All rights reserved. C1 [Kane, John; Yanushevskaya, Irena; Gobl, Christer] Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland. [Aylett, Matthew] Univ Edinburgh, Sch Informat, Edinburgh EH8 9YL, Midlothian, Scotland. [Aylett, Matthew] CereProc Ltd, Edinburgh, Midlothian, Scotland. RP Kane, J (reprint author), Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland. EM kanejo@tcd.ie; matthewa@cereproc.com; yanushei@tcd.ie; cegobl@tcd.ie FU Science Foundation Ireland [09/IN.1/I2631] FX The first, third and fourth authors are supported by the Science Foundation Ireland Grant 09/IN.1/I2631 (FASTNET). CR Airas M., 2007, P INT 2007, P1410 Ali AMA, 1999, ISCAS '99: PROCEEDINGS OF THE 1999 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL 3, P118 ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365 Alku P, 1997, SPEECH COMMUN, V22, P67, DOI 10.1016/S0167-6393(97)00020-4 Alku P, 2011, SADHANA-ACAD P ENG S, V36, P623, DOI 10.1007/s12046-011-0041-5 Alku P, 2013, J ACOUST SOC AM, V134, P1295, DOI 10.1121/1.4812756 Alku P., 1994, P INT C SPOK LANG PR, P1619 Aylett M.P., 2007, ARTIFICIAL INTELLIGE Bishop C. M., 2006, PATTERN RECOGNITION Campbell N., 2003, P 15 INT C PHON SCI, P2417 Chan WN, 2007, IEEE T AUDIO SPEECH, V15, P1884, DOI 10.1109/TASL.2007.900103 Chomsky N., 1968, SOUND PATTERN ENGLIS Cullen A., 2013, P WASSS GREN FRANC Drugman T., 2011, COMPUTER SPEECH LANG, V26, P20 Fant G, 1987, SPEECH TRANSMISSION, V1, P13 Fant G., 1970, ACOUSTIC THEORY SPEE Fant G., 1985, KTH SPEECH TRANSMISS, P21 Fant Gunnar, 1985, STL QPSR, V4, P1 Gobl C, 2013, J VOICE, V27, P155, DOI 10.1016/j.jvoice.2012.09.004 Hacki T., 1989, FOLIA PHONIATR, P43 Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991 HORNIK K, 1991, NEURAL NETWORKS, V4, P251, DOI 10.1016/0893-6080(91)90009-T Iliev I., 2010, COMPUT SPEECH LANG, V24, P445 Kane J, 2013, SPEECH COMMUN, V55, P295, DOI [10.1016/j.specom.2012.08.011, 10.1016/j.specom.2012.08.01] Kane J., 2013, P NOLISP MONS BELG, P1 Kane J, 2013, IEEE T AUDIO SPEECH, V21, P1170, DOI 10.1109/TASL.2013.2245653 Kane J, 2013, SPEECH COMMUN, V55, P397, DOI 10.1016/j.specom.2012.12.004 Kane J., 2013, P INT LYON FRANC Kane J., 2013, P ICASSP VANC CAN Kanokphara S., 2006, P 19 INT C IND ENG O King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Kominek J., 2004, ISCA SPEECH SYNTH WO, P223 Launay B, 2002, INT CONF ACOUST SPEE, P817 Lin Q., 1987, KTH SPEECH TRANSMISS, V28, P1 Lugger M, 2008, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2008.4518767 Mokhtari P, 2003, IEICE T INF SYST, VE86D, P574 Mokhtari P., 2002, P LANG RES EV LREC Murty KR, 2006, IEEE SIGNAL PROC LET, V13, P52, DOI 10.1109/LSP.2005.860538 Raitio T, 2014, COMPUT SPEECH LANG, V28, P648, DOI 10.1016/j.csl.2013.03.003 Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239 Richmond K., 2007, P BLIZZ CHALL WORKSH Siniscalchi SM, 2009, SPEECH COMMUN, V51, P1139, DOI 10.1016/j.specom.2009.05.004 Siniscalchi SM, 2013, NEUROCOMPUTING, V106, P148, DOI 10.1016/j.neucom.2012.11.008 Sturmel N., 2006, 7 C ADV QUANT LAR GR Szekely E, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4593 Tarek A., 2003, P NONL SPEECH PROC W Teager H. M, 1990, INT C AC SPEECH SIGN, V4, P241 Walker J, 2007, LECT NOTES COMPUT SC, V4391, P1 Young Steve J., 2007, HTK BOOK VERSION 3 4 Yu D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4169 Zheng NH, 2007, IEEE SIGNAL PROC LET, V14, P181, DOI 10.1109/LSP.2006.884031 NR 52 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2014 VL 59 BP 10 EP 21 DI 10.1016/j.specom.2013.12.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AD8DI UT WOS:000333496300002 ER PT J AU Liang, S Liu, WJ Jiang, W Xue, W AF Liang, Shan Liu, WenJu Jiang, Wei Xue, Wei TI The analysis of the simplification from the ideal ratio to binary mask in signal-to-noise ratio sense SO SPEECH COMMUNICATION LA English DT Article DE Ideal binary mask; Ideal ratio mask; W-Disjoint Orthogonality ID AUTOMATIC SPEECH RECOGNITION; SEGREGATION; SEPARATION; INTELLIGIBILITY; ALGORITHM AB For speech separation systems, the ideal binary mask (IBM) can be viewed as a simplified goal of the ideal ratio mask (IRM) which is derived from Wiener filter. The available research usually verify the rationality of this simplification from the aspect of speech intelligibility. However, the difference between the two masks has not been addressed rigorously in the signal-to-noise ratio (SNR) sense. In this paper, we analytically investigate the difference between the two ideal masks under the assumption of the approximate W-Disjoint Orthogonality (AWDO) which almost holds under many kinds of interference due to the sparse nature of speech. From the analysis, one theoretical upper bound of the difference is obtained under the AWDO assumption. Some other interesting discoveries include a new ratio mask which achieves higher SNR gains than the IRM and the essential relation between the AWDO degree and the SNR gain of the IRM. (C) 2013 Elsevier B.V. All rights reserved. C1 [Liang, Shan; Liu, WenJu; Jiang, Wei; Xue, Wei] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit NLPR, Beijing, Peoples R China. RP Liu, WJ (reprint author), Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit NLPR, Beijing, Peoples R China. EM sliang@nlpr.ia.ac.cn; lwj@nlpr.ia.ac.cn; wjiang@nlpr.ia.ac.cn; wxue@nlpr.ia.ac.cn FU China National Nature Science Foundation [91120303, 61273267, 90820011] FX This research was supported in part by the China National Nature Science Foundation (No. 91120303, No. 61273267 and No. 90820011). CR Barker J., 2000, P ICSLP BEIJ CHIN, P373 Bregman S., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 Brown G.J., 1993, J ACOUST SOC AM, V94, P2454, DOI 10.1121/1.407441 Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929 Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cooke M.P., 1993, MODELING AUDITORY PR Ellis D.P.W., 1996, THESIS Ellis D.P.W., 2006, COMPUTATIONAL AUDITO, P15 Ellis D.P.W., 1995, WORKSH COMP AUD SCEN, P111 Han K, 2012, J ACOUST SOC AM, V132, P3475, DOI 10.1121/1.4754541 Hu GN, 2010, IEEE T AUDIO SPEECH, V18, P2067, DOI 10.1109/TASL.2010.2041110 Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Hu G.N., 2001, IEEE WORKSH APPL SIG, P79 Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Kim G, 2009, J ACOUST SOC AM, V126, P1486, DOI 10.1121/1.3184603 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094 Lehmann EA, 2008, J ACOUST SOC AM, V124, P269, DOI 10.1121/1.2936367 Li N, 2008, J ACOUST SOC AM, V123, P1673, DOI 10.1121/1.2832617 Li Y.P., 2009, SPEECH COMMUN, V51, P1486 Liang S, 2012, IEEE SIGNAL PROC LET, V19, P627, DOI 10.1109/LSP.2012.2209643 Liang S, 2013, IEEE T AUDIO SPEECH, V21, P476, DOI 10.1109/TASL.2012.2226156 Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180 Ma WY, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P402 Mallat S., 1998, WAVELET TOUR SIGNAL Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Melia T., 2007, THESIS U COLL DUBLIN Patterson R. D., 1988, 2341 MRC APPL PSYCH Peharz R, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P249 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 Sawada H, 2011, IEEE T AUDIO SPEECH, V19, P516, DOI 10.1109/TASL.2010.2051355 Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006 Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003 Wang D. L., 2006, COMPUTATIONAL AUDITO, P1 Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 Weintraub M., 1985, THESIS STANFORD U Wiener N., 1949, EXTRAPOLATION INTERP Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896] NR 41 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2014 VL 59 BP 22 EP 30 DI 10.1016/j.specom.2013.12.002 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AD8DI UT WOS:000333496300003 ER PT J AU van der Zande, P Jesse, A Cutler, A AF van der Zande, Patrick Jesse, Alexandra Cutler, Anne TI Hearing words helps seeing words: A cross-modal word repetition effect SO SPEECH COMMUNICATION LA English DT Article DE Speech perception; Audiovisual speech; Word repetition priming; Cross-modal priming ID SPEECH-PERCEPTION; SPOKEN WORDS; RECOGNITION MEMORY; CONSONANT RECOGNITION; TALKER VARIABILITY; VISUAL PROSODY; VOICE; INTELLIGIBILITY; IDENTIFICATION; REPRESENTATION AB Watching a speaker say words benefits subsequent auditory recognition of the same words. In this study, we tested whether hearing words also facilitates subsequent phonological processing from visual speech, and if so, whether speaker repetition influences the magnitude of this word repetition priming. We used long-term cross-modal repetition priming as a means to investigate the underlying lexical representations involved in listening to and seeing speech. In Experiment 1, listeners identified auditory-only words during exposure and visual-only words at test. Words at test were repeated or new and produced by the exposure speaker or a novel speaker. Results showed a significant effect of cross-modal word repetition priming but this was unaffected by speaker changes. Experiment 2 added an explicit recognition task at test. Listeners' lipreading performance was again improved by prior exposure to auditory words. Explicit recognition memory was poor, and neither word repetition nor speaker repetition improved it. This suggests that cross-modal repetition priming is neither mediated by explicit memory nor improved by speaker information. Our results suggest that phonological representations in the lexicon are shared across auditory and visual processing, and that speaker information is not transferred across modalities at the lexical level. (C) 2014 Elsevier B.V. All rights reserved. C1 [van der Zande, Patrick; Cutler, Anne] Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands. [Jesse, Alexandra] Univ Massachusetts, Dept Psychol, Amherst, MA 01003 USA. [Cutler, Anne] Univ Western Sydney, MARCS Inst, Penrith, NSW 2751, Australia. RP van der Zande, P (reprint author), Max Planck Inst Psycholinguist, POB 310, NL-6500 AH Nijmegen, Netherlands. EM P.Zande@gmail.com; AJesse@psych.umass.edu; A.Cutler@uws.edu.au RI Cutler, Anne/C-9467-2012 CR Arnold P, 2001, BRIT J PSYCHOL, V92, P339, DOI 10.1348/000712601162220 Baayen R. H., 1993, CELEX LEXICAL DATABA Bates D., 2007, LME4 LINEAR MIXED EF BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4 Bradlow AR, 1999, PERCEPT PSYCHOPHYS, V61, P206, DOI 10.3758/BF03206883 Buchwald AB, 2009, LANG COGNITIVE PROC, V24, P580, DOI 10.1080/01690960802536357 CRAIK FIM, 1974, Q J EXP PSYCHOL, V26, P274, DOI 10.1080/14640747408400413 CREELMAN CD, 1957, J ACOUST SOC AM, V29, P655, DOI 10.1121/1.1909003 Cvejic E, 2012, COGNITION, V122, P442, DOI 10.1016/j.cognition.2011.11.013 Dodd B., 1989, VISIBLE LANG, V22, P58 Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009 Ellis A., 1982, CURR PSYCHOL RES REV, V2, P123 Ferguson SH, 2004, J ACOUST SOC AM, V116, P2365, DOI 10.1121/1.1788730 Foulkes P, 2006, J PHONETICS, V34, P409, DOI 10.1016/j.wocn.2005.08.002 Fowler CA, 2003, J MEM LANG, V49, P396, DOI 10.1016/S0749-596X(03)00072-X Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135 Goldinger SD, 1996, J EXP PSYCHOL LEARN, V22, P1166, DOI 10.1037/0278-7393.22.5.1166 Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788 Irwin JR, 2006, PERCEPT PSYCHOPHYS, V68, P582, DOI 10.3758/BF03208760 JACKSON A, 1984, MEM COGNITION, V12, P568, DOI 10.3758/BF03213345 Jesse A, 2014, Q J EXP PSYCHOL, V67, P793, DOI 10.1080/17470218.2013.834371 Jesse A, 2011, PSYCHON B REV, V18, P943, DOI 10.3758/s13423-011-0129-2 Jesse A, 2010, ATTEN PERCEPT PSYCHO, V72, P209, DOI 10.3758/APP.72.1.209 Kamachi M, 2003, CURR BIOL, V13, P1709, DOI 10.1016/j.cub.2003.09.005 Kim J, 2004, COGNITION, V93, pB39, DOI 10.1016/j.cognition.2003.11.003 Krahmer E, 2004, HUM COM INT, V7, P191 KRICOS PB, 1982, VOLTA REV, V84, P219 Lachs L, 2004, J EXP PSYCHOL HUMAN, V30, P378, DOI 10.1037/0096-1523.30.2.378 LADEFOGED P, 1980, LANGUAGE, V56, P485, DOI 10.2307/414446 Laver J, 1979, SOCIAL MARKERS SPEEC, P1 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 Luce PA, 1998, MEM COGNITION, V26, P708, DOI 10.3758/BF03211391 MACLEOD A, 1987, British Journal of Audiology, V21, P131, DOI 10.3109/03005368709077786 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 McLennan CT, 2003, J EXP PSYCHOL LEARN, V29, P539, DOI 10.1037/0278-7393.29.4.539 McQueen JM, 2006, COGNITIVE SCI, V30, P1113, DOI 10.1207/s15516709cog0000_79 MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688 Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x Norris D, 2006, COGNITIVE PSYCHOL, V53, P146, DOI 10.1016/j.cogpsych.2006.03.001 NYGAARD LC, 1994, PSYCHOL SCI, V5, P42, DOI 10.1111/j.1467-9280.1994.tb00612.x Nygaard LC, 1998, PERCEPT PSYCHOPHYS, V60, P355, DOI 10.3758/BF03206860 PALMERI TJ, 1993, J EXP PSYCHOL LEARN, V19, P309, DOI 10.1037/0278-7393.19.2.309 R Development Core Team, 2007, R LANG ENV STAT COMP Reisberg D., 1987, HEARING EYE PSYCHOL, P97 Rosenblum LD, 2007, PSYCHOL SCI, V18, P392, DOI 10.1111/j.1467-9280.2007.01911.x Rosenblum LD, 2008, CURR DIR PSYCHOL SCI, V17, P405, DOI 10.1111/j.1467-8721.2008.00615.x SCHACTER DL, 1992, J EXP PSYCHOL LEARN, V18, P915, DOI 10.1037/0278-7393.18.5.915 Sheffert SM, 1998, MEM COGNITION, V26, P591, DOI 10.3758/BF03201165 SHEFFERT SM, 1995, J MEM LANG, V34, P665, DOI 10.1006/jmla.1995.1030 Strelnikov K, 2009, NEUROPSYCHOLOGIA, V47, P972, DOI 10.1016/j.neuropsychologia.2008.10.017 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 Sumner M, 2005, J MEM LANG, V52, P322, DOI 10.1016/j.jml.2004.11.004 van der Zande P, 2013, J ACOUST SOC AM, V134, P562, DOI 10.1121/1.4807814 VANSON N, 1994, J ACOUST SOC AM, V96, P1341, DOI 10.1121/1.411324 WALDEN BE, 1974, J SPEECH HEAR RES, V17, P270 Yakel DA, 2000, PERCEPT PSYCHOPHYS, V62, P1405, DOI 10.3758/BF03212142 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X NR 58 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2014 VL 59 BP 31 EP 43 DI 10.1016/j.specom.2014.01.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AD8DI UT WOS:000333496300004 ER PT J AU Clapham, R Middag, C Hilgers, F Martens, JP van den Brekel, M van Son, R AF Clapham, Renee Middag, Catherine Hilgers, Frans Martens, Jean-Pierre van den Brekel, Michiel van Son, Rob TI Developing automatic articulation, phonation and accent assessment techniques for speakers treated for advanced head and neck cancer SO SPEECH COMMUNICATION LA English DT Article DE Automatic evaluation; Head and neck cancer; Perceptual evaluation; Phonemic features; Phonological features; AMPEX ID OROPHARYNGEAL CANCER; SUBSTITUTION VOICES; SPEECH; INTELLIGIBILITY; QUALITY; DISORDERS; FEATURES; OUTCOMES AB Purpose: To develop automatic assessment models for assessing the articulation, phonation and accent of speakers with head and neck cancer (Experiment 1) and to investigate whether the models can track changes over time (Experiment 2). Method: Several speech analysis methods for extracting a compact acoustic feature set that characterizes a speaker's speech are investigated. The effectiveness of a feature set for assessing a variable is assessed by feeding it to a linear regression model and by measuring the mean difference between the outputs of that model for a set of recordings and the corresponding perceptual scores for the assessed variable (Experiment 1). The models are trained and tested on recordings of 55 speakers treated non-surgically for advanced oral cavity, pharynx and larynx cancer. The perceptual scores are average unsealed ratings of a group of 13 raters. The ability of the models to track changes in perceptual scores over time is also investigated (Experiment 2). Results: Experiment 1 has demonstrated that combinations of feature sets generally result in better models, that the best articulation model outperforms the average human rater's performance and that the best accent and phonation models are deemed competitive. Scatter plots of computed and observed scores show, however, that especially low perceptual scores are difficult to assess automatically. Experiment 2 has shown that the articulation and phonation models show only variable success in tracking trends over time and for only one of the time pairs are they deemed compete with the average human rater (Experiment 2). Nevertheless, there is a significant level of agreement between computed and observed trends when considering only a coarse classification of the trend into three classes: clearly positive, clearly negative and minor differences. Conclusions: A baseline tool to support the multi-dimensional evaluation of speakers treated non-surgically for advanced head and neck cancer now exists. More work is required to further improve the models, particularly with respect to their ability to assess low-quality speech. (C) 2014 Elsevier B.V. All rights reserved. C1 [Clapham, Renee; Hilgers, Frans; van den Brekel, Michiel; van Son, Rob] Univ Amsterdam, Amsterdam Ctr Language & Commun, NL-1012 VT Amsterdam, Netherlands. [Clapham, Renee; Hilgers, Frans; van den Brekel, Michiel; van Son, Rob] Netherlands Canc Inst, NL-1066 CX Amsterdam, Netherlands. [Middag, Catherine; Martens, Jean-Pierre] Univ Ghent, Multimedia Lab ELIS, B-9000 Ghent, Belgium. RP Clapham, R (reprint author), Univ Amsterdam, Amsterdam Ctr Language & Commun, Spuistra 210, NL-1012 VT Amsterdam, Netherlands. EM r.p.clapham@uva.nl; Catherine.Middag@UGent.be; f.hilgers@nki.nl; martens@elis.ugent.be; M.W.M.vandenBrekel@uva.nl; r.v.son@nki.nl FU Atos Medical (Horby, Sweden); Verwelius Foundation (Naarden, the Netherlands) FX Part of this research was funded by unrestricted research grants from Atos Medical (Horby, Sweden) and the Verwelius Foundation (Naarden, the Netherlands). All speech recordings were collected by Lisette van der Molen, SLP PhD, and Irene Jacobi, PhD. The authors wish to thank Maya van Rossum, SLP PhD for her input on collecting the perceptual scores. Prof. Louis Pols is greatly acknowledged for his critical and constructive review of the manuscript. CR Breiman L., 1996, MACH LEARN, V24, P23 C Middag, 2011, P INT C SPOK LANG PR, P3005 Clapham RP, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3350 De Bruijn M. J., 2011, SPEECH COMMUN, V54, P632 De Bodt MS, 2002, J COMMUN DISORD, V35, P283, DOI 10.1016/S0021-9924(02)00065-5 de Bruijn M, 2011, LOGOP PHONIATR VOCO, V36, P168, DOI 10.3109/14015439.2011.606227 de Bruijn MJ, 2009, FOLIA PHONIATR LOGO, V61, P180, DOI 10.1159/000219953 Haderlein T, 2007, EUR ARCH OTO-RHINO-L, V264, P1315, DOI 10.1007/s00405-007-0363-4 Jacobi I, 2010, EUR ARCH OTO-RHINO-L, V267, P1495, DOI 10.1007/s00405-010-1316-x Jacobi I, 2013, ANN OTO RHINOL LARYN, V122, P754 Jacobi Irene, 2009, THESIS U AMSTERDAM Kissine M., 2003, LINGUISTICS NETHERLA, P93 Maier A, 2009, SPEECH COMMUN, V51, P425, DOI 10.1016/j.specom.2009.01.004 Manfredi C, 2011, LOGOP PHONIATR VOCO, V36, P78, DOI 10.3109/14015439.2011.578077 Maryn Y, 2010, J COMMUN DISORD, V43, P161, DOI 10.1016/j.jcomdis.2009.12.004 Middag C, 2014, COMPUT SPEECH LANG, V28, P467, DOI 10.1016/j.csl.2012.10.007 Middag C., 2009, EURASIP J ADV SIGNAL Middag C., 2010, P INT 2010 Moerman M, 2004, EUR ARCH OTO-RHINO-L, V261, P541, DOI 10.1007/s00405-003-0681-0 Moerman MBJ, 2006, EUR ARCH OTO-RHINO-L, V263, P183, DOI 10.1007/s00405-005-0960-z Newman L., 2001, HEAD NECK-J SCI SPEC, V24, P68 Rietveld A. C. M., 1997, ALGEMENE FONETIEK Schuurman I., 2003, P 4 INT WORKSH LANG, P340 Shrivastav R, 2005, J SPEECH LANG HEAR R, V48, P323, DOI 10.1044/1092-4388(2005/022) Stouten F, 2006, INT CONF ACOUST SPEE, P329 van der Molen L, 2012, J VOICE, V26, pe25, DOI DOI 10.1016/J.JV0ICE.2011.08.016 VANIMMERSEEL LM, 1992, J ACOUST SOC AM, V91, P3511, DOI 10.1121/1.402840 Van Nuffelen G, 2009, INT J LANG COMM DIS, V44, P716, DOI 10.1080/13682820802342062 Verdonck-de Leeuw I.M., 2007, HEAD NECK CANC TREAT, P27 NR 29 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2014 VL 59 BP 44 EP 54 DI 10.1016/j.specom.2014.01.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AD8DI UT WOS:000333496300005 ER PT J AU Deng, F Bao, F Bao, CC AF Deng, Feng Bao, Feng Bao, Chang-chun TI Speech enhancement using generalized weighted beta-order spectral amplitude estimator SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Auditory masking properties; Generalized weighed spectral amplitude estimator; A priori SNR estimation ID NOISE; MODELS AB In this paper, a single-channel speech enhancement method based on generalized weighted beta-order spectral amplitude estimator is proposed. First, we derive a new kind of generalized weighted beta-order Bayesian spectral amplitude estimator, which takes full advantage of both the traditional perceptually weighted estimators and beta-order spectral amplitude estimators and can obtain flexible and effective gain function. Second, according to the masking properties of human auditory system, the adaptive estimation methods for the perceptually weighted order p is proposed, which is based on a criterion that inaudible noise may be masked rather than removed. Thereby, the distortion of enhanced speech is reduced. Third, based on the compressive nonlinearity of the cochlea, the spectral amplitude order beta can be interpreted as the compression rate of the spectral amplitude, and then the adaptive calculation method of parameter beta is proposed. In addition, due to one frame delay, the a priori SNR estimation of decision-directed method in speech activity periods is inaccurate. In order to overcome the drawback, we present a new a priori SNR estimation method by combining predicted estimation with decision-directed rule. The subjective and objective test results indicate that the proposed Bayesian spectral amplitude estimator combined with the proposed a priori SNR estimation method can achieve a more significant segmental SNR improvement, a lower log-spectral distortion and a better speech quality over the reference methods. (C) 2014 Elsevier B.V. All rights reserved. C1 [Deng, Feng; Bao, Feng; Bao, Chang-chun] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. EM baochch@bjut.edu.cn FU Beijing Natural Science Foundation program; Scientific Research Key Program of Beijing Municipal Commission of Education [KZ201110005005]; National Natural Science Foundation of China [61072089] FX This work was supported by the Beijing Natural Science Foundation program and Scientific Research Key Program of Beijing Municipal Commission of Education (Grant No. KZ201110005005), the National Natural Science Foundation of China (Grant No. 61072089). CR Abramson A, 2007, IEEE T AUDIO SPEECH, V15, P2348, DOI 10.1109/TASL.2007.904231 [Anonymous], 2001, REC P 862 PERC EV SP BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 Deng F., 2011, 2011 INT C WIR COMM, P1 DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gradshteyn I. S., 2000, TABLE INTEGRALS SERI GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052 ITU-T, 1993, REC P 56 OBJ MEAS AC JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Loizou P.C., 2007, SPEECH ENHANCEMENT T, P213 Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 MALAH D, 1999, ACOUST SPEECH SIG PR, P789 Moore BC., 2003, INTRO PSYCHOL HEARIN Plourde E, 2008, IEEE T AUDIO SPEECH, V16, P1614, DOI 10.1109/TASL.2008.2004304 Poblete V, 2014, SPEECH COMMUN, V56, P19, DOI 10.1016/j.specom.2013.07.006 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Robles L, 2001, PHYSIOL REV, V81, P1305 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 You CH, 2005, IEEE T SPEECH AUDI P, V13, P475, DOI 10.1109/TSA.2005.848883 You C.H., 2004, IEEE INT C AC SPEECH, V1, P725 You CH, 2006, SPEECH COMMUN, V48, P57, DOI 10.1016/j.specom.2005.05.012 Zhao DY, 2007, IEEE T AUDIO SPEECH, V15, P882, DOI 10.1109/TASL.2006.885256 NR 29 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2014 VL 59 BP 55 EP 68 DI 10.1016/j.specom.2014.01.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AD8DI UT WOS:000333496300006 ER PT J AU Kanagasundaram, A Dean, D Sridharan, S Gonzalez-Dominguez, J Gonzalez-Rodriguez, J Ramos, D AF Kanagasundaram, A. Dean, D. Sridharan, S. Gonzalez-Dominguez, J. Gonzalez-Rodriguez, J. Ramos, D. TI Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques SO SPEECH COMMUNICATION LA English DT Article DE Speaker verification; I-vector; PLDA; SN-LDA; SUVN; SUV ID VARIABILITY AB This paper proposes techniques to improve the performance of i-vector based speaker verification systems when only short utterances are available. Short-length utterance i-vectors vary with speaker, session variations, and the phonetic content of the utterance. Well established methods such as linear discriminant analysis (LDA), source-normalized LDA (SN-LDA) and within-class covariance normalization (WCCN) exist for compensating the session variation but we have identified the variability introduced by phonetic content due to utterance variation as an additional source of degradation when short-duration utterances are used. To compensate for utterance variations in short i-vector speaker verification systems using cosine similarity scoring (CSS), we have introduced a short utterance variance normalization (SUVN) technique and a short utterance variance (SUV) modelling approach at the i-vector feature level. A combination of SUVN with LDA and SN-LDA is proposed to compensate the session and utterance variations and is shown to provide improvement in performance over the traditional approach of using LDA and/or SN-LDA followed by WCCN. An alternative approach is also introduced using probabilistic linear discriminant analysis (PLDA) approach to directly model the SUV. The combination of SUVN, LDA and SN-LDA followed by SUV PLDA modelling provides an improvement over the baseline PLDA approach. We also show that for this combination of techniques, the utterance variation information needs to be artificially added to full-length i-vectors for PLDA modelling. (C) 2014 Elsevier B.V. All rights reserved. C1 [Kanagasundaram, A.; Dean, D.; Sridharan, S.] Queensland Univ Technol, SAIVT, Speech Res Lab, Brisbane, Qld 4001, Australia. [Gonzalez-Dominguez, J.; Gonzalez-Rodriguez, J.; Ramos, D.] Univ Autonoma Madrid, ATVS Biometr Recognit Grp, E-28049 Madrid, Spain. RP Kanagasundaram, A (reprint author), Queensland Univ Technol, SAIVT, Speech Res Lab, Brisbane, Qld 4001, Australia. EM a.kanagasundaram@qut.edu.au; d.dean@qut.edu.au; s.sridharan@qut.edu.au; javier.gonzalez@uam.es; joaquin.gonzalez@uam.es; daniel.ramos@uam.es FU Australian Research Council (ARC) [LP130100110]; European Commission Marie Curie ITN Bayesian Biometrics for Forensics (BBfor2) network; Spanish Ministerio de Economia y Competitividad [TEC2012-37585-C02-01] FX This project was supported by an Australian Research Council (ARC) Linkage Grant LP130100110 and by the European Commission Marie Curie ITN Bayesian Biometrics for Forensics (BBfor2) network and the Spanish Ministerio de Economia y Competitividad under the project TEC2012-37585-C02-01. CR Dehak N., 2010, OD SPEAK LANG REC WO Dehak N., 2010, IEEE T AUDIO SPEECH, P1 Dehak N., 2009, P INT C SPOK LANG PR, P1559 Garcia-Romero D, 2011, INTERSPEECH, P249 Hasan T., 2013, IEEE INT C AC SPEECH Kanagasundaram A., 2011, P INTERSPEECH, P2341 Kanagasundaram A., 2012, P OD WORKSH Kanagasundaram Ahilan, 2012, SPEAK LANG REC WORKS KENNY P, 2006, IEEE OD 2006 SPEAK L, P1 Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147 Kenny P, 2005, JOINT FACTOR ANAL SP Kenny P, 2010, P OD SPEAK LANG REC Kenny P., 2013, IEEE INT C AC SPEECH McLaren M, 2011, INT CONF ACOUST SPEE, P5456 McLaren M, 2011, INT CONF ACOUST SPEE, P5460 McLaren M., 2010, P OD WORKSH McLaren M, 2012, IEEE T AUDIO SPEECH, V20, P755, DOI 10.1109/TASL.2011.2164533 NIST, 2008, NIST YEAR 2008 SPEAK NIST, 2010, NIST YEAR 2010 SPEAK Shum S., 2010, P OD Vogt R., 2008, INT 2008 BRISB AUSTR VOGT R, 2008, OD SPEAK LANG REC WO Vogt R, 2008, COMPUT SPEECH LANG, V22, P17, DOI 10.1016/j.csl.2007.05.003 Zhao XY, 2009, INT CONF ACOUST SPEE, P4049 NR 24 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2014 VL 59 BP 69 EP 82 DI 10.1016/j.specom.2014.01.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA AD8DI UT WOS:000333496300007 ER PT J AU Jeong, Y AF Jeong, Yongwon TI Joint speaker and environment adaptation using Tensor Voice for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Acoustic model adaptation; Environment adaptation; Speaker adaptation; Speech recognition; Tensor analysis ID HIDDEN MARKOV-MODELS; MAXIMUM-LIKELIHOOD; NOISE AB We present an adaptation of a hidden Markov model (HMM)-based automatic speech recognition system to the target speaker and noise environment. Given HMMs built from various speakers and noise conditions, we build tensorvoices that capture the interaction between the speaker and noise by using a tensor decomposition. We express the updated model for the target speaker and noise environment as a product of the tensorvoices and two weight vectors, one each for the speaker and noise. An iterative algorithm is presented to determine the weight vectors in the maximum likelihood (ML) framework. With the use of separate weight vectors, the tensorvoice approach can adapt to the target speaker and noise environment differentially, whereas the eigenvoice approach, which is based on a matrix decomposition technique, cannot differentially adapt to those two factors. In supervised adaptation tests using the AURORA4 corpus, the relative improvement of performance obtained by the tensorvoice method over the eigenvoice method is approximately 10% on average for adaptation data of 6-24 s in length, and the relative improvement of performance obtained by the tensorvoice method over the maximum likelihood linear regression (MLLR) method is approximately 5.4% on average for adaptation data of 6-18 s in length. Therefore, the tensorvoice approach. (C) 2013 Elsevier B.V. All rights reserved. C1 Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea. RP Jeong, Y (reprint author), Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea. EM jeongy@pusan.ac.kr CR Acero A., 2000, P ICSLP, P869 Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137 CARROLL JD, 1970, PSYCHOMETRIKA, V35, P283, DOI 10.1007/BF02310791 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6 Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Jeong Y, 2010, INT CONF ACOUST SPEE, P4870, DOI 10.1109/ICASSP.2010.5495117 Jeong Y, 2011, IEEE SIGNAL PROC LET, V18, P347, DOI 10.1109/LSP.2011.2136335 Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd Kolda TG, 2009, SIAM REV, V51, P455, DOI 10.1137/07070111X Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Lathauwer LD, 2000, SIAM J MATRIX ANAL A, V21, P1253 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Li J, 2007, PROCEEDINGS OF THE 3RD INTERNATIONAL YELLOW RIVER FORUM ON SUSTAINABLE WATER RESOURCES MANAGEMENT AND DELTA ECOSYSTEM MAINTENANCE, VOL VI, P65 Lu HP, 2008, IEEE T NEURAL NETWOR, V19, P18, DOI 10.1109/TNN.2007.901277 NGUYEN P, 1999, P EUROSPEECH, P2519 O'Shaughnessy D, 2008, PATTERN RECOGN, V41, P2965, DOI 10.1016/j.patcog.2008.05.008 Pallett D.S., 1994, P HUM LANG TECHN WOR, P49, DOI 10.3115/1075812.1075824 Parihar N., 2002, AURORA WORKING GROUP PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Rigazio L., 2001, P INTERSPEECH, P2347 Sagayama S., 1997, P IEEE WORKSH AUT SP, P396 Seltzer M. L., 2011, P INTERSPEECH, P1097 Seltzer M. L., 2011, P IEEE WORKSH AUT SP, P146 SIROVICH L, 1987, J OPT SOC AM A, V4, P519, DOI 10.1364/JOSAA.4.000519 Tsao Y, 2009, IEEE T AUDIO SPEECH, V17, P1025, DOI 10.1109/TASL.2009.2016231 TURK M, 1991, J COGNITIVE NEUROSCI, V3, P71, DOI 10.1162/jocn.1991.3.1.71 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Vasilescu M. A. O., 2007, P ICCV, P1 Vasilescu M.A.O., 2002, LNCS, V2350, P447 Vasilescu MAO, 2002, INT C PATT RECOG, P511 Vasilescu MAO, 2005, PROC CVPR IEEE, P547 Vlasic D, 2005, ACM T GRAPHIC, V24, P426, DOI 10.1145/1073204.1073209 Wang YQ, 2012, IEEE T AUDIO SPEECH, V20, P2149, DOI 10.1109/TASL.2012.2198059 NR 37 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 1 EP 10 DI 10.1016/j.specom.2013.10.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000001 ER PT J AU De Looze, C Scherer, S Vaughan, B Campbell, N AF De Looze, Celine Scherer, Stefan Vaughan, Brian Campbell, Nick TI Investigating automatic measurements of prosodic accommodation and its dynamics in social interaction SO SPEECH COMMUNICATION LA English DT Article DE Prosodic accommodation; Dynamics; Interactional conversation; Information exchange; Speakers' involvement and affinity ID CROSS-LANGUAGE; ADULT SPEECH; MIMICRY; COMMUNICATION; CONVERSATION; COORDINATION; CONVERGENCE; PERCEPTION; SYNCHRONY; DESIRABILITY AB Spoken dialogue systems are increasingly being used to facilitate and enhance human communication. While these interactive systems can process the linguistic aspects of human communication, they are not yet capable of processing the complex dynamics involved in social interaction, such as the adaptation on the part of interlocutors. Providing interactive systems with the capacity to process and exhibit this accommodation could however improve their efficiency and make machines more socially-competent interactants. At present, no automatic system is available to process prosodic accommodation, nor do any clear measures exist that quantify its dynamic manifestation. While it can be observed to be a monotonically manifest property, it is our hypotheses that it evolves dynamically with functional social aspects. In this paper, we propose an automatic system for its measurement and the capture of its dynamic manifestation. We investigate the evolution of prosodic accommodation in 41 Japanese dyadic telephone conversations and discuss its manifestation in relation to its functions in social interaction. Overall, our study shows that prosodic accommodation changes dynamically over the course of a conversation and across conversations, and that these dynamics inform about the naturalness of the conversation flow, the speakers' degree of involvement and their affinity in the conversation. (C) 2013 Elsevier B.V. All rights reserved. C1 [De Looze, Celine; Vaughan, Brian; Campbell, Nick] Trinity Coll Dublin, Speech Commun Lab, Dublin 2, Ireland. [Scherer, Stefan] Univ So Calif, Inst Creat Technol, Los Angeles, CA 90089 USA. RP De Looze, C (reprint author), Trinity Coll Dublin, Speech Commun Lab, 7-9 South Leinster St, Dublin 2, Ireland. EM deloozec@tcd.ie FU Science Foundation Ireland (SFI) [09/IN.I/12631]; FASTNET project - Focus on Action in Social Talk: Network Enabling Technology FX This work was undertaken as part of the FASTNET project - Focus on Action in Social Talk: Network Enabling Technology funded by Science Foundation Ireland (SFI) 09/IN.1/I2631. CR Agarwal S.K., 2011, 2 INT WORKSH INT US Apple Inc, 2011, APPL SIRI HOM Aubanel V, 2010, SPEECH COMMUN, V52, P577, DOI 10.1016/j.specom.2010.02.008 Babel M., 2011, LANG SPEECH, V55, P231 BAILLY G, 2010, INT 2010, P1153 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BAVELAS JB, 1986, J PERS SOC PSYCHOL, V50, P322, DOI 10.1037/0022-3514.50.2.322 Bell L., 2003, P ICPHS CIT, V3, P833 Bernieri F.J., 1991, FUNDAMENTALS NONVERB, P401 Black JW, 1949, J SPEECH HEAR DISORD, V14, P16 Boersma P., 2006, PRAAT DOING PHONETIC Boylan P., 2004, TECHNICAL REPORT Branigan HP, 2010, J PRAGMATICS, V42, P2355, DOI 10.1016/j.pragma.2009.12.012 Breazeal C, 2002, INT J ROBOT RES, V21, P883, DOI 10.1177/0278364902021010096 Brennan S, 1996, P INT S SPOK DIAL, P41 Burgoon J.K., 1995, NUMBER CAMBRIDGE Campbell N., 2004, P LANG RES EV C LREC, P183 Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893 Cleland AA, 2003, J MEM LANG, V49, P214, DOI 10.1016/S0749-596X(03)00060-3 Collins B., 1998, 5 INT C SPOK LANG PR CONDON WS, 1974, CHILD DEV, V45, P456, DOI 10.2307/1127968 Coulston R., 2002, P 7 INT C SPOK LANG, V4, P2689 CROWNE DP, 1960, J CONSULT PSYCHOL, V24, P349, DOI 10.1037/h0047358 De Looze C., 2010, THESIS U PROVENCE de Looze C., 2011, P 17 INT C PHON SCI, P1294 De Looze C., 2011, P INT, P1393 de Jong NH, 2009, BEHAV RES METHODS, V41, P385, DOI 10.3758/BRM.41.2.385 Delvaux V, 2007, PHONETICA, V64, P145, DOI 10.1159/000107914 Edlund J., 2009, 10 ANN C INT SPEECH, P2779 FERGUSON CA, 1975, ANTHROPOL LINGUIST, V17, P1 FERNALD A, 1989, J CHILD LANG, V16, P477 Gallois C., 1991, CONTEXTS ACCOMMODATI, P245, DOI 10.1017/CBO9780511663673.008 GALLOIS C, 1988, LANG COMMUN, V8, P271, DOI 10.1016/0271-5309(88)90022-5 Garrod Simon C., 2006, RES LANGUAGE COMPUTA, V4, P203, DOI DOI 10.1007/S11168-006-9004-0 Giles H., 1991, CONTEXTS ACCOMMODATI, P1, DOI DOI 10.1017/CBO9780511663673 GOLDMANEISLER F, 1961, LANG SPEECH, V4, P171 Goldman-Eisler F., 1968, PSYCHOLINGUISTICS EX Google, 2011, GOOGL VOIC SEARCH Gregory SW, 1997, J NONVERBAL BEHAV, V21, P23 GREGORY S, 1993, LANG COMMUN, V13, P195, DOI 10.1016/0271-5309(93)90026-J GREGORY SW, 1982, J PSYCHOLINGUIST RES, V11, P35, DOI 10.1007/BF01067500 GROSJEAN F, 1975, PHONETICA, V31, P144 Haywood SL, 2005, PSYCHOL SCI, V16, P362, DOI 10.1111/j.0956-7976.2005.01541.x Heldner J., 2010, P INT 2010, P1 Hess U, 2001, INT J PSYCHOPHYSIOL, V40, P129, DOI 10.1016/S0167-8760(00)00161-6 Jaffe J., 2001, RHYTHMS DIALOGUE INF Jaffe J., 1970, RHYTHMS DIALOGUE Juslin P. N., 2005, NEW HDB METHODS NONV Kleinberger T, 2007, LECT NOTES COMPUT SC, V4555, P103 Kopp S, 2010, SPEECH COMMUN, V52, P587, DOI 10.1016/j.specom.2010.02.007 Kousidis S., 2009, P SPECOM 2009 ST PET, P2 Kousidis S., 2008, P INT Lakin JL, 2003, PSYCHOL SCI, V14, P334, DOI 10.1111/1467-9280.14481 Lee CC, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P793 Levelt W. J. M., 1982, PROCESSES BELIEFS QU, P199 Levitan R., P ACL 2011, P113 Levitan R., 2011, INTERSPEECH 2011, P3081 Levitan R., 2011, 12 ANN C INT SPEECH Lu H, 2011, LECT NOTES COMPUT SC, V6696, P188 Maganti H.K., 2007, P IEEE INT C AC SPEE, V4 MATARAZZO JOSEPH D., 1967, J EXP RES PERSONALITY, V2, P56 MAURER RE, 1983, J COUNS PSYCHOL, V30, P158, DOI 10.1037/0022-0167.30.2.158 McGarva AR, 2003, J PSYCHOLINGUIST RES, V32, P335, DOI 10.1023/A:1023547703110 MELTZER L, 1971, J PERS SOC PSYCHOL, V18, P392, DOI 10.1037/h0030993 MELTZOFF AN, 1977, SCIENCE, V198, P75, DOI 10.1126/science.198.4312.75 Miles LK, 2009, J EXP SOC PSYCHOL, V45, P585, DOI 10.1016/j.jesp.2009.02.002 Mondada L., 2001, MARGES LINGUISTIQUES, V1, P1 NATALE M, 1975, J PERS SOC PSYCHOL, V32, P790, DOI 10.1037/0022-3514.32.5.790 Nenkova A., 2008, P 46 ANN M ASS COMP, P169, DOI 10.3115/1557690.1557737 Nishimura R, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P534 OHALA JJ, 1983, PHONETICA, V40, P1 Oviatt S, 1996, IEEE MULTIMEDIA, V3, P26, DOI 10.1109/93.556458 Pardo JS, 2006, J ACOUST SOC AM, V119, P2382, DOI 10.1121/1.2178720 Parrill F, 2006, J NONVERBAL BEHAV, V30, P157, DOI 10.1007/s10919-006-0014-2 Pentland A.S., 2008, HONEST SIGNALS THEY Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169 Pickering MJ, 2008, PSYCHOL BULL, V134, P427, DOI 10.1037/0033-2909.134.3.427 Putman W.B., 1984, INT J SOCIOL LANG, V1984, P97, DOI 10.1515/ijsl.1984.46.97 Ramseyer F, 2010, LECT NOTES COMPUT SC, V5967, P182, DOI 10.1007/978-3-642-12397-9_15 Richardson MJ, 2007, HUM MOVEMENT SCI, V26, P867, DOI 10.1016/j.humov.2007.07.002 Rumsey F., 2002, SOUND RECORDING INTR SCHERER KR, 1994, J PERS SOC PSYCHOL, V66, P310, DOI 10.1037/0022-3514.66.2.310 Shepard C. A., 2001, NEW HDB LANGUAGE SOC, P33 Shockley K, 2009, TOP COGN SCI, V1, P305, DOI 10.1111/j.1756-8765.2009.01021.x Shockley K, 2007, J EXP PSYCHOL HUMAN, V33, P201, DOI 10.1037/0096-1523.33.1.201 Smith C. L., 2007, P 16 INT C PHON SCI, P313 Stanford G.W., 1996, J PERS SOC PSYCHOL, V70, P1231 Street Jr Richard L., 1983, LANG SCI, V5, P79 Suzuki N, 2007, CONNECT SCI, V19, P131, DOI 10.1080/09540090701369125 Tickle-Degnen L, 1990, PSYCHOL INQ, V1, P285, DOI DOI 10.1207/S15327965PLI0104_1 VANSUMMERS W, 1988, J ACOUST SOC AM, V84, P917 Vaughan B., 2011, P INT 2011, P1865 Vinciarelli A, 2009, IEEE SIGNAL PROC MAG, V26, P133, DOI 10.1109/MSP.2009.933382 Ward Diane, 2007, ISCA TUT RES WORKSH, P4 Ward N., 2002, 7 INT C SPOK LANG PR Webb J. T., 1972, STUDIES DYADIC COMMU, P115 WELKOWIT.J, 1973, J CONSULT CLIN PSYCH, V41, P472, DOI 10.1037/h0035328 Woodall G.W., 1983, J NONVERBAL BEHAV, V8, P126 ZEBROWITZ LA, 1992, J NONVERBAL BEHAV, V16, P143, DOI 10.1007/BF00988031 ZEINE L, 1988, J COMMUN DISORD, V21, P373, DOI 10.1016/0021-9924(88)90022-6 Zhou J., 2012, GERONTOLOGIST, P73 Zuengler J., 1991, CONTEXTS ACCOMMODATI, P223, DOI 10.1017/CBO9780511663673.007 NR 102 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 11 EP 34 DI 10.1016/j.specom.2013.10.002 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000002 ER PT J AU Lu, CT AF Lu, Ching-Ta TI Reduction of musical residual noise using block-and-directional-median filter adapted by harmonic properties SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Spectral subtraction; Musical residual noise; Block-and-directional median filter; Post-processing; Median filter ID SPEECH ENHANCEMENT; MASKING PROPERTIES; TRANSFORM AB Many speech enhancement systems can efficiently remove background noise. However, most of them suffer from musical residual noise which is very annoying to the human ear. This study proposes a post-processing system to efficiently reduce the effect of musical residual noise, enabling the enhanced speech to be improved. Noisy speech is firstly enhanced by a speech enhancement algorithm to reduce background noise. The enhanced speech is then post-processed by a block-and-directional-median (BDM) filter adapted by harmonic properties, causing the musical effect of residual noise being efficiently reduced. In the case of a speech-like spectrum, directional-median filtering is performed to slightly reduce the musical effect of residual noise, where a strong harmonic spectrum of a vowel is well maintained. The quality of post-processed speech is then ensured. On the contrary, block-median filtering is performed to greatly reduce the spectral variation in noise-dominant regions, enabling the spectral peaks of musical tones to be significantly smoothed. The musical effect of residual noise is therefore reduced. Finally, the preprocessed and post-processed spectra are integrated according to the speech-presence probability. Experimental results show that the proposed post processor can efficiently improve the performance of a speech enhancement system by reducing the musical effect of residual noise. (C) 2013 Elsevier B.V. All rights reserved. C1 Asia Univ, Dept Informat Commun, Taichung 41354, Taiwan. RP Lu, CT (reprint author), Asia Univ, Dept Informat Commun, 500 Lioufeng Rd, Taichung 41354, Taiwan. EM lucas1@ms26.hinet.net FU National Science Council, Taiwan [NSC 101-2221-E-468-010] FX This research was supported by the National Science Council, Taiwan, under contract number NSC 101-2221-E-468-010. We thank the reviewers for their valuable comments which much improve the quality of this paper. Our gratitude also goes to Dr. Timothy Williams (Asia University) for his help in English proofreading. CR Cho E, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4569 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 Ding HJ, 2009, SPEECH COMMUN, V51, P259, DOI 10.1016/j.specom.2008.09.003 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Esch T, 2009, INT CONF ACOUST SPEE, P4409, DOI 10.1109/ICASSP.2009.4960607 Ghanbari Y, 2006, SPEECH COMMUN, V48, P927, DOI 10.1016/j.specom.2005.12.002 Goh Z, 1998, IEEE T SPEECH AUDI P, V6, P287 Hsu CC, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4001 Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714 Ibne M.N., 2003, P IEEE INT S SIGN PR, P749 ITU-T, 2001, INT TEL UN P ITU-T, 2003, INT TEL UN P Jensen J, 2012, IEEE T AUDIO SPEECH, V20, P92, DOI 10.1109/TASL.2011.2157685 Jensen J.R., 2012, IEEE T AUDIO SPEECH, V20, P948 Jin W, 2010, IEEE T AUDIO SPEECH, V18, P356, DOI 10.1109/TASL.2009.2028916 Jo S, 2010, IEEE T AUDIO SPEECH, V18, P2099, DOI 10.1109/TASL.2010.2041119 KLEIN M, 2002, ACOUST SPEECH SIG PR, P537 Leitner C., 2012, P INT C SYST SIGN IM, P464 Lu C-T, 2007, DIGIT SIGNAL PROCESS, V17, P171 Lu CT, 2003, SPEECH COMMUN, V41, P409, DOI 10.1016/S0167-6393(03)00011-6 Lu CT, 2010, COMPUT SPEECH LANG, V24, P632, DOI 10.1016/j.csl.2009.09.001 Lu C.-T., 2012, P IEEE SPRING WORLD, V2, P458 Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001 Lu CT, 2004, ELECTRON LETT, V40, P394, DOI 10.1049/el:20040266 Lu C.-T., 2011, P IEEE INT C COMP SC, V3, P475 Lu CT, 2011, SPEECH COMMUN, V53, P495, DOI 10.1016/j.specom.2010.11.008 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Miyazaki R, 2012, IEEE T AUDIO SPEECH, V20, P2080, DOI 10.1109/TASL.2012.2196513 Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621 Rix A., 2001, P IEEE INT C AC SPEE, P749 Shimamura T, 2001, IEEE T SPEECH AUDI P, V9, P727, DOI 10.1109/89.952490 Udrea RM, 2008, SIGNAL PROCESS, V88, P1293 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Yu HJ, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4573 NR 34 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 35 EP 48 DI 10.1016/j.specom.2013.11.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000003 ER PT J AU Schwerin, B Paliwal, K AF Schwerin, Belinda Paliwal, Kuldip TI Using STFT real and imaginary parts of modulation signals for MMSE-based speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Modulation domain; Analysis-modification-synthesis (AMS); Speech enhancement; MMSE short-time spectral magnitude estimator; Modulation spectrum; Modulation magnitude spectrum; Modulation signal; Modulation spectral subtraction ID SPECTRAL AMPLITUDE ESTIMATOR; SUBTRACTION; NOISE AB In this paper we investigate an alternate, RI-modulation (R = real, I = imaginary) AMS framework for speech enhancement, in which the real and imaginary parts of the modulation signal are processed in secondary AMS procedures. This framework offers theoretical advantages over the previously proposed modulation AMS frameworks in that noise is additive in the modulation signal and noisy acoustic phase is not used to reconstruct speech. Using the MMSE magnitude estimation to modify modulation magnitude spectra, initial experiments presented in this work evaluate if these advantages translate into improvements in processed speech quality. The effect of speech presence uncertainty and log-domain processing on MMSE magnitude estimation in the RI-modulation framework is also investigated. Finally, a comparison of different enhancement approaches applied in the RI-modulation framework is presented. Using subjective and objective experiments as well as spectrogram analysis, we show that RI-modulation MMSE magnitude estimation with speech presence uncertainty produces stimuli which has a higher preference by listeners than the other RI-modulation types. In comparisons to similar approaches in the modulation AMS framework, results showed that the theoretical advantages of the RI-modulation framework did not translate to an improvement in overall quality, with both frameworks yielding very similar sounding stimuli, but a clear improvement (compared to the corresponding modulation AMS based approach) in speech intelligibility was found. (C) 2013 Elsevier B.V. All rights reserved. C1 [Schwerin, Belinda; Paliwal, Kuldip] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia. RP Schwerin, B (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia. EM b.schwerin@griffith.edu.au; k.paliwal@griffith.edu.au CR Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004 Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P297 Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Paliwal K, 2012, SPEECH COMMUN, V54, P282, DOI 10.1016/j.specom.2011.09.003 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Quatieri T. F., 2002, DISCRETE TIME SPEECH Rix A., 2001, PERCEPTUAL EVALUATIO Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Vary P, 2006, DIGITAL SPEECH TRANS Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Wiener N., 1949, EXTRAPOLATION INTERP Zhang Y, 2013, SPEECH COMMUN, V55, P509, DOI 10.1016/j.specom.2012.09.005 Zhang Y, 2011, INT CONF ACOUST SPEE, P4744 NR 23 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 49 EP 68 DI 10.1016/j.specom.2013.11.001 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000004 ER PT J AU Lee, K AF Lee, Kyogu TI Application of non-negative spectrogram decomposition with sparsity constraints to single-channel speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Single-channel speech enhancement; Non-negative spectrogram decomposition; Sparsity constraint; Unsupervised source separation ID NOISE AB We propose an algorithm for single-channel speech enhancement that requires no pre-trained models - neither speech nor noise models - using non-negative spectrogram decomposition with sparsity constraints. To this end, before staring the EM algorithm for spectrogram decomposition, we divide the spectral basis vectors into two disjoint groups - speech and noise groups - and impose sparsity constraints only on those in the speech group as we update the parameters. After the EM algorithm converges, the proposed algorithm successfully separates speech from noise, and no post-processing is required for speech reconstruction. Experiments with various types of real-world noises show that the proposed algorithm achieves performance significantly better than other classical algorithms or comparable to the spectrogram decomposition method using pre-trained noise models. (C) 2013 Elsevier B.V. All rights reserved. C1 Seoul Natl Univ, Grad Sch Convergence Sci & Technol, Mus & Audio Res Grp, Seoul, South Korea. RP Lee, K (reprint author), Seoul Natl Univ, Grad Sch Convergence Sci & Technol, Mus & Audio Res Grp, 1 Gwanak Ro, Seoul, South Korea. EM kglee@snu.ac.kr FU MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) [NIPA-2013-H0301-13-4005]; NIPA (National IT Industry Promotion Agency) FX This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2013-H0301-13-4005) supervised by the NIPA (National IT Industry Promotion Agency). CR Brand M, 1999, NEURAL COMPUT, V11, P1155, DOI 10.1162/089976699300016395 Brand M.E., 1999, UNCERTAINTY 99 Duan Z., 2012, P LVA ICA Duan Z., 2012, P INTERSPEECH EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 Fevotte C, 2009, NEURAL COMPUT, V21, P793, DOI 10.1162/neco.2008.04-08-771 FIELD DJ, 1994, NEURAL COMPUT, V6, P559, DOI 10.1162/neco.1994.6.4.559 Gaussier E., 2005, P 28 ANN INT ACM SIG, P601, DOI DOI 10.1145/1076034.1076148 Hofmann Thomas, 1999, P 15 C UNC ART INT U Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 Joder C., 2012, P LVA ICA Kamath S., 2002, P ICASSP Lauberg H., 2008, P IEEE AS C SIGN SYS Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Raj B., 2005, P IEEE WORKSH APPL S Rix A., 2012, P ICASSP, P749 Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199 Shashanka M., 2007, P IEEE C AC SPEECH S Shashanka M., 2007, P NIPS Smaragdis P., 2006, P NIPS Smaragdis P., 2008, P IEEE C AC SPEECH S Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005 NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 69 EP 80 DI 10.1016/j.specom.2013.11.008 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000005 ER PT J AU Morrison, GS Lindh, J Curran, JM AF Morrison, Geoffrey Stewart Lindh, Jonas Curran, James M. TI Likelihood ratio calculation for a disputed-utterance analysis with limited available data SO SPEECH COMMUNICATION LA English DT Article DE Forensic; Likelihood ratio; Disputed utterance; Reliability; Hotelling's 7(2); Posterior predictive density ID ELEMENTAL COMPOSITION MEASUREMENTS; FORENSIC GLASS EVIDENCE; RELIABILITY AB We present a disputed-utterance analysis using relevant data, quantitative measurements and statistical models to calculate likelihood ratios. The acoustic data were taken from an actual forensic case in which the amount of data available to train the statistical models was small and the data point from the disputed word was far out on the tail of one of the modelled distributions. A procedure based on single multivariate Gaussian models for each hypothesis led to an unrealistically high likelihood ratio value with extremely poor reliability, but a procedure based on Hotelling's T-2 statistic and a procedure based on calculating a posterior predictive density produced more acceptable results. The Hotelling's T-2 procedure attempts to take account of the sampling uncertainty of the mean vectors and covariance matrices due to the small number of tokens used to train the models, and the posterior-predictive-density analysis integrates out the values of the mean vectors and covariance matrices as nuisance parameters. Data scarcity is common in forensic speech science and we argue that it is important not to accept extremely large calculated likelihood ratios at face value, but to consider whether such values can be supported given the size of the available data and modelling constraints. (C) 2013 Elsevier B.V. All rights reserved. C1 [Morrison, Geoffrey Stewart] Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia. [Lindh, Jonas] Univ Gothenburg, Sahlgrenska Acad, Inst Neurosc & Physiol, Dept Clin Neurosci & Rehabil,Div Speech & Languag, SE-40530 Gothenburg, Sweden. [Curran, James M.] Univ Auckland, Dept Stat, Auckland 1142, New Zealand. RP Morrison, GS (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia. EM geoff-morrison@forensic-evaluation.net FU Australian Research Council; Australian Federal Police; New South Wales Police; Queensland Police; National Institute of Forensic Science; Australasian Speech Science and Technology Association; Guardia Civil through Linkage Project [LP100200142]; US National Institute of Justice [2011-DN-BX-K541] FX Morrison's contribution to this research was supported by the Australian Research Council, Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Science, Australasian Speech Science and Technology Association, and the Guardia Civil through Linkage Project LP100200142. Unless otherwise explicitly attributed, the opinions expressed in this paper are those of the authors and do not necessarily represent the policies or opinions of any of the above mentioned organizations. Curran's contribution to this research was supported by grant 2011-DN-BX-K541 from the US National Institute of Justice. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the US Department of Justice. CR Anderson N, 1978, MODERN SPECTRUM ANAL, P252 Bernardo JM, 1994, BAYESIAN THEORY Boersma P., 2008, PRAAT DOING PHONETIC Champod C., 2000, FORENSIC LINGUIST, V7, P238, DOI 10.1558/sll.2000.7.2.238 Curran J. M., 2005, LAW PROBABILITY RISK, V4, P115, DOI [10.1093/1pr/mgi009, DOI 10.1093/LPR/MGI009, 10 1093/Ipr/mgi009] Curran JM, 1997, SCI JUSTICE, V37, P241, DOI 10.1016/S1355-0306(97)72197-X Curran JM, 1997, SCI JUSTICE, V37, P245, DOI 10.1016/S1355-0306(97)72198-1 Curran JM, 2002, SCI JUSTICE, V42, P29, DOI 10.1016/S1355-0306(02)71794-2 Genz A, 2012, MVTNORM MULTIVARIATE Genz A., 2009, LECT NOTES STAT, V195, DOI [DOI 10.1007/978-3-642-01689-9, 10.1007/978-3-642-01689-9] Gosset WS., 1908, BIOMETRIKA, V6, P1, DOI DOI 10.1093/BIOMET/6.1.1 Hotelling H, 1931, ANN MATH STAT, V2, P360, DOI 10.1214/aoms/1177732979 Kaye David H., 2009, CORNELL JL PUB POLY, V19, P145 Lagarias JC, 1998, SIAM J OPTIMIZ, V9, P112, DOI 10.1137/S1052623496303470 Lindh Jonas, 2009, Proceedings of the 2009 5th IEEE International Conference on e-Science (e-Science 2009), DOI 10.1109/e-Science.2009.15 Lundeborg I, 2012, LOGOP PHONIATR VOCO, V37, P117, DOI 10.3109/14015439.2012.664654 MathWorks Inc, 2010, MATL SOFTW REL 2010A Morrison G.S., 2012, P 46 AUD ENG SOC AES, P203 Morrison GS, 2011, SCI JUSTICE, V51, P91, DOI 10.1016/j.scijus.2011.03.002 Nordgaard A, 2012, LAW PROBAB RISK, V11, P1, DOI DOI 10.1093/1PR/MGR020 R Development Core Team, 2013, R LANG ENV STAT COMP Stoel R.D., 2012, HDB RISK THEORY EPIS, P135, DOI DOI 10.1007/978-94-007-1433-5 Villalba J., 2011, P INTERSPEECH, P505 Zhang CL, 2013, J ACOUST SOC AM, V133, pEL54, DOI 10.1121/1.4773223 NR 24 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 81 EP 90 DI 10.1016/j.specom.2013.11.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000006 ER PT J AU Etz, T Reetz, H Wegener, C Bahlmann, F AF Etz, Tanja Reetz, Henning Wegener, Carla Bahlmann, Franz TI Infant cry reliability: Acoustic homogeneity of spontaneous cries and pain-induced cries SO SPEECH COMMUNICATION LA English DT Article DE Infant cry; Reliability; Acoustic analysis ID AUTISM SPECTRUM DISORDER; NEWBORN-INFANTS; HEARING IMPAIRMENT; SOUND SPECTROGRAM; AGREEMENT; DIAGNOSIS; EXPOSURE; STIMULI; MELODY; RISK AB Infant cries can indicate certain developmental disorders and therefore may be suited for early diagnosis. An open research question is which type of crying (spontaneous, pain-induced) is best suited for infant cry analysis. For estimating the degree of consistency among single cries in an episode of crying, healthy infants were recorded and allocated to the four groups spontaneous cries, spontaneous non-distressed cries, pain-induced cries and pain-induced cries without the first cry after pain stimulus. 19 acoustic parameters were computed and statistically analyzed on their reliability with Krippendorff's Alpha. Krippendorff's Alpha values between 0.184 and 0.779 were reached over all groups. No significant differences between the cry groups were found. However, the non-distressed cries reached the highest alpha values in 16 out of 19 acoustic parameters by trend. The results show that the single cries within an infant's episode of crying are not very reliable in general. For the cry types, the non-distressed cry is the one with the best reliability making it the favorite for infant cry analysis. (C) 2013 Elsevier B.V. All rights reserved. C1 [Etz, Tanja; Wegener, Carla] Fresenius Univ Appl Sci Idstein, D-65510 Idstein, Germany. [Etz, Tanja; Reetz, Henning] Goethe Univ Frankfurt, Dept Phonet, D-60325 Frankfurt, Germany. [Bahlmann, Franz] Burgerhosp Frankfurt Main, D-60318 Frankfurt, Germany. RP Etz, T (reprint author), Mullerwies 14, D-65232 Taunusstein, Germany. EM tanja.etz@hs-fresenius.de; reetz@em.uni-frankfurt.de; wegener@hs-fresenius.de; f.bahlmann@buergerhospital-ffm.de CR ANDERSEN N, 1974, GEOPHYSICS, V39, P69, DOI 10.1190/1.1440413 APGAR V, 1953, Curr Res Anesth Analg, V32, P260 Arch-Tirado Emilio, 2004, Cir Cir, V72, P271 Artstein R, 2008, COMPUT LINGUIST, V34, P555, DOI 10.1162/coli.07-034-R2 Barr R.G., 2000, CLIN DEV MED, V152 BLINICK G, 1971, AM J OBSTET GYNECOL, V110, P948 Boersma P, 2009, FOLIA PHONIATR LOGO, V61, P305, DOI 10.1159/000245159 Boersma P., 1993, P I PHONETIC SCI, V17, P97 Boersma Paul, 2013, PRAAT DOING PHONETIC Branco A, 2007, INT J PEDIATR OTORHI, V71, P539, DOI 10.1016/j.ijporl.2006.11.009 Childers D.G., 1978, MODERN SPECTRUM ANAL CORWIN MJ, 1992, PEDIATRICS, V89, P1199 CROWE HP, 1992, CHILD ABUSE NEGLECT, V16, P19, DOI 10.1016/0145-2134(92)90005-C Esposito G, 2013, RES DEV DISABIL, V34, P2717, DOI 10.1016/j.ridd.2013.05.036 Etz T, 2012, FOLIA PHONIATR LOGO, V64, P254, DOI 10.1159/000343994 Fisichelli V.R., 1966, PSYCHON SCI, V6, P195 Fort A, 1998, MED ENG PHYS, V20, P432, DOI 10.1016/S1350-4533(98)00045-9 Furlow FB, 1997, EVOL HUM BEHAV, V18, P175, DOI 10.1016/S1090-5138(97)00006-8 GOLUB HL, 1982, PEDIATRICS, V69, P197 GRAU SM, 1995, J SPEECH HEAR RES, V38, P373 Green JA, 1998, CHILD DEV, V69, P271, DOI 10.1111/j.1467-8624.1998.tb06187.x Hayes Andrew F., 2007, COMMUNICATION METHOD, V1, P77, DOI DOI 10.1080/19312450709336664 IBM, 2011, SPSS STAT SOFTW VERS KARELITZ S, 1962, J PEDIATR-US, V61, P679, DOI 10.1016/S0022-3476(62)80338-2 Krippendorff K., 2003, CONTENT ANAL INTRO I LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310 LESTER BM, 1976, CHILD DEV, V47, P237 Lester BM, 2002, PEDIATRICS, V110, P1182, DOI 10.1542/peds.110.6.1182 Lind J., 1967, ACTA PAEDIATR SCAND, V177, P113 LIND J, 1970, DEV MED CHILD NEUROL, V12, P478 Lind K, 2002, INT J PEDIATR OTORHI, V64, P97, DOI 10.1016/S0165-5876(02)00024-1 Manfredi C, 2009, MED ENG PHYS, V31, P528, DOI 10.1016/j.medengphy.2008.10.003 Michelsson K, 2002, FOLIA PHONIATR LOGO, V54, P190, DOI 10.1159/000063190 MICHELSSON K, 1977, DEV MED CHILD NEUROL, V19, P309 Moller S, 1999, SPEECH COMMUN, V28, P175, DOI 10.1016/S0167-6393(99)00016-3 Nugent JK, 1996, CHILD DEV, V67, P1806, DOI 10.1111/j.1467-8624.1996.tb01829.x PORTER FL, 1986, CHILD DEV, V57, P790, DOI 10.2307/1130355 Press W. H., 2002, NUMERICAL RECIPES C, V2nd Rietveld T., 1993, STAT TECHNIQUES STUD Robb MP, 1997, FOLIA PHONIATR LOGO, V49, P35 Runefors P, 2005, FOLIA PHONIATR LOGO, V57, P90, DOI 10.1159/000083570 Runefors P, 2000, ACTA PAEDIATR, V89, P68, DOI 10.1080/080352500750029095 Sheinkopf SJ, 2012, AUTISM RES, V5, P331, DOI 10.1002/aur.1244 SHROUT PE, 1979, PSYCHOL BULL, V86, P420, DOI 10.1037//0033-2909.86.2.420 SIRVIO P, 1976, FOLIA PHONIATR, V28, P161 Thoden C.J., 1980, INFANT COMMUNICATION, P124 THODEN CJ, 1979, DEV MED CHILD NEUROL, V21, P400 Truby H.M., 1965, ACTA PAEDIATR, V54, P8, DOI 10.1111/j.1651-2227.1965.tb09308.x Varallyay G, 2007, INT J PEDIATR OTORHI, V71, P1699, DOI 10.1016/j.ijport.2007.07.005 Verduzco-Mendoza A, 2012, CIR CIR, V80, P3 Vuorenkoski V, 1966, Ann Paediatr Fenn, V12, P174 Wasz-Hockert O., 1968, CLIN DEV MED, V29 Wermke K, 2002, MED ENG PHYS, V24, P501, DOI 10.1016/S1350-4533(02)00061-9 Wermke K, 2011, CLEFT PALATE-CRAN J, V48, P321, DOI 10.1597/09-055 Wolff PH, 1969, DETERMINATS INFANT B, P81 NR 55 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 91 EP 100 DI 10.1016/j.specom.2013.11.006 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000007 ER PT J AU Yousefian, N Loizou, PC Hansen, JHL AF Yousefian, Nima Loizou, Philipos C. Hansen, John H. L. TI A coherence-based noise reduction algorithm for binaural hearing aids SO SPEECH COMMUNICATION LA English DT Article DE Binaural signal processing; Coherence function; Noise reduction; SNR estimation; Speech intelligibility ID DATA SPEECH RECOGNITION; ROOM REVERBERATION; ENHANCEMENT; SYSTEMS; MODEL AB In this study, we present a novel coherence-based noise reduction technique and show how it can be employed in binaural hearing aid instruments in order to suppress any potential noise present inside a realistic low reverberant environment. The technique is based on particular assumptions on the spatial properties of the target and undesired interfering signals and suppresses (coherent) interferences without prior statistical knowledge of the noise environment. The proposed algorithm is simple, easy to implement and has the advantage of high performance in coping with adverse signal conditions such as scenarios in which competing talkers are present. The technique was assessed by measurements with normal-hearing subjects and the processed outputs in each ear showed significant improvements in terms of speech intelligibility (measured by an adaptive speech reception threshold (SRT) sentence test) over the unprocessed signals (baseline). In a mildly reverberant room with T-60 = 200, the average improvement in SRT obtained relative to the baseline was approximately 6.5 dB. In addition, the proposed algorithm was found to yield higher intelligibility and quality than those obtained by a well-established interaural time difference (ITD)-based speech enhancement algorithm. These attractive features make the proposed method a potential candidate for future use in commercial hearing aid and cochlear implant devices. (C) 2013 Published by Elsevier B.V. C1 [Yousefian, Nima; Loizou, Philipos C.; Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, 800 West Campbell Rd EC33, Richardson, TX 75080 USA. EM nimayou@utdallas.edu; john.hanse-n@utdallas.edu CR Aarabi P, 2004, IEEE T SYST MAN CY B, V34, P1763, DOI 10.1109/TSMCB.2004.830345 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 Brandstein M., 2001, MICROPHONE ARRAYS SI, V1st Bronkhorst AW, 2000, ACUSTICA, V86, P117 Chen F., 2011, SPEECH COMMUN, V54, P272 Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298 Doclo S, 2002, IEEE T SIGNAL PROCES, V50, P2230, DOI 10.1109/TSP.2002.801937 Doclo S., 2006, P INT WORKSH AC ECH ERBER NP, 1975, J SPEECH HEAR DISORD, V40, P481 Griffiths L., 1982, IEEE T ANTENN PROPAG, V130, P27 HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Harding S, 2006, IEEE T AUDIO SPEECH, V14, P58, DOI 10.1109/TSA.2005.860354 Hawley M.L., 2004, J ACOUST SOC AM, V115, P843 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058 ITU, 2000, PERC EV SPEECH QUAL Kim C, 2011, INT CONF ACOUST SPEE, P5072 Krishnamurthy N, 2009, IEEE T AUDIO SPEECH, V17, P1394, DOI 10.1109/TASL.2009.2015084 LeBouquinJeannes R, 1997, IEEE T SPEECH AUDI P, V5, P484 LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Loizou PC, 2009, J ACOUST SOC AM, V125, P372, DOI 10.1121/1.3036175 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Maj JB, 2006, SPEECH COMMUN, V48, P957, DOI 10.1016/j.specom.2005.12.005 Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005 PLOMP R, 1986, J SPEECH HEAR RES, V29, P146 RIX AW, 2001, ACOUST SPEECH SIG PR, P749 Spahr AJ, 2012, EAR HEARING, V33, P112, DOI 10.1097/AUD.0b013e31822c2549 Spriet A, 2007, EAR HEARING, V28, P62, DOI 10.1097/01.aud.0000252470.54246.54 Spriet A, 2004, SIGNAL PROCESS, V84, P2367, DOI 10.1016/j.sigpro.2004.07.028 Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003 Van den Bogaert T, 2007, INT CONF ACOUST SPEE, P565 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Yousefian N, 2012, IEEE T AUDIO SPEECH, V20, P599, DOI 10.1109/TASL.2011.2162406 Yousefian N, 2009, INT CONF ACOUST SPEE, P4653, DOI 10.1109/ICASSP.2009.4960668 Yu T, 2009, INT CONF ACOUST SPEE, P213 NR 36 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 101 EP 110 DI 10.1016/j.specom.2013.11.003 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000008 ER PT J AU Garcia, JE Ortega, A Miguel, A Lleida, E AF Enrique Garcia, Jose Ortega, Alfonso Miguel, Antonio Lleida, Eduardo TI Low bit rate compression methods of feature vectors for distributed speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Distributed speech recognition; Neural Networks; Multi-layer perceptron; Predictive vector quantizer optimization; IP networks ID PACKET LOSS CONCEALMENT; WORLD-WIDE-WEB; CEPSTRAL PARAMETERS; QUANTIZATION; ROBUST; NETWORKS; CHANNELS AB In this paper, we present a family of compression methods based on differential vector quantization (DVQ) for encoding Mel frequency cepstral coefficients (MFCC) in distributed speech recognition (DSR) applications. The proposed techniques benefit from the existence of temporal correlation across consecutive MFCC frames as well as the presence of intra-frame redundancy. We present DVQ schemes based on linear prediction and non-linear methods with multi-layer perceptrons (MLP). In addition to this, we propose the use of a multipath search coding strategy based on the M-algorithm that obtains the sequence of centroids that minimize the quantization error globally instead of selecting the centroids that minimize the quantization error locally in a frame by frame basis. We have evaluated the performance of the proposed methods for two different tasks. On the one hand, two small-size vocabulary databases, Spechdat-Car and Aurora 2, have been considered obtaining negligible degradation in terms of Word Accuracy (around 1%) compared to the unquantized scheme for bit-rates as low as 0.5 kbps. On the other hand, for a large vocabulary task (Aurora 4), the proposed method achieves a WER comparable to the unquantized scheme only with 1.6 kbps. Moreover, we propose a combined scheme (differential/non-differential) that allows the system to present the same sensitivity to transmission errors than previous multi-frame coding proposals for DSR. (C) 2013 Elsevier B.V. All rights reserved. C1 [Enrique Garcia, Jose; Ortega, Alfonso; Miguel, Antonio; Lleida, Eduardo] Univ Zaragoza, Aragon Inst Engn Res I3A, Commun Technol Grp GTC, Zaragoza 50018, Spain. RP Garcia, JE (reprint author), Univ Zaragoza, Elect Engn & Commun Dept, C Maria de Luna 1, Zaragoza 50018, Spain. EM jegarlai@unizar.es; ortega@unizar.es; amiguel@unizar.es; lleida@unizar.es RI Ortega, Alfonso/J-6280-2014; Lleida, Eduardo/K-8974-2014 OI Ortega, Alfonso/0000-0002-3886-7748; Lleida, Eduardo/0000-0001-9137-4013 FU Spanish Government; European Union (FEDER) [TIN2011-28169-C05-02] FX This work has been partially funded by the Spanish Government and the European Union (FEDER) under Project TIN2011-28169-C05-02. CR [Anonymous], 2003, 202212 ETSI ES [Anonymous], 2002, 202050 ETSI ES [Anonymous], 2000, 201108 ETSI ES Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141 Boulis C, 2002, IEEE T SPEECH AUDI P, V10, P580, DOI 10.1109/TSA.2002.804532 DIGALAKIS V, 1998, ACOUST SPEECH SIG PR, P989 Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698 Flynn R, 2010, DIGIT SIGNAL PROCESS, V20, P1559, DOI 10.1016/j.dsp.2010.03.009 Flynn R, 2012, SPEECH COMMUN, V54, P881, DOI 10.1016/j.specom.2012.03.001 Garcia J.E., 2009, INTERSPEECH, P2587 Garcia JE, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2378 Gomez A., 2004, COST278 ISCA TUT RES Gomez AM, 2009, SPEECH COMMUN, V51, P390, DOI 10.1016/j.specom.2008.12.002 Gomez AM, 2006, IEEE T MULTIMEDIA, V8, P1228, DOI 10.1109/TMM.2006.884611 Hirsch G., 2000, ISCA ITRW ASR 2000 Hirsch G., 2001, ETSI STQ AURORA DSR Hsu W.H., 2004, P IEEE INT C AC SPEE, V1, P69 Huang X., 2001, SPOKEN LANGUAGE PROC James A.B., 2004, P ICSLP 04 James A.B., 2004, P IEEE INT C AC SPEE, V1, P853 Jayant N.S., 1984, PRINCIPLES APPL SPEE JELINEK F, 1971, IEEE T INFORM THEORY, V17, P118, DOI 10.1109/TIT.1971.1054572 Kelleher H., 2002, P ICSLP 02 DENV US Kiss I., 2000, P INT C SPOK LANG PR LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 LLOYD SP, 1982, IEEE T INFORM THEORY, V28, P129, DOI 10.1109/TIT.1982.1056489 MACKAY, 1992, NEURAL COMPUT, V4, P415 Milner B, 2006, IEEE T AUDIO SPEECH, V14, P223, DOI 10.1109/TSA.2005.852997 Milner B, 2000, PIMRC 2000: 11TH IEEE INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS, VOLS 1 AND 2, PROCEEDINGS, P1197 Moreno A., 2000, LREC So S, 2006, SPEECH COMMUN, V48, P746, DOI 10.1016/j.specom.2005.10.002 Pearce D., 2000, P AVIOS 00 SPEECH AP Pearce D., 2004, P COST278 ISCA TUT R Peinado AM, 2005, INT CONF ACOUST SPEE, P329 POTAMIANOS A, 2001, ACOUST SPEECH SIG PR, P269 RAMASWAMY GN, 1998, ACOUST SPEECH SIG PR, P977 Schulzrinne H., 2003, 3550 RFC Segura J., 2007, HIWIRE DATABASE NOIS So S, 2008, ADV PATTERN RECOGNIT, P131, DOI 10.1007/978-1-84800-143-5_7 Srinivasamurthy N, 2006, SPEECH COMMUN, V48, P888, DOI 10.1016/j.specom.2005.11.003 Subramaniam AD, 2003, IEEE T SPEECH AUDI P, V11, P130, DOI 10.1109/TSA.2003.809192 Tan ZH, 2003, ELECTRON LETT, V39, P1619, DOI 10.1049/el:20031026 Tan Z.H., 2004, P IEEE INT C AC SPEE, P57 Tan ZH, 2005, SPEECH COMMUN, V47, P220, DOI 10.1016/j.specom.2005.05.007 Tan ZH, 2007, IEEE T AUDIO SPEECH, V15, P1391, DOI 10.1109/TASL.2006.889799 Vetterli M., 1995, WAVELETS SUBBAND COD Weerackody V, 2002, IEEE T WIREL COMMUN, V1, P282, DOI 10.1109/7693.994822 Young S., 2002, HTK BOOK, P3 NR 48 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 111 EP 123 DI 10.1016/j.specom.2013.11.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000009 ER PT J AU Xu, N Tang, YB Bao, JY Jiang, AM Liu, XF Yang, Z AF Xu, Ning Tang, Yibing Bao, Jingyi Jiang, Aiming Liu, Xiaofeng Yang, Zhen TI Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data SO SPEECH COMMUNICATION LA English DT Article DE Asymmetric training; Coherent training; Gaussian processes; Gaussian mixture model; Voice conversion ID ARTIFICIAL NEURAL-NETWORKS; TRANSFORMATION AB Voice conversion (VC) is a technique aiming to mapping the individuality of a source speaker to that of a target speaker, wherein Gaussian mixture model (GMM) based methods are evidently prevalent. Despite their wide use, two major problems remains to be resolved, i.e., over-smoothing and over-fitting. The latter one arises naturally when the structure of model is too complicated given limited amount of training data. Recently, a new voice conversion method based on Gaussian processes (GPs) was proposed, whose nonparametric nature ensures that the over-fitting problem can be alleviated significantly. Meanwhile, it is flexible to perform non-linear mapping under the framework of GPs by introducing sophisticated kernel functions. Thus this kind of method deserves to be explored thoroughly in this paper. To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Moreover, the accuracy in computing the kernel functions of GP can be improved by resorting to an asymmetric training strategy that allows the dimensionality of input vectors being reasonably higher than that of the output vectors without additional computational costs. Experiments have been conducted to confirm the effectiveness of the proposed method both objectively and subjectively, which have demonstrated that improvements can be obtained by GP-based method compared to the traditional GMM-based approach. (C) 2013 Elsevier B.V. All rights reserved. C1 [Xu, Ning; Tang, Yibing; Jiang, Aiming; Liu, Xiaofeng] Hohai Univ, Coll IoT Engn, Changzhou, Peoples R China. [Xu, Ning; Jiang, Aiming; Liu, Xiaofeng] Hohai Univ, Changzhou Key Lab Robot & Intelligent Technol, Changzhou, Peoples R China. [Xu, Ning] Nanjing Univ Posts & Telecommun, Minist Educ, Key Lab Broadband Wireless Commun & Sensor Networ, Nanjing, Jiangsu, Peoples R China. [Bao, Jingyi] Changzhou Inst Technol, Sch Elect Informat & Elect Engn, Changzhou, Peoples R China. [Yang, Zhen] Nanjing Univ Posts & Telecommun, Coll Telecommun & Informat Engn, Nanjing, Jiangsu, Peoples R China. RP Xu, N (reprint author), Hohai Univ, Coll IoT Engn, Changzhou, Peoples R China. EM xuningdlts@gmail.com; tangyb@hhuc.e-du.cn; baojy@czu.cn; jiangam@hhuc.edu.cn; liuxf@hhuc.edu.cn; yangz@njupt.edu.cn FU National Natural Science Foundation of China [11274092, 61271335]; Fundamental Research Funds for the Central Universities [2011B11114, 2011B11314, 2012B07314, 2012B04014]; National Natural Science Foundation for Young Scholars of China [61101158, 61201301, 31101643]; Jiangsu Province Natural Science Foundation for Young Scholars of China [BK20130238]; Key Lab of Broadband Wireless Communication and Sensor Network Technology (Nanjing University of Posts and Telecommunications), Ministry of Education [NYKL201305] FX The work is supported in part by the Grant from the National Natural Science Foundation of China (11274092, 61271335), the Grant from the Fundamental Research Funds for the Central Universities (2011B11114, 2011B11314, 2012B07314, 2012B04014), the Grant from the National Natural Science Foundation for Young Scholars of China (61101158, 61201301, 31101643), the Grant from the Jiangsu Province Natural Science Foundation for Young Scholars of China (BK20130238), and the open research fund of Key Lab of Broadband Wireless Communication and Sensor Network Technology (Nanjing University of Posts and Telecommunications), Ministry of Education (NYKL201305). CR Abe M., 1998, P IEEE INT C AC SPEE, P655 Arslan LM, 1999, SPEECH COMMUN, V28, P211, DOI 10.1016/S0167-6393(99)00015-1 Bishop C. M., 2006, PATTERN RECOGNITION Chatzis SP, 2012, IEEE T NEUR NET LEAR, V23, P1862, DOI 10.1109/TNNLS.2012.2217986 Chen Y, 2003, P EUR, P2413 Desai S, 2010, IEEE T AUDIO SPEECH, V18, P954, DOI 10.1109/TASL.2010.2047683 Erro D., 2007, P 6 ISCA WORKSH SPEE, P194 Erro D, 2010, IEEE T AUDIO SPEECH, V18, P922, DOI 10.1109/TASL.2009.2038663 Erro D., 2008, THESIS U POLITECNICA Helander E, 2008, INT CONF ACOUST SPEE, P4669, DOI 10.1109/ICASSP.2008.4518698 Helander E, 2012, IEEE T AUDIO SPEECH, V20, P806, DOI 10.1109/TASL.2011.2165944 Kain A., 2001, THESIS OREGON HLTH S Lee KS, 2007, IEEE T AUDIO SPEECH, V15, P641, DOI 10.1109/TASL.2006.876760 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Marume M., 2007, TECHNICAL REPORT IEI, V107, P103 NARENDRANATH M, 1995, SPEECH COMMUN, V16, P207, DOI 10.1016/0167-6393(94)00058-I Pilkington N., 2011, P INTERSPEECH, P2761 Rabiner L.R., 2009, THEORY APPL DIGITAL Rasmussen CE, 2005, ADAPT COMPUT MACH LE, P1 Snelson E.L., 2007, THESIS U LONDON Stylianou Y, 2009, INT CONF ACOUST SPEE, P3585, DOI 10.1109/ICASSP.2009.4960401 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 Stylianou Y., 1996, THESIS ECOLE NATL SU Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001 Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344 Xu Ning, 2010, Journal of Nanjing University of Posts and Telecommunications, V30 Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839 NR 27 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2014 VL 58 BP 124 EP 138 DI 10.1016/j.specom.2013.11.005 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 293FE UT WOS:000329956000010 ER PT J AU Mariooryad, S Busso, C AF Mariooryad, Soroosh Busso, Carlos TI Compensating for speaker or lexical variabilities in speech for emotion recognition SO SPEECH COMMUNICATION LA English DT Article DE Emotion recognition; Factor analysis; Feature normalization; Speaker variability ID EXPRESSION; LEVEL AB Affect recognition is a crucial requirement for future human machine interfaces to effectively respond to nonverbal behaviors of the user. Speech emotion recognition systems analyze acoustic features to deduce the speaker's emotional state. However, human voice conveys a mixture of information including speaker, lexical, cultural, physiological and emotional traits. The presence of these communication aspects introduces variabilities that affect the performance of an emotion recognition system. Therefore, building robust emotional models requires careful considerations to compensate for the effect of these variabilities. This study aims to factorize speaker characteristics, verbal content and expressive behaviors in various acoustic features. The factorization technique consists in building phoneme level trajectory models for the features. We propose a metric to quantify the dependency between acoustic features and communication traits (i.e., speaker, lexical and emotional factors). This metric, which is motivated by the mutual information framework, estimates the uncertainty reduction in the trajectory models when a given trait is considered. The analysis provides important insights on the dependency between the features and the aforementioned factors. Motivated by these results, we propose a feature normalization technique based on the whitening transformation that aims to compensate for speaker and lexical variabilities. The benefit of employing this normalization scheme is validated with the presented factor analysis method. The emotion recognition experiments show that the normalization approach can attenuate the variability imposed by the verbal content and speaker identity, yielding 4.1% and 2.4% relative performance improvements on a selected set of features, respectively. (C) 2013 Elsevier B.V. All rights reserved. C1 [Mariooryad, Soroosh; Busso, Carlos] Univ Texas Dallas, Dept Elect Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75083 USA. RP Busso, C (reprint author), Univ Texas Dallas, Dept Elect Engn, Multimodal Signal Proc MSP Lab, 800 West Campbell Rd, Richardson, TX 75083 USA. EM busso@utdallas.edu FU Samsung Telecommunications America; US National Science Foundation [IIS 1217104, IIS: 1329659] FX This study was funded by Samsung Telecommunications America and the US National Science Foundation under Grants IIS 1217104 and IIS: 1329659. CR Batliner Anton, 2010, Advances in Human-Computing Interaction, DOI 10.1155/2010/782802 Busso Carlos, 2007, IEEE 9th Workshop on Multimedia Signal Processing, 2007. MMSP 2007 Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578 Busso C, 2011, INT CONF ACOUST SPEE, P5692 Busso C., 2006, 7 INT SEM SPEECH PRO, P549 BUSSO C, 2008, J LANGUAGE RESOURCES, V42, P335 Busso C., 2007, INTERSPEECH EUROSPEE, P2225 CHAPPELL DT, 1998, ACOUST SPEECH SIG PR, P885 Chauhan R, 2011, COMM COM INF SC, V168, P359 Cover T. M., 2006, ELEMENTS INFORM THEO, V2nd Dawes R. M., 2001, RATIONAL CHOICE UNCE Dehak N., 2011, INTERSPEECH, P857 Dehak N., 2009, INTERSPEECH, P1559 Eyben Florian, 2010, ACM INT C MULT MM 20, V25, P1459 Fu L, 2008, PAC AS WORKSH COMP I, P140 Gish H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319337 Gish H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P466 Gong YF, 1997, IEEE T SPEECH AUDI P, V5, P33 Hall M, 2009, ACM SIGKDD EXPLORATI, V11, P10, DOI DOI 10.1145/1656274.1656278 Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1448, DOI 10.1109/TASL.2007.894527 Lee C. M., 2004, 8 INT C SPOK LANG PR, V1, P889 Lee S., 2005, 9 EUR C SPEECH COMM, P497 Lei X, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1237 Li M, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P1937 Mariooryad S, 2012, IEEE IMAGE PROC, P2605, DOI 10.1109/ICIP.2012.6467432 Mariooryad S, 2013, IEEE T AFFECT COMPUT, V4, P183, DOI 10.1109/T-AFFC.2013.11 Mariooryad S., 2013, IEEE INT C AUT FAC G Metallinou A, 2010, INT CONF ACOUST SPEE, P2462, DOI 10.1109/ICASSP.2010.5494890 Metallinou A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2401 Metallinou A, 2010, INT CONF ACOUST SPEE, P2474, DOI 10.1109/ICASSP.2010.5494893 Mower E., 2009, INT C AFF COMP INT I Park S., 2010, INT J AERONAUT SPACE, V11, P327 Prince SJD, 2008, IEEE T PATTERN ANAL, V30, P970, DOI 10.1109/TPAMI.2008.48 RABINER LR, 1975, AT&T TECH J, V54, P297 Rahman T, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5117 Schuller B., 2006, 3 INT C SPEECH PROS Schuller B., 2011, 12 ANN C INT SPEECH, P3201 Schuller B, 2011, SPEECH COMMUN, V53, P1062, DOI 10.1016/j.specom.2011.01.011 Schuller B, 2010, IEEE T AFFECT COMPUT, V1, P119, DOI 10.1109/T-AFFC.2010.8 Sethu V., 2007, 15 INT C DIG SIGN PR, P611 Shriberg E, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P609 Tenenbaum JB, 2000, NEURAL COMPUT, V12, P1247, DOI 10.1162/089976600300015349 Vlasenko B., 2011, 12 ANN C INT SPEECH, P1577 Vlasenko B., 2011, IEEE INT C MULT EXP Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139 Wollmer M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P597 Wu W, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2102 Xia R., 2012, INT 2012 PORTL OR US, P2230 Yildirim S., 2004, 8 INT C SPOK LANG PR, P2193 Zeng ZH, 2008, IEEE T MULTIMEDIA, V10, P570, DOI 10.1109/TMM.2008.921737 NR 52 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 1 EP 12 DI 10.1016/j.specom.2013.07.011 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100001 ER PT J AU Chappel, R Paliwal, K AF Chappel, Roger Paliwal, Kuldip TI An educational platform to demonstrate speech processing techniques on Android based smart phones and tablets SO SPEECH COMMUNICATION LA English DT Article DE Digital signal processing (DSP); Electronic engineering education; Speech enhancement; Android; Smart phone ID SPECTRAL AMPLITUDE ESTIMATOR; ENHANCEMENT ALGORITHMS; PHASE; RECOGNITION; TECHNOLOGY; COURSES AB This work highlights the need to adapt teaching methods in digital signal processing (DSP) on speech to suit shifts in generational learning behavior, furthermore it suggests the use of integrating theory into a practical smart phone or tablet application as a means to bridge the gap between traditional teaching styles and current learning styles. The application presented here is called "Speech Enhancement for Android (SEA)" and aims at assisting in the development of an intuitive understanding of course content by allowing students to interact with theoretical concepts through their personal device. SEA not only allows the student to interact with speech processing methods, but also enables the student to interact with their surrounding environment by recording and processing their own voice. A case study on students studying DSP for speech processing found that by using SEA as an additional learning tool enhanced their understanding and helped to motivate students to engage in course work by way of having ready access to interactive content on a hand held device. This paper describes the platform in detail acting as a road-map for education institutions, and how it can be integrated into a DSP based speech processing education framework. (C) 2013 Elsevier B.V. All rights reserved. C1 [Chappel, Roger; Paliwal, Kuldip] Griffith Univ, Sch Engn, Signal Proc Lab, Brisbane, Qld 4111, Australia. RP Chappel, R (reprint author), Griffith Univ, Sch Engn, Signal Proc Lab, Nathan Campus, Brisbane, Qld 4111, Australia. EM roger.chappel@giriffithuni.edu.au CR Acero A., 2001, SPOKEN LANGUAGE PROC Allen I.E., 2007, STAT ROUND TABLE QUA, P64 Alsteris L., 2004, P IEEE INT C AC SPEE, V1, P573 ARM, 2009, WHIT PAP ARM CORT A9 ARMSTRONG RL, 1987, PERCEPT MOTOR SKILL, V64, P359 Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208 Billings DM, 2005, J PROF NURS, V21, P126, DOI 10.1016/j.profnurs.2005.01.002 Caillaud B, 2000, EUR ECON REV, V44, P1091, DOI 10.1016/S0014-2921(00)00047-7 Chassaing R., 2002, DSP APPL USING C TMS CROCHIERE RE, 1981, AT&T TECH J, V60, P1633 Deci E. L., 1985, INTRINSIC MOTIVATION Deci EL, 2001, REV EDUC RES, V71, P1, DOI 10.3102/00346543071001001 Donath J, 2004, BT TECHNOL J, V22, P71, DOI 10.1023/B:BTTJ.0000047585.06264.cc EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Finkelstein J, 2010, SECOND INTERNATIONAL CONFERENCE ON MOBILE, HYBRID, AND ON-LINE LEARNING (ELML 2010), P77, DOI 10.1109/eLmL.2010.36 Freeman D.K., 1989, IEEE INT C AC SPEECH, V1, P369 Gartner, 2011, FOR MED TABL OP SYST Gartner, 2011, MARK SHAR MOB COMM D Gonvalves F., 2001, IEEE POW EL SPEC C, V1, P85 Hassanlou K., 2009, TECHNOL SOC, V31, P125 Hayes M. H., 1996, STAT DIGITAL SIGNAL, V1st Haykin S., 1991, INFORM SYSTEM SCI SE, V2 Helmholtz H., 1954, SENSATIONS TONE PHYS Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2006, INT CONF ACOUST SPEE, P153 Huang X., 2001, SPOKEN LANGUAGE PROC Humphreys L, 2010, NEW MEDIA SOC, V12, P763, DOI 10.1177/1461444809349578 JACKSON J, 2001, ACOUST SPEECH SIG PR, P2721 Joiner R, 2006, COMPUT HUM BEHAV, V22, P67, DOI 10.1016/j.chb.2005.01.001 Kay S. M., 1993, SIGNAL PROCESSING SE Ko Y., 2003, P 33 ANN FRONT ED FI, V1, pT3E Kong JSL, 2012, INFORM MANAGE-AMSTER, V49, P1, DOI 10.1016/j.im.2011.10.004 Liu J., 2011, FRONT ED C FIE OCT, pF2G Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lubis MA, 2010, PROCD SOC BEHV, V9, DOI 10.1016/j.sbspro.2010.12.285 Maiti A., 2011, 2011 IEEE EUROCON IN, P1 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MARTIN R, 1994, P EUSPICO, V2, P1182 Martin S, 2011, COMPUT EDUC, V57, P1893, DOI 10.1016/j.compedu.2011.04.003 Milanesi C., 2011, IPAD FUTURE TABLET M MILNER B, 2002, ACOUST SPEECH SIG PR, P797 Mitra S., 2006, DIGITAL SIGNAL PROCE, V3 Nussbaumer H., 1981, FAST FOURIER TRANSFO, V2 [NVIDIA NVIDIA Corporation], 2010, BEN MULT CPU COR MOB O'Neil H. F., 2003, TECHNOLOGY APPL ED L Ohm G. S., 1843, ANN PHYS CHEM, V59, P513 Oppenheim A. V., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022 Paliwal K. K., 2003, P EUR 2003, P2117 Alsteris LD, 2006, SPEECH COMMUN, V48, P727, DOI 10.1016/j.specom.2005.10.005 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 Potts J, 2011, IEEE SOUTHEASTCON, P293, DOI 10.1109/SECON.2011.5752952 Price S, 2011, INTERACT COMPUT, V23, P499, DOI 10.1016/j.intcom.2011.06.003 Price S, 2003, INTERACT COMPUT, V15, P169, DOI 10.1016/S0953-5438(03)00006-7 Proakis J., 2007, DIGITAL SIGNAL PROCE Quatieri T. F., 2002, DISCRETE TIME SPEECH Ranganath S., 2012, ASEE ANN C JUN Rienties B, 2009, COMPUT HUM BEHAV, V25, P1195, DOI 10.1016/j.chb.2009.05.012 Rosenbaum E, 2008, SOC SCI RES, V37, P350, DOI 10.1016/j.ssresearch.2007.03.003 Rosenberg M., 1976, AUTOMATIC SPEAKER VE, V64, P475 Salajan FD, 2010, COMPUT EDUC, V55, P1393, DOI 10.1016/j.compedu.2010.06.017 Shi GJ, 2006, IEEE T AUDIO SPEECH, V14, P1867, DOI 10.1109/TSA.2005.858512 Sims R, 1997, COMPUT HUM BEHAV, V13, P157, DOI 10.1016/S0747-5632(97)00004-6 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Soliman S., 1990, INFORM SYSTEM SCI SE, V2 Soloway E, 2001, COMMUN ACM, V44, P15, DOI 10.1145/376134.376140 Song J., 2012, AUSTRALASIAN MARKETI, V20, P80 Srinivasan K., 1993, IEEE WORKSH SPEECH C, P85 Stark A, 2011, SPEECH COMMUN, V53, P51, DOI 10.1016/j.specom.2010.08.001 Steinfield Charles, 2007, J COMPUT-MEDIAT COMM, V12, P1143, DOI DOI 10.1111/J.1083-6101.2007.00367.X Talor F., 1983, DIGITAL FILTER DESIG Teng C.-C., 2010, 7 INT C INF TECHN NE, P471 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 Wellman B, 2001, INT J URBAN REGIONAL, V25, P227, DOI 10.1111/1468-2427.00309 White M.-A., 1989, Education & Computing, V5, DOI 10.1016/S0167-9287(89)80003-2 Wojcicki KK, 2007, INT CONF ACOUST SPEE, P729 Yerushalmy M., 2004, MOBILE PHONES ED CAS NR 79 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 13 EP 38 DI 10.1016/j.specom.2013.08.002 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100002 ER PT J AU Wu, L Wan, CY Xiao, K Wang, SP Wan, MX AF Wu, Liang Wan, Congying Xiao, Ke Wang, Supin Wan, Mingxi TI Evaluation of a method for vowel-specific voice source control of an electrolarynx using visual information SO SPEECH COMMUNICATION LA English DT Article DE Electrolarynx; Intelligibility evaluation; Visual information; Vowel-specific voice source ID SPEECH-PERCEPTION; LARYNGECTOMY; COMMUNICATION AB The electrolarynx (EL) is a widely used device for alaryngeal communication, but the low quality seriously reduces the intelligibility of EL speech. To improve EL speech quality, a vowel-specific voice source based on visual information of lip shape and movements and artificial neural network (ANN) is implemented into an experimental EL (SGVS-EL) system in real time. Five volunteers (one laryngectomee and four normal speakers) participated in the experimental evaluation of the method and SGVS-EL system. Using ANN participants were able to perform high vowel precision with identification rates of >90% after the training. The results of voicing control indicated that all subjects using SGVS-EL could achieve good vowel control performance in real time, but still control errors frequently occurred at the voice initiation period. However, the control errors had no significantly impact on the perception of SGVS-EL speech. Intelligibility evaluation demonstrated that both the vowels and words produced using the SGVS-EL were more intelligible than vowels spoken with a commercial EL (by 30%) or words (by 18%), respectively. Using a controlled vowel-specific voice source was a feasible and effective way to improve EL speech quality with more intelligible words. (C) 2013 Elsevier B.V. All rights reserved. C1 [Wu, Liang; Wan, Congying; Xiao, Ke; Wang, Supin; Wan, Mingxi] Xi An Jiao Tong Univ, Key Lab Biomed Informat Engn, Minist Educ, Dept Biomed Engn,Sch Life Sci & Technol, Xian 710049, Peoples R China. RP Wang, SP (reprint author), 28 Xianning West Rd, Xian 710049, Shaanxi, Peoples R China. EM spwang@mail.xjtu.edu.cn; mxwan@mail.xjtu.edu.cn FU National Natural Science Foundation of China [11274250, 61271087]; Research Fund for the Doctoral Program of Higher Education of China [20120201110049] FX This work was supported by the National Natural Science Foundation of China under Grant 11274250 and 61271087, and Research Fund for the Doctoral Program of Higher Education of China under Grand 20120201110049. The authors would like to express special appreciation to Ye Wenyuan and Rui Baozhen for their works in the experiments. CR Carlson R., 1975, AUDITORY ANAL PERCEP, P55 Carr MM, 2000, OTOLARYNG HEAD NECK, V122, P39, DOI 10.1016/S0194-5998(00)70141-0 Cho T, 1999, J PHONETICS, V27, P207, DOI 10.1006/jpho.1999.0094 Clements KS, 1997, ARCH OTOLARYNGOL, V123, P493 Dupont S, 2000, IEEE T MULTIMEDIA, V2, P141, DOI 10.1109/6046.865479 Evitts PM, 2010, J COMMUN DISORD, V43, P92, DOI 10.1016/j.jcomdis.2009.10.002 Goldstein EA, 2007, J SPEECH LANG HEAR R, V50, P335, DOI 10.1044/1092-4388(2007/024) Goldstein EA, 2004, IEEE T BIO-MED ENG, V51, P325, DOI 10.1109/TBME.2003.820373 Hillman R E, 1998, Ann Otol Rhinol Laryngol Suppl, V172, P1 MASSARO DW, 1983, J EXP PSYCHOL HUMAN, V9, P753, DOI 10.1037/0096-1523.9.5.753 Meltzner G.S., 2003, THESIS MIT Meltzner G.S., 2005, ELECTROLARYNGEAL SPE, P571 MORRIS HL, 1992, ANN OTO RHINOL LARYN, V101, P503 Neti C., 2000, AUD VIS SPEECH REC F, P764 Pawar PV, 2008, J CANC RES THER, V4, P186 QI YY, 1991, J SPEECH HEAR RES, V34, P1250 Stepp CE, 2009, IEEE T NEUR SYS REH, V17, P146, DOI 10.1109/TNSRE.2009.2017805 SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009 Takahashi H, 2008, J VOICE, V22, P420, DOI 10.1016/j.jvoice.2006.10.004 TRAUNMULLER H, 1987, SPEECH COMMUN, V6, P143, DOI 10.1016/0167-6393(87)90037-9 Uemi N., 1995, Japanese Journal of Medical Electronics and Biological Engineering, V33 UMEDA N, 1977, J ACOUST SOC AM, V61, P846, DOI 10.1121/1.381374 Wan C.Y., 2012, J VOICE, V26 WEISS MS, 1979, J ACOUST SOC AM, V65, P1298, DOI 10.1121/1.382697 Wu L, 2013, IEEE T BIO-MED ENG, V60, P1965, DOI 10.1109/TBME.2013.2246789 Wu L., 2013, J VOICE, V27, p259e7 NR 26 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 39 EP 49 DI 10.1016/j.specom.2013.09.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100003 ER PT J AU Norouzian, A Rose, R AF Norouzian, Atta Rose, Richard TI An approach for efficient open vocabulary spoken term detection SO SPEECH COMMUNICATION LA English DT Article DE Spoken term detection; Automatic speech recognition; Index ID RECOGNITION; SEARCH; SPEECH AB A hybrid two-pass approach for facilitating fast and efficient open vocabulary spoken term detection (STD) is presented in this paper. A large vocabulary continuous speech recognition (LVCSR) system is deployed for producing word lattices from audio recordings. An index construction technique is used for facilitating very fast search of lattices for finding occurrences of both in vocabulary (IV) and out of vocabulary (OOV) query terms. Efficient search for query terms is performed in two passes. In the first pass, a subword approach is used for identifying audio segments that are likely to contain occurrences of the IV and OOV query terms from the index. A more detailed subword based search is performed in the second pass for verifying the occurrence of the query terms in the candidate segments. The performance of this STD system is evaluated in an open vocabulary STD task defined on a lecture domain corpus. It is shown that the indexing method presented here results in an index that is nearly two orders of magnitude smaller than the LVCSR lattices while preserving most of the information relevant for STD. Furthermore, despite using word lattices for constructing the index, 67% of the segments containing occurrences of the OOV query terms are identified from the index in the first pass. Finally, it is shown that the detection performance of the subword based term detection performed in the second pass has the effect of reducing the performance gap between OOV and IV query terms. (C) 2013 Elsevier B.V. All rights reserved. C1 [Norouzian, Atta; Rose, Richard] McGill Univ, Montreal, PQ, Canada. RP Norouzian, A (reprint author), McGill Univ, Montreal, PQ, Canada. EM atta.norouzian@mail.mcgill.ca CR Allauzen C., 2004, WORKSH INT APPR SPEE, P33 Bisani M, 2008, SPEECH COMMUN, V50, P434, DOI 10.1016/j.specom.2008.01.002 Brno University Super Lectures, 2012, SUP LECT Can D, 2011, IEEE T AUDIO SPEECH, V19, P2338, DOI 10.1109/TASL.2011.2134087 Chaudhari UV, 2012, IEEE T AUDIO SPEECH, V20, P1633, DOI 10.1109/TASL.2012.2186805 Chelba C., 2005, P 43 ANN M ASS COMP, P443, DOI 10.3115/1219840.1219895 Chen YN, 2011, INT CONF ACOUST SPEE, P5644 COOL, 2012, COURS ONL Hain T., 2008, P NIST RT07 WORKSH Hori T., 2007, INT C AC SPEECH SIGN, V4, pIV Iwata K, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2195 Jansen A, 2011, INT CONF ACOUST SPEE, P5180 Jansen A., 2009, 10 ANN C INT SPEECH Koumpis K, 2005, IEEE SIGNAL PROC MAG, V22, P61, DOI 10.1109/MSP.2005.1511824 Mamou J., 2006, 29 ANN INT C RES DEV, P51 Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152 Manning C. D., 1999, FDN STAT NATURAL LAN Microsoft MAVIS, 2012, MAV Miller D., 2007, 8 ANN C INT SPEECH C, P314 Norouzian A., 2013, INT C AC SPEECH SIGN Norouzian A., 2013, 14 ANN C INT SPEECH Norouzian A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5169 Norouzian A., 2010, WORKSH SPOK LANG TEC, P194 Norouzian A., 2012, 13 ANN C INT SPEECH Rose R, 2010, INT CONF ACOUST SPEE, P5282, DOI 10.1109/ICASSP.2010.5494982 Saraclar M., 2004, HUM LANG TECHN C N A Schwarz P, 2004, LECT NOTES COMPUT SC, V3206, P465 Siohan O., 2005, 9 EUR C SPEECH COMM Stolcke A., 2002, P INT C SPOK LANG PR, V2, P901 Szoke I, 2008, LECT NOTES COMPUT SC, V4892, P237 Szoke I., 2007, WORKSH SIGN PROC APP, P1 Tu T.-W., 2011, AUT SPEECH REC UND W, P383 Wang D, 2010, INT CONF ACOUST SPEE, P5294, DOI 10.1109/ICASSP.2010.5494968 Yu P, 2005, INT CONF ACOUST SPEE, P481 NR 34 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 50 EP 62 DI 10.1016/j.specom.2013.09.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100004 ER PT J AU Szekely, E Ahmed, Z Hennig, S Cabral, JP Carson-Berndsen, J AF Szekely, Eva Ahmed, Zeeshan Hennig, Shannon Cabral, Joao P. Carson-Berndsen, Julie TI Predicting synthetic voice style from facial expressions. An application for augmented conversations SO SPEECH COMMUNICATION LA English DT Article DE Expressive speech synthesis; Facial expressions; Multimodal application; Augmentative and alternative communication ID SPEECH SYNTHESIS; RESPONSES; EMOTIONS AB The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automatically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener. (C) 2013 Elsevier B.V. All rights reserved. C1 [Szekely, Eva; Ahmed, Zeeshan; Cabral, Joao P.; Carson-Berndsen, Julie] Univ Coll Dublin, Sch Comp Sci & Informat, CNGL, Dublin 2, Ireland. [Hennig, Shannon] Ist Italian Tecnol, Genoa, Italy. RP Szekely, E (reprint author), Univ Coll Dublin, Sch Comp Sci & Informat, CNGL, Dublin 2, Ireland. EM eva.szekely@ucdconnect.ie; zeeshan.ahmed@ucdconnect.ie; shannon.hennig@iit.it; joao.cabral@ucd.ie; julie.berndsen@ucd.ie FU Science Foundation Ireland of the Centre for Next Generation Localisation at University College Dublin (UCD) [07/CE/I1142]; Istituto Italian di Tecnologia; Universita degli Studi di Genova FX This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at University College Dublin (UCD). The opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Science Foundation Ireland. This is research is further supported by the Istituto Italian di Tecnologia and the Universita degli Studi di Genova. The authors would also like to thank Nick Campbell and the Speech Communication Lab (TCD) for their invaluable help with the interactive evaluation. CR Asha, 2005, ROL RESP SPEECH LANG Bedrosian J., 1995, AUGMENTATIVE ALTERNA, V11, P6, DOI 10.1080/07434619512331277089 Beukelman D., 2012, COMMUNICATION ENHANC Braunschweiler N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2222 Breuer S., 2006, P LREC GEN Bulut M., 2002, P INT C SPOK LANG PR, P1265 Cabral J., 2007, P 6 ISCA WORKSH SPEE Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1 Cave C., 2002, P ICSLP DENV CHILDERS DG, 1995, J ACOUST SOC AM, V97, P505, DOI 10.1121/1.412276 Cowie R., 2000, P ISCA WORKSH SPEECH, P14 Creer S., 2010, COMPUTER SYNTHESISED, P92 Critchley HD, 2000, J NEUROSCI, V20, P3033 Cvejic I., 2011, P INT FLOR Dawson M.E., 2000, ELECTRODERMAL SYSTEM, V2, P200 EKMAN P, 1976, ENVIRON PSYCH NONVER, V1, P56, DOI 10.1007/BF01115465 Fant G., 1988, SPEECH TRANSMISSION, V29, P1 Fontaine JRJ, 2007, PSYCHOL SCI, V18, P1050, DOI 10.1111/j.1467-9280.2007.02024.x Gallo LC, 2000, PSYCHOPHYSIOLOGY, V37, P289, DOI 10.1017/S0048577200982222 Gobl C., 1989, STL QPSR, P9 Guenther FH, 2009, PLOS ONE, V4, DOI 10.1371/journal.pone.0008218 Hennig S., 2012, P ISAAC PITTSB Higginbotham D. J., 1995, AUGMENTATIVE ALTERNA, V11, P2, DOI 10.1080/07434619512331277079 Higginbotham J, 2010, COMPUTER SYNTHESIZED, P50 Higginbotham J., 2002, ASSIST TECHNOL, V24, P14 HTS, 2008, HTS 2 1 TOOLK HMM BA Kawanami H., 2003, P EUR GEN Kueblbeck C., 2006, J IMAGE VISION COMPU, V24, P564 Liao C., 2012, P ICASSP KYOT Light J, 2007, AUGMENT ALTERN COMM, V23, P204, DOI 10.1080/07434610701553635 Light J., 2003, COMMUNICATIVE COMPET, P3 MACWHINNEY B, 1982, MEM COGNITION, V10, P308, DOI 10.3758/BF03202422 Massaro D.W., 2004, MULTISENSORY INTEGRA Moubayed S.A., 2011, LECT NOTES COMPUTER, V6456, P55 Mullennix J.W., 2010, COMPUTER SYNTHESIZED, P1 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Schroder M, 2004, THESIS SAARLAND U SHORE, 2012, SHOR FAC DET ENG Sprengelmeyer R, 2006, NEUROPSYCHOLOGIA, V44, P2899, DOI 10.1016/j.neuropsychologia.2006.06.020 Stern SE, 2002, J APPL PSYCHOL, V87, P411, DOI 10.1037//0021-9010.87.2.411 Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 Szekely E., 2011, P INT FLOR ISCA, P2409 Szekely E., 2012, P SLPAT MONTR Szekely E., 2012, P LREC IST Szekely E., 2013, J MULTIMODA IN PRESS Taylor P, 2009, TEXT TO SPEECH SYNTH Wilkinson KM, 2011, AM J SPEECH-LANG PAT, V20, P288, DOI 10.1044/1058-0360(2011/10-0065) Wisenburn B, 2008, AUGMENT ALTERN COMM, V24, P100, DOI 10.1080/07434610701740448 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 Zhao Y., 2006, P INT PITTSB NR 50 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 63 EP 75 DI 10.1016/j.specom.2013.09.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100005 ER PT J AU Chao, YH AF Chao, Yi-Hsiang TI Using LR-based discriminant kernel methods with applications to speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Likelihood Ratio; Speaker verification; Support Vector Machine; Kernel Fisher Discriminant; Multiple Kernel Learning ID SUPPORT VECTOR MACHINES; RECOGNITION; MODELS; SCORE AB Kernel methods are powerful techniques that have been widely discussed and successfully applied to pattern recognition problems. Kernel-based speaker verification has also been developed to use the concept of sequence kernel that is able to deal with variable-length patterns such as speech. However, constructing a proper kernel cleverly tied in with speaker verification is still an issue. In this paper, we propose the new defined kernels derived by the Likelihood Ratio (LR) test, named the LR-based kernels, in attempts to integrate kernel methods with the LR-based speaker verification framework tightly and intuitively while an LR is embedded in the kernel function. The proposed kernels have two advantages over existing methods. The first is that they can compute the kernel function without needing to represent the variable-length speech as a fixed-dimension vector in advance. The second is that they have a trainable mechanism in the kernel computation using the Multiple Kernel Learning (MKL) algorithm. Our experimental results show that the proposed methods outperform conventional speaker verification approaches. (C) 2013 Elsevier B.V. All rights reserved. C1 Chien Hsin Univ Sci & Technol, Dept Appl Geomat, Tao Yuan, Taiwan. RP Chao, YH (reprint author), Chien Hsin Univ Sci & Technol, Dept Appl Geomat, Tao Yuan, Taiwan. EM yschao@uch.edu.tw FU National Science Council, Taiwan [NSC101-2221-E-231-026] FX This work was funded by the National Science Council, Taiwan, under Grant: NSC101-2221-E-231-026. CR Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Bengio S., 2004, P OD SPEAK LANG REC Bengio S, 2001, INT CONF ACOUST SPEE, P425, DOI 10.1109/ICASSP.2001.940858 Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003 Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086 Chao Y.H., 2006, P ICPR2006 Chao Y.H., 2007, INT J COMPUTATIONAL, V12, P255 Chao YH, 2008, IEEE T AUDIO SPEECH, V16, P1675, DOI 10.1109/TASL.2008.2004297 Chao Y.H., 2006, P INT ICSLP Dehak N., 2010, P OD SPEAK LANG REC Dehak N., 2009, P INT C AC SPEECH SI, P4237 Gonen M, 2011, J MACH LEARN RES, V12, P2211 Herbrich R., 2002, LEARNING KERNEL CLAS Huang X., 2001, SPOKEN LANGUAGE PROC Karam Z.N., 2008, P ICASSP LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Liu CS, 1996, IEEE T SPEECH AUDI P, V4, P56 Luettin J., 1998, 9805 IDIAPCOM Martin A.F., 1997, P EUR Messer K., 1999, P AVBPA Mika S, 2002, THESIS U TECHNOLOGY Rakotomamonjy A, 2008, J MACH LEARN RES, V9, P2491 Ratsch G., 1999, P IEEE INT WORKSH NE, V9, P41 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Rosenberg A. E., 1992, P INT C SPOK LANG PR, P599 Smith N, 2002, ADV NEUR IN, V14, P1197 Vapnik V, 1998, STAT LEARNING THEORY Wan V, 2005, IEEE T SPEECH AUDI P, V13, P203, DOI 10.1109/TSA.2004.841042 NR 30 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 76 EP 86 DI 10.1016/j.specom.2013.09.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100006 ER PT J AU Hyon, S Dang, JW Feng, H Wang, HC Honda, K AF Hyon, Songgun Dang, Jianwu Feng, Hui Wang, Hongcui Honda, Kiyoshi TI Detection of speaker individual information using a phoneme effect suppression method SO SPEECH COMMUNICATION LA English DT Article DE Speaker identification; Frequency warping; MFCC; Speech production; Phoneme-related effects ID ACOUSTIC CHARACTERISTICS; RECOGNITION; IDENTIFICATION; FREQUENCY; MODELS; NASAL; BAND AB Feature extraction of speaker information from speech signals is a key procedure for exploring individual speaker characteristics and also the most critical part in a speaker recognition system, which needs to preserve individual information while attenuating linguistic information. However, it is difficult to separate individual from linguistic information in a given utterance. For this reason, we investigated a number of potential effects on speaker individual information that arise from differences in articulation due to speaker-specific morphology of the speech organs, comparing English, Chinese and Korean. We found that voiced and unvoiced phonemes have different frequency distributions in speaker information and these effects are consistent across the three languages, while the effect of nasal sounds on speaker individuality is language dependent. Because these differences are confounded with speaker individual information, feature extraction is negatively affected. Accordingly, a new feature extraction method is proposed to more accurately detect speaker individual information by suppressing phoneme-related effects, where the phoneme alignment is required once in constructing a filter bank for phoneme effect suppression, but is not necessary in processing feature extraction. The proposed method was evaluated by implementing it in GMM speaker models for speaker identification experiments. It is shown that the proposed approach outperformed both Mel Frequency Cepstrum Coefficient (MFCC) and the traditional F-ratio (FFCC). The use of the proposed feature has reduced recognition errors by 32.1-67.3% for the three languages compared with MFCC, and by 6.6-31% compared with FFCC. When combining an automatic phoneme aligner with the proposed method, the result demonstrated that the proposed method can detect speaker individuality with about the same accuracy as that based on manual phoneme alignment. (C) 2013 Elsevier B.V. All rights reserved. C1 [Hyon, Songgun; Dang, Jianwu; Wang, Hongcui; Honda, Kiyoshi] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China. [Hyon, Songgun] KimllSung Univ, Sch Comp Sci, Ryongnam, South Korea. [Dang, Jianwu] Japan Adv Inst Sci & Technol, Sch Informat Sci, Kanazawa, Ishikawa, Japan. [Dang, Jianwu; Feng, Hui; Wang, Hongcui; Honda, Kiyoshi] Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China. [Feng, Hui] Tianjin Univ, Sch Liberal Arts & Law, Tianjin, Peoples R China. RP Dang, JW (reprint author), Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China. EM h_star1020@yahoo.com; jdang@jaist.ac.jp; fenghui@tju.edu.cn; hcwang@tju.edu.cn; khonda@sannet.ne.jp FU National Basic Research Program of China [2013CB329301]; National Natural Science Foundation of China [61233009, 6117501]; JSPS KAKENHI [25330190] FX The authors would like to thank Dr. Jiahong Yuan and M.S. Yuan Ma for conducting the automatic phoneme alignment and for running part of the experiments. The authors would also like to thank Dr. Mark Tiede for his helpful comments. This work is supported in part by the National Basic Research Program of China (No. 2013CB329301), and in part by the National Natural Science Foundation of China under contract Nos. 61233009 and 6117501. This study was also supported in part by JSPS KAKENHI Grant (NO. 25330190). CR ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 BLOMBERG M, 1991, SPEECH COMMUN, V10, P453, DOI 10.1016/0167-6393(91)90048-X Bozkurt B., 2005, P EUSIPCO AN TURK Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 Dang J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P965 DANG JW, 1994, J ACOUST SOC AM, V96, P2088, DOI 10.1121/1.410150 Dang JW, 1996, J ACOUST SOC AM, V100, P3374, DOI 10.1121/1.416978 Dang JW, 1997, J ACOUST SOC AM, V101, P456, DOI 10.1121/1.417990 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Fant G., 1960, ACOUSTIC THEORY SPEE Feng G, 1996, J ACOUST SOC AM, V99, P3694, DOI 10.1121/1.414967 Garofolo J.S., 1990, DARPA TIMIT ACOUSTIC Gutman D., 2002, P EUSIPCO 2002 TOUL Hansen E.G., 2004, P OD SPEAK REC WORKS HAYAKAWA S, 1994, INT CONF ACOUST SPEE, P137 Kajarekar S. S., 2001, P SPEAK OD 2001 CRET Kitamura Tatsuya, 2007, Acoustical Science and Technology, V28, DOI 10.1250/ast.28.434 Kitamura T., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.16 Lu XG, 2008, SPEECH COMMUN, V50, P312, DOI 10.1016/j.specom.2007.10.005 Matsui T., 1993, P ICASSP1993 MINN MN Miyajima C, 2001, SPEECH COMMUN, V35, P203, DOI 10.1016/S0167-6393(00)00079-0 Miyajima C., 1999, P EUR 1999, P779 Moore BCJ, 1996, ACUSTICA, V82, P335 O'Shaughnessy D., 1987, SPEECH COMMUN, P150 Orman O., 2001, P SPEAK OD SPEAK REC, P219 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 Rabiner L, 1993, FUNDAMENTALS SPEECH REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Stevens K.N., 1998, ACOUSTIC PHONETICS SUNDBERG J, 1974, J ACOUST SOC AM, V55, P838, DOI 10.1121/1.1914609 Suzuki H., 1990, P ICSLP90, P437 Takemoto H, 2006, J ACOUST SOC AM, V120, P2228, DOI 10.1121/1.2261270 Weber F., 2002, P ICASSP 2002 ORL FL WOLF JJ, 1972, J ACOUST SOC AM, V51, P2044, DOI 10.1121/1.1913065 Yu Yibiao, 2008, Acta Acustica, V33 Yu Yibiao, 2005, Acta Acustica, V30 Yuan J., 2008, P AC 08 ZWICKER E, 1980, J ACOUST SOC AM, V68, P1523, DOI 10.1121/1.385079 NR 38 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 87 EP 100 DI 10.1016/j.specom.2013.09.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100007 ER PT J AU Trawicki, MB Johnson, MT AF Trawicki, Marek B. Johnson, Michael T. TI Speech enhancement using Bayesian estimators of the perceptually-motivated short-time spectral amplitude (STSA) with Chi speech priors SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Probability; Amplitude estimation; Phase estimation; Parameter estimation AB In this paper, the authors propose new perceptually-motivated Weighted Euclidean (WE) and Weighted Cosh (WCOSH) estimators that utilize more appropriate Chi statistical models for the speech prior with Gaussian statistical models for the noise likelihood. Whereas the perceptually-motivated WE and WCOSH cost functions emphasized spectral valleys rather than spectral peaks (formants) and indirectly accounted for auditory masking effects, the incorporation of the Chi distribution statistical models demonstrated distinct improvement over the Rayleigh statistical models for the speech prior. The estimators incorporate both weighting law and shape parameters on the cost functions and distributions. Performance is evaluated in terms of the Segmental Signal-to-Noise Ratio (SSNR), Perceptual Evaluation of Speech Quality (PESQ), and Signal-to-Noise Ratio (SNR) Loss objective quality measures to determine the amount of noise reduction along with overall speech quality and speech intelligibility improvement. Based on experimental results across three different input SNRs and eight unique noises along with various weighting law and shape parameters, the two general, less-complicated, closed-form derived solution estimators of WE and WCOSH with Chi speech priors provide significant gains in noise reduction and noticeable gains in overall speech quality and speech intelligibility improvements over the baseline WE and WCOSH with the standard Rayleigh speech priors. Overall, the goal of the work is to capitalize on the mutual benefits of the WE and WCOSH cost functions and Chi distributions for the speech prior to improvement enhancement. (C) 2013 Elsevier B.V. All rights reserved. C1 [Trawicki, Marek B.; Johnson, Michael T.] Marquette Univ, Dept Elect & Comp Engn, Speech & Signal Proc Lab, Milwaukee, WI 53201 USA. RP Trawicki, MB (reprint author), Marquette Univ, Dept Elect & Comp Engn, Speech & Signal Proc Lab, POB 1881, Milwaukee, WI 53201 USA. EM marek.trawicki@marquette.edu; mike.johnson@marquette.edu CR Andrianakis I, 2009, SPEECH COMMUN, V51, P1, DOI 10.1016/j.specom.2008.05.018 Breithaupt C., 2008, INT C AC SPEECH SIGN EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gradshteyn IS, 2007, TABLES INTEGRALS SER, V7th GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 ITU, 2003, SUBJ TEST METH EV SP Johnson N.L., 1994, CONTINUOUS UNIVARIAT, VI Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Ma JF, 2011, SPEECH COMMUN, V53, P340, DOI 10.1016/j.specom.2010.10.005 Papamichalis P.E., 1987, PRACTICAL APPROACHES Pearce D., 2000, 6 INT C SPOK LANG PR Rix A., 2001, IEEE INT C AC SPEECH Subcommittee I., 1969, IEEE T AUDIO ELECTRO, P225 NR 17 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 101 EP 113 DI 10.1016/j.specom.2013.09.009 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100008 ER PT J AU McLachlan, NM Grayden, DB AF McLachlan, Neil M. Grayden, David B. TI Enhancement of speech perception in noise by periodicity processing: A neurobiological model and signal processing algorithm SO SPEECH COMMUNICATION LA English DT Article DE Neurocognitive; Model; Periodicity; Segregation; Algorithm; Speech ID VENTRAL COCHLEAR NUCLEUS; DIFFERENT FUNDAMENTAL FREQUENCIES; ITERATED RIPPLED NOISE; AUDITORY-NERVE FIBERS; INFERIOR COLLICULUS; PITCH STRENGTH; TEMPORAL INTEGRATION; AMPLITUDE-MODULATION; LATERAL LEMNISCUS; CONCURRENT VOWELS AB The perceived loudness of sound increases with its tonality or periodicity, and the pitch strength of tones are linearly proportional to their sound pressure level. These observations suggest a fundamental relationship between pitch strength and loudness. This relationship may be explained by the superimposition of inputs to inferior colliculus neurons from cochlear nucleus chopper cells and phase locked spike trains from the lateral lemniscus. The regularity of chopper cell outputs increases for stimuli with periodicity at the same frequency as their intrinsic chopping rate. So inputs to inferior colliculus cells become synchronized for periodic stimuli, leading to increased likelihood that they will fire and increased salience of periodic signal components at the characteristic frequency of the inferior colliculus cell. A computer algorithm to enhance speech in noise was based on this model. The periodicity of the outputs of a Gammatone filter bank after each sound onset was determined by first sampling each filter channel at a range of typical chopper cell frequencies and then passing these amplitudes through a step function to simulate the firing of coincidence detecting neurons in the inferior colliculus. Filter channel amplification was based on the maximum accumulated spike count after each onset, resulting in increased amplitudes for filter channels with greater periodicity. The speech intelligibility of stimuli in noise was not changed when the algorithm was used to remove around 14 dB of noise from stimuli with signal noise ratios of around 0 dB. This mechanism is a likely candidate for enhancing speech recognition in noise, and raises the proposition that pitch itself is an epiphenomenon that evolved from neural mechanisms that boost the hearing sensitivity of animals to vocalizations. (C) 2013 Elsevier B.V. All rights reserved. C1 [McLachlan, Neil M.] Univ Melbourne, Melbourne Sch Psychol Sci, Melbourne, Vic 3010, Australia. [Grayden, David B.] Univ Melbourne, Dept Elect & Elect Engn, NeuroEngn Lab, Melbourne, Vic 3010, Australia. [Grayden, David B.] Univ Melbourne, Ctr Neural Engn, Melbourne, Vic 3010, Australia. RP McLachlan, NM (reprint author), Univ Melbourne, Melbourne Sch Psychol Sci, Melbourne, Vic 3010, Australia. EM mcln@unimelb.edu.au; grayden@unimelb.edu.au FU Australian Research Council [DP1094830, DP120103039] FX This work was supported by Australian Research Council Discovery Project Grants DP1094830 and DP120103039. CR Alain C, 2007, HEARING RES, V229, P225, DOI 10.1016/j.heares.2007.01.011 ASSMANN PF, 1990, J ACOUST SOC AM, V88, P680, DOI 10.1121/1.399772 BEERENDS JG, 1989, J ACOUST SOC AM, V85, P813, DOI 10.1121/1.397974 Bench J., 1979, SPEECH HEARING TESTS, P481 BLACKBURN CC, 1992, J NEUROPHYSIOL, V68, P124 Blackburn C.C., 1991, J NEUROPHYSIOL, V65, P606 Brons I, 2013, EAR HEARING, V34, P29, DOI 10.1097/AUD.0b013e31825f299f Cant NB, 2003, BRAIN RES BULL, V60, P457, DOI 10.1016/S0361-9230(03)00050-9 Cariani PA, 2001, NEURAL NETWORKS, V14, P737, DOI 10.1016/S0893-6080(01)00056-9 CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 COVEY E, 1991, J NEUROSCI, V11, P3456 Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344 de Cheveigne A, 1999, SPEECH COMMUN, V27, P175, DOI 10.1016/S0167-6393(98)00074-0 Dicke U, 2007, J ACOUST SOC AM, V121, P310, DOI 10.1121/1.2400670 Ehret G., 2005, INFERIOR COLLICULUS, P319 Fastl H., 1989, P 13 INT C AC BELGR, P11 FASTL H, 1979, HEARING RES, V1, P293, DOI 10.1016/0378-5955(79)90002-9 Ferragamo MJ, 2002, J NEUROPHYSIOL, V87, P2262, DOI 10.1152/jn.00587.2001 Frisina R.D., 1990, HEARING RES, V44, P90 Guerin A, 2006, HEARING RES, V211, P54, DOI 10.1016/j.heares.2005.10.001 Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9 Herzke T, 2007, ACTA ACUST UNITED AC, V93, P498 HOWELL P, 1983, SPEECH COMMUN, V2, P164, DOI 10.1016/0167-6393(83)90018-3 Hsieh IH, 2007, HEARING RES, V233, P108, DOI 10.1016/j.heares.2007.08.005 Hu GN, 2008, J ACOUST SOC AM, V124, P1306, DOI 10.1121/1.2939132 Hutchins S., 2011, J EXP PSYCHOL GEN, V141, P76, DOI DOI 10.1037/A0025064 Ishizuka K, 2006, SPEECH COMMUN, V48, P1447, DOI 10.1016/j.specom.2006.06.008 Kidd Jr G., 1989, J ACOUST SOC AM, V38, P106 Krumbholz K, 2003, CEREB CORTEX, V13, P765, DOI 10.1093/cercor/13.7.765 KRYTER KD, 1965, J ACOUST SOC AM, V38, P106, DOI 10.1121/1.1909578 Langner G, 2002, HEARING RES, V168, P110, DOI 10.1016/S0378-5955(02)00367-2 McLachlan N, 2010, PSYCHOL REV, V117, P175, DOI 10.1037/a0018063 McLachlan N, 2011, J ACOUST SOC AM, V130, P2845, DOI 10.1121/1.3643082 McLachlan N, 2009, HEARING RES, V249, P23, DOI 10.1016/j.heares.2009.01.003 McLachlan N, 2013, J EXP PSYCHOL GEN, V142, P1142, DOI 10.1037/a0030830 MEDDIS R, 1992, J ACOUST SOC AM, V91, P233, DOI 10.1121/1.402767 Meddis R, 2006, J ACOUST SOC AM, V120, P3861, DOI 10.1121/1.2372595 Milczynski M, 2012, HEARING RES, V285, P1, DOI 10.1016/j.heares.2012.02.006 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Nelson PC, 2004, J ACOUST SOC AM, V116, P2173, DOI 10.1121/1.1784442 Oertel D, 2000, P NATL ACAD SCI USA, V97, P11773, DOI 10.1073/pnas.97.22.11773 Oertel D, 1997, NEURON, V19, P959, DOI 10.1016/S0896-6273(00)80388-8 Oertel D., 1985, J ACOUST SOC AM, V78, P329 Patel AD, 2006, J ACOUST SOC AM, V119, P3034, DOI 10.1121/1.2179657 PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456 Pickles J.O., 2008, INTRO PHYSL HEARING, P155 PLOMP R, 1994, EAR HEARING, V15, P2 Rakowski A, 1996, ACUSTICA, V82, pS80 Richardson U, 2004, DYSLEXIA, V10, P215, DOI 10.1002/dys.276 Riquelme R, 2001, J COMP NEUROL, V432, P409, DOI 10.1002/cne.1111 ROBINSON K, 1995, J ACOUST SOC AM, V98, P1858, DOI 10.1121/1.414405 Schofield B.R., 2005, INFERIOR COLLICULUS, P140 Seither-Preisler A, 2006, HEARING RES, V218, P50, DOI 10.1016/j.heares.2006.04.005 Slaney M., 1993, 35 APPL COMP SMITH RL, 1975, BIOL CYBERN, V17, P169, DOI 10.1007/BF00364166 Soeta Y, 2004, J ACOUST SOC AM, V116, P3275, DOI 10.1121/1.1782931 Soeta Y, 2007, J SOUND VIB, V304, P415, DOI 10.1016/j.jsv.2007.03.007 STIEBLER I, 1986, NEUROSCI LETT, V65, P336, DOI 10.1016/0304-3940(86)90285-5 Strait DL, 2012, CORTEX, V48, P360, DOI 10.1016/j.cortex.2011.03.015 STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 Turicchia L, 2005, IEEE T SPEECH AUDI P, V13, P243, DOI 10.1109/TSA.2004.841044 Vandali AE, 2011, J ACOUST SOC AM, V129, P4023, DOI 10.1121/1.3573988 VIEMEISTER NF, 1991, J ACOUST SOC AM, V90, P858, DOI 10.1121/1.401953 Wiegrebe L, 2001, J NEUROPHYSIOL, V85, P1206 Wiegrebe L, 2004, J ACOUST SOC AM, V115, P1207, DOI 10.1121/1.1643359 Yoo SD, 2007, J ACOUST SOC AM, V122, P1138, DOI 10.1121/1.2751257 Zwicker E., 1999, PSYCHOACOUSTICS FACT, P111 NR 67 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 114 EP 125 DI 10.1016/j.specom.2013.09.007 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100009 ER PT J AU Dileep, AD Sekhar, CC AF Dileep, A. D. Sekhar, C. Chandra TI Class-specific GMM based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines SO SPEECH COMMUNICATION LA English DT Article DE Varying length pattern; Long duration speech; Set of local feature vectors; Dynamic kernels; Intermediate matching kernel; Support vector machine; Speech emotion recognition; Speaker identification ID SPEAKER VERIFICATION; RECOGNITION; IDENTIFICATION; MODELS AB Dynamic kernel based support vector machines are used for classification of varying length patterns. This paper explores the use of intermediate matching kernel (IMK) as a dynamic kernel for classification of varying length patterns of long duration speech represented as sets of feature vectors. The main issue in construction of IMK is the choice for the set of virtual feature vectors used to select the local feature vectors for matching. The components of class-independent GMM (CIGMM) have been used earlier as a representation for the set of virtual feature vectors. For every component of CIGMM, a local feature vector each from the two sets of local feature vectors that has the highest probability of belonging to that component is selected and a base kernel is computed between the selected local feature vectors. The IMK is computed as the sum of all the base kernels corresponding to different components of CIGMM. The construction of CIGMM-based IMK does not use the class-specific information, as the local feature vectors are selected using the components of CIGMM that is common for all the classes. We propose two novel methods to build a better discriminatory IMK-based SVM classifier by considering a set of virtual feature vectors specific to each class depending on the approaches to multiclass classification using SVMs. In the first method, we propose a class-wise IMK based SVM for every class by using components of GMM built for a class as the set of virtual feature vectors for that class in the one-against-the-rest approach to multiclass pattern classification. In the second method, we propose a pairwise IMK based SVM for every pair of classes by using components of GM M built for a pair of classes as the set of virtual feature vectors for that pair of classes in the one-against-one approach to multiclass classification. We also proposed to use the mixture coefficient weighted and responsibility term weighted base kernels in computation of class-specific IMKs to improve their discrimination ability. This paper also proposes the posterior probability weighted dynamic kernels to improve their classification performance and reduce the number of support vectors. The performance of the SVM-based classifiers using the proposed class-specific IMKs is studied for speech emotion recognition and speaker identification tasks and compared with that of the SVM-based classifiers using the state-of-the-art dynamic kernels. (C) 2013 Elsevier B.V. All rights reserved. C1 [Dileep, A. D.; Sekhar, C. Chandra] Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India. RP Dileep, AD (reprint author), IIT Madras, Speech & Vis Lab, Dept CSE, Madras 600036, Tamil Nadu, India. EM addileep@gmail.com; chandra@c-se.iitm.ac.in CR Boughorbel S, 2005, IEEE IJCNN, P889 Boughorbel S., 2004, P BRIT MACH VIS C BM, P137 Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 Burkhardt F., 2005, P INT, P1517 Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086 Chandrakala S, 2009, Proceedings 2009 International Joint Conference on Neural Networks (IJCNN 2009 - Atlanta), DOI 10.1109/IJCNN.2009.5178777 Chang C. C., 2011, ACM T INTELLIGENT SY, V2 Dileep A.D., 2011, SPEAKER FORENSICS NE, P389 Gonen M, 2008, IEEE T NEURAL NETWOR, V19, P130, DOI 10.1109/TNN.2007.903157 Jaakkola T, 2000, J COMPUT BIOL, V7, P95, DOI 10.1089/10665270050081405 Lee K. A., 2007, P INT, P294 Neiberg D., 2006, P INT 2006 PITTSB US NIST, 2002, NIST YEAR 2002 SPEAK NIST, 2003, NIST YEAR 2003 SPEAK Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 Rabiner L.R., 2003, FUNDAMENTALS SPEECH REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Sato N., 2007, J NATURAL LANGUAGE P, V14, P83 Sha F, 2006, INT CONF ACOUST SPEE, P265 Shawe-Taylor J., 2004, KERNEL METHODS PATTE Smith N., 2001, DATA DEPENDENT KERNE Steidl S., 2009, THESIS U ERLANGEN NU Tao Q, 2005, IEEE T NEURAL NETWOR, V16, P1561, DOI 10.1109/tnn.2005.857955 Wallraven C., 2003, Proceedings Ninth IEEE International Conference on Computer Vision WAN V, 2002, ACOUST SPEECH SIG PR, P669 You CH, 2010, IEEE T AUDIO SPEECH, V18, P1300, DOI 10.1109/TASL.2009.2032950 NR 27 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 126 EP 143 DI 10.1016/j.specom.2013.09.010 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100010 ER PT J AU Maeno, Y Nose, T Kobayashi, T Koriyama, T Ijima, Y Nakajima, H Mizuno, H Yoshioka, O AF Maeno, Yu Nose, Takashi Kobayashi, Takao Koriyama, Tomoki Ijima, Yusuke Nakajima, Hideharu Mizuno, Hideyuki Yoshioka, Osamu TI Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE HMM-based expressive speech synthesis; Prosodic context; Unsupervised labeling; Audiobook; Prosody control ID CORPUS-BASED SPEECH; SYNTHESIS SYSTEM; MODEL AB This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic information, and prosodic features are generated from the HMMs. The average difference of original and generated prosodic features for each accent phrase is then calculated and classified into three classes, e.g., low, neutral, and high in the case of fundamental frequency. The created prosodic context label has a practical meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive way by manually changing the proposed labels. In the experiments, we evaluate the proposed technique in both ideal and practical conditions using speech of sales talk and fairy tale recorded under a realistic domain. In the evaluation under the practical condition, we evaluate whether the users achieve their intended prosodic modification by changing the proposed context label of a certain accent phrase for a given sentence. (C) 2013 Elsevier B.V. All rights reserved. C1 [Maeno, Yu; Nose, Takashi; Kobayashi, Takao; Koriyama, Tomoki] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. [Ijima, Yusuke; Nakajima, Hideharu; Mizuno, Hideyuki; Yoshioka, Osamu] NTT Corp, NTT Media Intelligence Labs, Yokosuka, Kanagawa 2390847, Japan. RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. EM takashi.nose@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp; koriyama.t.aa@m.titech.ac.jp RI Koriyama, Tomoki/B-9321-2015 OI Koriyama, Tomoki/0000-0002-8347-5604 CR Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123 Braunschweiler N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2222 Campbell N, 2005, IEICE T INF SYST, VE88D, P376, DOI 10.1093/ietisy/e88-d.3.376 Chen L., 2013, P ICASSP 2013, P7977 Doukhan D., 2011, P INTERSPEECH 2011, P3129 Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317 Eyben F, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4009 Iida A, 2003, SPEECH COMMUN, V40, P161, DOI 10.1016/S0167-6393(02)00081-X Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Maeno Y., 2011, P INTERSPEECH 2011, P1849 Morizane K, 2009, ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, P76 Nakajima H, 2009, 2009 EIGHTH INTERNATIONAL SYMPOSIUM ON NATURAL LANGUAGE PROCESSING, PROCEEDINGS, P137, DOI 10.1109/SNLP.2009.5340932 Nakajima H., 2010, CREATION ANAL JAPANE Nose T, 2013, SPEECH COMMUN, V55, P347, DOI 10.1016/j.specom.2012.09.003 Prahallad K., 2007, P INTERSPEECH, P2901 Schroder M, 2009, AFFECTIVE INFORMATION PROCESSING, P111, DOI 10.1007/978-1-84800-306-4_7 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Strom V., 2007, P INT 2007 ANTW BELG, P1282 Suni A., 2012, P BLIZZ CHALL 2012 W Szekely E., 2011, P INT FLOR ISCA, P2409 Tsuzuki R, 2004, P INTERSPEECH 2004 I, P1185 Vainio M., 2005, P 10 INT C SPEECH CO, P309 Yamagishi J., 2003, P INTERSPEECH 2003 E, P2461 Yu K, 2010, INT CONF ACOUST SPEE, P4238, DOI 10.1109/ICASSP.2010.5495690 Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 Zhao Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1750 NR 26 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 144 EP 154 DI 10.1016/j.specom.2013.09.014 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100011 ER PT J AU Origlia, A Cutugno, F Galata, V AF Origlia, A. Cutugno, F. Galata, V. TI Continuous emotion recognition with phonetic syllables SO SPEECH COMMUNICATION LA English DT Article DE Affective computing; Feature extraction; Phonetic syllables; Valence-Activation-Dominance space ID FUNDAMENTAL-FREQUENCY; TONAL PERCEPTION; SPEECH; MODEL; FEATURES; DISCRIMINATION; STYLIZATION; MODULATION; AMPLITUDE; RHYTHM AB As research on the extraction of acoustic properties of speech for emotion recognition progresses, the need of investigating methods of feature extraction taking into account the necessities of real time processing systems becomes more important. Past works have shown the importance of syllables for the transmission of emotions, while classical research methods adopted in prosody show that it is important to concentrate on specific areas of the speech signal to study intonation phenomena. Technological approaches, however, are often designed to use the whole speech signal without taking into account the qualitative variability of the spectral content. Given this contrast with the theoretical basis around which prosodic research is pursued, we present here a feature extraction method built on the basis of a phonetic interpretation of the concept of syllable. In particular, we concentrate on the spectral content of syllabic nuclei, thus reducing the amount of information to be processed. Moreover, we introduce feature weighting based on syllabic prominence, thus not considering all the units of analysis as being equally important. The method is evaluated on a continuous, three-dimensional model of emotions built on the classical axes of Valence, Activation and Dominance and is shown to be competitive with state-of-the-art performance. The potential impact of this approach on the design of affective computing systems is also analysed. (C) 2013 Elsevier B.V. All rights reserved. C1 [Origlia, A.; Cutugno, F.] Univ Naples Federico II, Dept Elect Engn & Informat Technol DIETI, Language Understanding & Speech Interfaces LUSI L, I-80125 Naples, Italy. [Galata, V.] CNR, Inst Cognit Sci & Technol ISTC, Padua, Italy. RP Origlia, A (reprint author), Univ Naples Federico II, Dept Elect Engn & Informat Technol DIETI, Language Understanding & Speech Interfaces LUSI L, Via Claudio 21, I-80125 Naples, Italy. EM antonio.origlia@unina.it; cutugno@unina.it; vincenzo.galata@pd.istc.cnr.it FU European Community [600958] FX Antonio Origlia and Francesco Cutugno's work was supported by the European Community, within the FP7 SHERPA IP #600958 project. The authors would like to thank the two anonymous reviewers and the editor for the helpful comments and the constructive suggestions provided. CR Arvaniti A, 2009, PHONETICA, V66, P46, DOI 10.1159/000208930 Avanzi M., 2010, P SPEECH PROS Baltruktitis T., 2013, P IEEE INT C AUT FAC Barry WJ, 2003, P 15 INT C PHON SCI, P2693 Batliner Anton, 2010, Advances in Human-Computing Interaction, DOI 10.1155/2010/782802 Boersma P., 2011, PRAAT DOING PHONETIC Boersma P., 1993, P I PHONETIC SCI, V17, P97 Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114 Burkhardt F., 2005, P INT, P1517 Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199 Collier R., 1990, PERCEPTUAL STUDY INT Cowie R., 2001, IEEE SIGNAL PROCESSI, V18, P33 Cowie R., 2000, P ISCA WORKSH SPEECH, P19 Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070 DALESSANDRO C, 1995, COMPUT SPEECH LANG, V9, P257, DOI 10.1006/csla.1995.0013 Dellwo V, 2003, P 15 INT C PHON SCI, P471 Dellwo V., 2006, LANGUAGE LANGUAGE PR, P231 Drioli C., 2003, P VOIC QUAL GEN, P127 EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068 Espinosa HP, 2010, INT CONF ACOUST SPEE, P5138 Fernandez R, 2011, SPEECH COMMUN, V53, P1088, DOI 10.1016/j.specom.2011.05.003 FETH LL, 1972, ACUSTICA, V26, P67 Fragopanagos N, 2005, NEURAL NETWORKS, V18, P389, DOI 10.1016/j.neunet.2005.03.006 Galata V., 2010, THESIS U CALABRIA IT Gharavian D., 2012, NEURAL COMPUT APPL, V22, P1 Goudbeek M., 2009, P INT, P1575 Grimm M, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P865, DOI 10.1109/ICME.2008.4607572 Grimm M., 2005, P IEEE AUT SPEECH RE, P381 Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010 Gunes Hatice, 2011, Proceedings 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2011), DOI 10.1109/FG.2011.5771357 Hall M. A., 1998, THESIS HAMILTON House D, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2048 House D., 1995, P EUR, P949 House David, 1990, TONAL PERCEPTION SPE Jespersen O., 1920, LEHRBUCH PHONETIC Jia J, 2011, IEEE T AUDIO SPEECH, V19, P570, DOI 10.1109/TASL.2010.2052246 Jittiwarangkul N., 1998, P IEEE AS PAC C CIRC, P169 Kaiser J., 1990, P IEEE INT C AC SPEE, V1, P381 Kao YH, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1814 KLATT DH, 1973, J ACOUST SOC AM, V53, P8, DOI 10.1121/1.1913333 Ludusan B., 2011, P INT, P2413 MAIWALD D, 1967, ACUSTICA, V18, P81 Martin P., 2010, P SPEECH PROS Mary L, 2008, SPEECH COMMUN, V50, P782, DOI 10.1016/j.specom.2008.04.010 McKeown G, 2010, IEEE INT CON MULTI, P1079, DOI 10.1109/ICME.2010.5583006 Mehrabian A, 1996, CURR PSYCHOL, V14, P261, DOI 10.1007/BF02686918 Mermelstein D., 1975, J ACOUST SOC AM, V54, P880 Mertens P., 2004, P SPEECH PROS Moller E., 1974, P FACTS MOD HEAR, P227 Nicolaou MA, 2011, IEEE T AFFECT COMPUT, V2, P92, DOI 10.1109/T-AFFC.2011.9 Origlia A, 2013, COMPUT SPEECH LANG, V27, P190, DOI 10.1016/j.csl.2012.04.003 Patel S, 2011, BIOL PSYCHOL, V87, P93, DOI 10.1016/j.biopsycho.2011.02.010 Petrillo M., 2003, P EUR, P2913 Pfitzinger HR, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1261 POLLACK I, 1968, J EXP PSYCHOL, V77, P535, DOI 10.1037/h0026051 Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X Roach P., 2000, PRACTICAL COURSE ROSSI M, 1978, LANG SPEECH, V21, P384 ROSSI M, 1971, PHONETICA, V23, P1 RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714 Scherer KR, 2003, SER AFFECTIVE SCI, P433 Schouten H.E.M., 1985, PERCEPT PSYCHOPHYS, V37, P369 Schuller B, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P1333, DOI 10.1109/ICME.2008.4607689 Schuller B, 2011, SPEECH COMMUN, V53, P1062, DOI 10.1016/j.specom.2011.01.011 Schuller Bjorn, 2009, P INTERSPEECH, P312 Seppi D., 2010, P SPEECH PROS SERGEANT RL, 1962, J ACOUST SOC AM, V34, P1625, DOI 10.1121/1.1909065 Silipo R., 1999, P 14 INT C PHON SCI, P2351 Tamburini Fabio, 2007, P INT 2007 ANTW BELG, P1809 TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019 Vlasenko B., 2011, P ICME, P4230 Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139 Wu SQ, 2011, SPEECH COMMUN, V53, P768, DOI 10.1016/j.specom.2010.08.013 ZWICKER EBERHARD, 1962, JOUR ACOUSTICAL SOC AMER, V34, P1425, DOI 10.1121/1.1918362 NR 74 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 155 EP 169 DI 10.1016/j.specom.2013.09.012 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100012 ER PT J AU Wolf, M Nadeu, C AF Wolf, Martin Nadeu, Climent TI Channel selection measures for multi-microphone speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; Channel (microphone) selection; Signal quality; Multi-microphone; Reverberation ID NOISE AB Automatic speech recognition in a room with distant microphones is strongly affected by noise and reverberation. In scenarios where the speech signal is captured by several arbitrarily located microphones the degree of distortion differs from one channel to another. In this work we deal with measures extracted from a given distorted signal that either estimate its quality or measure how well it fits the acoustic models of the recognition system. We then apply them to solve the problem of selecting the signal (i.e. the channel) that presumably leads to the lowest recognition error rate. New channel. selection techniques are presented, and compared experimentally in reverberant environments with other approaches reported in the literature. Significant improvements in recognition rate are observed for most of the measures. A new measure based on the variance of the speech intensity envelope shows a good trade-off between recognition accuracy, latency and computational cost. Also, the combination of measures allows a further improvement in recognition rate. (C) 2013 Elsevier B.V. All rights reserved. C1 [Wolf, Martin; Nadeu, Climent] Univ Politecn Cataluna, TALP Res Ctr, Dept Signal Theory & Commun, ES-08034 Barcelona, Spain. RP Wolf, M (reprint author), Univ Politecn Cataluna, TALP Res Ctr, Dept Signal Theory & Commun, Jordi Girona 1-3, ES-08034 Barcelona, Spain. EM martin.wolf@upc.edu; climent.nadeu@upc.edu RI Nadeu, Climent/B-9638-2014 OI Nadeu, Climent/0000-0002-5863-0983 FU Spanish project SARAI [TEC2010-21040-C02-01] FX This work was supported by the Spanish project SARAI (reference number TEC2010-21040-C02-01). CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 Brandstein M., 2001, MICROPHONE ARRAYS de la Torre A., 2002, P ICASSP Fisher RA, 1936, ANN EUGENIC, V7, P179 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hui Jiang, 2005, Speech Communication, V45, DOI 10.1016/j.specom.2004.12.004 ICSI, 2003, ICSI M REC DIG CORP Janin A., 2004, P ICASSP 2004 M REC Jeub M., 2011, P EUR SIGN PROC C EU Kumar K, 2011, P HSCMA ED UK, P1 Leonard R. G., 1984, P ICASSP 84, P111 Molau S, 2001, ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, P21 NIST, 2000, SPEECH QUAL ASS SPQA Obuchi Y, 2006, ELECTRON COMM JPN 2, V89, P9, DOI 10.1002/ecjb.20281 Obuchi Y., 2004, WORKSH STAT PERC AUD OPENSHAW JP, 1994, INT CONF ACOUST SPEE, P49 Petrick R., 2007, P INT, P1094 Shimizu Y, 2000, INT CONF ACOUST SPEE, P1747, DOI 10.1109/ICASSP.2000.862090 Wolf M., 2009, P 1 JOINT SIG IL MIC, P61 Wolf M., 2012, P IBERSPEECH MADR SP, P513 Wolf M., 2010, P INTERSPEECH TOK JA, P80 Wolfel M., 2007, P INT ANTW AUG, P582 Wolfel M., 2009, DISTANT SPEECH RECOG Wolfel M., 2006, P INTERSPEECH Young S., 2006, HTK BOOK HTK VERSION NR 26 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 170 EP 180 DI 10.1016/j.specom.2013.09.015 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100013 ER PT J AU Xu, Y Prom-on, S AF Xu, Yi Prom-on, Santitham TI Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning SO SPEECH COMMUNICATION LA English DT Article DE Prosody modeling; Target approximation; Parallel encoding; Analysis-by-synthesis; Simulated annealing ID COMMAND-RESPONSE MODEL; STANDARD CHINESE; MATCHED STATEMENTS; MANDARIN CHINESE; FOCUS LOCATION; INTONATION; TONE; PROSODY; ENGLISH; PITCH AB Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer-A trainable yet deterministic prosody synthesizer based on an articulatory functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed. (C) 2013 Elsevier B.V. All rights reserved. C1 [Xu, Yi; Prom-on, Santitham] UCL, Dept Speech Hearing & Phonet Sci, London WC1N 1PF, England. [Prom-on, Santitham] King Mongkuts Univ Technol Thonburi, Dept Comp Engn, Fac Engn, Bangkok 10140, Thailand. RP Prom-on, S (reprint author), King Mongkuts Univ Technol Thonburi, Dept Comp Engn, Fac Engn, 126 Prachauthit Rd, Bangkok 10140, Thailand. EM yi.xu@ucl.ac.uk; santitham@cpe.kmutt.ac.th FU Royal Society; Royal Academy of Engineering through the Newton International Fellowship Scheme; Thai Research Fund through the Research Grant for New Researcher [TRG5680096]; National Science Foundation FX We would like to thank for the financial supports the Royal Society and the Royal Academy of Engineering through the Newton International Fellowship Scheme (to SP), the Thai Research Fund through the Research Grant for New Researcher (Grant Number TRG5680096 to SP), and the National Science Foundation (to YX). We thank Fang Liu for providing the English and Mandarin Chinese corpora used in this work. We would further like to thank the Organizers of Speech Prosody 2012 for inviting us to give the tutorial about this work at the conference. CR Abramson A.S., 1962, INT J AM LINGUIST, V28, P3 Anderson M.D., 1984, P ICASSP 1984 SAN DI, P77 Arvaniti A, 2009, PHONOLOGY, V26, P43, DOI 10.1017/S0952675709001717 Bailly G, 2005, SPEECH COMMUN, V46, P348, DOI 10.1016/j.specom.2005.04.008 Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X Black AW, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1385 Boersma Paul, 2012, PRAAT DOING PHONETIC Breen M, 2012, CORPUS LINGUIST LING, V8, P277, DOI 10.1515/cllt-2012-0011 Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE Chen G.-P., 2004, P ISCSLP, P177 Chen Mathew Y., 2000, TONE SANDHI PATTERNS Chen YY, 2006, PHONETICA, V63, P47, DOI 10.1159/000091406 Collier R., 1990, PERCEPTUAL STUDY INT COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372 Duanmu S., 2000, PHONOLOGY STANDARD C EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091 EFRON B, 1979, ANN STAT, V7, P1, DOI 10.1214/aos/1176344552 Fujisaki H, 2005, SPEECH COMMUN, V47, P59, DOI 10.1016/j.specom.2005.06.009 Fujisaki H., 1990, P ICSLP 90, P841 GANDOUR J, 1994, J PHONETICS, V22, P477 Grabe E, 2007, LANG SPEECH, V50, P281 Gu WT, 2006, IEEE T AUDIO SPEECH, V14, P1155, DOI 10.1109/TASL.2006.876132 Gu WT, 2007, PHONETICA, V64, P29, DOI 10.1159/0000100060 HADDINGKOCH K, 1964, PHONETICA, V11, P175 Hermes DJ, 1998, J SPEECH LANG HEAR R, V41, P73 Hirst D. J., 2011, J SPEECH SCI, V1, P55 Hirst DJ, 2005, SPEECH COMMUN, V46, P334, DOI 10.1016/j.specom.2005.02.020 HO AT, 1977, PHONETICA, V34, P446 HOWIE JM, 1974, PHONETICA, V30, P129 Jilka M, 1999, SPEECH COMMUN, V28, P83, DOI 10.1016/S0167-6393(99)00008-4 Jokisch O., 2000, P ICSLP 2000 BEIJ, P645 KIRKPATRICK S, 1983, SCIENCE, V220, P671, DOI 10.1126/science.220.4598.671 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Konishi S, 1996, BIOMETRIKA, V83, P875, DOI 10.1093/biomet/83.4.875 Kuo YC, 2007, PHONOL PHONET, V12-2, P211 Ladd DR, 2008, CAMB STUD LINGUIST, V79, P1 Ladefoged P., 1967, 3 AREAS EXPT PHONETI Lee Y.-C., 2010, P SPEECH PROS 2010 C Liu F., 2013, J SPEECH SCI, V3, P85 Liu F, 2005, PHONETICA, V62, P70, DOI 10.1159/000090090 Mixdorff H., 2003, P EUR 2003, P873 Myers Scott, 1998, PHONOLOGY, V15, P367, DOI 10.1017/S0952675799003620 Ni J., 2004, P INT C SPEECH PROS, P95 Ni JF, 2006, J ACOUST SOC AM, V119, P1764, DOI 10.1121/1.2165071 Ni JF, 2006, SPEECH COMMUN, V48, P989, DOI 10.1016/j.specom.2006.01.002 OSHAUGHNESSY D, 1983, J ACOUST SOC AM, V74, P1155, DOI 10.1121/1.390039 Pell MD, 2001, J ACOUST SOC AM, V109, P1668, DOI 10.1121/1.1352088 Peng S.-H., 2000, PAPERS LAB PHONOLOGY, VV, P152 Perkell J. S., 1986, INVARIANCE VARIABILI Perrier P, 1996, J SPEECH HEAR RES, V39, P365 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 Pierrehumbert Janet, 1980, THESIS MIT CAMBRIDGE Potisuk S., 1997, PHONETICA, V42, P22 Prom-on S., 2011, P 17 INT C PHON SCI, P1638 Prom-on S, 2009, J ACOUST SOC AM, V125, P405, DOI 10.1121/1.3037222 Prom-on S, 2012, J ACOUST SOC AM, V132, P421, DOI 10.1121/1.4725762 Raidt S., 2004, P SPEECH PROS 2004 N, P417 Rose P.J., 1988, PROSODIC ANAL ASIAN, P55 Ross KN, 1999, IEEE T SPEECH AUDI P, V7, P295, DOI 10.1109/89.759037 Sakurai A, 2003, SPEECH COMMUN, V40, P535, DOI 10.1016/S0167-6393(02)00177-2 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 Shih C., 1987, PHONETICS CHINESE TO Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 SILVERMAN K, 1986, PHONETICA, V43, P76 Sun X., 2002, THESIS NW U Sun XJ, 2002, J VOICE, V16, P443, DOI 10.1016/S0892-1997(02)00119-4 Syrdal A. K., 2000, P INT C SPOK LANG PR, V3, P235 Taylor P, 2009, TEXT TO SPEECH SYNTH Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 Vainio M., 2009, P SPECOM 2009 ST PET, P164 Vainio M, 2010, J ACOUST SOC AM, V128, P1313, DOI 10.1121/1.3467767 van Santen JPH, 2000, TEXT SPEECH LANG TEC, V15, P269 Wagner M, 2010, LANG COGNITIVE PROC, V25, P905, DOI 10.1080/01690961003589492 WANG WSY, 1967, J SPEECH HEAR RES, V10, P629 WHALEN DH, 1995, J PHONETICS, V23, P349, DOI 10.1016/S0095-4470(95)80165-0 Wightman C., 1999, P IEEE ASRU 1999 KEY, P333 Wightman C., 2002, P SPEECH PROS 2002 A, P25 Wong Y.W., 2007, P 16 INT C PHON SCI, P1293 Wu W.L., 2010, P SPEECH PROS 2010 C Wu Zong Ji, 1984, ZHONGGUO YUYAN XUEBA, V2, P70 Xiaonan Shen, 1990, PROSODY MANDARIN CHI Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y, 2012, LINGUIST REV, V29, P131, DOI 10.1515/tlr-2012-0006 Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 Xu Y, 2013, PLOS ONE, V8, DOI 10.1371/journal.pone.0062397 Xu Y., 2011, J SPEECH SCI, V1, P85 Xu Y, 2005, SPEECH COMMUN, V46, P220, DOI 10.1016/j.specom.2005.02.014 Xu Y., 2009, J PHONETICS, V37, P507 Xu Y., 2013, PROSODY ICONICITY, P33 Xu Y., 2010, PENTATRAINER1 Xu Y, 2005, J PHONETICS, V33, P159, DOI 10.1016/j.wocn.2004.11.001 Xu Y, 1998, PHONETICA, V55, P179, DOI 10.1159/000028432 Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7 Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789 Yang XH, 2012, PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, P543 Yip M., 2002, TONE Yuan J., 2002, P 1 INT C SPEECH PRO, P711 Zhang J., 2004, PHONETICALLY BASED P NR 101 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 181 EP 208 DI 10.1016/j.specom.2013.09.013 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100014 ER PT J AU Wagner, P Malisz, Z Kopp, S AF Wagner, Petra Malisz, Zofia Kopp, Stefah TI Gesture and speech in interaction: An overview SO SPEECH COMMUNICATION LA English DT Editorial Material ID EMBODIED CONVERSATIONAL AGENTS; BEHAVIOR MARKUP LANGUAGE; HEAD MOVEMENTS; COLLABORATIVE PROCESS; PROSODIC PROMINENCE; LISTENER RESPONSES; ANNOTATION-SCHEME; VIRTUAL HUMANS; HAND; COMMUNICATION AB Gestures and speech interact. They are linked in language production and perception, with their interaction contributing to felicitous communication. The multifaceted nature of these interactions has attracted considerable attention from the speech and gesture community. This article provides an overview of our current understanding of manual and head gesture form and function, of the principle functional interactions between gesture and speech aiding communication, transporting meaning and producing speech. Furthermore, we present an overview of research on temporal speech-gesture synchrony, including the special role of prosody in speech-gesture alignment. In addition, we provide a summary of tools and data available for gesture analysis, and describe speech-gesture interaction models and simulations in technical systems. This overview also serves as an introduction to a Special Issue covering a wide range of articles on these topics. We provide links to the Special Issue throughout this paper. (C) 2013 Elsevier B.V. All rights reserved. EM zofia.malisz@uni-bielefeld.de CR Abercrombie David, 1954, ELT J, V9, P3 Akakin HC, 2011, IMAGE VISION COMPUT, V29, P470, DOI 10.1016/j.imavis.2011.03.001 Alibali MW, 2001, J MEM LANG, V44, P169, DOI 10.1006/jmla.2000.2752 Alibali MW, 2000, LANG COGNITIVE PROC, V15, P593 Alibali MW, 1999, COGNITIVE DEV, V14, P37, DOI 10.1016/S0885-2014(99)80017-3 Allwood J, 2007, LANG RESOUR EVAL, V41, P273, DOI 10.1007/s10579-007-9061-5 Allwood J., 2003, 1 NORD S MULT COMM C, P7 Al Moubayed S, 2010, J MULTIMODAL USER IN, V3, P299, DOI 10.1007/s12193-010-0054-0 Altorfer A, 2000, BEHAV RES METH INS C, V32, P17, DOI 10.3758/BF03200785 Bavelas J, 2008, J MEM LANG, V58, P495, DOI 10.1016/j.jml.2007.02.004 Bavelas JB, 1994, RES LANG SOC INTERAC, V27, P201, DOI DOI 10.1207/S15327973RLSI2703_3 Bavelas JB, 2002, J COMMUN, V52, P566, DOI 10.1093/joc/52.3.566 Becker Raymond, 2011, P GESPIN2011 GEST SP Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X Bergmann K, 2006, P 10 WORKSH SEM PRAG, P90 Bergmann K, 2010, LECT NOTES ARTIF INT, V6356, P104, DOI 10.1007/978-3-642-15892-6_11 Bergmann K., 2013, LECT NOTES ARTIF INT, P139 Bergmann K, 2010, LECT NOTES ARTIF INT, V5934, P182, DOI 10.1007/978-3-642-12553-9_16 Bergmann Kirsten, 2011, P GESPIN2011 GEST SP Bergmann Kirsten, 2013, P INT C INT VIRT AG Bergmann Kirsten, 2009, P 8 INT C AUT AG MUL, P361 Beskow J., 2006, FOCAL ACCENT FACIAL, P52 Beskow J, 2007, LECT NOTES COMPUT SC, V4775, P250 Bevacqua Elisabetta, 2009, THESIS U PARIS 8 PAR Birdwhistell Ray L., 1970, ESSAYS BODY MOTION C Birdwhistell Ray L., 1952, INTRO KINESICS ANNOT Boersma P., 2008, PRAAT DOING PHONETIC BOLINGER D, 1983, AM SPEECH, V58, P156, DOI 10.2307/455326 Bolinger D., 1986, INTONATION ITS PARTS Bolinger Dwight, 1961, LANGUAGE, V37, P87 Bolinger Dwight, 1982, CHICAGO LINGUISTICS, P1 Bousmalis Konstantinos, 2012, IMAGE VISION COMPUT, V31, P203 Bressem J, 2011, SEMIOTICA, V184, P53, DOI 10.1515/semi.2011.022 Browman Catherine, 1986, PHONOLOGY YB, V3, P219 Brugman Hennie, 2004, P 4 INT C LANG RES E, P2065 Brugman Hennie, 2002, 3 INT C LANG RES EV, P176 BULL P, 1985, J NONVERBAL BEHAV, V9, P169, DOI 10.1007/BF01000738 Buss S. R., 2003, 3D COMPUTER GRAPHICS Butterworth B., 1978, RECENT ADV PSYCHOL L Cafaro Angelo, 2012, IVA, P67 Caldognetto Emanuela M., 2004, P LREC WORKSH MULT C, P29 Caldognetto Magno, 2001, MULTIMODALITA MULTIM Cassell J., 2000, EMBODIED CONVERSATIO Cassell Justine, 1996, COMPUTER VISION HUMA Cerrato Loredana, 2007, THESIS KTH COMPUTER Cerrato Loredana, 2005, GOTHENBURG PAPERS TH, P153 Chiu Chung-Cheng, 2011, 11 INT C INT VIRT AG CHRISTENFELD N, 1991, J PSYCHOLINGUIST RES, V20, P1, DOI 10.1007/BF01076916 Chui K, 2005, J PRAGMATICS, V37, P871, DOI 10.1016/j.pragma.2004.10.016 Cienki A, 2008, CAMB HANDB PSYCHOL, P483 Clark H. H., 1991, PERSPECTIVES SOCIALL, V13, P127, DOI DOI 10.1037/10096-006 CLARK HH, 1986, COGNITION, V22, P1, DOI 10.1016/0010-0277(86)90010-7 Condon W. S., 1971, PERCEPTION LANGUAGE Corradini A., 2002, Gesture and Sign Language in Human-Computer Interaction. International Gesture Workshop, GW 2001. Revised Papers (Lecture Notes in Artificial Intelligence Vol.2298) De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018 De Ruiter JP, 2000, LANGUAGE GESTURE, P248 de Ruiter JP, 2010, INTERACT STUD, V11, P51, DOI 10.1075/is.11.1.05rui de Ruiter JP, 2012, TOP COGN SCI, V4, P232, DOI 10.1111/j.1756-8765.2012.01183.x DeSteno D, 2012, PSYCHOL SCI, V23, P1549, DOI 10.1177/0956797612448793 DITTMANN A. T., 1969, J PERS SOC PSYCHOL, V23, P283 DITTMANN AT, 1968, J PERS SOC PSYCHOL, V9, P79, DOI 10.1037/h0025722 Dobrogaev S.M., 1929, IAZYKOVEDENIE MAT, P105 DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031 Eickeler S, 1998, INT C PATT RECOG, P1206 Ekman P., 1979, HUMAN ETHOLOGY, P169 EKMAN P, 1972, J COMMUN, V22, P353, DOI 10.1111/j.1460-2466.1972.tb00163.x Erdem U. M., 2002, Proceedings 16th International Conference on Pattern Recognition, DOI 10.1109/ICPR.2002.1044759 Eriksson A., 2001, PROCEEDINGS OF EUROS, P399 Ferre G., 2010, LANGUAGE RESOURCES E Feyereisen P., 1987, PSYCHOL REV, V94, P168 Gentilucci Maurizio, 2007, GESTURE, V7, P159, DOI 10.1075/gest.7.2.03gen GOLDINMEADOW S, 1993, PSYCHOL REV, V100, P279, DOI 10.1037/0033-295X.100.2.279 Goldin-Meadow S, 2000, CHILD DEV, V71, P231, DOI 10.1111/1467-8624.00138 Goldin-Meadow S, 2001, PSYCHOL SCI, V12, P516, DOI 10.1111/1467-9280.00395 Goldin-Meadow S, 1999, TRENDS COGN SCI, V3, P419, DOI 10.1016/S1364-6613(99)01397-2 Goldsmith J., 1990, AUTOSEGMENTAL METRIC Goodwin C., 1981, CONVERSATIONAL ORG I Granstrom Bjorn, 2007, P 16 INT C PHON SCI, P11 Gravano A, 2011, COMPUT SPEECH LANG, V25, P601, DOI 10.1016/j.csl.2010.10.003 Gussenhoven Carlos, 1999, LANG SPEECH, V42, P283 Gut Ulrike, 2003, KI, V17, P34 Habets B, 2011, J COGNITIVE NEUROSCI, V23, P1845, DOI 10.1162/jocn.2010.21462 HADAR U, 1984, HUM MOVEMENT SCI, V3, P237, DOI 10.1016/0167-9457(84)90018-6 HADAR U, 1983, HUM MOVEMENT SCI, V2, P35, DOI 10.1016/0167-9457(83)90004-0 BUTTERWORTH B, 1989, PSYCHOL REV, V96, P168, DOI 10.1037//0033-295X.96.1.168 Hadar Uri, 1985, J NONVERBAL BEHAV, V9, P214 Harling P. A., 1997, Progress in Gestural Interaction. Proceedings of Gesture Workshop '96 Harrison S., 2013, P TIGER 2013 TILB NL Hartmann B, 2006, LECT NOTES ARTIF INT, V3881, P188 Heldner Mattias, 2012, NORD PROS P 11 C TAR, P137 Heylen D, 2011, COGN TECHNOL, P321, DOI 10.1007/978-3-642-15184-2_17 Heylen D, 2008, LECT NOTES ARTIF INT, V4930, P241 Heylen D, 2006, INT J HUM ROBOT, V3, P241, DOI 10.1142/S0219843606000746 Heylen Dirk, 2005, P JOINT S VIRT SOC A, P45 Holler J, 2007, J LANG SOC PSYCHOL, V26, P4, DOI 10.1177/0261927X06296428 Holler J., 2009, J LANG SOC PSYCHOL, V26, P4 Holler J, 2003, GESTURE, V3, P127, DOI DOI 10.1075/GEST.3.2.02HOL Hostetter A., 2007, GESTURE, V7, P73, DOI DOI 10.1075/GEST.7.1.05HOS Hostetter AB, 2007, LANG COGNITIVE PROC, V22, P313, DOI 10.1080/01690960600632812 Hostetter AB, 2012, GESTURE, V12, P62, DOI 10.1075/gest.12.1.04hos Hostetter AB, 2008, PSYCHON B REV, V15, P495, DOI 10.3758/PBR.15.3.495 Ishii R, 2008, LECT NOTES COMPUT SC, V5208, P200 Iverson J. M., 1999, J CONSCIOUSNESS STUD, V6, P19 Iverson JM, 1998, NATURE, V396, P228, DOI 10.1038/24300 Jakobson Roman, 1972, LANGUAGE SOC, V1, P91 Jannedy S., 2005, INTERDISCIPLINARY ST, V03, P199 Jokinen K, 2010, P WORKSH EYE GAZ INT, P118 Jun Sun-Ah, 2007, PROSODIC TYPOLOGY PH, P430 Kapoor A, 2001, P 2001 WORKSH PERC U, P1, DOI 10.1145/971478.971509 Karpitiski Maciej, 2009, SPEECH LANGUAGE TECH, V11, P113 Kelso J. A. S., 1983, PRODUCTION SPEECH, P138 Kendon A., 2004, GESTURE VISIBLE ACTI Kendon Adam, 1972, STUDIES DYADIC COMMU, P177 Kendon Adam, 1980, NONVERBAL COMMUNICAT, P207 Kendon Adam, 2003, GESTURE, V2, P147 Kipp M, 2007, LANG RESOUR EVAL, V41, P325, DOI 10.1007/s10579-007-9053-5 Kipp M., 2004, THESIS SAARLAND U Kipp M., 2009, P INT C AFF COMP INT Kipp Michael, 2001, P WORKSH MULT COMM C Kipp Michael, 2009, LECT NOTES COMPUTER, V5509 Kipp Michael, 2012, MULTIMEDIA INFORM EX, P531 Kirchhof Carolin, 2012, 5 C INT SOC GEST STU Kirchhof Carolin, 2011, P GESPIN2011 GEST SP Kita S., 2000, LANGUAGE CULTURE COG, P162 Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3 Kita S., 2009, LANG COGNITIVE PROC, V24, P795 Knight Dawn, 2011, RBLA, V2, P391 Kopp S., 2004, P INT C MULT INT ICM, P97, DOI 10.1145/1027933.1027952 Kopp S, 2006, LECT NOTES ARTIF INT, V4133, P205 Kopp S, 2008, LECT NOTES ARTIF INT, V4930, P18 Kopp Stefan, 2013, P 35 ANN M COGN SCI Kousidis Spyros, 2013, TIGER 2013 TILB GEST Kousidis Spyros, 2012, P INT WORKSH FEEDB B Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005 Krauss R., 1999, GESTURE SPEECH SIGN, P93 Krenn B., 2004, P AISB 2004 S LANG S, P107 Kunin M, 2007, J NEUROPHYSIOL, V98, P3095, DOI 10.1152/jn.00764.2007 Lakoff John, 1980, METHAPHORS WE LIVE Lausberg H, 2009, BEHAV RES METHODS, V41, P841, DOI 10.3758/BRM.41.3.841 Lee J, 2006, LECT NOTES ARTIF INT, V4133, P243 Leonard Thomas, 2010, LANG COGNITIVE PROC, V26, P1457 Leonard Thomas, 2009, P GESPIN2009 GEST SP, P1 LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X Li Renxiang, 2007, MOBILITY 07, P572 Loehr D. P., 2012, LAB PHONOL, V3, P71 Loehr DP, 2004, THESIS GEORGETOWN U LOEvenbruck H., 2009, SOME ASPECTS SPEECH, P211 Louwerse MM, 2012, COGNITIVE SCI, V36, P1404, DOI 10.1111/j.1551-6709.2012.01269.x Lu Peng, 2005, LECT NOTES COMPUTER, P495 Lucking A, 2013, J MULTIMODAL USER IN, V7, P5, DOI 10.1007/s12193-012-0106-8 Martell C., 2002, P ICSLP 02, P353 Martell CH, 2005, TEXT SPEECH LANG TEC, V30, P79 Mayer RE, 2012, J EXP PSYCHOL-APPL, V18, P239, DOI 10.1037/a0028616 McClave E., 1991, THESIS GEORGETOWN U MCCLAVE E, 1994, J PSYCHOLINGUIST RES, V23, P45, DOI 10.1007/BF02143175 McClave EZ, 2000, J PRAGMATICS, V32, P855, DOI 10.1016/S0378-2166(99)00079-X McNeill D., 1992, HAND MIND WHAT GESTU McNeill D., 2005, GESTURE AND THOUGHT MCNEILL D, 1985, PSYCHOL REV, V92, P350, DOI 10.1037//0033-295X.92.3.350 McNeill David, 1989, PSYCHOL REV, V94, P499 Mertins Inge, 2000, HDB MULTIMODAL SPOKE Morency L-P, 2005, P 7 INT C MULT INT, P18, DOI 10.1145/1088463.1088470 MORRELSAMUELS P, 1992, J EXP PSYCHOL LEARN, V18, P615, DOI 10.1037/0278-7393.18.3.615 Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x Neff M, 2008, ACM T GRAPHIC, V27, DOI 10.1145/1330511.1330516 NGUYEN L, 2012, P INT C MULT INT ACM, P289 Nobe S, 2000, LANGUAGE GESTURE, P186, DOI 10.1017/CBO9780511620850.012 Oertel C, 2013, J MULTIMODAL USER IN, V7, P19, DOI 10.1007/s12193-012-0108-6 OHALA JJ, 1984, PHONETICA, V41, P1 Ozyarek A., 2007, J COGNITIVE NEUROSCI, V19, P605 Parrell B, 2011, P 9 INT SEM SPEECH P Poggi I., 2010, P INT C LANG RES EV, P17 Poggi Isabella, 2001, P INT VIRT AG 3 INT, P235 Poggi I, 2013, J MULTIMODAL USER IN, V7, P67, DOI 10.1007/s12193-012-0102-z Priesters M.A., 2013, P TILB GEST RES M TI Rietveld Toni, 1985, J PHONETICS, V13, P299 Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173) Rosenfeld H.M., 1980, RELATIONSHIP VERBAL, P193 Roth WM, 2001, REV EDUC RES, V71, P365, DOI 10.3102/00346543071003365 Roustan Benjamin, 2010, SPEECH PROSODY 2010 Ruttkay Z, 2007, LECT NOTES COMPUT SC, V4775, P23 Salem Maha, 2012, INT J SOC ROBOT, V1875-4805, P201 Sargin Mehmet, 2006, P IEEE INT C MULT, P893 Sargin ME, 2008, IEEE T PATTERN ANAL, V30, P1330, DOI 10.1109/TPAMI.2007.70797 Schegloff E.A., 1984, STRUCTURES SOCIAL AC, P266 Schmidt T., 2004, P LREC WORKSH XML BA SCHRODER M, 2001, P EUR 2001 AALB, V1, P87 Selting M., 1998, LINGUISTISCHE BERICH, V173, P91 Selting Margret, 2009, GESPRACHSFORSCHUNG O, V10, P353 Shattuck-Hufnagel Stefanie, 2007, NATO SECURITY SCI E, V18 Slobin Dan Isaac, 1996, RETHINKING LINGUISTI, P70 So WC, 2009, COGNITIVE SCI, V33, P115, DOI 10.1111/j.1551-6709.2008.01006.x Spoons D., 1993, INTELLIGENT MULTIMED, P257 Stetson Raymond Herbert, 1951, MOTOR PHONETICS Stone M, 2004, ACM T GRAPHIC, V23, P506, DOI 10.1145/1015706.1015753 Streeck J, 2008, GESTURE, V8, P285, DOI 10.1075/gest.8.3.02str SWERTS M, 1994, LANG SPEECH, V37, P21 Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 Tamburini Fabio, 2007, P INT 2007 ANTW BELG, P1809 TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019 Theune M, 2010, LECT NOTES ARTIF INT, V5934, P195, DOI 10.1007/978-3-642-12553-9_17 Tomlinson RD, 2000, IEEE ENG MED BIOL, V19, P43, DOI 10.1109/51.827404 Treffner P, 2008, ECOL PSYCHOL, V20, P32, DOI 10.1080/10407410701766643 Trippel Thorsten, 2004, P LREC 2004 LISB POR Truong KP, 2011, P INT FLOR IT, P2973 TUITE K, 1993, SEMIOTICA, V93, P83, DOI 10.1515/semi.1993.93.1-2.83 Urban Christian, 2011, THESIS BIELEFELD U Vendler Z, 1967, LINGUISTICS PHILOS Vilhjalmsson H, 2007, LECT NOTES ARTIF INT, V4722, P99 Wachsmuth I., 1998, LECT NOTES ARTIF INT, V1317, P23 Wagner Petra, NEW THEORY COMMUNICA Wexelblat A, 1995, ACM T COMPUT-HUM INT, V2, P179, DOI 10.1145/210079.210080 Wilson Andrew D., 1996, INT C AUT FAC GEST R Wlodarczak Marcin, 2012, P INT WORKSH FEEDB B Wu Y, 1999, LECT NOTES ARTIF INT, V1739, P103 Yassinik Y., 2004, P INT C SOUND SENS M, pC97 Yehia C., 2002, J PHONETICS, V30, P555 Yoganandan N, 2009, J BIOMECH, V42, P1177, DOI 10.1016/j.jbiomech.2009.03.029 NR 218 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 209 EP 232 DI 10.1016/j.specom.2013.09.008 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100015 ER PT J AU Ishi, CT Ishiguro, H Hagita, N AF Ishi, Carlos Toshinori Ishiguro, Hiroshi Hagita, Norihiro TI Analysis of relationship between head motion events and speech in dialogue conversations SO SPEECH COMMUNICATION LA English DT Article DE Head motion; Paralinguistic information; Dialogue act; Inter-personal relationship; Spontaneous speech ID ANIMATION AB Head motion naturally occurs in synchrony with speech and may convey paralinguistic information (such as intentions, attitudes and emotions) in dialogue communication. With the aim of verifying the relationship between head motion events and speech utterances, analyses were conducted on motion-captured data of multiple speakers during spontaneous dialogue conversations. The relationship between head motion events and dialogue acts was firstly analyzed. Among the head motion types, nods occurred with most frequency during speech utterances, not only for expressing dialogue acts of agreement or affirmation, but also appearing at the end of phrases with strong boundaries (including both turn-keeping and giving dialogue act functions). Head shakes usually appeared for expressing negation, while head tilts appeared mostly in interjections expressing denial, and in phrases with weak boundaries, where the speaker is thinking or did not finish uttering. The synchronization of head motion events and speech was also analyzed with focus on the timing of nods relative to the last syllable of a phrase. Results showed that nods were highly synchronized with the center portion of backchannels, while it was more synchronized with the end portion of the last syllable in phrases with strong boundaries. Speaker variability analyses indicated that the inter-personal relationship with the interlocutor is one factor influencing the frequency of head motion events, It was found that the frequency of nods was lower for dialogue partners with close relationship (such as family members), where speakers do not have to express careful attitudes. On the other hand, the frequency of nods (especially of multiple nods) clearly increased when the inter-personal relationship between the dialogue partners was distant. (C) 2013 Elsevier B.V. All rights reserved. C1 [Ishi, Carlos Toshinori; Hagita, Norihiro] ATR Intelligent Robot & Commun Lab, Keihanna Sci City, Kyoto 6190288, Japan. [Ishiguro, Hiroshi] ATR Hiroshi Ishiguro Special Lab, Kyoto, Japan. RP Ishi, CT (reprint author), ATR Intelligent Robot & Commun Lab, 2-2-2 Hikaridai, Keihanna Sci City, Kyoto 6190288, Japan. EM carlos@atr.jp FU Ministry of Internal Affairs and Communication FX This work is partly supported by the Ministry of Internal Affairs and Communication. We thank Kyoko Nakanishi, Maiko Hirano, Chaoran Liu, Hiroaki Hatano and Mika Morita for their contributions in data annotation and analysis. We also thank Freerk Wilbers and Judith Haas for their contributions in the collection and processing of motion data. CR Beskow J, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1272 Burnham D., 2007, P 8 ANN C INT SPEECH, P698 Busso C, 2007, IEEE T AUDIO SPEECH, V15, P1075, DOI 10.1109/TASL.2006.885910 Dohen M., 2006, P SPEECH PROSODY, P221 Foster ME, 2007, LANG RESOUR EVAL, V41, P305, DOI 10.1007/s10579-007-9055-3 Graf H.P., 2002, P IEEE INT C AUT FAC Hofer G., 2007, P INT 2007, P722 Ishi Carlos T, 2010, Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2010), DOI 10.1109/HRI.2010.5453183 Ishi C.T., 2006, J PHONETIC SOC JPN, V10, P18 Ishi C.T., 2007, P INT 2007, P670 Ishi C.T., 2008, P INT C AUD VIS SPEE, P37 Iwano Y., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607233 Morency LP, 2007, ARTIF INTELL, V171, P568, DOI 10.1016/j.artint.2007.04.003 Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x Sargin M.E., 2006, P IEEE INT C MULT Sidner C.L., 2006, P HRI 2006, P290, DOI DOI 10.1145/1121241.1121291 Stegmann M. B., 2002, BRIEF INTRO STAT SHA Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 Venditti J., 1997, OSU WORKING PAPERS L, V50, P127 Watanabe T, 2004, INT J HUM-COMPUT INT, V17, P43, DOI 10.1207/s15327590ijhc1701_4 Watanuki K., 2000, P ICSLP Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165 NR 22 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 233 EP 243 DI 10.1016/j.specom.2013.06.008 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100016 ER PT J AU Rowbotham, S Holler, J Lloyd, D Wearden, A AF Rowbotham, Samantha Holler, Judith Lloyd, Donna Wearden, Alison TI Handling pain: The semantic interplay of speech and co-speech hand gestures in the description of pain sensations SO SPEECH COMMUNICATION LA English DT Article DE Co-speech gesture; Gesture-speech redundancy; Pain communication; Pain sensation; Semantic interplay of gesture and speech ID FACE-TO-FACE; ICONIC GESTURES; COMMUNICATION; VISIBILITY; TELEPHONE; DIALOGUE; BEHAVIOR; RECORD; WORDS; MODEL AB Pain is a private and subjective experience about which effective communication is vital, particularly in medical settings. Speakers often represent information about pain sensation in both speech and co-speech hand gestures simultaneously, but it is not known whether gestures merely replicate spoken information or complement it in some way. We examined the representational contribution of gestures in a range of consecutive analyses. Firstly, we found that 78% of speech units containing pain sensation were accompanied by gestures, with 53% of these gestures representing pain sensation. Secondly, in 43% of these instances, gestures represented pain sensation information that was not contained in speech, contributing additional, complementary information to the pain sensation message. Finally, when applying a specificity analysis, we found that in contrast with research in different domains of talk, gestures did not make the pain sensation information in speech more specific. Rather, they complemented the verbal pain message by representing different aspects of pain sensation, contributing to a fuller representation of pain sensation than speech alone. These findings highlight the importance of gestures in communicating about pain sensation and suggest that this modality provides additional information to supplement and clarify the often ambiguous verbal pain message. (C) 2013 Elsevier B.V. All rights reserved. C1 [Rowbotham, Samantha; Holler, Judith; Lloyd, Donna; Wearden, Alison] Univ Manchester, Sch Psychol Sci, Manchester M13 9PL, Lancs, England. [Holler, Judith] Max Planck Inst Psycholinguist, NL-6525 XD Nijmegen, Netherlands. [Lloyd, Donna] Univ Leeds, Inst Psychol Sci, Leeds LS2 9JT, W Yorkshire, England. RP Rowbotham, S (reprint author), Univ Manchester, Sch Psychol Sci, Coupland Bldg,Oxford Rd, Manchester M13 9PL, Lancs, England. EM samantha.rowbotham@manchester.ac.uk CR Alibali MW, 1997, J EDUC PSYCHOL, V89, P183, DOI 10.1037/0022-0663.89.1.183 Alibali MW, 2001, J MEM LANG, V44, P169, DOI 10.1006/jmla.2000.2752 Bavelas J., 2002, GESTURE, V2, P1, DOI 10.1075/gest.2.1.02bav Bavelas J, 2008, J MEM LANG, V58, P495, DOI 10.1016/j.jml.2007.02.004 Bavelas JB, 1994, RES LANG SOC INTERAC, V27, P201, DOI DOI 10.1207/S15327973RLSI2703_3 BAVELAS JB, 1992, DISCOURSE PROCESS, V15, P469 Bavelas JB, 2000, J LANG SOC PSYCHOL, V19, P163, DOI 10.1177/0261927X00019002001 Beattie G, 1999, SEMIOTICA, V123, P1, DOI 10.1515/semi.1999.123.1-2.1 Beattie G, 1999, J LANG SOC PSYCHOL, V18, P438, DOI 10.1177/0261927X99018004005 Bergmann K., 2006, BRANDIAL 06, P90 Briggs Emma, 2010, Nurs Stand, V25, P35 Bruckner CT, 2006, AM J MENT RETARD, V111, P433, DOI 10.1352/0895-8017(2006)111[433:IKIORB]2.0.CO;2 BUTTERWO.B, 1975, J PSYCHOLINGUIST RES, V4, P75, DOI 10.1007/BF01066991 Cook SW, 2009, COGNITION, V113, P98, DOI 10.1016/j.cognition.2009.06.006 Craig KD, 2009, CAN PSYCHOL, V50, P22, DOI 10.1037/a0014772 CRAIG KD, 1992, APS J, V1, P153, DOI 10.1016/1058-9139(92)90001-S Ehlich K, 1985, Theor Med, V6, P177, DOI 10.1007/BF00489662 Ekman P, 1968, RES PSYCHOTHER, P179, DOI 10.1037/10546-011 EMMORREY K., 2001, GESTURE, V1, P35, DOI DOI 10.1075/GEST.1.1.04EMM Frank A. W., 1991, WILL BODY REFLECTION Gerwing J, 2009, GESTURE, V9, P312, DOI 10.1075/gest.9.3.03ger Gerwing J, 2011, GESTURE, V11, P308, DOI 10.1075/gest.11.3.03ger GRAHAM JA, 1975, INT J PSYCHOL, V10, P57, DOI 10.1080/00207597508247319 Hartzband P, 2008, NEW ENGL J MED, V358, P1656, DOI 10.1056/NEJMp0802221 Heath C, 2002, J COMMUN, V52, P597, DOI 10.1093/joc/52.3.597 Holler J, 2003, SEMIOTICA, V146, P81 Holler J, 2002, SEMIOTICA, V142, P31 Holler J, 2009, J NONVERBAL BEHAV, V33, P73, DOI 10.1007/s10919-008-0063-9 Holler J, 2003, GESTURE, V3, P127, DOI DOI 10.1075/GEST.3.2.02HOL Hostetter AB, 2011, PSYCHOL BULL, V137, P297, DOI 10.1037/a0022128 Hyden LC, 2002, HEALTH, V6, P325, DOI 10.1177/136345930200600305 Jacobs N, 2007, J MEM LANG, V56, P291, DOI 10.1016/j.jml.2006.07.011 Kallai I, 2004, PAIN, V112, P142, DOI 10.1016/j.pain.2004.08.008 Kendon A, 1997, ANNU REV ANTHROPOL, V26, P109, DOI 10.1146/annurev.anthro.26.1.109 Kendon A., 1980, RELATIONSHIP VERBAL, P207 Kendon A., 2004, GESTURE VISIBLE ACTI Kendon Adam, 1985, PERSPECTIVES SILENCE, P215 Labus JS, 2003, PAIN, V102, P109, DOI 10.1016/S0304-3959(02)00354-8 LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310 LEVINE FM, 1991, PAIN, V44, P69, DOI 10.1016/0304-3959(91)90149-R Loeser JD, 1999, LANCET, V353, P1607, DOI 10.1016/S0140-6736(99)01311-2 Makoul G, 2001, J AM MED INFORM ASSN, V8, P610 Margalit RS, 2006, PATIENT EDUC COUNS, V61, P134, DOI 10.1016/j.pec.2005.03.004 McCahon S, 2005, CLIN J PAIN, V21, P223, DOI 10.1097/00002508-200505000-00005 McGrath John M, 2007, Health Informatics J, V13, P105, DOI 10.1177/1460458207076466 McNeill D., 1992, HAND MIND WHAT GESTU McNeill D., 2005, GESTURE AND THOUGHT MCNEILL D, 1985, PSYCHOL REV, V92, P350, DOI 10.1037//0033-295X.92.3.350 National Cancer Institute, 2011, PAIN ASS OLDFIELD RC, 1971, NEUROPSYCHOLOGIA, V9, P97, DOI 10.1016/0028-3932(71)90067-4 Padfield D., 2003, PERCEPTIONS PAIN PRKACHIN KM, 1995, J NONVERBAL BEHAV, V19, P191, DOI 10.1007/BF02173080 Rowbotham S, 2012, J NONVERBAL BEHAV, V36, P1, DOI 10.1007/s10919-011-0122-5 Rowbotham S., HLTH PSYCHO IN PRESS, V22, P19 Ruusuvuori J, 2001, SOC SCI MED, V52, P1093, DOI 10.1016/S0277-9536(00)00227-6 Ryle G., 1949, CONCEPT MIND Salovey P., 1992, VITAL HLTH STAT, P6 Scarry E., 1985, BODY PAIN MAKING UNM Schott GD, 2004, PAIN, V108, P209, DOI 10.1016/j.pain.2004.01.037 Swann J, 2010, NURS RESIDENTIAL CAR, V12, P212 Tian TN, 2011, PAIN, V152, P1210, DOI 10.1016/j.pain.2011.02.022 Treasure T, 1998, BRIT MED J, V317, P602 Turk D.C., 2001, HDB PAIN WAGSTAFF S, 1985, ANN RHEUM DIS, V44, P262, DOI 10.1136/ard.44.4.262 NR 64 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 244 EP 256 DI 10.1016/j.specom.2013.04.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100017 ER PT J AU Hoetjes, M Krahmer, E Swerts, M AF Hoetjes, Marieke Krahmer, Emiel Swerts, Marc TI Does our speech change when we cannot gesture? SO SPEECH COMMUNICATION LA English DT Article DE Gesture; Speech; Gesture prevention; Director-matcher task ID NONVERBAL BEHAVIOR; LEXICAL RETRIEVAL; HAND GESTURES; COMMUNICATION; ADDRESSEES; VISIBILITY; DIALOGUE; SPEAKING; ACCESS AB Do people speak differently when they cannot use their hands? Previous studies have suggested that speech becomes less fluent and more monotonous when speakers cannot gesture, but the evidence for this claim remains inconclusive. The present study attempts to find support for this claim in a production experiment in which speakers had to give addressees instructions on how to tie a tie; half of the participants had to perform this task while sitting on their hands. Other factors that influence the ease of communication, such as mutual visibility and previous experience, were also taken into account. No evidence was found for the claim that the inability to gesture affects speech fluency or monotony. An additional perception task showed that people were also not able to hear whether someone gestures or not. (C) 2013 Elsevier B.V. All rights reserved. C1 [Hoetjes, Marieke; Krahmer, Emiel; Swerts, Marc] Tilburg Univ, Sch Humanities, Tilburg Ctr Cognit & Commun TiCC, NL-5000 LE Tilburg, Netherlands. RP Hoetjes, M (reprint author), Tilburg Univ, Room D404,POB 90153, NL-5000 LE Tilburg, Netherlands. EM m.w.hoetjes@tilburguniversity.edu; e.j.krahmer@tilburguniversity.edu; m.g.j.swerts@tilburguniversity.edu RI Swerts, Marc/C-8855-2013 FU Netherlands Organization for Scientific Research [27770007] FX We would like to thank Bastiaan Roset and Nick Wood for statistical and technical support and help in creating the stimuli, Joost Driessen for help in transcribing the data, Martijn Goudbeek for statistical support and Katya Chown for providing background information on Dobrogaev. We received financial support from The Netherlands Organization for Scientific Research, via a Vici grant (NWO grant 27770007), which is gratefully acknowledged. Parts of this paper were presented at the Tabu dag 2009 in Groningen, at the Gesture Centre at the Max Planck Institute for Psycholinguistics, at the 2009 AVSP conference, at LabPhon 2010 and at ISGS 2010. We would like to thank the audiences for their suggestions and comments. Finally, thanks to the anonymous reviewers for their useful and constructive comments. CR Alibali MW, 2001, J MEM LANG, V44, P169, DOI 10.1006/jmla.2000.2752 Alibali MW, 2000, LANG COGNITIVE PROC, V15, P593 Bavelas J, 2008, J MEM LANG, V58, P495, DOI 10.1016/j.jml.2007.02.004 Beattie G, 1999, BRIT J PSYCHOL, V90, P35, DOI 10.1348/000712699161251 Bernardis P, 2006, NEUROPSYCHOLOGIA, V44, P178, DOI 10.1016/j.neuropsychologia.2005.05.007 Boersma P., 2010, PRAAT DOING PHONETIC BOLINGER D, 1983, AM SPEECH, V58, P156, DOI 10.2307/455326 Cave C., 1996, 4 INT C SPOK LANG PR Chown K, 2008, STUD E EUR THOUGHT, V60, P307, DOI 10.1007/s11212-008-9063-x Chu M., 2007, INTEGRATING GESTURES CLARK HH, 1986, COGNITION, V22, P1, DOI 10.1016/0010-0277(86)90010-7 Clark HH, 2004, J MEM LANG, V50, P62, DOI 10.1016/j.jml.2003.08.004 De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018 De Ruiter J.-P., 2006, ADV SPEECH LANGUAGE, V8, P124, DOI DOI 10.1080/14417040600667285 Dobrogaev S.M., 1929, IAZYKOVEDENIE MAT, P105 Duncan S., 2000, LANGUAGE GESTURE, P141, DOI 10.1017/CBO9780511620850.010 EMMORREY K., 2001, GESTURE, V1, P35, DOI DOI 10.1075/GEST.1.1.04EMM Finlayson S., 2003, DISFLUENCY SPONTANEO Flecha-Garcia ML, 2010, SPEECH COMMUN, V52, P542, DOI 10.1016/j.specom.2009.12.003 GRAHAM JA, 1975, EUR J SOC PSYCHOL, V5, P189, DOI 10.1002/ejsp.2420050204 Gullberg M, 2006, LANG LEARN, V56, P155, DOI 10.1111/j.0023-8333.2006.00344.x Hostetter AB, 2010, J MEM LANG, V63, P245, DOI 10.1016/j.jml.2010.04.003 Hostetter A.B., 2007, 29 ANN M COGN SCI SO, P1097 Kendon A., 2004, GESTURE VISIBLE ACTI Kendon Adam, 1980, NONVERBAL COMMUNICAT, P207 Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3 Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005 Krahmer E, 2004, HUM COM INT, V7, P191 Krauss RM, 1998, CURR DIR PSYCHOL SCI, V7, P54, DOI 10.1111/1467-8721.ep13175642 Krauss R.M., 2001, GESTURE SPEECH SIGN, P93 Krauss RM, 1996, ADV EXP SOC PSYCHOL, V28, P389, DOI 10.1016/S0065-2601(08)60241-5 Kuhlen AK, 2013, PSYCHON B REV, V20, P54, DOI 10.3758/s13423-012-0341-8 Levelt W. J., 1989, SPEAKING INTENTION A McClave E, 1998, J PSYCHOLINGUIST RES, V27, P69, DOI 10.1023/A:1023274823974 McNeill D., 1992, HAND MIND WHAT GESTU Mol L, 2009, GESTURE, V9, P97, DOI 10.1075/gest.9.1.04mol Morsella E, 2005, J PSYCHOLINGUIST RES, V34, P415, DOI 10.1007/s10936-005-6141-9 Ozyurek A, 2002, J MEM LANG, V46, P688, DOI 10.1006/jmla.2001.2826 Pine KJ, 2007, DEVELOPMENTAL SCI, V10, P747, DOI 10.1111/j.1467-7687.2007.00610.x Rauscher FH, 1996, PSYCHOL SCI, V7, P226, DOI 10.1111/j.1467-9280.1996.tb00364.x RIME B, 1984, MOTIV EMOTION, V8, P311, DOI 10.1007/BF00991870 Wittenburg P., 2006, LREC 2006 NR 42 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 257 EP 267 DI 10.1016/j.specom.2013.06.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100018 ER PT J AU Ferre, G AF Ferre, Gaelle TI A multimodal approach to markedness in spoken French SO SPEECH COMMUNICATION LA English DT Article DE Contrast; Fronted syntactic constructions; Prosodic emphasis; Gesture reinforcement; Gesture-speech production models ID PROSODIC PROMINENCE; CONTRASTIVE FOCUS; VISUAL-PERCEPTION; SPEECH; GESTURE AB This study aims at examining the links between marked structures in the syntactic and prosodic domains (fronting and focal accent), and the way the two types of contrast can be reinforced by gestures. It was conducted on a corpus of 1h30 of spoken French, involving three pairs of speakers in dialogues. Results show that although the tendency is for marked constructions both in syntax and prosody not to be reinforced by gestures, there is still a higher proportion of gesture reinforcing with prosodic marking than with syntactic fronting. The paper describes which eyebrow and head movements as well as hand gestures are more liable to accompany the two operations. Beyond these findings, the study gives an insight into the current models proposed in the literature for gesture speech production. (C) 2013 Elsevier B.V. All rights reserved. C1 [Ferre, Gaelle] Univ Nantes, Sch Languages & Linguist Lab LLING, F-44312 Nantes 3, France. RP Ferre, G (reprint author), Univ Nantes, Fac Langues & Cultures Etrangeres, BP 81227, F-44312 Nantes 3, France. EM Gaelle.Ferre@univ-nantes.fr FU French National Research Agency [ANR BLAN08-2_349062] FX This research is supported by the French National Research Agency (Project number: ANR BLAN08-2_349062) and is based on a corpus and transcriptions made by various team members beside the author of the current paper, whom we would like to thank here. The OTIM project is referenced on the following webpage: http://aune.lpl.univ-aix.fr/similar to otim/. Many thanks as well to two anonymous reviewers for their useful comments on previous versions of the paper. CR Al Moubayed S., 2010, AUTONOMOUS ADAPTIVE, P55 Al Moubayed S, 2010, J MULTIMODAL USER IN, V3, P299, DOI 10.1007/s12193-010-0054-0 Astesano C., 2004, P JOURN ET PAR JEP F, P1 Beavin Bavelas J., 2000, PERSPECTIVES FLUENCY, P91 Bertrand R., 2008, TRAITEMENT AUTOMATIQ, V49, P105 Blache P, 2009, LECT NOTES ARTIF INT, V5509, P38 Boersma P., 2005, PRAAT DOING PHONETIC BOLINGER D, 1983, AM SPEECH, V58, P156, DOI 10.2307/455326 Buring D., 2007, OXFORD HDB LINGUISTI, P445 Calhoun S, 2009, STUD PRAGMAT, V8, P53 Combettes Bernard, 1999, THEMATISATION LANGUE, P231 Creissels D., 2004, COURS SYNTAXE GEN, P1 DAHAN D, 1994, J PHYS IV, V4, P501, DOI 10.1051/jp4:19945106 De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018 Dohen M., 2004, P 8 ICSLP, P1313 Dohen M., 2005, P INT 2005, P2413 Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009 Dooley R.A., 2000, ANAL DISCOURSE MANUA Ferre G., 2007, P INT 2007 ANTW BELG Ferre G., 2011, MULTIMODAL COMMUNICA, V1, P5 Ferre G., 2010, P LREC WORKSH MULT C, P86 Fery C., 2003, NOUVEAUX DEPARTS PHO, P161 Fery Caroline, 2001, AUDIATUR VOX SAPIENT, P153 Gregory ML, 2001, J PRAGMATICS, V33, P1665, DOI 10.1016/S0378-2166(00)00063-1 Halliday M. A. K., 1967, J LINGUIST, V3, P199, DOI DOI 10.1017/S0022226700016613 Herment-Dujardin S., 2002, SPEECH PROSODY 2002, P379 Katz J, 2011, LANGUAGE, V87, P771 Kipp M., 2001, P 7 EUR C SPEECH COM, P1367 Kita S, 2007, LANG COGNITIVE PROC, V22, P1212, DOI 10.1080/01690960701461426 Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3 Kohler K.J., 2006, P SPEECH PROS DRESD, P748 Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005 Krahmer E., 2002, P SPEECH PROS 2002 A, P443 Krahmer E, 2001, SPEECH COMMUN, V34, P391, DOI 10.1016/S0167-6393(00)00058-3 Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017 Lacheret-Dujour A., 2003, B SOC LINGUISTIQUE P, VXIII, P137 Lambrecht K., 1994, INFORM STRUCTURE SEN McNeill D., 1992, HAND MIND WHAT GESTU McNeill D., 2005, GESTURE AND THOUGHT Prevost S., 2003, CAHIERS PRAXEMATIQUE, V40, P97 Rialland A., 2002, SPEECH PROSODY 2002, P595 SELKIRK E., 1978, NORDIC PROSODY, P111 Stark E., 1999, THEMATISATION LANGUE, P337 Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 Wilmes K.A., 2009, THESIS U OSNABRUCK O NR 45 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 268 EP 282 DI 10.1016/j.specom.2013.06.002 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100019 ER PT J AU Rusiewicz, HL Shaiman, S Iverson, JM Szuminsky, N AF Rusiewicz, Heather Leavy Shaiman, Susan Iverson, Jana M. Szuminsky, Neil TI Effects of perturbation and prosody on the coordination of speech and gesture SO SPEECH COMMUNICATION LA English DT Article DE Gesture; Perturbation; Prosody; Entrainment; Speech-hand coordination; Capacitance ID DELAYED AUDITORY-FEEDBACK; FINGER MOVEMENTS; LEXICAL STRESS; BODY MOVEMENT; RESPONSES; HAND; JAW; TASK; TIME; SUSCEPTIBILITY AB The temporal alignment of speech and gesture is widely acknowledged as primary evidence of the integration of spoken language and gesture systems. Yet there is a disconnect between the lack of experimental research on the variables that affect the temporal relationship of speech and gesture and the overwhelming acceptance that speech and gesture are temporally coordinated. Furthermore, the mechanism of the temporal coordination of speech and gesture is poorly represented. Recent experimental research suggests that gestures overlap prosodically prominent points in the speech stream, though the effects of other variables such as perturbation of speech are not yet studied in a controlled paradigm. The purpose of the present investigation was to further investigate the mechanism of this interaction according to a dynamic systems framework. Fifteen typical young adults completed a task that elicited the production of contrastive prosodic stress on different syllable positions with and without delayed auditory feedback while pointing to corresponding pictures. The coordination of deictic gestures and spoken language was examined as a function of perturbation, prosody, and position of the target syllable. Results indicated that the temporal parameters of gesture were affected by all three variables. The findings suggest that speech and gesture may be coordinated due to internal pulse-based temporal entrainment of the two motor systems. (C) 2013 Elsevier B.V. All rights reserved. C1 [Rusiewicz, Heather Leavy] Duquesne Univ, Dept Speech Language Pathol, Pittsburgh, PA 15282 USA. [Shaiman, Susan; Szuminsky, Neil] Univ Pittsburgh, Dept Commun Sci & Disorders, Pittsburgh, PA 15260 USA. [Iverson, Jana M.] Univ Pittsburgh, Dept Psychol, Pittsburgh, PA 15260 USA. RP Rusiewicz, HL (reprint author), Duquesne Univ, Dept Speech Language Pathol, 412 Fisher Hall,600 Forbes Ave, Pittsburgh, PA 15282 USA. EM rusiewih@duq.edu FU University of Pittsburgh School of Rehabilitation and Health Sciences Development Fund FX This project was supported by the University of Pittsburgh School of Rehabilitation and Health Sciences Development Fund. We are grateful for the time of the participants and notable contributions to all stages of this work by Christine Dollaghan, Thomas Campbell, J. Scott Yaruss, and Diane Williams. We also thank Jordana Birnbaum, Megan Pelletierre, and Alyssa Milloy for assistance in data analysis and entry and Megan Murray for assistance in the construction of this manuscript. Portions of this work were presented at the 2009 and 2010 Annual Convention of the American-Speech-Language-Hearing Association and the 2011 conference on Gesture and Speech in Interaction (GESPIN). CR ABBS JH, 1984, J NEUROPHYSIOL, V51, P705 ABBS JH, 1976, J SPEECH HEAR RES, V19, P19 ABBS J H, 1982, Journal of the Acoustical Society of America, V71, pS33, DOI 10.1121/1.2019343 Barbosa P.A., 2002, SPEECH PROS AIX EN P Bard C, 1999, EXP BRAIN RES, V125, P410, DOI 10.1007/s002210050697 Birdwhistell R. L., 1970, KINESICS CONTEXT ESS Birdwhistell Ray L., 1952, INTRO KINESICS ANNOT Bluedorn A.C., 2002, HUMAN ORG TIME TEMPO BULL P, 1985, J NONVERBAL BEHAV, V9, P169, DOI 10.1007/BF01000738 BURKE BD, 1975, J COMMUN DISORD, V8, P75, DOI 10.1016/0021-9924(75)90028-3 Cutler A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90004-0 CHANG P, 1987, J MOTOR BEHAV, V19, P265 Corriveau KH, 2009, CORTEX, V45, P119, DOI 10.1016/j.cortex.2007.09.008 Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070 De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018 De Rutter J. P., 1998, THESIS KATHOLIEKE U DITTMANN AT, 1969, J PERS SOC PSYCHOL, V11, P98, DOI 10.1037/h0027035 ELMAN JL, 1983, J SPEECH HEAR RES, V26, P106 Esteve-Gibert N., 2011, P 2 GEST SPEECH INT FOLKINS JW, 1975, J SPEECH HEAR RES, V18, P207 FOLKINS JW, 1981, J ACOUST SOC AM, V69, P1441, DOI 10.1121/1.385828 Franks IM, 1998, CAN J EXP PSYCHOL, V52, P93, DOI 10.1037/h0087284 FRANZ EA, 1992, J MOTOR BEHAV, V24, P281 Gentilucci M, 2004, EUR J NEUROSCI, V19, P190, DOI 10.1111/j.1460-9568.2004.03104.x Gentilucci M, 2001, J NEUROPHYSIOL, V86, P1685 Goffman L, 1999, J SPEECH LANG HEAR R, V42, P1003 Goldin-Meadow S, 1999, TRENDS COGN SCI, V3, P419, DOI 10.1016/S1364-6613(99)01397-2 Gracco V.L., 1986, EXP BRAIN RES, V65, P155 GRACCO VL, 1985, J NEUROPHYSIOL, V54, P418 Guenther FH, 1998, PSYCHOL REV, V105, P611 HAKEN H, 1985, BIOL CYBERN, V51, P347, DOI 10.1007/BF00336922 HISCOCK M, 1986, NEUROPSYCHOLOGIA, V24, P691, DOI 10.1016/0028-3932(86)90008-4 Howell P, 2002, J ACOUST SOC AM, V111, P2842, DOI 10.1121/1.1474444 HOWELL P, 1984, PERCEPT PSYCHOPHYS, V36, P296, DOI 10.3758/BF03206371 Iverson J. M., 1999, J CONSCIOUSNESS STUD, V6, P19 JONES MR, 1989, PSYCHOL REV, V96, P459, DOI 10.1037//0033-295X.96.3.459 Kelso J.A., 1984, J EXPER PSYCH HUM PE, V6, P812 Kelso J.A., 1984, AM J PHYSIO REGULAT, V246, pR935 KELSO JAS, 1983, J SPEECH HEAR RES, V26, P217 Kelso J.A.S., 1981, PRODUCTION SPEECH, P137 Kendon A., 1980, RELATIONSHIP VERBAL, P207 Kendon Adam, 1972, STUDIES DYADIC COMMU, P177 KOMILIS E, 1993, J MOTOR BEHAV, V25, P299 Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005 Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017 Large EW, 1999, PSYCHOL REV, V106, P119, DOI 10.1037/0033-295X.106.1.119 Leonard T, 2011, LANG COGNITIVE PROC, V26, P1457, DOI 10.1080/01690965.2010.500218 LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X Liberman AM, 2000, TRENDS COGN SCI, V4, P187, DOI 10.1016/S1364-6613(00)01471-6 Loehr D., 2007, GESTURE, V7, P179, DOI [10.1075/gest.7.2.04loe, DOI 10.1075/GEST.7.2.04LOE] Loehr D.P., 2004, DISS ABSTR INT A, V65, P2180 Lorenz S, 2007, J COMPUT CHEM, V28, P1384, DOI 10.1002/jcc.20674 MARSLENWILSON WD, 1981, PHILOS T ROY SOC B, V295, P317, DOI 10.1098/rstb.1981.0143 Mayberry R., 2000, LANGUAGE GESTURE, P199, DOI 10.1017/CBO9780511620850.013 Mayberry R. I., 1998, NEW DIR CHILD ADOLES, V79, P77 McClave E, 1998, J PSYCHOLINGUIST RES, V27, P69, DOI 10.1023/A:1023274823974 MCCLAVE E, 1994, J PSYCHOLINGUIST RES, V23, P45, DOI 10.1007/BF02143175 McNeill D., 1992, HAND AND MIND Merker B., 2009, CORTEX, V45, P1 Nobe S., 1996, DISS ABSTR INT A, V57, P4736 O'Dell M., 1999, P 14 INT C PHON SCI, V2, P1075 CONDON WS, 1966, J NERV MENT DIS, V143, P338, DOI 10.1097/00005053-196610000-00005 PAULIGNAN Y, 1991, EXP BRAIN RES, V87, P407 Peter B, 2011, TOP LANG DISORD, V31, P145, DOI 10.1097/TLD.0b013e318217b855 Port R., 1998, NONLINEAR ANAL DEV P, P5 Port RF, 2003, J PHONETICS, V31, P599, DOI 10.1016/j.wocn.2003.08.001 PRABLANC C, 1992, J NEUROPHYSIOL, V67, P455 Ringel R.L., 1963, J SPEECH HEAR RES, V13, P369 Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173) Rusiewicz HL, 2013, J SPEECH LANG HEAR R, V56, P458, DOI 10.1044/1092-4388(2012/11-0283) Rusiewicz H.L., 2011, P 2 GEST SPEECH INT Sager Rebecca, 2004, ESEM COUNTERPOINT, V1, P1 Saltzman E, 2000, HUM MOVEMENT SCI, V19, P499, DOI 10.1016/S0167-9457(00)00030-0 Saltzman E.L., 1989, ECOLOG PSYCH, V1, P330 Schmidt RA, 1999, MOTOR CONTROL LEARNI Shaiman S, 2002, EXP BRAIN RES, V146, P411, DOI 10.1007/s00221-002-1195-5 SHAIMAN S, 1989, J ACOUST SOC AM, V86, P78, DOI 10.1121/1.398223 Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P549, DOI 10.1080/0269920031000138123 SMITH A, 1992, CRIT REV ORAL BIOL M, V3, P233 SMITH A, 1986, J SPEECH HEAR RES, V29, P471 Stuart A, 2002, J ACOUST SOC AM, V111, P2237, DOI 10.1121/1.1466868 Thelen E., 2002, DYNAMIC SYSTEM APPRO Tuite K., 1993, SEMIOTICA, V93, P8 Tuller B., 1982, J ACOUST SOCI AM, V72, pS103, DOI 10.1121/1.2019693 TULLER B, 1982, J ACOUST SOC AM, V71, P1534, DOI 10.1121/1.387807 TULLER B, 1983, J EXP PSYCHOL HUMAN, V9, P829, DOI 10.1037/0096-1523.9.5.829 TURVEY MT, 1990, AM PSYCHOL, V45, P938, DOI 10.1037/0003-066X.45.8.938 van Lieshout P., 2004, SPEECH MOTOR CONTROL, P51 van Kuijk D, 1999, SPEECH COMMUN, V27, P95, DOI 10.1016/S0167-6393(98)00069-7 von Hoist E., 1973, COLLECTED PAPERS E V Wlodarczak M., 2012, 6 INT C SHANGH CHIN Yasinnik Y., 2004, M SOUND SENS 50 YEAR NR 92 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 283 EP 300 DI 10.1016/j.specom.2013.06.004 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100020 ER PT J AU Esteve-Gibert, N Prieto, P AF Esteve-Gibert, Nuria Prieto, Pilar TI Infants temporally coordinate gesture-speech combinations before they produce their first words SO SPEECH COMMUNICATION LA English DT Article DE Early gestures; Early acquisition of multimodality; Early gesture-speech temporal coordination ID LANGUAGE-DEVELOPMENT; YOUNG-CHILDREN; BEHAVIOR; INTENTIONALITY; PERCEPTION; SYSTEM; FOCUS AB This study explores the patterns of gesture and speech combinations from the babbling period to the one-word stage and the temporal alignment between the two modalities. The communicative acts of four Catalan children at 0;11, 1;1, 1;3, 1;5, and 1;7 were gesturally and acoustically analyzed. Results from the analysis of a total of 4,507 communicative acts extracted from approximately 24 h of at-home recordings showed that (1) from the early single-word period onwards gesture starts being produced mainly in combination with speech rather than as a gesture-only act; (2) in these early gesture-speech combinations most of the gestures are deictic gestures (pointing and reaching gestures) with a declarative communicative purpose; and (3) there is evidence of temporal coordination between gesture and speech already at the babbling stage because gestures start before the vocalizations associated with them, the stroke onset coincides with the onset of the prominent syllable in speech, and the gesture apex is produced before the end of the accented syllable. These results suggest that during the transition between the babbling stage and single-word period infants start combining deictic gestures and speech and, when combined, the two modalities are temporally coordinated. (C) 2013 Elsevier B.V. All rights reserved. C1 [Esteve-Gibert, Nuria; Prieto, Pilar] Univ Pompeu Fabra, Dpt Translat & Language Sci, Barcelona, Spain. RP Esteve-Gibert, N (reprint author), Roc Boronat 138, Barcelona 08018, Spain. EM nuria.esteve@upf.edu RI Esteve-Gibert, Nuria/D-5342-2014 OI Esteve-Gibert, Nuria/0000-0003-2408-5849 FU Spanish Ministry of Science and Innovation [FFI2009-07648/FILO, FI2012-31995]; Generalitat de Catalunya [2009SGR-701]; Obra Social 'La Caixa'; Consolider-Ingenio grant [CSD2007-00012] FX This research has been funded by two research grants awarded by the Spanish Ministry of Science and Innovation (FFI2009-07648/FILO "The role of tonal scaling and tonal alignment in distinguishing intonational categories in Catalan and Spanish", and FFI2012-31995 "Gestures, prosody and linguistic structure"), by a grant awarded by the Generalitat de Catalunya (2009SGR-701) to the Grup d'Estudis de Prosodia, by the Generalitat de Catalunya (2009SGR-701) to the Grup d'Estudis de Prosodia, by the grant RECERCAIXA 2012 for the project "Els precursors del llenguatge. Una guia TIC per a pares i educadors" awarded by Obra Social 'La Caixa', and by the Consolider-Ingenio 2010 (CSD2007-00012) grant. CR Astruc L., 2013, LANG SPEECH, V56, P78 Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005 Bangerter A, 2004, PSYCHOL SCI, V15, P415, DOI 10.1111/j.0956-7976.2004.00694.x BATES E, 1975, MERRILL PALMER QUART, V21, P205 Bergmann K., 2011, P 2 WORKSH GEST SPEE BLAKE J, 1994, INFANT BEHAV DEV, V17, P195, DOI 10.1016/0163-6383(94)90055-8 Blake J., 2005, GESTURE, V5, P201, DOI 10.1075/gest.5.1.14bla Boersma Paul, 2012, PRAAT DOING PHONETIC Bonsdroff L., 2005, 18 SWED PHON C DEP L, P59 Butcher C., 2000, LANGUAGE GESTURE, P235, DOI DOI 10.1017/CBO9780511620850.015 Butterworth B., 1978, RECENT ADV PSYCHOL L, P347 Camaioni L, 2004, INFANCY, V5, P291, DOI 10.1207/s15327078in0503_3 Capone NC, 2004, J SPEECH LANG HEAR R, V47, P173, DOI [10.1044/1092-4388(2004/015), 10.1044/1092-4388(2004/15)] Cochet H, 2010, GESTURE, V10, P129, DOI 10.1075/gest.10.2-3.02coc Colonnesi C, 2010, DEV REV, V30, P352, DOI 10.1016/j.dr.2010.10.001 De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018 De Ruiter J.-P., 2006, ADV SPEECH LANGUAGE, V8, P124, DOI DOI 10.1080/14417040600667285 De Rutter J. P., 1998, THESIS KATHOLIEKE U Ejiri K, 2001, DEVELOPMENTAL SCI, V4, P40, DOI 10.1111/1467-7687.00147 Ekman P., 1969, SEMIOTICA, V1, P49 Engstrand O., 2004, 17 SWED PHON C STOCK, P64 Esteve-Gibert N., 2013, J SPEECH LANG HEAR R Esteve-Gibert N., 2012, ESTEVE PRIETO CATALA Feldman R, 1996, INFANT BEHAV DEV, V19, P483, DOI 10.1016/S0163-6383(96)90008-9 Ferre G., 2010, P LREC WORKSH MULT C, P86 Frota S., 2008, P 11 INT C STUD CHIL Iverson J. M., 1999, J CONSCIOUSNESS STUD, V6, P19 Iverson JM, 2000, J NONVERBAL BEHAV, V24, P105, DOI 10.1023/A:1006605912965 Iverson JM, 2005, PSYCHOL SCI, V16, P367, DOI 10.1111/j.0956-7976.2005.01542.x Iverson JM, 2004, CHILD DEV, V75, P1053, DOI 10.1111/j.1467-8624.2004.00725.x Kendon A., 1980, RELATIONSHIP VERBAL, P207 Kendon A., 2004, GESTURE VISIBLE ACTI Kita S., 2000, LANGUAGE CULTURE COG, P162 Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005 Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017 Lausberg H, 2009, BEHAV RES METHODS, V41, P841, DOI 10.3758/BRM.41.3.841 Leonard T., 2010, LANG COGNITIVE PROC, V26, P1295 LEUNG EHL, 1981, DEV PSYCHOL, V17, P215, DOI 10.1037//0012-1649.17.2.215 LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X Liszkowski U., 2007, GESTURAL COMMUNICATI, P124 Loehr D., 2007, GESTURE, V7, P179, DOI [10.1075/gest.7.2.04loe, DOI 10.1075/GEST.7.2.04LOE] Masataka N, 2003, POINTING: WHERE LANGAUAGE, CULTURE, AND COGNITON MEET, P69 McNeill D., 1992, HAND MIND WHAT GESTU Melinger A., 2004, GESTURE, V4, P119, DOI DOI 10.1075/GEST.4.2.02MEL Nobe S., 1996, THESIS U CHICAGO CHI Oller D.K., 1976, J CHILD LANG, V3, P1 Olson S.L., 1982, INFANT BEHAV DEV, V12, P77 Ozcaliskan S., 2005, COGNITION, V96, P101, DOI [10.1016/j.cognition.2005.01.001, DOI 10.1016/J.C0GNITI0N.2005.01.001] Papaeliou CF, 2006, J CHILD LANG, V33, P163, DOI 10.1017/S0305000905007300 Prieto P., 2013, INTONATIONAL VARATIO Prieto P., 2006, LANG SPEECH, V49, P233 Rochat P, 2007, ACTA PSYCHOL, V124, P8, DOI 10.1016/j.actpsy.2006.09.004 Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173) Roustan B., 2010, P SPEECH PROS CHIC Rusiewicz H.L., 2010, THESIS U PITTSBURGH So WC, 2010, APPL PSYCHOLINGUIST, V31, P209, DOI 10.1017/S0142716409990221 Sperry LA, 2003, J AUTISM DEV DISORD, V33, P281, DOI 10.1023/A:1024454517263 Tomasello M, 2007, CHILD DEV, V78, P705, DOI 10.1111/j.1467-8624.2007.01025.x Vanrell M.M., 2011, ANEJO QUADERNS FILOL, P71 DEBOYSSONBARDIES B, 1991, LANGUAGE, V67, P297, DOI 10.2307/415108 VIHMAN MM, 1985, LANGUAGE, V61, P397, DOI 10.2307/414151 Vihman MM, 2009, CAMB HB LANG LINGUIS, P163 West B. T., 2007, LINEAR MIXED MODELS NR 63 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 301 EP 316 DI 10.1016/j.specom.2013.06.006 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100021 ER PT J AU Kim, J Cvejic, E Davis, C AF Kim, Jeesun Cvejic, Erin Davis, Chris TI Tracking eyebrows and head gestures associated with spoken prosody SO SPEECH COMMUNICATION LA English DT Article DE Visual prosody; Eyebrow movements; Focus; Sentence modality; Guided principal component analysis ID SPEECH ACOUSTICS; FACE; STRESS; PROMINENCE; KINEMATICS; ENGLISH; MOTION; VOICE AB Although it is clear that eyebrow and head movements are in some way associated with spoken prosody, the precise form of this association is unclear. To examine this, eyebrow and head movements were recorded from six talkers producing 30 sentences (with two repetitions) in three prosodic conditions (Broad focus, Narrow focus and Echoic question) in a face to face dialogue exchange task. Movement displacement and peak velocity were measured for the prosodically marked constituents (critical region) as well as for the preceding and following regions. The amount of eyebrow movement in the Narrow focus and Echoic question conditions tended to be larger at the beginning of an utterance (in the pre-critical and critical regions) than at the end (in the post-critical region). Head rotation (nodding) tended to occur later, being maximal in the critical region and still occurring often in the post-critical one. For eyebrow movements, peak velocity tended to distinguish the regions better than the displacement measure. The extent to which eyebrow and head movements co-occurred was also examined. Compared to broad focussed condition, both movement types occurred more often in the narrow focussed and echoic question ones. When these double movements occurred in narrow focused utterances, brow raises tended to begin before the onset of the critical constituent and reach a peak displacement at the time of the critical constituent, whereas rigid pitch movements tended to begin at the time of critical constituent and reach peak displacement after this region. The pattern for echoic questions was similar for eyebrow motion however head rotations tended to begin earlier compared to the narrow focus condition. These results are discussed in terms of the differences these types of visual cues may have in production and perception. (C) 2013 Elsevier B.V. All rights reserved. C1 [Kim, Jeesun; Cvejic, Erin; Davis, Chris] Univ Western Sydney, MARCS Inst, Penrith, NSW 2751, Australia. [Cvejic, Erin] Univ New S Wales, Sch Psychiat, Sydney, NSW 2052, Australia. RP Davis, C (reprint author), Univ Western Sydney, MARCS Inst, Locked Bag 1797, Penrith, NSW 2751, Australia. EM chris.davis@uws.edu.au FU Australian Research Council [DP120104298] FX We thank Virginie Attina and Guillame Gibert for their contribution of Mat lab scripts used for the gPCA analysis and all of the speakers. We also thank two anonymous reviewers for helpful comments. Some of the results reported here were from the second author's PhD thesis. We acknowledge support from Australian Research Council (DP120104298). CR Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166 Bailly G, 2009, EURASIP J AUDIO SPEE, DOI 10.1155/2009/769494 Beautemps D, 2001, J ACOUST SOC AM, V109, P2165, DOI 10.1121/1.1361090 Boersma P., 2001, GLOT INT, V5, P341 Brooks V.B., 1998, J MOTOR BEHAV, V2, P117 Cave C, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2175 Cvejic E, 2012, COGNITION, V122, P442, DOI 10.1016/j.cognition.2011.11.013 Cvejic E, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1433 Cvejic E, 2012, J ACOUST SOC AM, V131, P1011, DOI 10.1121/1.3676605 DEJONG KJ, 1995, J ACOUST SOC AM, V97, P491, DOI 10.1121/1.412275 Dohen M., 2009, VISUAL SPEECH RECOGN, P416 Dohen M, 2009, LANG SPEECH, V52, P177, DOI 10.1177/0023830909103166 EDWARDS J, 1991, J ACOUST SOC AM, V89, P369, DOI 10.1121/1.400674 Fagel S., 2008, P INT C AUD VIS SPEE, P59 Fitzpatrick M., 2011, P INT 2011, P2829 Flecha-Garcia ML, 2010, SPEECH COMMUN, V52, P542, DOI 10.1016/j.specom.2009.12.003 Guaitella I, 2009, LANG SPEECH, V52, P207, DOI 10.1177/0023830909103167 HADAR U, 1983, LANG SPEECH, V26, P117 HORN BKP, 1987, J OPT SOC AM A, V4, P629, DOI 10.1364/JOSAA.4.000629 House D., 2001, P EUR 2001, P387 IEEE, 1969, IEEE T AUDIO ELECTRO, VAE-17, P227 Kendon A., 2004, GESTURE VISIBLE ACTI Kim J, 2011, PERCEPTION, V40, P853, DOI 10.1068/p6941 Lucero JC, 2005, J ACOUST SOC AM, V118, P405, DOI 10.1121/1.1928807 MAEDA S, 2005, ZAS PAPERS LINGUISTI, V40, P95 Nam H, 2012, J ACOUST SOC AM, V132, P3980, DOI 10.1121/1.4763545 Schroder M., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025708916924 Schwartz JL, 2004, COGNITION, V93, pB69, DOI 10.1016/j.cognition.2004.01.006 Srinivasan RJ, 2003, LANG SPEECH, V46, P1 Swerts M, 2010, J PHONETICS, V38, P197, DOI 10.1016/j.wocn.2009.10.002 Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 VANSUMMERS W, 1987, J ACOUST SOC AM, V82, P847, DOI 10.1121/1.395284 WING AM, 1984, PSYCHOL RES-PSYCH FO, V46, P121, DOI 10.1007/BF00308597 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165 NR 35 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 317 EP 330 DI 10.1016/j.specom.2013.06.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100022 ER PT J AU Fernandez-Baena, A Montano, R Antonijoan, M Roversi, A Miralles, D Alias, F AF Fernandez-Baena, Adso Montano, Raul Antonijoan, Marc Roversi, Arturo Miralles, David Alias, Francesc TI Gesture synthesis adapted to speech emphasis SO SPEECH COMMUNICATION LA English DT Article DE Human computer interaction; Body language; Speech analysis; Speech emphasis; Character animation; Motion capture; Motion graphs ID ANIMATION; PROMINENCE; CHARACTERS; EXPRESSION; PROSODY; BEAT AB Avatars communicate through speech and gestures to appear realistic and to enhance interaction with humans. In this context, several works have analyzed the relationship between speech and gestures, while others have been focused on their synthesis, following different approaches. In this work, we address both goals by linking speech to gestures in terms of time and intensity, to then use this knowledge to drive a gesture synthesizer from a manually annotated speech signal. To that effect, we define strength indicators for speech and motion. After validating them through perceptual tests, we obtain an intensity rule from their correlation. Moreover, we derive a synchrony rule to determine temporal correspondences between speech and gestures. These analyses have been conducted on aggressive and neutral performances to cover a broad range of emphatic levels, whose speech signal and motion have been manually annotated. Next, intensity and synchrony rules are used to drive a gesture synthesizer called gesture motion graph (GMG). These rules are validated by users from GMG output animations through perceptual tests. Results show that animations using intensity and synchrony rules perform better than those only using the synchrony rule (which in turn enhance realism with respect to random animation). Finally, we conclude that the extracted rules allow GMG to properly synthesize gestures adapted to speech emphasis from annotated speech. (C) 2013 Elsevier B.V. All rights reserved. C1 [Fernandez-Baena, Adso; Montano, Raul; Antonijoan, Marc; Roversi, Arturo; Miralles, David; Alias, Francesc] La Salle Univ Ramon Llull, GTM, Barcelona 08022, Spain. RP Fernandez-Baena, A (reprint author), La Salle Univ Ramon Llull, GTM, Barcelona 08022, Spain. EM adso@salle.url.edu RI Alias, Francesc/L-1088-2014; Antonijoan, Marc/L-5249-2014 OI Alias, Francesc/0000-0002-1921-2375; FU CENIT program [CEN-20101019]; Ministry of Science and Innovation of Spain FX We thank Eduard Ruesga and Meritxell Aragones for their work on the acquisition and processing of motion capture data. We also want to thank Dani Arguedas for his labor as actor. We acknowledge Anna Fuste for assisting in programming. This work was supported by the CENIT program number CEN-20101019, Granted by the Ministry of Science and Innovation of Spain. CR Abete G., 2010, P SPEECH PROS 2010 S Antonijoan M., 2012, THESIS R LLULL U LA Arikan O, 2002, ACM T GRAPHIC, V21, P483 Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008 Boersma P., 2001, GLOT INT, V5, P341 Bolinger D., 1986, INTONATION ITS PARTS Bulut M, 2007, INT CONF ACOUST SPEE, P1237 Cassell J., 1994, P SIGGRAPH 94, P413, DOI 10.1145/192161.192272 Cassell J, 2001, COMP GRAPH, P477 Chiu C.C., 2011, P 10 INT C INT VIRT, P127 Condon W.S., 1976, ANAL BEHAV ORG, P285 Escudero-Mancebo D., 2010, P INT C SPEECH PROS Fernandez-Baena A., 2012, FAST RESPONSE QUICK, P77 Goldman J.-P., 2011, INTERSPEECH 2011, P3233 Hall M., 2009, SIGKDD EXPLORATIONS, V11, DOI DOI 10.1145/1656274.1656278 ITU-T, 1996, P 800 METH SUBJ DET Kalinli O, 2009, IEEE T AUDIO SPEECH, V17, P1009, DOI 10.1109/TASL.2009.2014795 Kendon Adam, 1980, NONVERBAL COMMUNICAT, P207 Kipp M, 2007, LECT NOTES ARTIF INT, V4722, P15 Kipp M, 2004, THESIS BOCA RATON Kipp M., 2001, P 7 EUR C SPEECH COM, P1367 Kipp M., 2009, P INT C AFF COMP INT KOPP S., 2008, INT J SEMANTIC COMPU, V2, P115, DOI 10.1142/S1793351X08000361 Kovar L, 2002, ACM T GRAPHIC, V21, P473 Lee JH, 2002, ACM T GRAPHIC, V21, P491 Leonard T, 2011, LANG COGNITIVE PROC, V26, P1457, DOI 10.1080/01690965.2010.500218 Levine S., 2010, ACM T GRAPH, V29 Levine S., 2009, ACM T GRAPH, V28 Llull L.S.U.R., 2012, MEDIALAB MOTION CAPT Loehr DP, 2004, THESIS GEORGETOWN U Luo PC, 2009, LECT NOTES ARTIF INT, V5773, P405 MATTHEWS BW, 1975, BIOCHIM BIOPHYS ACTA, V405, P442, DOI 10.1016/0005-2795(75)90109-9 McNeill D., 1992, HAND MIND WHAT GESTU McNeill D., 2008, PHOENIX POETS SERIES MCNEILL D, 1985, PSYCHOL REV, V92, P350, DOI 10.1037//0033-295X.92.3.350 Neff M., 2008, ACM T GRAPH, V27, P5 Nobe Shuichi, 1996, REPRESENTATIONAL GES Onuma K., 2008, P EUROGRAPHICS Ortega-Llebaria M, 2011, LANG SPEECH, V54, P73, DOI 10.1177/0023830910388014 Ortiz-Lira H., 1999, ONOMAZEIN, V4, P429 Oshita M., 2011, P 4 INT C MOT GAM, P120 Pejsa T, 2010, COMPUT GRAPH FORUM, V29, P202, DOI 10.1111/j.1467-8659.2009.01591.x Pelachaud C., 2005, P 13 ANN ACM INT C M, P683, DOI 10.1145/1101149.1101301 Pierrehumbert Janet, 1980, THESIS MIT CAMBRIDGE Planet S., 2008, P 2 INT WORKSH EMOTI Platt J. C., 1999, FAST TRAINING SUPPOR, P185 Prieto Pilar, 2009, ESTUDIOS FONETICA EX, V18, P263 Qin Y., 2010, INT C SIGN PROC SYST Quintilian, 1920, I ORATORIA QUINTILIA Ren Z., 2011, 2011 8 INT C INF COM, P1 Renwick M., 2004, SOUND SENSE 50 YEARS, P97 Roekhaut S., 2010, 5 INT C SPEECH PROS RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714 Shalev-Shwartz S., 2007, 24 INT C MACH LEARN, P807 Shattuck-Hufnagel S., 2007, NATO PUBLISHING SU E, V18 Silipo R., 1999, P 14 INT C PHON SCI, P2351 Silverman K., 1992, ICSLP ISCA Stone M, 2004, ACM T GRAPHIC, V23, P506, DOI 10.1145/1015706.1015753 Streefkerk B., 1999, EUROSPEECH Syrdal AK, 2001, SPEECH COMMUN, V33, P135, DOI 10.1016/S0167-6393(00)00073-X Tamburini F., 2007, INT 2007, P1809 TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019 Valbonesi L., 2002, MULTIMODAL SIGNAL AN Van Basten B.J.H., 2009, P 4 INT C FDN DIG GA, P199, DOI 10.1145/1536513.1536551 van der Sluis I, 2007, DISCOURSE PROCESS, V44, P145 VANRIJSBERGEN CJ, 1974, J DOC, V30, P365 van Welbergen H, 2010, COMPUT GRAPH FORUM, V29, P2530, DOI 10.1111/j.1467-8659.2010.01822.x Vilhjalmsson H., 2003, THESIS MIT Wachsmuth I., 1998, LECT NOTES ARTIF INT, V1317, P23 Wallbott HG, 1998, EUR J SOC PSYCHOL, V28, P879, DOI 10.1002/(SICI)1099-0992(1998110)28:6<879::AID-EJSP901>3.0.CO;2-W Wang J., 2008, ACM T GRAPH, V27, P1 Xu Jing, 2011, Instrument Techniques and Sensor NR 72 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2014 VL 57 BP 331 EP 350 DI 10.1016/j.specom.2013.06.005 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 268OJ UT WOS:000328180100023 ER PT J AU Cucu, H Buzo, A Besacier, L Burileanu, C AF Cucu, Horia Buzo, Andi Besacier, Laurent Burileanu, Corneliu TI SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian SO SPEECH COMMUNICATION LA English DT Article DE Under-resourced languages; Domain adaptation; Automatic speech recognition; Statistical machine translation; Language modeling AB This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary. (C) 2013 Elsevier B.V. All rights reserved. C1 [Cucu, Horia; Buzo, Andi; Burileanu, Corneliu] Univ Politehn Bucuresti, Bucharest, Romania. [Cucu, Horia; Besacier, Laurent] Univ Grenoble 1, LIG, Grenoble, France. RP Cucu, H (reprint author), Univ Politehn Bucuresti, Bucharest, Romania. EM horia.cucu@upb.ro; andi.buzo@upb.ro; laurent.besacier@imag.fr; cburileanu@messnet.pub.ro RI Buzo, Andi/B-1834-2013 OI Buzo, Andi/0000-0001-6545-5338 CR Abdillahi N, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P289 Berment V, 2004, THESIS U J FOURIER G Bertoldi N., 2009, PRAGUE B MATH LINGUI, P1 Bick E., MULTILINGUALITY INTE, P169 BILLA J, 2002, ACOUST SPEECH SIG PR, P5 Bisani M., 2003, EUR GEN SWITZ, P933 Bisani M., SPEECH COMMUNICATION, V50, P434 Bonaventura P., 1998, ACL Bonneau-Maynard H., 2005, INTERSPEECH, P3457 Burileanu D., 1999, ICPHS SAN FRANC US, V1, P503 Cai J., 2008, SLTU HAN VIETN Cucu H., 2011, U POLITEHNICA BUCH C, P179 Cucu H, 2012, EUR SIGNAL PR CONF, P1648 Cucu H., 2011, SPECOM KAZ RUSS, P81 Cucu H., 2011, THESIS POLITEHNICA U Cucu H., 2011, ASRU HAW US, P260 Domokos J., 2009, THESIS TU CLUJ NAPOC Domokos J., 2011, SPED BRAS ROM, P1 Draxler C., 2007, INTERSPEECH, P1509 Dumitru C.-O., 2008, ADV ROBOTICS AUTOMAT, P472 Gizaw S., 2008, SLTU HAN VIETN Jabaian B, 2011, INT CONF ACOUST SPEE, P5612 Jensson A., 2008, SLTU HAN VIETN Jitca D., 2003, SPED BUCH ROM, P43 Kabir A., 2011, 10 INT C SIGN PROC R, P323 Karanasou P, 2010, LECT NOTES ARTIF INT, V6233, P167, DOI 10.1007/978-3-642-14770-8_20 Koehn P., 2007, ACL PRAG CZECH REP Koehn P., 2005, MACHINE TRANSLATION, P79 Laurent A., 2009, INT 2009 BRIGHT UK, P708 Le V.B., 2003, EUROSPEECH 2003, P3117 Le V.B., IEEE T AUDIO SPEECH, V17, P1471 Lita L., 2003, ACL 2003 SAPP JAP, P152 Macoveiciuc M., MULTILINGUALITY INTE, P151 Mihajlik P., 2007, INTERSPEECH, P1497 Militaru D, 2009, FROM SPEECH PROCESSING TO SPOKEN LANGUAGE TECHNOLOGY, P21 Munteanu D.-P., 2006, THESIS TECHNICAL MIL Nakajima H., 2002, COLING 2002, V2, P716 Oancea E., 2004, COMMUNICATIONS, P221 Ordean M.A., 2009, SYNASC TIM ROM, P401 Papineni K., 2002, ACL, P311 Pellegrini T, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P285 Petrea C.S., 2010, ECIT IAS ROM, P13 Sam S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P254 Schultz T., MULTILINGUAL SPEECH Simard M., 2007, NAACL HLT ROCH US, P508 Stuker S., 2008, SLTU HAN VIETN Suenderman K., 2009, INT 2009 BRIGHT UK, P1475 Toma S.-A., 2009, COMPUTATION WORLD AT, P682 Tufi D., 2008, LREC MARR MOR Tufis D., 1999, INT WORKSH COMP LEX, P185 Ungurean Catalin, 2008, University "Politehnica" of Bucharest, Scientific Bulletin Series C: Electrical Engineering, V70 Ungurean C., 2011, SPED BRAS ROM, P1 Vlad A, 2007, LECT NOTES COMPUT SC, V4705, P409 NR 53 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 195 EP 212 DI 10.1016/j.specom.2013.05.003 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800016 ER PT J AU Dufour, R Esteve, Y Deleglise, P AF Dufour, Richard Esteve, Yannick Deleglise, Paul TI Characterizing and detecting spontaneous speech: Application to speaker role recognition SO SPEECH COMMUNICATION LA English DT Article DE Spontaneous speech; Speaker role; Feature extraction; Speech classification; Automatic speech recognition; Role recognition ID IDENTIFICATION; DISFLUENCIES AB Processing spontaneous speech is one of the many challenges that automatic speech recognition systems have to deal with. The main characteristics of this kind of speech are disfluencies (filled pause, repetition, false start, etc.) and many studies have focused on their detection and correction. Spontaneous speech is defined in opposition to prepared speech, where utterances contain well-formed sentences close to those found in written documents. Acoustic and linguistic features made available by the use of an automatic speech recognition system are proposed to characterize and detect spontaneous speech segments from large audio databases. To better define this notion of spontaneous speech, segments of an 11-hour corpus (French Broadcast News) had been manually labeled according to three classes of spontaneity. Firstly, we present a study of these features. We then propose a two-level strategy to automatically assign a class of spontaneity to each speech segment. The proposed system reaches a 73.0% precision and a 73.5% recall on high spontaneous speech segments, and a 66.8% precision and a 69.6% recall on prepared speech segments. A quantitative study shows that the classes of spontaneity are useful information to characterize the speaker roles. This is confirmed by extending the speech spontaneity characterization approach to build an efficient automatic speaker role recognition system. (C) 2013 Elsevier B.V. All rights reserved. C1 [Dufour, Richard; Esteve, Yannick; Deleglise, Paul] Univ Le Mans, LIUM, Le Mans, France. RP Dufour, R (reprint author), Univ Avignon, LIA, Avignon, France. EM richard.dufour@univ-avignon.fr; yan-nick.esteve@lium.univ-lemans.fr; paul.deleglise@lium.univ-lemans.fr CR Amaral R., 2003, ISCA WORKSH MULT SPO, P31 Barzilay R., 2000, SEVENTEENTH NATIONAL, P679 Bazillon T., 2008, THE SIXTH INTERNATIO, P1067 Bigot B., 2010, INTERNATIONAL WORKSH, P5 Boula de Mareuil P., 2005, PROCEEDING OF THE WO Caelen-Haumont G., 2002, P 1 INT C SPEECH PRO, P195 COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104 Damnati G., 2011, INTERNATIONAL CONFER Deleglise P., 2009, CONFERENCE OF THE IN, P2123 DUEZ D, 1982, LANG SPEECH, V25, P11 Dufour R., 2010, ACM WORKSHOP ON SEAR Dufour R., 2011, CONFERENCE OF THE IN Dufour R., 2009, 13TH INTERNATIONAL C Dufour R., 2010, CONFERENCE OF THE IN Dufour R., 2009, AUTOMATIC SPEECH REC Esteve Y., 2010, LREC VALL MALT, P1686 Eugenio B.D., 2004, COMPUTATIONAL LINGUI, V30, P95, DOI 10.1162/089120104773633402 Galliano S., 2005, CONFERENCE OF THE IN Garg P.N., 2008, ACM MULTIMEDIA CONFE, P693 Goto M., 1999, SIXTH EUROPEAN CONFE, P227 Gravier G., 2012, INTERNATIONAL CONFER Hakkani-Tur D., 2007, INTERNATIONAL CONFER, V4, P1 Heeman PA, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P362 Jousse V, 2009, INT CONF ACOUST SPEE, P4557, DOI 10.1109/ICASSP.2009.4960644 Lease M, 2006, IEEE T AUDIO SPEECH, V14, P1566, DOI 10.1109/TASL.2006.878269 Liu Y, 2006, IEEE T AUDIO SPEECH, V14, P1526, DOI 10.1109/TASL.2006.878255 Liu Y., 2006, HUM LANG TECHN C NAA, P81 Liu Y., 2005, CONFERENCE OF THE IN, P3313 Luzzati D., 2004, WORKSHOP MODELISATIO, P13 MCDONOUGH J, 1994, INT CONF ACOUST SPEE, P385 Meignier S., 2010, CMU SPHINX USERS AND Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 O'Shaughnessy D., 1993, INTERNATIONAL CONFER, V2, P724, DOI 10.1109/ICASSP.1993.319414 Peskin B., 1993, HLT 93 P WORKSH HUM, P119 Rousseau A., 2011, PROCEEDINGS OF IWSLT Salamin H, 2009, IEEE T MULTIMEDIA, V11, P1373, DOI 10.1109/TMM.2009.2030740 Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923 Schuller B., 2012, CONFERENCE OF THE IN Shriberg E., 1999, P INT C PHON SCI SAN, P619 Siu M, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P386 Vinciarelli A, 2007, IEEE T MULTIMEDIA, V9, P1215, DOI 10.1109/TMM.2007.902882 Yeh JF, 2006, IEEE T AUDIO SPEECH, V14, P1574, DOI 10.1109/TASL.2006.878267 NR 42 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 1 EP 18 DI 10.1016/j.specom.2013.07.007 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800001 ER PT J AU Poblete, V Yoma, NB Stern, RM AF Poblete, Victor Yoma, Nestor Becerra Stern, Richard M. TI Optimization of the parameters characterizing sigmoidal rate-level functions based on acoustic features SO SPEECH COMMUNICATION LA English DT Article DE Sigmoidal function; Auditory systems; Optimization; Acoustic features; Speech enhancement ID ROBUST SPEECH RECOGNITION; AUDITORY-NERVE FIBERS; DYNAMIC-RANGE ADAPTATION; CONTRAST GAIN-CONTROL; SPEAKER VERIFICATION; NOISY ENVIRONMENTS; NEURAL ADAPTATION; BASILAR-MEMBRANE; SCENE ANALYSIS; MODEL AB This paper describes the development of an optimal sigmoidal rate-level function that is a component of many models of the peripheral auditory system. The optimization makes use of a set of criteria defined exclusively on the basis of physical attributes of the input sound that are inspired by physiological evidence. The criteria developed attempt to discriminate between a degraded speech signal and noise to preserve the maximum amount of information in the linear region of the sigmoidal curve, and to minimize the effects of distortion in the saturating regions. The performance of the proposed optimal sigmoidal function is validated by text-independent speaker-verification experiments with signals corrupted by additive noise at different SNRs. The experimental results suggest that the approach presented in combination with cepstral variance normalization can lead to relative reductions in equal error rate as great as 40% when compared with the use of baseline MFCC coefficients for some SNRs. (C) 2013 Elsevier By. All rights reserved. C1 [Poblete, Victor; Yoma, Nestor Becerra] Univ Chile, Speech Proc & Transmiss Lab, Santiago, Chile. [Stern, Richard M.] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA. [Stern, Richard M.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. [Poblete, Victor] Univ Austral Chile, Inst Acoust, Valdivia, Chile. RP Yoma, NB (reprint author), Univ Chile, Speech Proc & Transmiss Lab, Av Tupper 2007,POB 412-3, Santiago, Chile. EM nbecerra@ing.uchile.cl FU Conicyt-Chile [Fondecyt 1100195, ACT 1120] FX This research was funded by Conicyt-Chile under grants Fondecyt 1100195 and Team Research in Science and Technology ACT 1120. CR Ajmera PK, 2011, PATTERN RECOGN, V44, P2749, DOI 10.1016/j.patcog.2011.04.009 Allen J. B., 1985, IEEE ASSP Magazine, V2, DOI 10.1109/MASSP.1985.1163723 Barbour DL, 2011, NEUROSCI BIOBEHAV R, V35, P2064, DOI 10.1016/j.neubiorev.2011.04.009 Bures Z, 2010, EUR J NEUROSCI, V32, P155, DOI 10.1111/j.1460-9568.2010.07280.x Campbell J., 1994, YOHO SPEAKER VERIFIC Chiu YHB, 2012, IEEE T AUDIO SPEECH, V20, P900, DOI 10.1109/TASL.2011.2168209 Chiu YHB, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1000 COHEN JR, 1989, J ACOUST SOC AM, V85, P2623, DOI 10.1121/1.397756 COSTALUPES JA, 1984, J NEUROPHYSIOL, V51, P1326 Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156 Dean I, 2008, J NEUROSCI, V28, P6430, DOI 10.1523/JNEUROSCI.0470-08.2008 Dean I, 2005, NAT NEUROSCI, V8, P1684, DOI 10.1038/nn1541 Dimitriadis D, 2011, IEEE T AUDIO SPEECH, V19, P1504, DOI 10.1109/TASL.2010.2092766 Gao F, 2009, PHYSIOL BEHAV, V97, P369, DOI 10.1016/j.physbeh.2009.03.004 Garcia-Lazaro JA, 2007, EUR J NEUROSCI, V26, P2359, DOI 10.1111/j.1460-9568.2007.05847.x Ghitza O., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80018-3 Ghitza O, 1994, IEEE T SPEECH AUDI P, V2, P115, DOI 10.1109/89.260357 Hanilci C, 2012, IEEE SIGNAL PROC LET, V19, P163, DOI 10.1109/LSP.2012.2184284 Hasan T, 2013, IEEE T AUDIO SPEECH, V21, P842, DOI 10.1109/TASL.2012.2226161 Hirsch H.G., 2000, INT WORKSH AUT SPEEC, P181 Jankowski C.R., 1992, P WORKSH SPEECH NAT, P453, DOI 10.3115/1075527.1075637 Kang SY, 2010, JARO-J ASSOC RES OTO, V11, P245, DOI 10.1007/s10162-009-0194-7 Kim C., 2006, P INT PITTSB PENNS, P1975 Kim C, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4101 Kim DS, 1999, IEEE T SPEECH AUDI P, V7, P55 Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Kinnunen T, 2012, IEEE T AUDIO SPEECH, V20, P1990, DOI 10.1109/TASL.2012.2191960 Li Q, 2011, IEEE T AUDIO SPEECH, V19, P1791, DOI 10.1109/TASL.2010.2101594 Li Q, 2010, INT CONF ACOUST SPEE, P4514, DOI 10.1109/ICASSP.2010.5495589 Lyon R., 1982, IEEE INT C AC SPEECH, V7, P1282, DOI 10.1109/ICASSP.1982.1171644 MAY BJ, 1992, J NEUROPHYSIOL, V68, P1589 Middlebrooks JC, 2004, J ACOUST SOC AM, V116, P452, DOI 10.1121/1.1760795 Miller CA, 2011, JARO-J ASSOC RES OTO, V12, P219, DOI 10.1007/s10162-010-0249-9 Ming J, 2007, IEEE T AUDIO SPEECH, V15, P1711, DOI 10.1109/TASL.2007.899278 Moore B.C.J., 2003, INTRO PSYCHOL HEARIN, P39 Nizami L, 2005, HEARING RES, V208, P26, DOI 10.1016/j.heares.2005.05.002 OHZAWA I, 1985, J NEUROPHYSIOL, V54, P651 Patterson R.D., 1992, AUDITORY PROCESSING, P67 Pfingst BE, 2011, HEARING RES, V281, P65, DOI 10.1016/j.heares.2011.05.002 Pickles J.O., 2008, INTRO PHYSL HEARING Rabinowitz NC, 2011, NEURON, V70, P1178, DOI 10.1016/j.neuron.2011.04.030 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 RHODE WS, 1993, HEARING RES, V66, P31, DOI 10.1016/0378-5955(93)90257-2 Robles L, 2001, PHYSIOL REV, V81, P1305 SACHS MB, 1974, J ACOUST SOC AM, V56, P1835, DOI 10.1121/1.1903521 Saeidi R, 2010, IEEE SIGNAL PROC LET, V17, P599, DOI 10.1109/LSP.2010.2048649 Schneider BA, 2011, ATTEN PERCEPT PSYCHO, V73, P1562, DOI 10.3758/s13414-011-0097-7 SENEFF S, 1988, J PHONETICS, V16, P55 SHAMMA S, 1988, J PHONETICS, V16, P77 SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1612, DOI 10.1121/1.392799 Shao Y, 2008, INT CONF ACOUST SPEE, P1589 Shao Y, 2010, COMPUT SPEECH LANG, V24, P77, DOI 10.1016/j.csl.2008.03.004 Shao Y, 2007, INT CONF ACOUST SPEE, P277 Shin JW, 2008, IEEE SIGNAL PROC LET, V15, P257, DOI 10.1109/LSP.2008.917027 Slaney M., 1998, 1998010 INT RES CORP Stern R., 2012, TECHNIQUES NOISE ROB Stern RM, 2012, IEEE SIGNAL PROC MAG, V29, P34, DOI 10.1109/MSP.2012.2207989 Taberner AM, 2005, J NEUROPHYSIOL, V93, P557, DOI 10.1152/jn.00574.2004 Wang KS, 1994, IEEE T SPEECH AUDI P, V2, P421 Wang N, 2011, IEEE T AUDIO SPEECH, V19, P196, DOI 10.1109/TASL.2010.2045800 Watkins PV, 2011, CEREB CORTEX, V21, P178, DOI 10.1093/cercor/bhq079 Wen B, 2012, J NEUROPHYSIOL, V108, P69, DOI 10.1152/jn.00055.2012 Wen B, 2009, J NEUROSCI, V29, P13797, DOI 10.1523/JNEUROSCI.5610-08.2009 Werblin F, 1996, IEEE SPECTRUM, V33, P30, DOI 10.1109/6.490054 WINSLOW RL, 1987, J NEUROPHYSIOL, V57, P1002 Wu W, 2007, IEEE T AUDIO SPEECH, V15, P1893, DOI 10.1109/TASL.2007.899297 YATES GK, 1990, HEARING RES, V45, P203, DOI 10.1016/0378-5955(90)90121-5 Young ED, 2008, PHILOS T R SOC B, V363, P923, DOI 10.1098/rstb.2007.2151 Zilany MSA, 2010, J NEUROSCI, V30, P10380, DOI 10.1523/JNEUROSCI.0647-10.2010 NR 70 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 19 EP 34 DI 10.1016/j.specom.2013.07.006 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800002 ER PT J AU Tisljar-Szabo, E Pleh, C AF Tisljar-Szabo, Eszter Pleh, Csaba TI Ascribing emotions depending on pause length in native and foreign language speech SO SPEECH COMMUNICATION LA English DT Article DE Emotion ascribing; Cross-linguistic; Speech pauses; Silent pause duration; Foreign language ID CROSS-CULTURAL DIFFERENCES; FACIAL EXPRESSIONS; VOCAL COMMUNICATION; MEXICAN CHILDREN; AMERICAN SAMPLE; SPEAKER AFFECT; RECOGNITION; PERCEPTION; DISTURBANCES; ANXIETY AB Although the relationship between emotions and speech is well documented, little is known about the role of speech pauses in emotion expression and emotion recognition. The present study investigated how speech pause length influences how listeners ascribe emotional states to the speaker. Emotionally neutral Hungarian speech samples were taken, and speech pauses were systematically manipulated to create five variants of all passages. Hungarian and Austrian participants rated the emotionality of these passages by indicating on a 1-6 point scale how angry, sad, disgusted, happy, surprised, scared, positive, and heated the speaker could have been. The data reveal that the length of silent pauses influences listeners in attributing emotional states to the speaker. Our findings argue that pauses play a relevant role in ascribing emotions and that this phenomenon might be partly independent of language. (C) 2013 Elsevier B.V. All rights reserved. C1 [Tisljar-Szabo, Eszter; Pleh, Csaba] Budapest Univ Technol & Econ, Dept Cognit Sci, H-1111 Budapest, Hungary. RP Tisljar-Szabo, E (reprint author), Univ Debrecen, Med & Hlth Sci Ctr, Dept Behav Sci, Nagyerdei krt 98,POB 45, H-4032 Debrecen, Hungary. EM jszaboeszter@gmail.com; pleh.csaba@ektf.hu FU Austria-Hungary Action Foundation FX We thank Florian Menz, professor at the Linguistic Institute, the University of Vienna for helping in providing conditions for the experiment. This work was supported by a scholarship from the Austria-Hungary Action Foundation to the first author. We thank Csaba Szabo, Agnes Lukacs, Dezso Nemeth, Karolina Janacsek, and Istvan Winkler for valuable feedback and suggestions. CR ALBAS DC, 1976, J CROSS CULT PSYCHOL, V7, P481, DOI 10.1177/002202217674009 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 Beaupre MG, 2005, J CROSS CULT PSYCHOL, V36, P355, DOI 10.1177/0022022104273656 BEIER EG, 1972, J CONSULT CLIN PSYCH, V39, P166, DOI 10.1037/h0033170 BERGMANN G, 1988, Z EXP ANGEW PSYCHOL, V35, P167 Biehl M, 1997, J NONVERBAL BEHAV, V21, P3, DOI 10.1023/A:1024902500935 Breitenstein C., 1996, NEUROLOGIE REHABILIT, V2 Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114 Burkhardt F., 2006, P SPEECH PROS 2006 D Burkhardt F., 2000, ISCA WORKSH ITRW SPE Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1 CARLSON R, 1992, SPEECH COMMUN, V11, P159, DOI 10.1016/0167-6393(92)90010-5 Deppermann A., 2005, Z QUALITATIVE FORSCH, V1, P35 Ekman A, 1969, Lakartidningen, V66, P5021 EKMAN P, 1987, J PERS SOC PSYCHOL, V53, P712, DOI 10.1037/0022-3514.53.4.712 EKMAN P, 1971, J PERS SOC PSYCHOL, V17, P124, DOI 10.1037/h0030377 EKMAN P, 1969, SCIENCE, V164, P86, DOI 10.1126/science.164.3875.86 ELDRED SH, 1958, PSYCHIATR, V21, P115 Elfenbein HA, 2003, J CROSS CULT PSYCHOL, V34, P92, DOI 10.1177/0022022102239157 Elfenbein HA, 2007, EMOTION, V7, P131, DOI 10.1037/1528-3542.7.1.131 Fairbanks G, 1941, SPEECH MONOGR, V8, P85 Fonagy I., 1963, Z PHONETIK SPRACHWIS, V16, P293 Fontaine JRJ, 2007, PSYCHOL SCI, V18, P1050, DOI 10.1111/j.1467-9280.2007.02024.x GOLDMANEISLER F, 1958, Q J EXP PSYCHOL, V10, P96, DOI 10.1080/17470215808416261 Goldman-Eisler F., 1968, PSYCHOLINGUISTICS EX Gosy M., 2008, BESZEDKUTATAS, V2008, P116 Gosy M., 2008, HUMAN FACTORS VOICE, P193 Gosy M., 2003, MAGYAR NYELVOR, V127, P257 HENDERSO.A, 1966, LANG SPEECH, V9, P207 Hofmann SG, 1997, J ANXIETY DISORD, V11, P573, DOI 10.1016/S0887-6185(97)00040-6 IZARD CE, 1994, PSYCHOL BULL, V115, P288, DOI 10.1037/0033-2909.115.2.288 Izard C.E., 1971, FACE EMOTION JOHNSON WF, 1986, ARCH GEN PSYCHIAT, V43, P280 Jovicic S.T., SPECOM 2004, P77 Juslin P. N., 2005, NEW HDB METHODS NONV, P65 Juslin PN, 2001, EMOTION, V1, P381, DOI 10.1037//1528-3542.1.4.381 KASL SV, 1965, J PERS SOC PSYCHOL, V1, P425, DOI 10.1037/h0021918 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 Laukka P, 2008, J NONVERBAL BEHAV, V32, P195, DOI 10.1007/s10919-008-0055-9 MAHL GF, 1956, J ABNORM SOC PSYCH, V53, P1, DOI 10.1037/h0047552 MATSUMOTO D, 1993, MOTIV EMOTION, V17, P107, DOI 10.1007/BF00995188 MCCLUSKEY KW, 1981, INT J PSYCHOL, V16, P119, DOI 10.1080/00207598108247409 MCCLUSKEY KW, 1975, DEV PSYCHOL, V11, P551, DOI 10.1037/0012-1649.11.5.551 Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 Pell MD, 2008, SPEECH COMMUN, V50, P519, DOI 10.1016/j.specom.2008.03.006 POPE B, 1970, J CONSULT CLIN PSYCH, V35, P128, DOI 10.1037/h0029659 ROCHESTE.SR, 1973, J PSYCHOLINGUIST RES, V2, P51, DOI 10.1007/BF01067111 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Schonpflug U, 2008, COGNITIVE DEV, V23, P385, DOI 10.1016/j.cogdev.2008.05.002 Schrober M, 2003, SPEECH COMMUN, V40, P99, DOI 10.1016/S0167-6393(02)00078-X Szabo E., 2008, MAGYAR PSZICHOLOGIAI, V63, P651 Thompson WF, 2006, SEMIOTICA, V158, P407, DOI 10.1515/SEM.2006.017 VANBEZOOIJEN R, 1983, J CROSS CULT PSYCHOL, V14, P387, DOI 10.1177/0022002183014004001 WILLIAMS EJ, 1949, AUST J SCI RES SER A, V2, P149 NR 57 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 35 EP 48 DI 10.1016/j.specom.2013.07.009 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800003 ER PT J AU Sun, Y Gemmeke, JF Cranen, B ten Bosch, L Boves, L AF Sun, Yang Gemmeke, Jort F. Cranen, Bert ten Bosch, Louis Boves, Lou TI Fusion of parametric and non-parametric approaches to noise-robust ASR SO SPEECH COMMUNICATION LA English DT Article DE Dynamic Bayesian Network; Virtual Evidence; Sparse Classification; Early fusion; Robust speech recognition ID AUTOMATIC SPEECH RECOGNITION AB In this paper we present a principled method for the fusion of independent estimates of the state likelihood in a Dynamic Bayesian Network (DBN) by means of the Virtual Evidence option for improving speech recognition in the AURORA-2 task. A first estimate is derived from a conventional parametric Gaussian Mixture Model; a second estimate is obtained from a non-parametric Sparse Classification (SC) system. During training the parameters pertaining to the input streams can be optimized independently, but also jointly, provided that all streams represent true probability functions. During decoding the weights of the streams can be varied much more freely. It appeared that the state likelihoods in the GMM and SC streams are very different, and that this makes it necessary to apply different weights to the streams in decoding. When using optimal weights, the dual-input system can outperform the individual GMM or the SC systems for all SNR levels in test sets A and B in the AURORA-2 task. (C) 2013 Elsevier B.V. All rights reserved. C1 [Sun, Yang; Cranen, Bert; ten Bosch, Louis; Boves, Lou] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6525 HT Nijmegen, Netherlands. [Gemmeke, Jort F.] Katholieke Univ Leuven, Dept Elect Engn ESAT, B-3001 Heverlee, Belgium. RP Sun, Y (reprint author), Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6525 HT Nijmegen, Netherlands. EM jonathan.eric.sun@gmail.com; jgemme-ke@amadana.nl; B.Cranen@let.ru.nl; L.tenBosch@let.ru.nl; L.Boves@let.ru.nl FU European Community [213850 SCALE]; IWT-SBO project ALADIN [100049]; FP7-SME project OPTI-FOX [262266] FX The research leading to these results has received funding from the European Community's Seventh Framework Programme FP7/2007-2013 under Grant agreement no 213850 SCALE. The research of Jort F. Gemmeke was funded by IWT-SBO project ALADIN contract 100049. Louis ten Bosch received funding from the FP7-SME project OPTI-FOX, project reference 262266. CR Aradilla G., 2008, THESIS ECOLE POLYTEC Bilmes J., 2004, UWEETR20040016 U WAS Bilmes J., 2001, DISCR STRUCT GRAPH M Bilmes J., 2002, GMTK DOCUMENTATION Bilmes J., 2001, UWEETR20010005 U WAS Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 Cetin O., 2005, THESIS U WASHINGTON Chen CP, 2007, IEEE T AUDIO SPEECH, V15, P257, DOI 10.1109/TASL.2006.876717 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Ellis D.P.W., 2000, STREAM COMBINATION A, P1635 ETSI, 2007, 202050 ETSI ES Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 Gemmeke J. F., 2011, P EUSIPCO, P1490 Gemmeke JF, 2011, IEEE T AUDIO SPEECH, V19, P2067, DOI 10.1109/TASL.2011.2112350 Gemmeke J.F., 2011, P IEEE WORKSH AUT SP Hirsch H., 2000, P ICSLP, V4, P29 Hurmalainen A., 2011, P INT WORKSH MACH LI Kirchhoff K., 2000, P ISCA ITRW WORKSH A, P17 Kirchhoff K, 2000, INT CONF ACOUST SPEE, P1435, DOI 10.1109/ICASSP.2000.861883 MISRA H, 2003, ACOUST SPEECH SIG PR, P741 Misra H., 2005, THESIS ECOLE POLYTEC Morris J, 2008, IEEE T AUDIO SPEECH, V16, P617, DOI 10.1109/TASL.2008.916057 Pearl J., 1988, PROBABILISTIC REASON Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828 Rasipuram R, 2011, INT CONF ACOUST SPEE, P5192 Saenko K, 2009, IEEE T PATTERN ANAL, V31, P1700, DOI 10.1109/TPAMI.2008.303 Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006 Subramanya A., 2007, P IEEE WORKSH AUT SP Sun Y., 2011, P EUSIPCO BARC SPAIN, P1495 Sun Y., 2010, P INT MAK JAP Sun Y., 2011, P INTERSPEECH, P1669 Valente F, 2010, SPEECH COMMUN, V52, P213, DOI 10.1016/j.specom.2009.10.002 van Dalen RC, 2011, IEEE T AUDIO SPEECH, V19, P733, DOI 10.1109/TASL.2010.2061226 Wolmer M., 2012, COMPUTER SPEECH LANG, V27, P780 Wu S., 1998, P ICASSP 98, P459 WU SL, 1998, ACOUST SPEECH SIG PR, P721 NR 36 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 49 EP 62 DI 10.1016/j.specom.2013.07.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800004 ER PT J AU Kim, JW Nam, CM Kim, YW Kim, HH AF Kim, Jung Wan Nam, Chung Mo Kim, Yong Wook Kim, Hyang Hee TI The development of the Geriatric Index of Communicative Ability (GICA) for measuring communicative competence of elderly: A pilot study SO SPEECH COMMUNICATION LA English DT Article DE Communicative ability; Elderly; Language; Cognition; Index ID SPEECH RECOGNITION; YOUNG; LISTENERS; NOISE; LANGUAGE; LIFE AB A change in communicative ability, among various changes arising during the aging process, may cause various difficulties for the elderly. This study aims to develop a Geriatric Index of Communicative Ability (GICA) and verify its reliability and validity. After organizing the areas required for GICA and defining the categories for the sub-domains, relevant questions were arranged. The final version of GICA was completed through the stages of content and face validity, expert review, and pilot study. The overall reliability of GICA was good and the internal consistency (Cronbach's alpha = .786) and test-retest reliability (range of Pearson's correlation coefficients: .58-.98) were high. Based on this verification of the instrument's reliability and validity, the completed GICA was organized with three questions in each of six sub-domains: hearing, language comprehension & production, attention & memory, communication efficiency, voice and reading/writing/calculation. As a tool to measure the communicative ability of elderly people reliably and appropriately, GICA is very useful in the early identification of those with communication difficulties among the elderly. (C) 2013 Published by Elsevier B.V. C1 [Kim, Jung Wan] Daegu Univ, Dept Speech & Language Pathol, Gyongsan 712714, South Korea. [Nam, Chung Mo] Yonsei Univ, Coll Med, Dept Prevent Med, Seoul 120752, South Korea. [Kim, Yong Wook; Kim, Hyang Hee] Yonsei Univ, Coll Med, Dept & Res Inst Rehabil Med, Seoul 120752, South Korea. [Kim, Hyang Hee] Yonsei Univ, Grad Program Speech & Language Pathol, Seoul 120752, South Korea. RP Kim, HH (reprint author), Yonsei Univ, Coll Med, Dept & Res Inst Rehabil Med, Seoul 120752, South Korea. EM h.kim@yonsei.ac.kr CR Brod M, 1999, GERONTOLOGIST, V39, P25 Burzynski C.M., 1987, COMMUNICATION DISORD, P214 Christensen K. J., 1991, PSYCHOL ASSESSMENT J, V3, P168, DOI 10.1037//1040-3590.3.2.168 FEHRING RJ, 1987, HEART LUNG, V16, P625 Frattali C, 1995, AM SPEECH LANGUAGE H Friedenberg L, 1995, PSYCHOL TESTING DESI Frisina DR, 1997, HEARING RES, V106, P95, DOI 10.1016/S0378-5955(97)00006-3 Glorig A., 1965, HEARING LEVELS AD 11, V11 GORDONSALANT S, 1993, J SPEECH HEAR RES, V36, P1276 Gronlund N., 1988, CONSTRUCT ACHIEVEMEN Hegde M.N., 2001, INTRO COMMUNICATIVE Holland A. L., 1999, COMMUNICATION ACTIVI Kang Yeonwook, 2006, [Korean Journal of Psychology: General, 한국심리학회지:일반], V25, P1 Ki B, 1996, J KOREAN NEUROPSYCHI, V35, P298 Kim J.W., 2009, KOREAN J COMMUNICATI, V14, P495 Kim JY, COMMUNICATION Kim Y.S., 2001, J KOREAN ACAD FAMILY, V22, P878 Likert R., 1932, ARCH PSYCHOL, V40, P1 LOMAS J, 1989, J SPEECH HEAR DISORD, V54, P113 Mayerson M.D., 1976, J GERONTOL, V31, P29 MUELLER PB, 1984, EAR NOSE THROAT J, V63, P292 Park J., 2001, GRADE RESPONSE MODEL PICHORAFULLER MK, 1995, J ACOUST SOC AM, V97, P593, DOI 10.1121/1.412282 RIEGEL KF, 1973, GERONTOLOGIST, V13, P478 Statistics Korea, 2010, ELD STAT 2010 The Korean Gerontological Society, 2002, REV GER STUD Thorndike R. M., 1991, MEASUREMENT EVALUATI Tun P. A., 1999, J GERONTOL B-PSYCHOL, V54B, P317 Versfeld NJ, 2002, J ACOUST SOC AM, V111, P401, DOI 10.1121/1.1426376 Watson BC, 1998, J ACOUST SOC AM, V103, P3642, DOI 10.1121/1.423068 Wingfield A, 2006, J AM ACAD AUDIOL, V17, P487, DOI 10.3766/jaaa.17.7.4 Wingfield A, 2006, J NEUROPHYSIOL, V96, P2830, DOI 10.1152/jn.00628.2006 Yonan CA, 2000, PSYCHOL AGING, V15, P88, DOI 10.1037//0882-7974.15.1.88 NR 33 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 63 EP 69 DI 10.1016/j.specom.2013.08.001 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800005 ER PT J AU Amino, K Osanai, T AF Amino, Kanae Osanai, Takashi TI Native vs. non-native accent identification using Japanese spoken telephone numbers SO SPEECH COMMUNICATION LA English DT Article DE Foreign accent identification; Non-native speech; Spoken telephone numbers; Prosody; Forensic speech science ID AUTOMATIC LANGUAGE IDENTIFICATION; PERCEIVED FOREIGN ACCENT; EMOTION RECOGNITION; SPEECH; CLASSIFICATION; ADVANTAGE; ENGLISH AB In forensic investigations, it would be helpful to be able to identify a speaker's native language based on the sound of their speech. Previous research on foreign accent identification suggested that the identification accuracy can be improved by using linguistic forms in which non-native characteristics are reflected. This study investigates how native and non-native speakers of Japanese differ in reading Japanese telephone numbers, which have a specific prosodic structure called a bipodic template. Spoken Japanese telephone numbers were recorded from native speakers, and Chinese and Korean learners of Japanese. Twelve utterances were obtained from each speaker, and their FO contours were compared between native and non-native speakers. All native speakers realised the prosodic pattern of the bipodic template while reading the telephone numbers, whereas non-native speakers did not. The metric rhythm and segmental properties of the speech samples were also analysed, and a foreign accent identification experiment was carried out using six acoustic features. By applying a logistic regression analysis, this method yielded an 81.8% correct identification rate, which is slightly better than that achieved in other studies. Discrimination accuracy between native and non-native accents was better than 90%, although discrimination between the two non-native accents was not that successful. A perceptual accent identification experiment was also conducted in order to compare automatic and human identifications. The results revealed that human listeners could discriminate between native and non-native speakers better, while they were inferior at identifying foreign accents. (C) 2013 Elsevier B.V. All rights reserved. C1 [Amino, Kanae; Osanai, Takashi] Natl Res Inst Police Sci, Kashiwa, Chiba 2770882, Japan. RP Amino, K (reprint author), Natl Res Inst Police Sci, 6-3-1 Kashiwanoha, Kashiwa, Chiba 2770882, Japan. EM arnino@nrips.go.jp; osanai@nrips.go.jp FU MEXT [25350488, 24810034, 21300060] FX Portions of this work were presented at ICPhS 2011 (K. Amino and T. Osanai, Realisation of the prosodic structure of spoken telephone numbers by native and non-native speakers of Japanese, in: Proc. International Congress of Phonetic Sciences, pp. 236-239, Hong Kong, August 2011) and the ASJ meeting in 2011 (K. Amino and T. Osanai, Identification of native and nonnative speech by using Japanese spoken telephone numbers, in: Proc. Autumn Meeting Acoust. Soc. Jpn., pp. 407-410, Matsue, September 2011). This work was supported by Grants-in-Aid for Scientific Research from MEXT (25350488, 24810034, 21300060). CR Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6 Arslan LM, 1997, INT CONF ACOUST SPEE, P1123, DOI 10.1109/ICASSP.1997.596139 Baumann S., 2001, P EUR 2001, P557 Beaupre MG, 2006, PERS SOC PSYCHOL B, V32, P16, DOI 10.1177/0146167205277097 Berkling K., 1998, P INT C SPOK LANG PR Blackburn C.S., 1993, P EUROSPEECH, P1241 Bloch B, 1950, LANGUAGE, V26, P86, DOI 10.2307/410409 Boersma P., 2001, GLOT INT, V5, P341 BREWER MB, 1979, PSYCHOL BULL, V86, P307, DOI 10.1037/0033-2909.86.2.307 Brousseau J., 1992, P INT C SPOK LANG PR, P1003 Cleirigh C., 1994, P INT C SPOK LANG PR, P375 Elfenbein HA, 2002, PSYCHOL BULL, V128, P243, DOI 10.1037//0033-2909.128.2.243 FLEGE JE, 1992, J ACOUST SOC AM, V91, P370, DOI 10.1121/1.402780 FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876 Fullana N., 2009, RECENT RES 2 LANGUAG, P97 Fung P., 1999, P ICASSP, P221 Hall M, 2009, ACM SIGKDD EXPLORATI, V11, P10, DOI DOI 10.1145/1656274.1656278 HANSEN JHL, 1995, INT CONF ACOUST SPEE, P836, DOI 10.1109/ICASSP.1995.479824 Hirano H., 2006, IEICE TECHNICAL REPO, V105, P23 Hirano H., 2006, IEICE TECHNICAL REPO, V106, P19 Hollien H., 2002, FORENSIC VOICE IDENT Itahashi S., 1992, P INT C SPOK LANG PR, P1015 Itahashi S., 1993, P EUR, P639 Ito C., 2006, MIT WORKING PAPERS L, V52, P65 Joh H., 2011, DICT BASIC PHONETIC Katagiri K.L., 2008, P S ED JAP AS CHUL U, P103 KIM CW, 1968, LANGUAGE, V44, P516, DOI 10.2307/411719 Kindaichi H., 2001, DICT JAPANESE ACCENT Koshimizu M., 1998, GUIDE WORLDS LANGU 2 Kulshreshtha M, 2012, FORENSIC SPEAKER RECOGNITION: LAW ENFORCEMENT AND COUNTER-TERRORISM, P71, DOI 10.1007/978-1-4614-0263-3_4 Kumpf K, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1740 Lee J.P., 2004, P INTERSPEECH, P1245 Lee O.H., 2005, SPEECH SCI, V12, P95 Lin H, 2007, J CHINESE LINGUISTIC, V17, P127 Mackay I.R.A., 2009, RECENT RES 2 LANGUAG, P43 Manning C. D., 1999, FDN STAT NATURAL LAN Markham D., 1999, P INT C SPOK LANG PR, P1187 Matsuzaki H., 1999, J PHONETIC SOC JAPAN, V3, P26 Mehlhorn G., 2007, P 16 INT C PHON SCI, P1745 Min K.J., 1996, THESIS TOHOKU U SEND Mixdorff H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1469 Mok P., 2008, P SPEECH PROS 2008 C, P423 Muthusamy YK, 1994, IEEE SIGNAL PROC MAG, V11, P33, DOI 10.1109/79.317925 Nasu A., 2001, J OSAKA U FOREIGN ST, V25, P115 Noma H., 1998, GUIDE WORLDS LANGU 2 Piat M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P759 POSER WJ, 1990, LANGUAGE, V66, P78, DOI 10.2307/415280 Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X Rogers H., 1998, INT J SPEECH LANG LA, V5, P203 Saito Y., 2009, INTRO JAPANESE PHONE Tate D.A., 1979, CURRENT ISSUES PHONE, P847 Teixeira C, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1784 Toki S., 2010, PHONETICS RES BASED Tsujimura N., 1996, INTRO JAPANESE LINGU Utsugi A., 2004, J PHONET SOC JPN, V8, P96 Vance T., 2008, SOUNDS JAPANESE Vieru-Dimulescu B., 2007, P INT WORKSH PAR SPE, P47 Wrembel M., 2009, RECENT RES 2 LANGUAG, P291 Yanguas L.R., 1998, P INT C SPOK LANG PR Zissman MA, 1996, INT CONF ACOUST SPEE, P777, DOI 10.1109/ICASSP.1996.543236 Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450 NR 61 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 70 EP 81 DI 10.1016/j.specom.2013.07.010 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800006 ER PT J AU Besacier, L Barnard, E Karpov, A Schultz, T AF Besacier, Laurent Barnard, Etienne Karpov, Alexey Schultz, Tanja TI Introduction to the special issue on processing under-resourced languages SO SPEECH COMMUNICATION LA English DT Editorial Material C1 [Besacier, Laurent] Lab Informat Grenoble, Grenoble, France. [Barnard, Etienne] North West Univ, Vanderbijlpark, South Africa. [Karpov, Alexey] Russian Acad Sci, St Petersburg Inst Informat & Automat, St Petersburg 196140, Russia. [Schultz, Tanja] Karlsruhe Inst Technol, D-76021 Karlsruhe, Germany. RP Besacier, L (reprint author), Lab Informat Grenoble, Grenoble, France. RI Karpov, Alexey/A-8905-2012 OI Karpov, Alexey/0000-0003-3424-652X NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 83 EP 84 DI 10.1016/j.specom.2013.09.001 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800007 ER PT J AU Besacier, L Barnard, E Karpov, A Schultz, T AF Besacier, Laurent Barnard, Etienne Karpov, Alexey Schultz, Tanja TI Automatic speech recognition for under-resourced languages: A survey SO SPEECH COMMUNICATION LA English DT Article DE Under-resourced languages; Automatic speech recognition (ASR); Language portability; Speech and language resources acquisition; Statistical language modeling; Crosslingual acoustic modeling and adaptation; Automatic pronunciation generation; Lexical modeling ID ASR AB Speech processing for under-resourced languages is an active field of research, which has experienced significant progress during the past decade. We propose, in this paper, a survey that focuses on automatic speech recognition (ASR) for these languages. The definition of under-resourced languages and the challenges associated to them are first defined. The main part of the paper is a literature review of the recent (last 8 years) contributions made in ASR for under-resourced languages. Examples of past projects and future trends when dealing with under-resourced languages are also presented. We believe that this paper will be a good starting point for anyone interested to initiate research in (or operational development of) ASR for one or several under-resourced languages. It should be clear, however, that many of the issues and approaches presented here, apply to speech technology in general (text-to-speech synthesis for instance). (C) 2013 Published by Elsevier B.V. C1 [Besacier, Laurent] Lab Informat Grenoble, Grenoble, France. [Barnard, Etienne] North West Univ, Vanderbijlpark, South Africa. [Karpov, Alexey] Russian Acad Sci, St Petersburg Inst Informat & Automat, St Petersburg 196140, Russia. [Schultz, Tanja] Karlsruhe Inst Technol, D-76021 Karlsruhe, Germany. RP Besacier, L (reprint author), Lab Informat Grenoble, Grenoble, France. RI Karpov, Alexey/A-8905-2012 OI Karpov, Alexey/0000-0003-3424-652X CR Abdillahi N, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P289 Ablimit M, 2010, 2010 IEEE 10TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS (ICSP2010), VOLS I-III, P581, DOI 10.1109/ICOSP.2010.5656065 Adda-Decker M., 2003, P EUR C SPEECH COMM, P257 [Anonymous], 2009, US NIST 2009 RT 09 R Arisoy E, 2006, SIGNAL PROCESS, V86, P2844, DOI 10.1016/j.sigpro.2005.12.002 Arisoy E., 2012, P NAACL HLT 2012 WOR, P20 Barnard E., 2009, P INTERSPEECH, P2847 Barnard E., 2010, P AAAI SPRING S ART, P8 Barnett J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2191 Berment V., 2004, THESIS J FOURIER U G Besacier L., 2006, IEEE ACL SLT 2006 AR Bhanuprasad K., 2008, P 3 INT JOINT C NAT, P805 Billa J., 1997, P EUR, P363 Cai J., 2008, SLTU 08 HAN VIETN Carki K., 2000, IEEE ICASSP Cetin O., 2008, SLTU 08 HAN VIETN Chan H.Y., 2012, P 2 ACM S COMP DEV Charniak E., 2003, P MT SUMM 9 NEW ORL, P40 Charoenpornsawat P., 2006, HUM LANG TECHN C HLT Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147 Cohen P., 1997, P AUT SPEECH REC UND, P591 Constantinescu A., 1997, P ASRU, P606 Creutz M., 2005, A81 HELS U TECHN Creutz M., 2007, ACM T SPEECH LANGUAG, V5 Crystal D., 2000, LANGUAGE DEATH Cucu H., 2011, P ASRU 2011 HAW US Cucu H., 2012, EUSIPCO 2012 BUC ROM Cucu H., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.05.003 Davel M.H., 2011, P INTERSPEECH, P3153 De Vries N.J., 2011, P INTERSPEECH, P3177 De Vries N.J., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.07.001 Denoual E., 2006, NLP ANN M TOK JAP, P731 Do T, 2010, WORKSH SPOK LANG TEC Dugast C., 1995, P EUROSPEECH, P197 Ekpenyong M., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.02.003 Ganapathiraju A., 2000, P INT C SPOK LANG PR, V4, P504 Gebreegziabher M., 2012, SLTU WORKSH SPOK LAN Gelas H., 2011, INT 2011 FLOR IT 28 Gelas H., 2010, LAPH 12 NEW MEX US J Gemmeke J.F., 2011, IEEE ASRU 2011 HI US Ghoshal A., 2009, IEEE ICASSP Gizaw S., 2008, SLTU 2008 HAN VIETN GLASS J, 1995, SPEECH COMMUN, V17, P1, DOI 10.1016/0167-6393(95)00008-C Godfrey J. J., 1992, P ICASSP, V1, P517 Gokcen S., 1997, P AUT SPEECH REC UND, P599 Grezl F., 2007, P ICASSP Hermansky H., 2000, P ICASSP Huang C., 2000, P ICSLP, P818 Huet S, 2010, COMPUT SPEECH LANG, V24, P663, DOI 10.1016/j.csl.2009.10.001 Hughes T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1914 IPA, 1999, HDB INT PHON ASS GUI Jensson A., 2008, SLTU 08 HAN VIETN Jing Z., 2010, P INT C COMP MECH CO, V5, P320 Kanejiya D.P., 2003, P TIFR WORKSH SPOK L, P93 Kanthak S., 2003, EUR 2003 GEN SWITZ, P1145 Karanasou P, 2010, LECT NOTES ARTIF INT, V6233, P167, DOI 10.1007/978-3-642-14770-8_20 Karpov A., 2011, P INT 2011 FLOR IT, P3161 Karpov A., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.07.004 Kiecza D., 1999, P INT C SPEECH PROC, P323 Killer M., 2003, INTERSPEECH Kipyatkova I., 2012, Proceedings of the 2012 Federated Conference on Computer Science and Information Systems (FedCSIS) KOHLER J, 1998, ACOUST SPEECH SIG PR, P417 Krauwer S., 2003, P 2003 INT WORKSH SP, P8 Kuo HKJ, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P327, DOI 10.1109/ASRU.2009.5373470 Kurimo M., 2006, P HLT NAACL NY US Kurimo M., 2006, P INT 06 PITTSB PA U, P1021 Lamel L., 1995, P EUR, P185 Laurent A., 2009, INT 2009 BRIGHT UK, P708 Le V.B., 2003, EUROSPEECH 2003, P3117 Le VB, 2009, IEEE T AUDIO SPEECH, V17, P1471, DOI 10.1109/TASL.2009.2021723 Lee DG, 2009, IEEE T AUDIO SPEECH, V17, P945, DOI 10.1109/TASL.2009.2019922 Loof J., 2009, INT 2009 BRIGHT UK Lopatkova M, 2005, LECT NOTES ARTIF INT, V3658, P140 Mihajlik P., 2007, INT 07 ANTW BELG MIKOLOV T, 2010, P INT, P1045 Mohamed AR, 2012, IEEE T AUDIO SPEECH, V20, P14, DOI 10.1109/TASL.2011.2109382 Muthusamy Y.K., 1992, 2 INT C SPOK LANG PR Nakajima H., 2002, COLING 2002, V2, P716 Nanjo H, 2005, INT CONF ACOUST SPEE, P1053 Oparin I., 2008, P IEEE WORKSH SPOK L Parent G, 2010, Proceedings 2010 IEEE Spoken Language Technology Workshop (SLT 2010), DOI 10.1109/SLT.2010.5700870 Patel N, 2009, CHI2009: PROCEEDINGS OF THE 27TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P51 Patel N, 2010, CHI2010: PROCEEDINGS OF THE 28TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P733 Pellegrini T., 2006, ICSLP 06 PITTSB Pellegrini T., 2008, SLTU 08 HAN VIETN Pellegrini T, 2009, IEEE T AUDIO SPEECH, V17, P863, DOI 10.1109/TASL.2009.2022295 Plahl C., 2011, P ASRU US Rastrow A., 2012, P 50 ANN M ASS COMP, P175 Ronzhin A L, 2007, Pattern Recognition and Image Analysis, V17, DOI 10.1134/S1054661807020216 Rotovnik T, 2007, SPEECH COMMUN, V49, P437, DOI 10.1016/j.specom.2007.02.010 Roux J.C., 2000, P 2 INT C LANG RES E, P975 Sak H, 2010, INT CONF ACOUST SPEE, P5402, DOI 10.1109/ICASSP.2010.5494927 Sarikaya R, 2007, INT CONF ACOUST SPEE, P181 Schlippe T., 2010, INT 2010 MAK JAP 26 Schlippe T., 2012, ICASSP 2012 KYOT JAP Schlippe T., 2012, INT 2012 PORTL OR 9 Schlippe T., 2013, COMMUNICATION, DOI DOI 10.1016/J.SPEC0M.2013.06.015 Schultz T., 2002, P ICSLP, V1, P345 Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 Schultz T., 2007, INT 2007 ANTW BELG Schultz T., 2013, ICASSP 2013 VANC CAN Schultz T., 2006, MULTILINGUAL SPEECH SCHULTZ T, 1998, P ICSLP SYDN, P1819 Seide F., 2011, P ASRU, P24 Siniscalchi SM, 2013, COMPUT SPEECH LANG, V27, P209, DOI 10.1016/j.csl.2012.05.001 Solera-Urena R, 2007, SPEECH COMMUN, V49, P253, DOI 10.1016/j.specom.2007.01.013 Stahlberg F., 2012, P 4 IEEE WORKSH SPOK Stahlberg F., 2013, P 1 INT C STAT LANG Stephenson T.A., 2002, IDIAPRR242002, P10 Stolcke A., 2006, P ICASSP 2006 Stuker S., 2009, INT 2009 BRIGHT UK Stuker S., 2008, SLTU 08 HAN VIETN Stuker S., 2003, ICASSP 2003 Stuker S., 2003, P ICASSP 03 IEEE INT Suenderman K., 2009, INT 2009 BRIGHT UK, P1475 Szarvas M., 2003, P ICASSP HONG KONG C, P368 Tachbelie M., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.01.008 Tachbelie M., 2012, SLTU WORKSH SPOK LAN Tarjan B., 2010, P 2 INT WORKSH SPOK, P10 Thomas S., 2012, P INTERSPEECH Thomas Samuel, 2012, P ICASSP Toth L., 2008, P INTERSPEECH Trentin E, 2001, NEUROCOMPUTING, V37, P91, DOI 10.1016/S0925-2312(00)00308-8 van Heerden C., 2010, P WORKSH SPOK LANG T, P17 van Niekerk D.R., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.01.009 Vergyri D., 2004, P ICSLP, P2245 Vesely K., 2012, P SLT US Vu Ngoc Thang, 2012, P INTERSPEECH Vu N.T., 2011, P INTERSPEECH Vu N.T., 2012, P SLTU S AFR Vu N.T., 2010, P SLT US WHEATLEY B, 1994, INT CONF ACOUST SPEE, P237 Whittaker EWD, 2001, INT CONF ACOUST SPEE, P545, DOI 10.1109/ICASSP.2001.940889 Whittaker E.W.D., 2000, THESIS CAMBRIDGE U, P140 Wissing D., 2008, SO AFRICAN LINGUISTI, V26, P255 Young S., 2008, SPRINGER HDB SPEECH, P539, DOI 10.1007/978-3-540-49127-9_27 Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 Yu D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4169 NR 138 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 85 EP 100 DI 10.1016/j.specom.2013.07.008 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800008 ER PT J AU Schlippe, T Ochs, S Schultz, T AF Schlippe, Tim Ochs, Sebastian Schultz, Tanja TI Web-based tools and methods for rapid pronunciation dictionary creation SO SPEECH COMMUNICATION LA English DT Article DE Web-derived pronunciations; Pronunciation modeling; Rapid bootstrapping; Multilingual speech recognition AB In this paper we study the potential as well as the challenges of using the World Wide Web as a seed for the rapid generation of pronunciation dictionaries in new languages. In particular, we describe Wiktionary, a community-driven resource of pronunciations in IPA notation, which is available in many different languages. First, we analyze Wiktionary in terms of language and vocabulary coverage and compare it in terms of quality and coverage with another source of pronunciation dictionaries in multiple languages (GlobalPhone). Second, we investigate the performance of statistical grapheme-to-phoneme models in ten different languages and measure the model performance for these languages over the amount of training data. The results show that for the studied languages about 15k phone tokens are sufficient to train stable grapheme-to-phoneme models. Third, we create grapheme-to-phoneme models for ten languages using both the GlobalPhone and the Wiktionary resources. The resulting pronunciation dictionaries are carefully evaluated along several quality checks, i.e. in terms of consistency, complexity, model confidence, grapheme n-gram coverage, and phoneme perplexity. Fourth, as a crucial prerequisite for a fully automated process of dictionary generation, we implement and evaluate methods to automatically remove flawed and inconsistent pronunciations from dictionaries. Last but not least, speech recognition experiments in six languages evaluate the usefulness of the dictionaries in terms of word error rates. Our results indicate that the web resources of Wiktionary can be successfully leveraged to fully automatically create pronunciation dictionaries in new languages. (C) 2013 Elsevier B.V. All rights reserved. C1 [Schlippe, Tim; Ochs, Sebastian; Schultz, Tanja] Karlsruhe Inst Technol, Inst Anthropomat, Cognit Syst Lab, D-76131 Karlsruhe, Germany. RP Schlippe, T (reprint author), Karlsruhe Inst Technol, Inst Anthropomat, Cognit Syst Lab, Adenauerring 4, D-76131 Karlsruhe, Germany. EM tim.schlippe@kit.edu FU OSEO, French State agency for innovation FX This work was partly realized as part of the Quaero Programme, funded by OSEO, French State agency for innovation. CR Besling S., 1994, KONVENS Bisani M., 2008, SPEECH COMMUNICATION Black A. W., 1998, ESCA WORKSH SPEECH S Can D., 2009, 32 ANN INT ACM SIGIR Chen S.F., 2003, EUROSPEECH Davel M., 2006, INTERSPEECH Davel M., 2009, INTERSPEECH Davel M., 2010, INTERSPEECH Davel M., 2004, ICSLP Gerosa M., 2009, P 2009 IEEE INT C AC, DOI DOI 10.1109/ICASSP.2009.4960583 Ghoshal A., 2009, ICASSP Hahn S., 2012, INT 2012 IPA I. P. A., 1999, HDB INT PHON ASS GUI Jiampojamarn S., 2007, HLT Kanthak S., 2002, ICASSP Kaplan R.M., 1994, COMPUTATIONAL LINGUI Karanasou P., 2010, P 7 INT C ADV NAT LA Killer M., 2003, EUROSPEECH Kneser R., 2000, WYTP409100002 PHIL S Kominek J., 2009, THESIS Kominek J., 2006, HLT C NAACL Laurent A., 2009, INTERSPEECH Llitjos A.F., 2002, LREC Martirosian O., 2007, PRASA Novak J., 2011, PHONETISAURUS WFST D Novak J., 2012, INT WORKSH FIN STAT Schlippe T., 2010, INTERSPEECH Schlippe T., 2012, SLT U Schlippe T., 2012, INTERSPEECH Schlippe T., 2012, ICASSP Schultz T., 2002, ICSLP Schultz T., 2007, INTERSPEECH Stueker S., 2004, SPECOM Vozila P., 2003, EUROSPEECH Vu N. T., 2010, INTERSPEECH Wells JC, 1997, HDB STANDARDS RESOUR Wikimedia, 2012, LIST WIK ED RANK ART Wolff M., 2002, PMLA Zhu X., 2001, ICASSP NR 39 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 101 EP 118 DI 10.1016/j.specom.2013.06.015 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800009 ER PT J AU de Vries, NJ Davel, MH Badenhorst, J Basson, WD de Wet, F Barnard, E de Waal, A AF de Vries, Nic J. Davel, Marelie H. Badenhorst, Jaco Basson, Willem D. de Wet, Febe Barnard, Etienne de Waal, Alta TI A smartphone-based ASR data collection tool for under-resourced languages SO SPEECH COMMUNICATION LA English DT Article DE Smartphone-based; ASR data collection; Under-resourced languages; Automatic speech recognition; ASR corpora; Speech resources; Speech data collection; Broadband speech corpora; Woefzela; On-device quality control; QC-on-the-go; Android ID DEVELOPING REGIONS; TECHNOLOGY AB Acoustic data collection for automatic speech recognition (ASR) purposes is a particularly challenging task when working with under-resourced languages, many of which are found in the developing world. We provide a brief overview of related data collection strategies, highlighting some of the salient issues pertaining to collecting ASR data for under-resourced languages. We then describe the development of a smartphone-based data collection tool, Woefzela, which is designed to function in a developing world context. Specifically, this tool is designed to function without any Internet connectivity, while remaining portable and allowing for the collection of multiple sessions in parallel; it also simplifies the data collection process by providing process support to various role players during the data collection process, and performs on-device quality control in order to maximise the use of recording opportunities. The use of the tool is demonstrated as part of a South African data collection project, during which almost 800 hours of ASR data was collected, often in remote, rural areas, and subsequently used to successfully build acoustic models for eleven languages. The on-device quality control mechanism (referred to as QC-on-the-go) is an interesting aspect of the Woefzela tool and we discuss this functionality in more detail. We experiment with different uses of quality control information, and evaluate the impact of these on ASR accuracy. Woefzela was developed for the Android Operating System and is freely available for use on Android smartphones. (C) 2013 Elsevier B.V. All rights reserved. C1 [de Vries, Nic J.; Badenhorst, Jaco; Basson, Willem D.; de Wet, Febe; de Waal, Alta] CSIR, Meraka Inst, Human Language Technol Res Grp, Pretoria, South Africa. [Davel, Marelie H.; Badenhorst, Jaco; Basson, Willem D.; Barnard, Etienne] North West Univ, Multilingual Speech Technol, ZA-1900 Vanderbijlpark, South Africa. [de Wet, Febe] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7602 Matieland, South Africa. RP de Vries, NJ (reprint author), CSIR, Meraka Inst, Human Language Technol Res Grp, Pretoria, South Africa. EM ndevries@csir.co.za; marelie.davel@nwu.ac.za CR Abney S., 2010, P 48 ANN M ASS COMP, P88 Ackermann U., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607244 Badenhorst J, 2011, LANG RESOUR EVAL, V45, P289, DOI 10.1007/s10579-011-9152-1 Badenhorst J., 2012, P WORKSH SPOK LANG T, P139 Badenhorst J., 2011, P PATT REC ASS S AFR, P1 Barnard E, 2008, 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, P13, DOI 10.1109/SLT.2008.4777828 Barnard E, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P282 Barnard E., 2003, 9 INT WIN C STELL S, P1 Barnard E., 2009, P INTERSPEECH, P2847 Barnard E., 2010, AAAI S ART INT DEV A, P8 Basson W.D., 2012, P PATT REC ASS S AFR, P144 Botha G., 2005, P 16 ANN S PATT REC, P194 Brewer E, 2006, IEEE PERVAS COMPUT, V5, P15, DOI 10.1109/MPRV.2006.40 Brewer E, 2005, COMPUTER, V38, P25, DOI 10.1109/MC.2005.204 Constantinescu A., 1997, AUT SPEECH REC UND A, P606 Davel M., 2009, INTERSPEECH, P2851 Davel M, 2008, COMPUT SPEECH LANG, V22, P374, DOI 10.1016/j.csl.2008.01.001 Davel M.H., 2012, P WORKSH SPOK LANG T, P68 Davel M.H., 2011, P INTERSPEECH, P3153 De Vries N.J., 2011, P INTERSPEECH, P3177 De Vries N.J., 2012, THESIS N W U S AFRIC De Wet F., 2011, NATL CTR HUMAN LANGU De Wet F., 2006, P PATT REC ASS S AFR, P1 De Wet F., 2011, P INTERSPEECH, P3185 Draxler C., 2007, P INT 07 ANTW BELG, P1509 Gauvain J.-1., 1988, P ICSLP 88 SYDN AUST, P1335 Giwa O., 2011, P PATT REC ASS S AFR, P49 Grover A.S., 2011, P 20 INT C WORLD WID, P433 Grover A.S., 2009, IEEE INT C ICTD ICTD, P95 Huang X., 2001, SPOKEN LANGUAGE PROC Hughes T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1914 Kamper H., 2012, P WORKSH SPOK LANG T, P102 Kamvar M, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1966 Kim DY, 2003, ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, P105 Kleynhans N., 2012, P PATT REC ASS S AFR, P165 Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186 Lane Ian, 2010, P NAACL HLT, P184 Lata S., 2010, P 7 C INT LANG RES E, P2851 Lee C., 2011, P INT, P3041 Maxwell M., 2006, P WORKSH FRONT LING, P29, DOI 10.3115/1641991.1641996 McGraw I., 2010, P LREC, P19 Modipa T.I., 2012, P PATT REC ASS S AFR, P173 Parent G., 2011, P INTERSPEECH, P3037 Pentland AS, 2004, COMPUTER, V37, P78, DOI 10.1109/MC.2004.1260729 Roux JC, 2004, P LREC LISB PORT, P93 Schiel F., 2003, PRODUCTION SPEECH CO Schultz T., 2002, P ICSLP, V1, P345 Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 Schultz T, 2006, MULTILINGUAL SPEECH Schuster M, 2010, LECT NOTES ARTIF INT, V6230, P8, DOI 10.1007/978-3-642-15246-7_3 Shan JL, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P354 Grover AS, 2011, LANG RESOUR EVAL, V45, P271, DOI 10.1007/s10579-011-9151-2 Sherwani J., 2007, INT C INF COMM TECHN, P1 Sung Y., 2011, P INTERSPEECH, P2865 van Heerden C., 2010, P WORKSH SPOK LANG T, P17 Van Heerden C., 2011, P PATT REC ASS S AFR, P138 Van Heerden C.J., 2012, P WORKSH SPOK LANG T, P146 van den Heuvel H, 2008, LANG RESOUR EVAL, V42, P41, DOI 10.1007/s10579-007-9049-1 WHEATLEY B, 1994, INT CONF ACOUST SPEE, P237 Young S., 2009, HTK BOOK VERSION 3 4 NR 60 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 119 EP 131 DI 10.1016/j.specom.2013.07.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800010 ER PT J AU Ko, T Mak, B AF Ko, Tom Mak, Brian TI Eigentrigraphemes for under-resourced languages SO SPEECH COMMUNICATION LA English DT Article DE Eigentriphone; Eigentrigrapheme; Under-resourced language; Grapheme; Regularization; Weighted PCA AB Grapheme-based modeling has an advantage over phone-based modeling in automatic speech recognition for under-resourced languages when a good dictionary is not available. Recently we proposed a new method for parameter estimation of context-dependent hidden Markov model (HMM) called eigentriphone modeling. Eigentriphone modeling outperforms conventional tied-state HMM by eliminating the quantization errors among the tied states. The eigentriphone modeling framework is very flexible and can be applied to any group of modeling unit provided that they may be represented by vectors of the same dimension. In this paper, we would like to port the eigentriphone modeling method from a phone-based system to a grapheme-based system; the new method will be called eigentrigrapheme modeling. Experiments on four official South African under-resourced languages (Afrikaans, South African English, Sesotho, siSwati) show that the new eigentrigrapheme modeling method reduces the word error rates of conventional tied-state trigrapheme modeling by an average of 4.08% relative. (C) 2013 Elsevier B.V. All rights reserved. C1 [Ko, Tom; Mak, Brian] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China. RP Mak, B (reprint author), Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China. EM tomko@cse.ust.hk; mak@cse.ust.hk FU Research Grants Council of the Hong Kong SAR [FSGRF12EG31, FSGRF13EG20, SRFI11EG15] FX This research is partially supported by the Research Grants Council of the Hong Kong SAR under the grant numbers FSGRF12EG31, FSGRF13EG20, and SRFI11EG15. CR Andersen O., 1996, P INT C SPOK LANG PR Bellegarda J.R., 2003, P IEEE INT C AC SPEE Beyerlein P., 1999, P IEEE AUT SPEECH RE Burget L., 2010, P IEEE INT C AC SPEE Charoenpornsawat S.H.P., 2006, P HUM LANG TECHN C N Daniels Peter T., 1996, WORLDS WRITING SYSTE Davel M., 2009, P INTERSPEECH Davel M., 2004, P INTERSPEECH de Wet F., 2011, P INTERSPEECH Kamper H., 2011, P INTERSPEECH Kanthak S., 2003, P EUR C SPEECH COMM Kanthak S., 2002, P IEEE INT C AC SPEE Ko T., 2011, P INTERSPEECH Ko T., 2011, P IEEE INT C AC SPEE Ko T., 2012, P IEEE INT C AC SPEE Ko T, 2013, IEEE T AUDIO SPEECH, V21, P1285, DOI 10.1109/TASL.2013.2248722 Kohler J., 1996, P INT C SPOK LANG PR Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Le VB, 2009, IEEE T AUDIO SPEECH, V17, P1471, DOI 10.1109/TASL.2009.2021723 Lu L., 2011, 2011 IEEE WORKSH AUT, P365 Meng H, 1996, SPEECH COMMUN, V18, P47, DOI 10.1016/0167-6393(95)00032-1 Meraka-Institute, LWAZ ASR CORP Ogbureke K.U., 2010, P IEEE INT C AC SPEE Povey D., 2010, P IEEE INT C AC SPEE Roux J.C., 2004, P LREC Schukat-Talamazzini E.G., 1993, P EUR C SPEECH COMM Sharma-Grover A., 2010, P 2 WORKSH AFR LANG Sooful J.J., 2001, P PATT REC ASS S AFR Stuker S., 2009, THESIS U FRIDERICIAN Stuker S., 2008, P IEEE INT C AC SPEE Tempest M., 2009, DICT MAKER 2 16 USER van Heerden C., 2009, P INTERSPEECH van Huyssteen G.B., 2009, P 20 ANN S PATT REC Young S. J., 1994, P WORKSH HUM LANG TE Young S. J., 2006, HTK BOOK VERSION 3 4 NR 35 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 132 EP 141 DI 10.1016/j.specom.2013.01.010 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800011 ER PT J AU Imseng, D Motlicek, P Bourlard, H Garner, PN AF Imseng, David Motlicek, Petr Bourlard, Herve Garner, Philip N. TI Using out-of-language data to improve an under-resourced speech recognizer SO SPEECH COMMUNICATION LA English DT Article DE Multilingual speech recognition; Posterior features; Subspace Gaussian mixture models; Under-resourced languages; Afrikaans AB Under-resourced speech recognizers may benefit from data in languages other than the target language. In this paper, we report how to boost the performance of an Afrikaans automatic speech recognition system by using already available Dutch data. We successfully exploit available multilingual resources through (1) posterior features, estimated by multilayer perceptrons (MLP) and (2) subspace Gaussian mixture models (SGMMs). Both the MLPs and the SGMMs can be trained on out-of-language data. We use three different acoustic modeling techniques, namely Tandem, Kullback-Leibler divergence based HMMs (KL-HMM) as well as SGMMs and show that the proposed multilingual systems yield 12% relative improvement compared to a conventional monolingual HMM/GMM system only trained on Afrikaans. We also show that KL-HMMs are extremely powerful for under-resourced languages: using only six minutes of Afrikaans data (in combination with out-of-language data), KL-HMM yields about 30% relative improvement compared to conventional maximum likelihood linear regression and maximum a posteriori based acoustic model adaptation. (C) 2013 Elsevier B.V. All rights reserved. C1 [Imseng, David; Motlicek, Petr; Bourlard, Herve; Garner, Philip N.] Idiap Res Inst, Martigny, Switzerland. [Imseng, David; Bourlard, Herve] Ecole Polytech Fed Lausanne, CH-1015 Lausanne, Switzerland. RP Imseng, D (reprint author), Idiap Res Inst, Martigny, Switzerland. EM dimseng@idiap.ch FU Swiss NSF through the project Interactive Cognitive Systems (ICS) [200021_132619/1]; National Centre of Competence in Research (NCCR) in Interactive Multimodal Information Management (IM2) FX This research was supported by the Swiss NSF through the project Interactive Cognitive Systems (ICS) under contract number 200021_132619/1 and the National Centre of Competence in Research (NCCR) in Interactive Multimodal Information Management (IM2) http://wwww.im2.ch. CR Aradilla G, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P928 Barnard E., 2009, P INTERSPEECH, P2847 Bisani M., 2004, P IEEE INT C AC SPEE, V1, P409 Bloomfield Leonard, 1933, LANGUAGE Burget L, 2010, INT CONF ACOUST SPEE, P4334, DOI 10.1109/ICASSP.2010.5495646 Cover T M, 1991, ELEMENTS INFORM THEO Davel M., 2009, P INTERSPEECH, P2851 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gauvain J.-L., 1993, P ICASSP, V2, P558 Grezl F., 2011, P ASRU, P359 Heeringa W., 2008, P C PATT REC ASS S A, P159 Hermansky H, 2000, P ICASSP, V3, P1635 Imseng D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4869 Imseng D., 2011, P INT, P537 Imseng D., 2012, P 3 INT WORKSH SPOK, P60 Imseng D., 2012, P INT Johnson D., 2005, QUICKNET KOHLER J, 1998, ACOUST SPEECH SIG PR, P417 KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 KULLBACK S, 1987, AM STAT, V41, P340 Niesler T, 2007, SPEECH COMMUN, V49, P453, DOI 10.1016/j.specom.2007.04.001 Oostdijk NHJ, 2000, P LREC 2000 ATH, V2, P887 Povey D, 2011, INT CONF ACOUST SPEE, P4504 Povey D, 2010, INT CONF ACOUST SPEE, P4330, DOI 10.1109/ICASSP.2010.5495662 Qian Y., 2011, P ASRU HAW IEEE, P354 Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 Shinoda K., 1997, P EUROSPEECH, V1, P99 Toth L, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2695 van Heerden C., 2009, P INT BRIGHT UK, P3003 Zen H., 2007, HMM BASED SPEECH SYN NR 30 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 142 EP 151 DI 10.1016/j.specom.2013.01.007 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800012 ER PT J AU Kempton, T Moore, RK AF Kempton, Timothy Moore, Roger K. TI Discovering the phoneme inventory of an unwritten language: A machine-assisted approach SO SPEECH COMMUNICATION LA English DT Article DE Phonemic analysis; Endangered languages; Field linguistics AB There is a consensus between many linguists that half of all languages risk disappearing by the end of the century. Documentation is agreed to be a priority. This includes the process of phonemic analysis to discover the contrastive sounds of a language with the resulting benefits of further linguistic analysis, literacy, and access to speech technology. A machine-assisted approach to phonemic analysis has the potential to greatly speed up the process and make the analysis more objective. It is demonstrated that a machine-assisted approach can make a measurable contribution to a phonemic analysis for all the procedures investigated; phonetic similarity, complementary distribution, and minimal pairs. The evaluation measures introduced in this paper allows a comprehensive quantitative comparison between these phonemic analysis procedures. Given the best available data and the machine-assisted procedures described, there is a strong indication that phonetic similarity is the most important piece of evidence in a phonemic analysis. (C) 2013 Elsevier B.V. All rights reserved. C1 [Kempton, Timothy; Moore, Roger K.] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England. RP Kempton, T (reprint author), Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England. EM drtimothykempton@gmail.com FU UK Engineering and Physical Sciences Research Council (EPSRC) [EP/P502748/1] FX This work is funded by the UK Engineering and Physical Sciences Research Council (EPSRC grant number EP/P502748/1). We are grateful to Andy Castro for providing the Kua-nsi recordings, Mary Pearce and Cathy Bartram for helpful discussions about phonemic analysis heuristics, and Sharon Peperkamp for answering our questions regarding previous experiments. We would also like to thank Nic de Vries for comments on the application of the tools. Image component credits: Zscout370 (Lesotho flag). CR Aslam J. A., 2005, P 14 ACM INT C INF K, P664, DOI 10.1145/1099554.1099721 BAMBER D, 1975, J MATH PSYCHOL, V12, P387, DOI 10.1016/0022-2496(75)90001-2 Burquest D., 2006, PHONOLOGICAL ANAL FU Castro A., 2010, SIL ELECT SURVEY REP, V1, P96 Clark J., 2007, INTRO PHONETICS PHON Crystal D., 2000, LANGUAGE DEATH Davis J., 2006, P 23 INT C MACH LEAR, P233, DOI DOI 10.1145/1143844.1143874 Demuth K., 2007, SESOTHO SPEECH ACQUI, P528 Dingemanse M., 2008, LANG DOC CONSERV, V2, P325 Fitt S., 1999, 6 EUR C SPEECH COMM Garofolo JS, 1993, TIMIT ACOUSTIC PHONE Gildea D, 1996, COMPUT LINGUIST, V22, P497 Gleason H., 1961, INTRO DESCRIPTIVE LI Grenoble L. A., 2006, SAVING LANGUAGES INT HAYES BRUCE, 2009, INTRO PHONOLOGY Himmelmann N., 2002, LECT ENDANGERED LANG, V5, P37 Hockett C, 1955, ASTOUNDING SCI FICTI, P97 Huckvale M., 2004, P ICSLP Jeffreys H., 1948, THEORY PROBABILITY, V2nd Kempton T., 2012, THESIS U SHEFFIELD Kondrak G, 2003, COMPUT HUMANITIES, V37, P273, DOI 10.1023/A:1025071200644 KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 Kurtic E., 2012, C LANG RES EV INST T Ladefoged Peter, 2003, PHONETIC DATA ANAL I Le Calvez R., 2007, P 2 EUR COGN SCI C, P167 Moseley C, 2009, ATLAS WORLDS LANGUAG, P2009 Peperkamp S, 2006, COGNITION, V101, pB31, DOI 10.1016/j.cognition.2005.10.006 Pike Kenneth, 1947, PHONEMICS TECHNIQUE Poser B., 2008, MINPAIR VERSION 5 1 Postal P., 1968, ASPECTS PHONOLOGICAL SIL, 2008, PHON ASS V3 0 1 Siniscalchi SM, 2008, INT CONF ACOUST SPEE, P4261, DOI 10.1109/ICASSP.2008.4518596 Sproat R., 1993, J PHON, V21 Wells John C., 1982, ACCENTS ENGLISH, V3 Williams L, 1977, J PHONETICS, V5, P169 NR 35 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 152 EP 166 DI 10.1016/j.specom.2013.02.006 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800013 ER PT J AU Mohan, A Rose, R Ghalehjegh, SH Umesh, S AF Mohan, Aanchan Rose, Richard Ghalehjegh, Sina Hamidi Umesh, S. TI Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; Multi-lingual speech recognition; Subspace modelling; Under-resourced languages; Acoustic-normalization AB In developing speech recognition based services for any task domain, it is necessary to account for the support of an increasing number of languages over the life of the service. This paper considers a small vocabulary speech recognition task in multiple Indian languages. To configure a multi-lingual system in this task domain, an experimental study is presented using data from two linguistically similar languages Hindi and Marathi. We do so by training a subspace Gaussian mixture model (SGMM) (Povey et al., 2011; Rose et al., 2011) under a multi-lingual scenario (Burget et al., 2010; Mohan et al., 2012a). Speech data was collected from the targeted user population to develop spoken dialogue systems in an agricultural commodities task domain for this experimental study. It is well known that acoustic, channel and environmental mismatch between data sets from multiple languages is an issue while building multi-lingual systems of this nature. As a result, we use a cross-corpus acoustic normalization procedure which is a variant of speaker adaptive training (SAT) (Mohan et al., 2012a). The resulting multi-lingual system provides the best speech recognition performance for both languages. Further, the effect of sharing "similar" context-dependent states from the Marathi language on the Hindi speech recognition performance is presented. (C) 2013 Elsevier B.V. All rights reserved. C1 [Mohan, Aanchan; Rose, Richard; Ghalehjegh, Sina Hamidi] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ, Canada. [Umesh, S.] Indian Inst Technol, Madras 600036, Tamil Nadu, India. RP Mohan, A (reprint author), McGill Univ, Dept Elect & Comp Engn, Montreal, PQ, Canada. EM aanchan.mohan@mail.mcgill.ca; rose@ece.mcgill.ca; sina.hamidighalehjegh@mail.mcgill.ca; umeshs@ee.iitm.ac.in FU Government of India FX The authors would like to thank all of the members involved in the data collection effort and the development of the dialogue system for the project "Speech-based Access 1 for Agricultural Commodity Prices in Six Indian Languages" sponsored by the Government of India. We would also like to thank M.S. Research Scholar Raghavengra Bilgi at IIT Madras for his timely help with providing resources for this experimental study. CR Bowonder B., 2003, DEV RURAL MAZRKET EH Burget L, 2010, INT CONF ACOUST SPEE, P4334, DOI 10.1109/ICASSP.2010.5495646 Cardona G., 2003, INDOARYAN LANGUAGES, V2 Central Hindi Directorate I., 1977, DEV DEV AMPL STAND Chopde A., 2006, ITRANS INDIAN LANGUA Gales M., 2001, P IEEE INT C AC SPEE Gales M. J. F., 1998, COMPUTER SPEECH LANG, V12 Gales M. J. F., 2001, P ASRU 2001 TRENT IT, P77 Gillick L., 1989, P ICASSP, V1, P532 Hakkani-Tur D., 2006, P IEEE INT C AC SPEE, V1, pI Killer M., 2003, 8 EUR C SPEECH COMM Lee K.-F., 1989, AUTOMATIC SPEECH REC Lu L, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4877 Lu L, 2011, P IEEE ASRU, P365 Mantena G., 2011, JOINT WORKSH HANDS F, P153 Mohan A., 2012, IEEE C INF SCI SIGN Mohan A, 2012, P IEEE INT C AC SPEE Patel N, 2010, CHI2010: PROCEEDINGS OF THE 28TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P733 Plauche M., 2007, INF TECHNOL INT DEV, V4, P69, DOI 10.1162/itid.2007.4.1.69 Povey D., 2009, TUTORIAL STYLE INTRO Povey D, 2011, COMPUT SPEECH LANG, V25, P404, DOI 10.1016/j.csl.2010.06.003 Qian Y., 2011, 12 ANN C INT SPEECH Rose RC, 2011, P IEEE INT C AC SPEE SARACLAR M, 2000, ACOUST SPEECH SIG PR, P1679 Scharf P., 2009, LINGUISTIC ISSUES EN Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 Schultz T, 2006, MULTILINGUAL SPEECH Seltzer M. L., 2011, P INTERSPEECH, P1097 Shrishrimal Pukhraj P., 2012, INT J COMPUTER APPL, V47, P17 SINGH R, 1999, P INT C SPOK LANG PR, V1, P117 Sulaiman R., 2003, INNOVATIONS AGR EXTE Vu NT, 2011, INT CONF ACOUST SPEE, P5000 Weilhammer K., 2006, 9 INT C SPOK LANG PR Young S., 2006, HTK BOOK HTK VERSION NR 34 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 167 EP 180 DI 10.1016/j.specom.2013.07.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800014 ER PT J AU Tachbelie, MY Abate, ST Besacier, L AF Tachbelie, Martha Yifiru Abate, Solomon Teferra Besacier, Laurent TI Using different acoustic, lexical and language modeling units for ASR of an under-resourced language - Amharic SO SPEECH COMMUNICATION LA English DT Article DE Syllable-based acoustic modeling; Hybrid (phone syllable) acoustic modeling; Morphemebased; Speech recognition; Under-resourced languages; Amharic ID SPEECH RECOGNITION AB State-of-the-art large vocabulary continuous speech recognition systems use mostly phone based acoustic models (AMs) and word based lexical and language models. However, phone based AMs are not efficient in modeling long-term temporal dependencies and the use of words in lexical and language models leads to out-of-vocabulary (00V) problem, which is a serious issue for morphologically rich languages. This paper presents the results of our contributions on the use of different units for acoustic, lexical and language modeling for an under-resourced language (Amharic spoken in Ethiopia). Triphone, Syllable and hybrid (syllable-phone) units have been investigated for acoustic modeling. Word and morphemes have been investigated for lexical and language modeling. We have also investigated the use of longer (syllable) acoustic units and shorter (morpheme) lexical as well as language modeling units in a speech recognition system. Although hybrid AMs did not bring much improvement over context dependent syllable based recognizers in speech recognition performance with word based lexical and language model (i.e. word based speech recognition), we observed a significant word error rate (WER) reduction compared to triphone-based systems in morpheme-based speech recognition. Syllable AMs also led to a WER reduction over the triphone-based systems both in word based and morpheme based speech recognition. It was possible to obtain a 3% absolute WER reduction as a result of using syllable acoustic units in morpheme-based speech recognition. Overall, our result shows that syllable and hybrid AMs are best fitted in morpheme-based speech recognition. (C) 2013 Elsevier B.V. All rights reserved. C1 [Tachbelie, Martha Yifiru; Abate, Solomon Teferra] Univ Addis Ababa, Sch Informat Sci, Addis Ababa, Ethiopia. [Besacier, Laurent] Univ Grenoble 1, LIG, Grenoble 1, France. RP Tachbelie, MY (reprint author), Univ Addis Ababa, Sch Informat Sci, Addis Ababa, Ethiopia. EM marthayifiru@gmail.com; solomon_teferra_7@yahoo.com; laurent.besacier@imag.fr CR Abate S. T., 2005, P INT LISB PORT, P1601 Abate Solomon Teferra, 2007, P INT ANTW BELG, P1541 Abate Solomon Teferra, 2006, THESIS U HAMBURG GER Abate Solomon Teferra, 2007, P 2007 WORKSH COMP A, P33, DOI 10.3115/1654576.1654583 Abhinav Sethy, 2002, P ISCA PRON MOD WORK, P30 Appleyard David, 1995, C AMHARIC COMPLETE C Azim Mohamed Mostafa, 2008, WSEAS T SIGNAL PROCE, V4, P211 Bazzi I., 2002, THESIS MIT Bender M., 1976, LANGUAGES ETHIOPIA Berhanu Solomon, 2001, THESIS ADDIS ABABA U Berment V, 2004, THESIS U J FOURIER G Besacier L, 2006, INT CONF ACOUST SPEE, P1221 Carki, 2000, ICASSP 2000 IST TURK, V3, P1563 Creutz M., 2005, A81 HELS U TECHN NEU El-Desoky A., 2009, P INT 2009, P2679 Gales Mark, 2006, RECENT PROGR LARGE V Ganapathiraju A, 2001, IEEE T SPEECH AUDI P, V9, P358, DOI 10.1109/89.917681 Gelas Hadrien, 2011, P INTERSPEECH FLOR I Geutner Petra, 1995, P ICASSP, VI, P445 Girmaw Molalgne, 2004, THESIS ROYAL I TECHN Gruenstein Alexander, 2009, P SLATE BRIGHT UK Haile Alemayehu, 1995, J ETHOPIAN STUD, V28, P15 Hamalainen Annika, 2005, P SPECOM 2005, P499 Hirsimaki T., 2005, P INT INT C AD KNOWL, P121 Ircing P., 2001, P 7 EUR C SPEECH COM, P487 Kirchhoff Katrin, 2002, NOVEL SPEECH RECOGNI Leslau Wolf, 2000, INTRO GRAMMAR AMHARI Liu X, 2011, INT CONF ACOUST SPEE, P4872 Marge Matthew, 2010, P NAACL HLT Mariam Sebsibe H., 2004, P 5 ISCA SPEECH SYNT, P103 McGraw Ian, 2009, P INTERSPEECH Mohri M, 1998, LECT NOTES COMPUT SC, V1436, P144 Mulugeta Seyoum, 2001, THESIS DEP LINGUISTI Pellegrini T., 2007, P INTERSPEECH 2007, P1797 Pellegrini T, 2009, IEEE T AUDIO SPEECH, V17, P863, DOI 10.1109/TASL.2009.2022295 Pellegrini Thomas, 2006, P LREC Pellegrini Thomas, 2006, P INTERSPEECH 2006 Scott Novotney, 2010, P NAACL HLT, P207 Seid Hussien, 2005, P INT LISB PORT, P3349 Seifu Zegaye, 2003, THESIS ADDIS ABABA U Snow R., 2008, P C EMP METH NAT LAN, P254, DOI 10.3115/1613715.1613751 Stolcke Andreas, 2002, P ICSLP 2002 DENB CO, P901 Tachbelie Martha Yifiru, P SLTU 10 PEN MAL, P68 Tachbelie Martha Yifiru, 2010, THESIS U HAMBURG GER Tachbelie Martha Yifiru, 2003, THESIS ADDIS ABABA U Tachbelie M.Y., 2009, P 4 LANG TECHN C LT, P114 Tadesse Kinfe, 2002, THESIS ADDIS ABABA U Thangarajan R., 2008, S ASIAN LANGUAGE REV, V17, P71 Vesa Siivola, P EUR, P2293 Voigt Rainer M, 1987, JSS, V32, P1 Whittaker E.W.D., 2000, P 6 INT C SPOK LANG, P170 Whittaker E.W.D., 2001, IEEE WORKSH AUT SPEE, P315, DOI 10.1109/ASRU.2001.1034650 WOODLAND PC, 1995, INT CONF ACOUST SPEE, P73, DOI 10.1109/ICASSP.1995.479276 Yifiru Tachbelie Martha, 2011, P HLTD 2011, P50 Yifiru Tachbelie Martha, 2011, LECT NOTES COMPUTER, V6562, P82 Yimam Baye, 2007, YEAMARINA SEWASEW NR 56 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 181 EP 194 DI 10.1016/j.specom.2013.01.008 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800015 ER PT J AU Karpov, A Markov, K Kipyatkova, I Vazhenina, D Ronzhin, A AF Karpov, Alexey Markov, Konstantin Kipyatkova, Irina Vazhenina, Dania Ronzhin, Andrey TI Large vocabulary Russian speech recognition using syntactico-statistical language modeling SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; Slavic languages; Russian speech; Language modeling; Syntactical analysis ID SYSTEMS AB Speech is the most natural way of human communication and in order to achieve convenient and efficient human computer interaction implementation of state-of-the-art spoken language technology is necessary. Research in this area has been traditionally focused on several main languages, such as English, French, Spanish, Chinese or Japanese, but some other languages, particularly Eastern European languages, have received much less attention. However, recently, research activities on speech technologies for Czech, Polish, Serbo-Croatian, Russian languages have been steadily increasing. In this paper, we describe our efforts to build an automatic speech recognition (ASR) system for the Russian language with a large vocabulary. Russian is a synthetic and highly inflected language with lots of roots and affixes. This greatly reduces the performance of the ASR systems designed using traditional approaches. In our work, we have taken special attention to the specifics of the Russian language when developing the acoustic, lexical and language models. A special software tool for pronunciation lexicon creation was developed. For the acoustic model, we investigated a combination of knowledge-based and statistical approaches to create several different phoneme sets, the best of which was determined experimentally. For the language model (LM), we introduced a new method that combines syntactical and statistical analysis of the training text data in order to build better n-gram models. Evaluation experiments were performed using two different Russian speech databases and an internally collected text corpus. Among the several phoneme sets we created, the one which achieved the fewest word level recognition errors was the set with 47 phonemes and thus we used it in the following language modeling evaluations. Experiments with 204 thousand words vocabulary ASR were performed to compare the standard statistical n-gram LMs and the language models created using our syntactico-statistical method. The results demonstrated that the proposed language modeling approach is capable of reducing the word recognition errors. (C) 2013 Elsevier B.V. All rights reserved. C1 [Karpov, Alexey; Kipyatkova, Irina; Ronzhin, Andrey] Russian Acad Sci SPIIRAS, St Petersburg Inst Informat & Automat, St Petersburg, Russia. [Markov, Konstantin; Vazhenina, Dania] Univ Aizu, Human Interface Lab, Fukushima, Japan. RP Kipyatkova, I (reprint author), Russian Acad Sci SPIIRAS, St Petersburg Inst Informat & Automat, St Petersburg, Russia. EM kipyatkova@iias.spb.su RI Karpov, Alexey/A-8905-2012 OI Karpov, Alexey/0000-0003-3424-652X FU Ministry of Education and Science of Russia [07.514.11.4139]; grant of the President of Russia [MK-1880.2012.8]; Russian Foundation for Basic Research [12-08-01265]; Russian Humanitarian Scientific Foundation [12-04-12062] FX This research is supported by the Ministry of Education and Science of Russia (contract No. 07.514.11.4139), by the grant of the President of Russia (project No. MK-1880.2012.8), by the Russian Foundation for Basic Research (project No. 12-08-01265) and by the Russian Humanitarian Scientific Foundation (project No. 12-04-12062). CR Anisimovich K., 2012, P DIAL 2012 MOSC RUS, V2, P91 Antonova A., 2012, P INT C DIAL 2012 MO, V2, P104 Arisoy E, 2010, INT CONF ACOUST SPEE, P5538, DOI 10.1109/ICASSP.2010.5495226 Arlazarov V., 2004, P INT C SPECOM 2004, P650 Bechet F., 2009, P INT 2009 BRIGHT UK, P1039 Bellegarda JR, 2004, SPEECH COMMUN, V42, P93, DOI 10.1016/j.specom.2003.08.002 Bhanuprasad K., 2008, P 3 INT JOINT C NAT, P805 Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147 Cubberley P., 2002, RUSSIAN LINGUISTIC I Deoras A., 2012, P INT 2012 PORTL OR Huet S, 2010, COMPUT SPEECH LANG, V24, P663, DOI 10.1016/j.csl.2009.10.001 Iomdin L., 2012, P DIAL 2012 MOSC RUS, V2, P119 Ircing P., 2006, P INT C LANG RES EV, P2600 Jokisch O., 2009, P SPECOM 2009 ST PET, P515 Kanejiya D.P., 2003, P TIFR WORKSH SPOK L, P93 Kanevsky D., 1996, P 1 INT C SPEECH COM, P117 Karpov A., 2011, P INT 2011 FLOR IT, P3161 Karpov A., 2012, P 3 INT WORKSH SPOK, P84 Kipyatkova I, 2012, 2012 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), P719 Kouznetsov V., 1999, P SPECOM 1999 MOSC R, P179 Kuo HKJ, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P327, DOI 10.1109/ASRU.2009.5373470 Kurimo M., 2006, P HLT NAACL NEW YORK, P487, DOI 10.3115/1220835.1220897 Lamel L., 2012, P SLTU 2012 CAP TOWN, P156 Lamel L., 2011, P INT WORKSH SPOK LA, P121 Lee A., 2009, P APSIPA ASC, P131 Leontyeva A, 2008, LECT NOTES ARTIF INT, V5246, P373, DOI 10.1007/978-3-540-87391-4_48 Moore G.L., 2001, THESIS CAMBRIDGE U Nozhov I., 2003, THESIS, P140 Odell JJ, 1995, THESIS CAMBRIDGE U Oparin I, 2008, 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, P189, DOI 10.1109/SLT.2008.4777872 Oparin I., 2005, P SPECOM PATR GREEC, P575 Padgett J, 2005, PHONETICA, V62, P14, DOI 10.1159/000087223 Potapova R., 2011, P INT C SPEECH COMP, P13 Psutka J., 2005, P EUR 2005 LISB PORT, P1349 Pylypenko V, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1809 Rastrow A., 2012, P 50 ANN M ASS COMP, P175 Roark B., 2002, P 40 ANN M ASS COMP, P287 Ronzhin A., 2004, P INT C SPECOM 2004, P291 Schalkwyk J., 2010, ADV SPEECH RECOGNITI, P61, DOI 10.1007/978-1-4419-5951-5_4 Schultz T., 1998, P TSD 1998 BRN CZECH, P311 Shirokova A., 2007, P SPECOM 2007 MOSC R, P877 Shvedova N., 1980, RUSSIAN GRAMMAR, V1, P783 Sidorov G., 2012, LECT NOTES ARTIF INT, V7630, P1 Singh R, 2002, IEEE T SPEECH AUDI P, V10, P89, DOI 10.1109/89.985546 Skatov D., 2012, DICTASCOPE SYNTAX NA Skrelin P, 2010, LECT NOTES ARTIF INT, V6231, P392, DOI 10.1007/978-3-642-15760-8_50 Smirnova J., 2011, P 17 INT C PHON SCI, P1870 Sokirko A., 2004, P 10 INT C DIAL 2004, P559 Starostin A., 2007, P INT C DIAL 2007 MO Stolcke A., 2011, P IEEE AUT SPEECH RE Stuker S, 2008, INT CONF ACOUST SPEE, P4249, DOI 10.1109/ICASSP.2008.4518593 Stuker S., 2004, P SPECOM 2004 SAINT, P297 Szarvas M., 2003, P ICASSP HONG KONG C, P368 Tatarnikova M., 2006, P SPECOM 2006 ST PET, P83 Vaiciunas A., 2006, THESIS VYTAUTAS MAGN Vazhenina D., 2011, P 7 INT C NLP KNOWL, P475 Vazhenina D., 2012, P JOINT INT C HUM CT, P59 Viktorov A., 2009, SPEECH TECHNOL, V2, P39 Vintsyuk T., 1968, KIBERNETICA, V1, P15 Whittaker EWD, 2001, INT CONF ACOUST SPEE, P545, DOI 10.1109/ICASSP.2001.940889 Whittaker E.W.D., 2000, THESIS CAMBRIDGE U, P140 Young S., 2009, HTK BOOK, P384 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Zablotskiy S, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3374 Zaliznjak A.A., 2003, GRAMMATICAL DICT RUS, P800 Zhang JS, 2008, IEICE T INF SYST, VE91D, P508, DOI 10.1093/ictisy/e9l-d.3.508 NR 66 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 213 EP 228 DI 10.1016/j.specom.2013.07.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800017 ER PT J AU Van Niekerk, DR Barnard, E AF Van Niekerk, Daniel R. Barnard, Etienne TI Predicting utterance pitch targets in Yoruba for tone realisation in speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Yoruba; Tone language; Speech synthesis; Fundamental frequency ID UNIVERSALITY; INTONATION AB Pitch is a fundamental acoustic feature of speech and as such needs to be determined during the process of speech synthesis. While a range of communicative functions are attributed to pitch variation in speech of all languages, it plays a vital role in distinguishing meaning of lexical items in tone languages. As a number of factors are assumed to affect the realisation of pitch, it is important to know which mechanisms are systematically responsible for pitch realisation in order to be able to model these effectively and thus develop robust speech synthesis systems in under-resourced environments. To this end, features influencing syllable pitch targets in continuous utterances in Yoruba are investigated in a small speech corpus of 4 speakers. It is found that the previous syllable pitch level is strongly correlated with pitch changes between syllables and a number of approaches and features are evaluated in this context. The resulting models can be used to predict utterance pitch targets for speech synthesisers (whether it be concatenative or statistical parametric systems), and may also prove useful in speech-recognition systems. (C) 2013 Elsevier B.V. All rights reserved. C1 [Van Niekerk, Daniel R.; Barnard, Etienne] North West Univ, Vanderbijlpark, South Africa. [Van Niekerk, Daniel R.] CSIR, Meraka Inst, Human Language Technol Res Grp, ZA-0001 Pretoria, South Africa. RP Van Niekerk, DR (reprint author), North West Univ, Ctr Text Technol, Potchefstroom, South Africa. EM daniel.vanniekerk@nwu.ac.za CR Adegbola T., 2009, P EACL 2009 WORKSH L, P53, DOI 10.3115/1564508.1564519 Adegbola T., 2012, 3 INT WORKSH SPOK LA, P48 Boersma P., 2001, PRAAT SYSTEM DOING P Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199 Connell B, 2002, J PHONETICS, V30, P101, DOI 10.1006/jpho.2001.0156 Connell B., 1990, PHONOLOGY, V7, P1, DOI 10.1017/S095267570000110X Courtenay K., 1971, STUDIES AFRICAN LING, V2, P239 Davel M, 2008, COMPUT SPEECH LANG, V22, P374, DOI 10.1016/j.csl.2008.01.001 Ekpenyong Moses E., 2008, International Journal of Speech Technology, V11, DOI 10.1007/s10772-009-9037-5 Fujisaki H., 1998, 3 ESCA COCOSDA WORKS, P26 Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Laniran YO, 2003, J PHONETICS, V31, P203, DOI 10.1016/S0095-4470(02)00098-0 Louw J.A., 2006, S AFRICAN J AFRICAN, V2, P1 Odejobi Odetunji A., 2008, Computer Speech & Language, V22, DOI 10.1016/j.csl.2007.05.002 Odejobi O.A., 2007, INFOCOMP J COMPUT SC, V6, P47 Odejobi OA, 2006, COMPUT SPEECH LANG, V20, P563, DOI 10.1016/j.csl.2005.08.006 Olshen R., 1984, CLASSIFICATION REGRE, V1st Pedregosa F, 2011, J MACH LEARN RES, V12, P2825 Peng G, 2005, SPEECH COMMUN, V45, P49, DOI 10.1016/j.specom.2004.09.004 Prom-on S, 2009, J ACOUST SOC AM, V125, P405, DOI 10.1121/1.3037222 Tao JH, 2006, IEEE T AUDIO SPEECH, V14, P1145, DOI 10.1109/TASL.2006.876113 Van Niekerk D.R., 2012, 3 INT WORKSH SPOK LA, P54 van Niekerk D.R., 2009, P INT 2009 BRIGHT UK, P880 WHALEN DH, 1995, J PHONETICS, V23, P349, DOI 10.1016/S0095-4470(95)80165-0 Xu Y., 2000, 6 INT C SPOK LANG PR, P666 Xu Y, 2005, SPEECH COMMUN, V46, P220, DOI 10.1016/j.specom.2005.02.014 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Zen H., 2006, 6 INT WORKSH SPEECH, P294 NR 28 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 229 EP 242 DI 10.1016/j.specom.2013.01.009 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800018 ER PT J AU Ekpenyong, M Urua, EA Watts, O King, S Yamagishi, J AF Ekpenyong, Moses Urua, Eno-Abasi Watts, Oliver King, Simon Yamagishi, Junichi TI Statistical parametric speech synthesis for Ibibio SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; Ibibio; Low-resource languages; HTS ID TONE CORRECTNESS AB Ibibio is a Nigerian tone language, spoken in the south-east coastal region of Nigeria. Like most African languages, it is resource-limited. This presents a major challenge to conventional approaches to speech synthesis, which typically require the training of numerous predictive models of linguistic features such as the phoneme sequence (i.e., a pronunciation dictionary plus a letter-to-sound model) and prosodic structure (e.g., a phrase break predictor). This training is invariably supervised, requiring a corpus of training data labelled with the linguistic feature to be predicted. In this paper, we investigate what can be achieved in the absence of many of these expensive resources, and also with a limited amount of speech recordings. We employ a statistical parametric method, because this has been found to offer good performance even on small corpora, and because it is able to directly learn the relationship between acoustics and whatever linguistic features are available, potentially mitigating the absence of explicit representations of intermediate linguistic layers such as prosody. We present an evaluation that compares systems that have access to varying degrees of linguistic structure. The simplest system only uses phonetic context (quinphones), and this is compared to systems with access to a richer set of context features, with or without tone marking. It is found that the use of tone marking contributes significantly to the quality of synthetic speech. Future work should therefore address the problem of tone assignment using a dictionary and the building of a prediction module for out-of-vocabulary words. (C) 2013 Elsevier B.V. All rights reserved. C1 [Ekpenyong, Moses] Univ Uyo, Dept Comp Sci, Uyo 520003, Akwa Ibom State, Nigeria. [Urua, Eno-Abasi] Univ Uyo, Dept Linguist & Nigerian Languages, Uyo 520003, Akwa Ibom State, Nigeria. [Watts, Oliver; King, Simon; Yamagishi, Junichi] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland. RP Ekpenyong, M (reprint author), Univ Uyo, Dept Comp Sci, PMB 1017, Uyo 520003, Akwa Ibom State, Nigeria. EM mosesekpenyong@gmail.com CR Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X Chomphan S, 2008, SPEECH COMMUN, V50, P392, DOI 10.1016/j.specom.2007.12.002 Chunwijitra V, 2012, SPEECH COMMUN, V54, P245, DOI 10.1016/j.specom.2011.08.006 Clark R., 2007, P BLZ3 2007 P SSW6 Ekpenyong Moses E., 2008, International Journal of Speech Technology, V11, DOI 10.1007/s10772-009-9037-5 Ekpenyong M., 2009, USEM J LANG LINGUIST, V2, P71 Essien O.E., 1990, GRAMMAR IBIBIO LANGU Gibbon D., 2004, DATA CREATION IBIBIO Gibbon D., 2001, P EUR, P83 Gibbon D., 2006, INT TUT RES WORKSH M, P1 King S., 2009, P BLIZZ CHALL WORKSH Louw J.A., 2008, P 19 ANN S PATT REC, P165 Podsiadlo M., 2007, THESIS U EDINBURGH E Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Taylor P, 2009, TEXT TO SPEECH SYNTH Tucker R., 2005, P INT 2005 EUR LISB, P453 Urua E.-A., 2001, UYO IBIBIO IN PRESS Watts O., 2012, THESIS U EDINBURGH E Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394 Zen H., 2009, P APSIPA ASC OCT, P121 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 22 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2014 VL 56 BP 243 EP 251 DI 10.1016/j.specom.2013.02.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 244AZ UT WOS:000326360800019 ER PT J AU Yokoyama, R Nasu, Y Iwano, K Shinoda, K AF Yokoyama, Ryo Nasu, Yu Iwano, Koji Shinoda, Koichi TI Detection of overlapped speech using lapel microphones in meeting SO SPEECH COMMUNICATION LA English DT Article DE Overlap speech detection; Spectral subtraction; Cosine distance ID BLIND SEPARATION AB We propose an overlapped speech detection method for speech recognition and speaker diarization of meetings, where each speaker wears a lapel microphone. Two novel features are utilized as inputs for a GMM-based detector. One is speech power after cross-channel spectral subtraction which reduces the power from the other speakers. The other is an amplitude spectral cosine correlation coefficient which effectively extracts the correlation of spectral components in a rather quiet condition. We evaluated our method using a meeting speech corpus of four speakers. The accuracy of our proposed method, 75.7%, was significantly better than that of the conventional method, 66.8%, which uses raw speech power and power spectral Pearson's correlation coefficient. (c) 2013 Elsevier B.V. All rights reserved. C1 [Yokoyama, Ryo; Nasu, Yu; Shinoda, Koichi] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. [Iwano, Koji] Tokyo City Univ, Fac Environm & Informat Studies, Tsuzuki Ku, Yokohama, Kanagawa 2248551, Japan. RP Yokoyama, R (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan. EM yokoyama@ks.cs.titech.ac.jp RI Shinoda, Koichi/D-3198-2014 OI Shinoda, Koichi/0000-0003-1095-3203 CR Aoki M., 2001, Acoustical Science and Technology, V22, DOI 10.1250/ast.22.149 BELL AJ, 1995, NEURAL COMPUT, V7, P1129, DOI 10.1162/neco.1995.7.6.1129 Ben-Harush O., 2009, IEEE INT WORKSH MACH, P1 Boakye K, 2008, INT CONF ACOUST SPEE, P4353, DOI 10.1109/ICASSP.2008.4518619 Boakye K., 2011, P INTERSPEECH, P941 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Ghosh PK, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P3098 Janin A, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P364 Maganti HK, 2007, INT CONF ACOUST SPEE, P1037 Moore D. C., 2003, P ICASSP, P497 Nasu Y, 2011, INT CONF ACOUST SPEE, P4812 Pfau T, 2001, IEEE WORKSH AUT SPEE, V1, P107, DOI 10.1109/ASRU.2001.1034599 Rickard S., 2001, P ICA2001 DEC, P651 Rozgic Viktor, 2010, Journal of Multimedia, V5, DOI 10.4304/jmm.5.4.322-331 Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2 Stolcke A, 2011, INT CONF ACOUST SPEE, P4992 Stolcke A, 2010, INT CONF ACOUST SPEE, P4390, DOI 10.1109/ICASSP.2010.5495626 Sun H, 2011, P INTERSPEECH, P2345 Sun HW, 2010, INT CONF ACOUST SPEE, P4982, DOI 10.1109/ICASSP.2010.5495077 Valente F, 2011, INT CONF ACOUST SPEE, P4416 Valente F, 2010, INT CONF ACOUST SPEE, P4954, DOI 10.1109/ICASSP.2010.5495087 Vijayasenan D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4173 Vijayasenan D, 2010, INT CONF ACOUST SPEE, P4950, DOI 10.1109/ICASSP.2010.5495086 Wrigley SN, 2005, IEEE T SPEECH AUDI P, V13, P84, DOI 10.1109/TSA.2004.838531 Xiao B, 2011, INT CONF ACOUST SPEE, P5216 Yamamoto K, 2006, IEICE T FUND ELECTR, VE89A, P2158, DOI 10.1093/ietfec/e89-a.8.2158 Yella S. H., 2011, P INTERSPEECH, P953 Zhu M, 2004, 09 U WAT DEP STAT AC Zwyssig E, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4177 NR 29 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 941 EP 949 DI 10.1016/j.specom.2013.06.013 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500001 ER PT J AU Zhang, WL Qu, D Zhang, WQ Li, BC AF Zhang, Wen-Lin Qu, Dan Zhang, Wei-Qiang Li, Bi-Cheng TI Rapid speaker adaptation using compressive sensing SO SPEECH COMMUNICATION LA English DT Article DE Speaker adaptation; Speaker subspace; Compressive sensing; Matching pursuit; l(1) regularization ID SPEECH RECOGNITION; INVERSE PROBLEMS; RECONSTRUCTION; REPRESENTATION; SUBSPACE AB Speaker-space-based speaker adaptation methods can obtain good performance even if the amount of adaptation data is limited. However, it is difficult to determine the optimal dimension and basis vectors of the subspace for a particular unknown speaker. Conventional methods, such as eigenvoice (EV) and reference speaker weighting (RSW), can only obtain a sub-optimal speaker subspace. In this paper, we present a new speaker-space-based speaker adaptation framework using compressive sensing. The mean vectors of all mixture components of a conventional Gaussian-Mixture-Model-Hidden-Markov-Model (GMM-HMM)-based speech recognition system are concatenated to form a supervector. The speaker adaptation problem is viewed as recovering the speaker-dependent supervector from limited speech signal observations. A redundant speaker dictionary is constructed by a combination of all the training speaker supervectors and the supervectors derived from the EV method. Given the adaptation data, the best subspace for a particular speaker is constructed in a maximum a posterior manner by selecting a proper set of items from this dictionary. Two algorithms, i.e. matching pursuit and l(1) regularized optimization, are adapted to solve this problem. With an efficient redundant basis vector removal mechanism and an iterative updating of the speaker coordinate, the matching pursuit based speaker adaptation method is fast and efficient. The matching pursuit algorithm is greedy and sub-optimal, while direct optimization of the likelihood of the adaptation data with an explicit l(1) regularization term can obtain better approximation of the unknown speaker model. The projected gradient optimization algorithm is adopted and a few iterations of the matching pursuit algorithm can provide a good initial value. Experimental results show that matching pursuit algorithm outperforms the conventional testing methods under all testing conditions. Better performance is obtained when direct l(1) regularized optimization is applied. Both methods can select a proper mixed set of the eigenvoice and reference speaker supervectors automatically for estimation of the unknown speaker models. (c) 2013 Elsevier B.V. All rights reserved. C1 [Zhang, Wen-Lin; Qu, Dan; Li, Bi-Cheng] Zhengzhou Informat Sci & Technol Inst, Zhengzhou 450002, Peoples R China. [Zhang, Wei-Qiang] Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China. RP Zhang, WL (reprint author), Zhengzhou Informat Sci & Technol Inst, Zhengzhou 450002, Peoples R China. EM zwlin_2004@163.com; qudanqu-dan@sina.com; wqzhang@tsinghua.edu.cn; lbclm@163.com RI Zhang, Wei-Qiang/A-7088-2008 OI Zhang, Wei-Qiang/0000-0003-3841-1959 FU National Natural Science Foundation of China [61175017, 61005019]; National High-Tech Research and Development Plan of China [2012AA011603] FX This work was supported in part by the National Natural Science Foundation of China (No. 61175017 and No. 61005019) and the National High-Tech Research and Development Plan of China (No. 2012AA011603). CR Boominathan V, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4381 Boyd S., 2004, CONVEX OPTIMIZATION Bruckstein AM, 2009, SIAM REV, V51, P34, DOI 10.1137/060657704 Candes EJ, 2006, IEEE T INFORM THEORY, V52, P489, DOI 10.1109/TIT.2005.862083 Chang E, 2001, P EUR, P2799 Chen S., 1999, P EUR C SPEECH COMM, V3, P1087 Cho HY, 2010, ETRI J, V32, P795, DOI 10.4218/etrij.10.1510.0062 Daubechies I, 2004, COMMUN PUR APPL MATH, V57, P1413, DOI 10.1002/cpa.20042 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 Donoho DL, 2006, IEEE T INFORM THEORY, V52, P1289, DOI 10.1109/TIT.2006.871582 Figueiredo MAT, 2007, IEEE J-STSP, V1, P586, DOI 10.1109/JSTSP.2007.910281 Gemmeke JF, 2011, IEEE T AUDIO SPEECH, V19, P2067, DOI 10.1109/TASL.2011.2112350 Hahm S, 2010, INT CONF ACOUST SPEE, P4302, DOI 10.1109/ICASSP.2010.5495672 Hazen T. J., 1997, P EUR, P2047 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 Kenny P, 2004, IEEE T SPEECH AUDI P, V12, P579, DOI 10.1109/TSA.2004.825668 Kua JMK, 2011, INT CONF ACOUST SPEE, P4548 Kua JMK, 2013, SPEECH COMMUN, V55, P707, DOI 10.1016/j.specom.2013.01.005 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Li JY, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1656 Lu L, 2011, IEEE SIGNAL PROC LET, V18, P419, DOI 10.1109/LSP.2011.2157820 Mak B., 2006, P ICASSP, V1 MALLAT SG, 1993, IEEE T SIGNAL PROCES, V41, P3397, DOI 10.1109/78.258082 Naseem Imran, 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), DOI 10.1109/ICPR.2010.1083 Olsen P. A, 2011, P ASRU, P53 Olsen PA, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4317 Petersen KB, 2008, MATRIX COOKBOOK Povey D, 2012, COMPUT SPEECH LANG, V26, P35, DOI 10.1016/j.csl.2011.04.002 Shinoda K, 2010, IEICE T INF SYST, VE93D, P2348, DOI 10.1587/transinf.E93.D.2348 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Teng W. X., 2007, P INTERSPEECH, P258 Teng WX, 2009, INT CONF ACOUST SPEE, P4381, DOI 10.1109/ICASSP.2009.4960600 Tibshirani R, 1996, J ROY STAT SOC B MET, V58, P267 Tropp JA, 2007, IEEE T INFORM THEORY, V53, P4655, DOI 10.1109/TIT.2007.909108 Wiesler S, 2011, INT CONF ACOUST SPEE, P5324 WOODLAND PC, 1994, INT CONF ACOUST SPEE, P125 Young S., 2009, HTK BOOK HTK VERSION Yu D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4409 NR 38 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 950 EP 963 DI 10.1016/j.specom.2013.06.012 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500002 ER PT J AU Yu, RS AF Yu, Rongshan TI Speech enhancement based on soft audible noise masking and noise power estimation SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Speech processing; Auditory model; Perceptual model; Noise estimation; Noise suppression; Noise tracking ID SPECTRAL AMPLITUDE ESTIMATOR; AUDITORY-SYSTEM; SUPPRESSION; EPHRAIM AB This paper presents a perceptual model based speech enhancement algorithm. The proposed algorithm measures the amount of the audible noise in the input noisy speech based on estimation of short-time spectral power of noise signal, and masking threshold calculated from the estimated spectrum of clean speech. An appropriate amount of noise reduction is chosen based on the result to achieve good noise suppression without introducing significant distortion to the clean speech. To mitigate the problem of "musical noise", the amount of noise reduction is linked directly to the estimation of short-term noise spectral amplitude instead of noise variance so that the spectral peaks of noise can be better suppressed. Good performance of the proposed speech enhancement system is confirmed through objective and subjective tests. (c) 2013 Elsevier B.V. All rights reserved. C1 [Yu, Rongshan] Inst Infocomm Res I2R, Dept Signal Proc, Singapore, Singapore. [Yu, Rongshan] Dolby Labs Inc, San Francisco, CA 94103 USA. RP Yu, RS (reprint author), Inst Infocomm Res I2R, Dept Signal Proc, 1 Fusionopolis Way,21-01 Connexis South Tower, Singapore, Singapore. EM ryu@i2r.a-star.edu.sg CR 3GPP, 2001, 26978 3GPP TR 3GPP, 2001, 26090 3GPP TS 3GPP2, 2004, CS00140 3GPP2 [Anonymous], 2003, P835 ITUT [Anonymous], 2001, PERC EV SPEECH QUAL, P862 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erkelens J. S, 2008, IEEE INT C AC SPEECH Fisher W. M., 1986, P DARPA WORKSH SPEEC, P93 Gustafsson S, 1998, IEEE INT C AC SPEECH Hansen JHL, 2006, IEEE T AUDIO SPEECH, V14, P2049, DOI 10.1109/TASL.2006.876883 Hu Y, 2006, 2006 IEEE INT C AC S Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714 ISO/IEC, 1992, JTC1SC29WG11IS111723 ITU, 1993, P56 ITUT, P56 ITU-T, 1996, P800 ITUT, P800 Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Lin L, 2003, IEEE INT C AC SPEECH Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Princen J. P., 1987, P ICASSP 87 DALL TX, V4, P2161 RICE SO, 1948, AT&T TECH J, V27, P109 Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Widrow B, 1985, ADAPTIVE SIGNAL PROC Wolfe PJ, 2003, EURASIP J APPL SIG P, V2003, P1043, DOI 10.1155/S1110865703304111 Yu RS, 2009, INT CONF ACOUST SPEE, P4421 NR 29 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 964 EP 974 DI 10.1016/j.specom.2013.05.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500003 ER PT J AU Djendi, M Scalart, P Gilloire, A AF Djendi, Mohamed Scalart, Pascal Gilloire, Andre TI Analysis of two-sensors forward BSS structure with post-filters in the presence of coherent and incoherent noise SO SPEECH COMMUNICATION LA English DT Article DE Cepstral distance; Spectral distortion; Signal to noise ratio (SNR); Speech to coherent noise (SCR) ratio; Coherent noise to diffuse noise (CDR) ratio; BSS; Speech enhancement; Forward and Backward BSS structure ID LOW SIGNAL-DISTORTION; SPECTRAL AMPLITUDE ESTIMATOR; SPEECH ENHANCEMENT; SEPARATION; CANCELER; INTELLIGIBILITY; CANCELLATION; ALGORITHMS; REDUCTION; FIELD AB We consider the speech enhancement problem in a moving car through a blind source separation BSS scheme involving two sensors. To correct the distortion brought by this structure we have proposed in previous work (Djendi et al., 2007) two frequency-domain methods to compute the post-filters placed at the output of the forward BSS structure (FBSS). In this work, we consider the case where the noises at the sensor inputs contain coherent and non-coherent components. We provide an analysis of the performance (output SNR and the distortion criterion) of the FBSS structure with post-filters as a function of two new parameters: the coherent to diffuse ratio (CDR) and the speech to coherent ratio (SCR). Simulation results show perfect agreement between theoretical and experimental results. Crown Copyright (c) 2013 Published by Elsevier B.V. All rights reserved. C1 [Djendi, Mohamed] Blida Univ, LATSI Lab, Blida 09000, Algeria. [Djendi, Mohamed; Scalart, Pascal] Univ Rennes, IRISA ENSSAT, F-22305 Lannion, France. [Gilloire, Andre] France Telecom R&D, TECHISSTP, F-22307 Lannion, France. RP Djendi, M (reprint author), Univ Rennes, IRISA ENSSAT, F-22305 Lannion, France. EM m_djendi@yahoo.fr; pascal.scalart@enssat.fr; andre.gilloire@wanadoo.fr CR ALKINDI MJ, 1989, SIGNAL PROCESS, V17, P241, DOI 10.1016/0165-1684(89)90005-4 Araki S, 2005, INT CONF ACOUST SPEE, P81 Bentler R, 2008, INT J AUDIOL, V47, P447, DOI 10.1080/14992020802033091 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Charkani N. H., 1996, THESIS U RENNES 1 FR DJENDI M, 2006, INT CONF ACOUST SPEE, P744 Djendi M, 2007, P EUSIPCO POZN, V1, P218 Djendi M, 2009, P EUSICPO GLASG UK, V1, P165 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Even J, 2009, 2009 IEEE/SP 15TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, P513, DOI 10.1109/SSP.2009.5278525 FERRARA ER, 1980, IEEE T ACOUST SPEECH, V28, P474, DOI 10.1109/TASSP.1980.1163432 Gabrea M, 1996, P EUISIPCO 1996 TRIE, V2, P983 Gabrea M, 2003, P ICASSP, V2, P904 Goodwin G. C, 1985, INFO SYST SCI Guerin A, 2002, THESIS NATL POLLYTEC Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 Ikeda S, 1999, IEICE T FUND ELECTR, VE82A, P1517 Ikeda S, 1999, IEEE T SIGNAL PROCES, V47, P665, DOI 10.1109/78.747774 Jan T, 2009, INT CONF ACOUST SPEE, P1713, DOI 10.1109/ICASSP.2009.4959933 Kim G, 2009, J ACOUST SOC AM, V126, P1486, DOI 10.1121/1.3184603 Lim J, 1978, IEEE T ACOUST SPEECH, VASSP-37, P471 Marro C, 1998, IEEE T SPEECH AUDI P, V6, P240, DOI 10.1109/89.668818 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Matsuoka K., 2002, P 41 SICE ANN C AUG, V4, P2138, DOI 10.1109/SICE.2002.1195729 McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212 Mourjopoulos J. N, 1982, P ICASSP, V1, P1858 MOURJOPOULOS JN, 1994, J AUDIO ENG SOC, V42, P884 Plapous C, 2005, INT CONF ACOUST SPEE, P157 Plapous C., 2004, P IEEE INT C AC SPEE, V1, P289 Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621 Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005 Sato M, 2005, IEICE T FUND ELECTR, VE88A, P2055, DOI 10.1093/ietfec/e88-a.8.2055 Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199 Simmer K.U., 2001, MICROPHONE ARRAYS, P36 Sugiyama A, 1989, P IEEE ICASSP, V2, P892 VANGERVEN S, 1995, IEEE T SIGNAL PROCES, V43, P1602, DOI 10.1109/78.398721 Wang DL, 2009, J ACOUST SOC AM, V125, P2336, DOI 10.1121/1.3083233 Weiss RJ, 2010, COMPUT SPEECH LANG, V24, P16, DOI 10.1016/j.csl.2008.03.003 Zelinski R, 1988, P ICASSP, V5 NR 41 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 975 EP 987 DI 10.1016/j.specom.2013.06.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500004 ER PT J AU Shi, YY Wiggers, P Jonker, CM AF Shi, Yangyang Wiggers, Pascal Jonker, Catholijn M. TI Classifying the socio-situational settings of transcripts of spoken discourses SO SPEECH COMMUNICATION LA English DT Article DE Socio-situational setting; Support vector machine; Dynamic Bayesian networks; Genre classification; Part of speech ID CLASSIFICATION; LANGUAGE; TEXT AB In this paper, we investigate automatic classification of the socio-situational settings of transcripts of a spoken discourse. Knowledge of the socio-situational setting can be used to search for content recorded in a particular setting or to select context-dependent models for example in speech recognition. The subjective experiment we report on in this paper shows that people correctly classify 68% the socio-situational settings. Based on the cues that participants mentioned in the experiment, we developed two types of automatic socio-situational setting classification methods; a static socio-situational setting classification method using support vector machines (S3C-SVM), and a dynamic socio-situational classification method applying dynamic Bayesian networks (S3C-DBN). Using these two methods, we developed classifiers applying various features and combinations of features. The S3C-SVM method with sentence length, function word ratio, single occurrence word ratio, part of speech (POS) and words as features results in a classification accuracy of almost 90%. Using a bigram S3C-DBN with Pos tag and word features results in a dynamic classifier which can obtain nearly 89% classification accuracy. The dynamic classifiers not only can achieve similar results as the static classifiers, but also can track the socio-situational setting while processing a transcript or conversation. On discourses with a static social situational setting, the dynamic classifiers only need the initial 25% of data to achieve a classification accuracy close to the accuracy achieved when all data of a transcript is used. (c) 2013 Elsevier B.V. All rights reserved. C1 [Shi, Yangyang; Jonker, Catholijn M.] Delft Univ Technol, Intelligent Syst Dept, NL-2600 AA Delft, Netherlands. [Wiggers, Pascal] Amsterdam Univ Appl Sci HvA, CREATE IT Appl Res, Amsterdam, Netherlands. RP Shi, YY (reprint author), HB12-290,Mekelweg 4, NL-2628 CD Delft, Netherlands. EM shiyang1983@gmail.com CR Argamon S., 1998, P 1 INT WORKSH INN I Baeza-Yates R., 1999, MODERN INFORM RETRIE Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199 Chen GY, 2008, APPLIED COMPUTING 2008, VOLS 1-3, P2353 CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411 Dean T., 1989, Computational Intelligence, V5, DOI 10.1111/j.1467-8640.1989.tb00324.x Fan RE, 2008, J MACH LEARN RES, V9, P1871 Feldman S, 2009, INT CONF ACOUST SPEE, P4781, DOI 10.1109/ICASSP.2009.4960700 Firth J. R., 1957, STUDIES LINGUISTIC A, P1930 Iyer R., 1994, P ARPA WORKSH HUM LA, P82, DOI 10.3115/1075812.1075828 Joachims T., 1998, MACH LEARN ECML 98, V1398, P137, DOI DOI 10.1007/BFB0026683 KARLGREN J, 1994, P 15 INT C COMP LING, V2, P1071, DOI 10.3115/991250.991324 Kessler B., 1997, P 35 ANN M ASS COMP, P32 Labov William, 1972, SOCIOLINGUISTIC PATT Langley P., 1992, P 10 NAT C ART INT, P223 Lee Y., 2002, P ACL SIGIR C RES DE, P145 LEVINSON SC, 1979, LINGUISTICS, V17, P365, DOI 10.1515/ling.1979.17.5-6.365 Murphy K.P., 2002, THESIS U CALIFORNIA Obin N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P3070 Oostdijk N., 1999, BUILDING CORPUS SPOK Oostdijk N., 2002, P 3 INT C LANG RES E, P340 Pearl J., 1988, PROBABILISTIC REASON Peng FC, 2003, LECT NOTES COMPUT SC, V2633, P335 Ries K, 2000, P INT C LANG RESS EV Rosenfeld R, 2000, P IEEE, V88, P1270, DOI 10.1109/5.880083 Santini M., 2004, 7 ANN CLUK RES C Santini M., 2006, JADT 2006 8 JOURN Shi Y, 2010, 22 BEN C ART INT, P154 Stamatatos E., 2000, P 18 INT C COMP LING, V2, P808, DOI 10.3115/992730.992763 Theodoridis S, 2009, PATTERN RECOGNITION, 4RTH EDITION, P1 Tong S, 2000, J MACHINE LEARNING R, V2, P45, DOI DOI 10.1162/153244302760185243 Van Gijsel S, 2006, P 8 INT C STAT AN TE, V2, P961 Wiggers P., P INT C TEXT SPEECH, P366 NR 33 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 988 EP 1002 DI 10.1016/j.specom.2013.06.011 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500005 ER PT J AU Yook, S Nam, KW Kim, H Kwon, SY Kim, D Lee, S Hong, SH Jang, DP Kim, IY AF Yook, Sunhyun Nam, Kyoung Won Kim, Heepyung Kwon, See Youn Kim, Dongwook Lee, Sangmin Hong, Sung Hwa Jang, Dong Pyo Kim, In Young TI Modified segmental signal-to-noise ratio reflecting spectral masking effect for evaluating the performance of hearing aid algorithms SO SPEECH COMMUNICATION LA English DT Article DE Speech quality test; Hearing aid; Masking threshold; Segmental signal to noise ratio ID SPEECH ENHANCEMENT; QUALITY; RECOGNITION AB Most traditional objective indices don't distinguish between a real sound and a perceived sound, and therefore, these indices have limitations in regard to the evaluation of the real effect of an algorithm under investigation on the auditory perception of a hearing-impaired person. Though several objective indices, such as perceptual evaluation of speech quality (PESQ) and composite measurements, that reflect the psychoacoustic factors were already in use, it is helpful to develop more objective indices that take into account human psychoacoustic factors in order to accurately evaluate the performance of hearing aid algorithms. In this study, a new objective index that reflects the spectral masking effect into the calculation of the conventional segmental signal-to-noise ratio (segSNR) was proposed. The performance of this index was evaluated by analyzing the correlation of the result and (1) the mean opinion score and (2) the speech recognition threshold tests of 15 normal-hearing volunteers and 15 hearing-impaired patients. The correlation values of the proposed index were relatively high (0.83-0.97) across various ambient noise situations. Based on these experimental results, the proposed index has the potential to be widely used as a measuring index for the performance evaluation of various hearing aid algorithms prior to conducting clinical experiments. (c) 2013 Elsevier B.V. All rights reserved. C1 [Yook, Sunhyun; Nam, Kyoung Won; Kim, Heepyung; Jang, Dong Pyo; Kim, In Young] Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea. [Kwon, See Youn; Hong, Sung Hwa] Samsung Med Ctr, Dept Otolaryngol Head & Neck Surg, Seoul 135710, South Korea. [Kim, Dongwook] Samsung Adv Inst Technol, Bio & Hlth Lab, Yongin 446712, South Korea. [Lee, Sangmin] Inha Univ, Dept Elect Engn, Inchon 402751, South Korea. RP Kim, IY (reprint author), Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea. EM soonhyun@bme.hanyang.ac.kr; kwnam@bme.hanyang.ac.kr; heepyung@bme.hanyang.ac.kr; sykwon@bme.hanyang.ac.kr; steve7.kim@sam-sung.com; sanglee@inha.ac.kr; hongsh@skku.edu; dongpjang@gmail.com; iykim@hanyang.ac.kr FU Strategic Technology Development Program of Ministry of Knowledge Economy, KOREA [10031764]; Seoul R&BD Program, KOREA [SS100022] FX This work was supported by grants from the Strategic Technology Development Program of Ministry of Knowledge Economy, KOREA (No. 10031764) and from Seoul R&BD Program, KOREA (No. SS100022). The authors are grateful to Silvia Allegro and Phonak for allowing access to their database. CR Alam M. J, 2009, P IEEE INT C ICCIT, V12, P483 Amehraye A, 2009, INT J INF COMMUN ENG, V5, P2 Arehart KH, 2010, EAR HEARING, V31, P420, DOI 10.1097/AUD.0b013e3181d3d4f3 Beerends JG, 2002, J AUDIO ENG SOC, V50, P765 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Brandenburg K, 1992, INT C TEST MEAS, P11 Buchler M, 2005, EURASIP J APPL SIG P, V18, P2991 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 Garcia R. A, 1999, AUDIO ENG SOC, V107, P5073 Han H, 2011, INT J AUDIOL, V50, P59, DOI 10.3109/14992027.2010.526637 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058 ISO, 2003, 2262003 ISO JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Kim J. S, 2008, KOREAN ACAD AUDIOL, V4, P126 Li JF, 2011, J ACOUST SOC AM, V129, P3291, DOI 10.1121/1.3571422 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Mourjopoulos J, 1991, AUDIO ENG SOC, V90, P3064 Nam KW, 2013, SPEECH COMMUN, V55, P544, DOI 10.1016/j.specom.2012.11.002 Quackenbush S, 1988, OBJECTIVE MEASURES S, P47 Richards D.L., 1965, Electronics Letters, V1, DOI 10.1049/el:19650037 Scalart P, 1969, P IEEE INT C AC SPEE, V2, P629 Tribolet J., 1988, P IEEE INT C AC SPEE, V3, P586 Turner CW, 2004, J ACOUST SOC AM, V115, P1729, DOI 10.1121/1.1687425 Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 Yost W. A., 1994, FUNDAMENTALS HEARING NR 26 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 1003 EP 1010 DI 10.1016/j.specom.2013.05.005 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500006 ER PT J AU Chen, F Wong, LLN Hu, Y AF Chen, Fei Wong, Lena L. N. Hu, Yi TI A Hilbert-fine-structure-derived physical metric for predicting the intelligibility of noise-distorted and noise-suppressed speech SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Hilbert fine-structure signal; Speech transmission index ID HEARING-IMPAIRED LISTENERS; TEMPORAL ENVELOPE; REDUCTION ALGORITHMS; AUDITORY-PERCEPTION; ARTICULATION INDEX; CUES; COHERENCE; QUALITY; ABILITY; RECOGNITION AB Despite the established importance of temporal fine-structure (TFS) on speech perception in noise, existing speech transmission metrics use primarily envelope information to model speech intelligibility variance. This study proposes a new physical metric for predicting speech intelligibility using information obtained from the Hilbert-derived TFS waveform. It is found that by making explicit use of coherence information contained in the complex spectra of the Hilbert-derived TFS waveforms of the clean and corrupted speech signals, and assessing the extent to which the coherence in the Hilbert fine structure is affected following the linear or non-linear processing (e.g., noise distortion, speech enhancement, etc.) of the stimulus, the predictive power of the intelligibility measure can be significantly improved for noise-distorted and noise-suppressed speech signals. When evaluated with speech recognition scores obtained with normal-hearing listeners, including a total of sixty-four noise-suppressed conditions with nonlinear distortions and eight noisy conditions without subsequent noise reduction, the proposed TFS-based measure was found to predict speech intelligibility better than most envelope- and coherence-based measures. High correlation was maintained for all types of maskers tested, with a maximum correlation of r = 0.95 achieved in car and street noise conditions. (c) 2013 Elsevier B.V. All rights reserved. C1 [Chen, Fei; Wong, Lena L. N.] Univ Hong Kong, Prince Philip Dent Hosp, Div Speech & Hearing Sci, Hong Kong, Hong Kong, Peoples R China. [Hu, Yi] Univ Wisconsin, Dept Elect Engn & Comp Sci, Milwaukee, WI 53211 USA. RP Chen, F (reprint author), Univ Hong Kong, Prince Philip Dent Hosp, Div Speech & Hearing Sci, 34 Hosp Rd, Hong Kong, Hong Kong, Peoples R China. EM feichen1@hku.hk; huy@uwm.edu FU Faculty Research Fund, Faculty of Education, The University of Hong Kong; Seed Funding for Basic Research, The University of Hong Kong; General Research Fund (GRF) FX This research was supported by Faculty Research Fund, Faculty of Education, The University of Hong Kong, by Seed Funding for Basic Research, The University of Hong Kong, and by General Research Fund (GRF), administered by the Hong Kong Research Grants council. The authors thank the Associate Editor, Dr. Karen Livescu, and two anonymous reviewers for their constructive and helpful comments. CR ANSI, 1997, S351997 ANSI Arehart KH, 2007, J ACOUST SOC AM, V122, P1150, DOI 10.1121/1.2754061 Baher H, 2001, ANALOG DIGITAL SIGNA Carter C, 1973, IEEE T AUDIO ELECTRO, VAU-21, P337 Chen F, 2013, J ACOUST SOC AM, V133, pEL405, DOI 10.1121/1.4800189 Chen F, 2010, J ACOUST SOC AM, V128, P3715, DOI 10.1121/1.3502473 Chen F, 2012, J ACOUST SOC AM, V131, P4104, DOI 10.1121/1.3695401 Drennan WR, 2005, EAR HEARING, V26, P461, DOI 10.1097/01.aud.0000179690.30137.21 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 Gilbert G, 2006, J ACOUST SOC AM, V119, P2438, DOI 10.1121/1.2173522 Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 Gomez AM, 2012, SPEECH COMMUN, V54, P503, DOI 10.1016/j.specom.2011.11.001 GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052 Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083 Hirsch H, 2000, ISCA TUT RES WORKSH Hollube I., 1996, J ACOUST SOC AM, V100, P1703 Hopkins K, 2008, J ACOUST SOC AM, V123, P1140, DOI 10.1121/1.2824018 HOUTGAST T, 1971, ACUSTICA, V25, P355 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778 Jorgensen S, 2011, J ACOUST SOC AM, V130, P1475, DOI 10.1121/1.3621502 KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657 Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575 Kong YY, 2006, J ACOUST SOC AM, V120, P2830, DOI 10.1121/1.2346009 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1698, DOI 10.1121/1.1909096 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lorenzi C, 2006, P NATL ACAD SCI USA, V103, P18866, DOI 10.1073/pnas.0607364103 Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493 Moore BCJ, 2008, JARO-J ASSOC RES OTO, V9, P399, DOI 10.1007/s10162-008-0143-x Myers R, 1990, CLASSICAL MODERN REG, V2nd PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 STEENEKEN HJM, 1982, ACUSTICA, V51, P229 STEIGER JH, 1980, PSYCHOL BULL, V87, P245, DOI 10.1037//0033-2909.87.2.245 Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881 van Buuren RA, 1999, J ACOUST SOC AM, V105, P2903, DOI 10.1121/1.426943 Zeng FG, 2004, J ACOUST SOC AM, V116, P1351, DOI 10.1121/1.1777938 NR 43 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 1011 EP 1020 DI 10.1016/j.specom.2013.06.016 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500007 ER PT J AU Ozimek, E Kocinski, J Kutzner, D Sek, A Wicher, A AF Ozimek, Edward Kocinski, Jedrzej Kutzner, Dariusz Sek, Aleksander Wicher, Andrzej TI Speech intelligibility for different spatial configurations of target speech and competing noise source in a horizontal and median plane SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Speech-in-noise test; Spatial perception; Monaural and binaural perception ID IMPAIRED LISTENERS; UNMASKING; HEARING; MASKING; RECOGNITION; SEPARATION; THRESHOLD; LOCATION; RELEASE AB The speech intelligibility for different configurations of a target signal (speech) and masker (babble noise) in a horizontal and a median plane was investigated. The sources were placed at the front, in the back or in the right hand side (at different angular configurations) of a dummy head. The speech signals were presented to listeners via headphones at different signal-to-noise ratios (SNR). Three different types of listening mode (binaural and monaural for the right or left ear) were tested. It was found that the binaural mode gave the lowest, i.e. 'the best', speech reception threshold (SRT) values compared to the other modes, except for the cases when both the target and masker were at the same position. With regard to the monaural modes, SRTs were generally worse than those for the binaural mode. The new data gathered for the median plane revealed that a change in elevation of the speech source had a small, but statistically significant, influence on speech intelligibility. It was found that when speech elevation was increased, speech intelligibility decreased. (c) 2013 Elsevier B.V. All rights reserved. C1 [Ozimek, Edward; Kocinski, Jedrzej; Kutzner, Dariusz; Sek, Aleksander; Wicher, Andrzej] Adam Mickiewicz Univ, Fac Phys, Inst Acoust, PL-61614 Poznan, Poland. RP Ozimek, E (reprint author), Adam Mickiewicz Univ, Inst Acoust, Ul Umultowska 85, PL-61614 Poznan, Poland. EM ozimaku@amu.edu.pl FU European Union [004171]; Polish-Norwegian Research Fund, State Ministry of Science and Higher Education [N N518 502139]; National Science Centre [UMO-2011/03/B/HS6/03709] FX This work was supported by Grants from: the European Union FP6: Project 004171 HEARCOM, the Polish-Norwegian Research Fund, State Ministry of Science and Higher Education: Project Number N N518 502139 and National Science Centre: Project Number UMO-2011/03/B/HS6/03709. CR Allen K, 2008, J ACOUST SOC AM, V123, P1562, DOI 10.1121/1.2831774 Bosman AJ, 1995, AUDIOLOGY, V34, P260 Brand T, 2002, J ACOUST SOC AM, V111, P2801, DOI 10.1121/1.1479152 BRONKHORST AW, 1988, J ACOUST SOC AM, V83, P1508, DOI 10.1121/1.395906 Brungart DS, 2012, J ACOUST SOC AM, V132, P2545, DOI 10.1121/1.4747005 Brungart DS, 2002, J ACOUST SOC AM, V112, P664, DOI 10.1121/1.1490592 COX RM, 1991, J SPEECH HEAR RES, V34, P904 Drullman R, 2000, J ACOUST SOC AM, V107, P2224, DOI 10.1121/1.428503 Edmonds BA, 2006, J ACOUST SOC AM, V120, P1539, DOI 10.1121/1.2228573 Freyman RL, 1999, J ACOUST SOC AM, V106, P3578, DOI 10.1121/1.428211 Freyman RL, 2001, J ACOUST SOC AM, V109, P2112, DOI 10.1121/1.1354984 Garadat S. N., 2006, J ACOUST SOC AM, V121, P1047 Hawley ML, 2004, J ACOUST SOC AM, V115, P833, DOI 10.1121/1.1639908 Kocinski J., 2005, Archives of Acoustics, V30 Kollmeier B., 1997, J ACOUST SOC AM, V102, P1085 Lin WY, 2003, J NEUROSCI, V23, P8143 Litovsky RY, 2005, J ACOUST SOC AM, V117, P3091, DOI 10.1121/1.1873913 Muller C, 1992, PERZEPTIVE ANAL WEIT Ozimek E., 2009, INT J AUDIOL, V48, P440 Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150 Shinn-Cunningham BG, 2001, J ACOUST SOC AM, V110, P1118, DOI 10.1121/1.1386633 Versfeld NJ, 2000, J ACOUST SOC AM, V107, P1671, DOI 10.1121/1.428451 Yost W. A., 1993, HUMAN PSYCHOPHYSICS NR 23 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 1021 EP 1032 DI 10.1016/j.specom.2013.06.009 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500008 ER PT J AU Cerva, P Silovsky, J Zdansky, J Nouza, J Seps, L AF Cerva, Petr Silovsky, Jan Zdansky, Jindrich Nouza, Jan Seps, Ladislav TI Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives SO SPEECH COMMUNICATION LA English DT Article DE Speaker adaptive; Automatic speech recognition; Speaker adaptation; Speaker diarization; Automatic transcription; Large spoken archives ID SYSTEM; NORMALIZATION; ADAPTATION; ACCESS; WORD AB This paper deals with speaker-adaptive speech recognition for large spoken archives. The goal is to improve the recognition accuracy of an automatic speech recognition (ASR) system that is being deployed for transcription of a large archive of Czech radio. This archive represents a significant part of Czech cultural heritage, as it contains recordings covering 90 years of broadcasting. A large portion of these documents (100,000 h) is to be transcribed and made public for browsing. To improve the transcription results, an efficient speaker-adaptive scheme is proposed. The scheme is based on integration of speaker diarization and adaptation methods and is designed to achieve a low Real-Time Factor (RTF) of the entire adaptation process, because the archive's size is enormous. It thus employs just two decoding passes, where the first one is carried out using the lexicon with a reduced number of items. Moreover, the transcripts from the first pass serve not only for adaptation, but also as the input to the speaker diarization module, which employs two-stage clustering. The output of diarization is then utilized for a cluster-based unsupervised Speaker Adaptation (SA) approach that also utilizes information based on the gender of each individual speaker. Presented experimental results on various types of programs show that our adaptation scheme yields a significant Word Error Rate (WER) reduction from 22.24% to 18.85% over the Speaker Independent (SI) system while operating at a reasonable RTF. (c) 2013 Elsevier B.V. All rights reserved. C1 [Cerva, Petr; Silovsky, Jan; Zdansky, Jindrich; Nouza, Jan; Seps, Ladislav] Tech Univ Liberec, Inst Informat Technol & Elect, Liberec 46117, Czech Republic. RP Cerva, P (reprint author), Tech Univ Liberec, Inst Informat Technol & Elect, Studentska 2, Liberec 46117, Czech Republic. EM petr.cerva@tul.cz; jan.silovsky@tul.cz; jindrich.zdansky@tul.cz; jan.nouza@tul.cz; ladislav.seps@tul.cz RI Nouza, Jan/E-9914-2011 FU Czech Science Foundation [P103/11/P499]; Student Grant Scheme (SGS) at the Technical University of Liberec FX This work was supported by the Czech Science Foundation (project no. P103/11/P499) and by the Student Grant Scheme (SGS) at the Technical University of Liberec. CR Acero A, 1991, ACOUSTICS SPEECH SIG, V2, P893 Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137 Anguera X, 2006, INT 06 PITTSB PA US Breslin C, 2011, INTERSPEECH, P1085 Byrne W, 2004, IEEE T SPEECH AUDI P, V12, P420, DOI 10.1109/TSA.2004.828702 Cerva P, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P2576 Chen S. S., 1998, P DARPA BROADC NEWS, P127 Chu SM, 2008, INT CONF ACOUST SPEE, P4329, DOI 10.1109/ICASSP.2008.4518613 Chu SM, 2010, INT CONF ACOUST SPEE, P4374, DOI 10.1109/ICASSP.2010.5495639 Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307 Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 Franco-Pedroso J, 2010, FALA 2010, V2010, P415 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Glass J., 2007, P INTERSPEECH, P2553 Hain T, 2012, IEEE T AUDIO SPEECH, V20, P486, DOI 10.1109/TASL.2011.2163395 Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088 Kenny P, 2005, IEEE T SPEECH AUDI P, V13, P345, DOI 10.1109/TSA.2004.840940 Kim D. Y., 2004, P 2004 INT C SPOK LA KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394 Lee L, 1996, INT CONF ACOUST SPEE, P353 Liu D, 2005, INTERPEECH 2005 LISB, P281 luc Gauvain J., 1998, ICSLP 98, P1335 Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152 Matsoukas S, 2006, IEEE T AUDIO SPEECH, V14, P1541, DOI 10.1109/TASL.2006.878257 Moattar MH, 2012, SPEECH COMMUN, V54, P1065, DOI 10.1016/j.specom.2012.05.002 Ng T, 2012, P INT 2012 INT SPEEC, P1967 NIST, 2009, 2009 RT 09 RICH TRAN Nouza J, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1650 Nouza J, 2010, IEEE MEDITERR ELECT, P202, DOI 10.1109/MELCON.2010.5476306 Nouza J, 2012, COMM COM INF SC, V247, P27 Ordelman R, 2006, P 17 EUR C ART INT E Povey D, 2011, ASRU, P158 Shinoda K, 2005, ELECTRON COMM JPN 3, V88, P25, DOI 10.1002/ecjc.20207 Silovsky J, 2012, P INT 2012 INT SPEEC, P478 Silovsky J, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4193 Stolcke A, 2008, LECT NOTES COMPUT SC, V4625, P450 Uebel L. F, 1999, 6 EUR C SPEECH COMM Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435 Zdansky J, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2182 NR 41 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 1033 EP 1046 DI 10.1016/j.specom.2013.06.017 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500009 ER PT J AU Tiemounou, S Jeannes, RL Barriac, V AF Tiemounou, Sibiri Jeannes, Regine Le Bouquin Barriac, Vincent TI On the identification of relevant degradation indicators in super wideband listening quality assessment models SO SPEECH COMMUNICATION LA English DT Article DE Perceptual dimensions; Super wideband; Voice quality assessment; Objective models; Diagnostic ID PERCEPTUAL EVALUATION; ITU STANDARD; SPEECH; PESQ AB Recently, new objective speech quality evaluation methods, designed and adapted to new high voice quality contexts, have been developed. One interest of these methods is that they integrate voice quality perceptual dimensions reflecting the effects of frequency response distortions, discontinuities, noise and/or speech level deviations respectively. This makes it possible to use these methods also to provide diagnostic information about specific aspects of the transmission systems' quality, as perceived by end-users. In this paper, we present and analyze in depth two of these approaches namely POLQA (Perceived Objective Listening Quality Assessment) and DIAL (Diagnostic Instrumental Assessment of Listening quality), in terms of quality degradation indicators related to the perceptual dimensions these models could embed. The main goal of our work is to find and propose the most robust quality degradation indicators to reliably characterize the impact of degradations relative to the perceptual dimensions described above and to identify the underlying technical causes in super wideband telephone communications [50, 14000] Hz. To do so, the first step of our study was to identify in both models the correspondence between perceptual dimensions and quality degradation indicators. Such indicators could be either present in the model itself or derived from our own investigation of the model. In a second step, we analyzed the performance and robustness of the identified quality degradation indicators on speech samples only impaired by one degradation (representative of one perceptual dimension) at a time. This study highlighted the reliability of some of the quality degradation indicators embedded in the two models under study and stood for a first step in the evaluation of performance of these indicators to quantify the degradation for which they were designed. (c) 2013 Elsevier B.V. All rights reserved. C1 [Tiemounou, Sibiri; Barriac, Vincent] Orange Labs Lannion, F-22307 Lannion, France. [Tiemounou, Sibiri; Jeannes, Regine Le Bouquin] INSERM, U1099, F-35000 Rennes, France. [Tiemounou, Sibiri; Jeannes, Regine Le Bouquin] Univ Rennes 1, LTSI, F-35000 Rennes, France. RP Tiemounou, S (reprint author), Orange Labs Lannion, 2 Av Pierre Marzin, F-22307 Lannion, France. EM sibiri.tiemounou@orange.com; regine.le-bouquin-jeannes@univ-rennes1.fr; vincent.barriac@orange.com FU Rohde & Schwarz -SwissQual; Opticom FX The authors wish to thank the following companies for granting us permission to use their SWB subjectively scored databases and publish results based on them: Rohde & Schwarz - SwissQual, TNO, Deutsche Telekom, Net-scout-Psytechnics. The use and analysis of the new POL-QA/P.863 standard model have been made possible thanks to the support of Rohde & Schwarz -SwissQual and Opticom. CR [Anonymous], 2011, P 863 PERC OBJ LIST, P863 [Anonymous], 1996, METH SUBJ DET TRANSM, P800 [Anonymous], 2001, PERC EV SPEECH QUAL, P862 Beerends J. G, 2004, JOINT C GERM FRENCH Beerends JG, 2007, J AUDIO ENG SOC, V55, P1059 Beerends JG, 2002, J AUDIO ENG SOC, V50, P765 BEERENDS JG, 1994, J AUDIO ENG SOC, V42, P115 Cote N., 2011, INTEGRAL DIAGNOSTIC, P133 Cote N, 2006, P 2 ISCA DEGA TUT RE, P115 Danno V, 2013, ADAPTIVE MULTIRATE W FU Q, 2000, ACOUST SPEECH SIG PR, P1511 Huo L., 2008, P 155 M AC SOC AM 5, P1 Huo L, 2008, P 155 M AC SOC AM 5 Huo L., 2007, IEEE WORKSH APPL SIG ITU-T Rec, 2005, WID EXT REC P 862 AS IUT-T Rec, 1988, SPEC INT REF SYST, P48 Leman A., 2010, P 38 INT C AUD ENG S MCDERMOT.BJ, 1969, J ACOUST SOC AM, V45, P774, DOI 10.1121/1.1911465 Raake A, 2006, SPEECH QUALITY VOIP, P74 Rix AW, 2002, J AUDIO ENG SOC, V50, P755 Scholz K, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1523 Sen D, 2008, J AUDIO ENG SOC, V131, P4087 Sen D, 2012, INT TELECOMMUNICATIO Waltermann M, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2170 Waltermann M, 2008, VOIC COMM ITG C DE A, P1 Zielinski S, 2008, J AUDIO ENG SOC, V56, P427 NR 26 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 1047 EP 1063 DI 10.1016/j.specom.2013.06.010 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500010 ER PT J AU Short, G Hirose, K Minematsu, N AF Short, Greg Hirose, Keikichi Minematsu, Nobuaki TI Japanese lexical accent recognition for a CALL system by deriving classification equations with perceptual experiments SO SPEECH COMMUNICATION LA English DT Article DE Japanese pitch accent; Automatic recognition; Perception; Fundamental frequency; Resynthesis; Stimulus continua AB For non-native learners of Japanese, the pitch accent can be cumbersome to acquire without proper instruction. A Computer Assisted Language Learning (CALL) system could aid these learners in this acquisition provided that it can generate helpful feedback based on automatic analysis of the learner's utterance. For this, it is necessary to consider that the characteristics of a given learner's Japanese production will be largely influenced by his or her native tongue. For example, non-natives may produce pitch contours that natives do not produce. A standard approach to carry out recognition for error detection is to use a machine learning algorithm making use of an array composed of a variety of features. However, a method motivated by perceptual analysis may be better for a CALL system. With such a method, it should be possible to better understand the human recognition process and the causal relationships between contour and perception, which could be useful for feedback. Also, since accent recognition is a perceptual process, it may be possible to improve automatic recognition for non-native speech with such a method. Thus, we carry out listening tests making use of experiments using resynthesized speech to construct a method. First, we inspect which variables the probability of a pitch level transition is dependent on, and from this inspection, derive equations to calculate the probability at the disyllable level. Then, to recognize the word-level pattern, the location of each transition was determined from the probabilities for each two syllable pair. This method makes it possible to recognize all pitch patterns and to give more in-depth feedback. We conduct recognition experiments using these functions and achieve results that performed comparably to the inter-labeler agreement rate and outperformed SVM-based methods for non-native speech. (C) 2013 Elsevier B.V. All rights reserved. C1 [Short, Greg; Hirose, Keikichi] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan. [Minematsu, Nobuaki] Univ Tokyo, Grad Sch Engn, Tokyo, Japan. RP Short, G (reprint author), Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan. EM shortg29@gmail.com CR Boersma Paul, 2012, PRAAT DOING PHONETIC Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199 Fujisaki H., 1968, 6TH INT C AC, P95 Fujisaki Hiroya, 1977, NIHONGO 5 ONIN, P63 Hirano-Cook E., 2011, THESIS Ishi C, 2001, I ELECT INFORM COMMU, V101, P65 Ishi C. T, 2003, SPEECH COMMUN, V101, P23 Kato S, 2012, P SPEECH PROS, V2, P198 Kawai G, 1999, P EUR 99, V1, P177 Kumagai Y, 1999, IPSJ SIG NOTES, V99, P22 Lee A., 2001, P EUR C SPEECH COMM, P1691 Liberman A, 1957, J EXP PSYCHOL, P358 Masuda-Katsuse I., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.97 Mattingly I, 1971, COGNITIVE PSYCHOL, P131 Minematsu N, 1996, TECHNICAL REPORT IEI, P69 Nakagawa C., 2002, TOKYOGO AKUSENTO SHU, P73 Neri A, 2002, P CALL C, P179 NHK Broadcasting Culture Research Institute, 2005, JAP ACC DICT Oyakawa T, 1971, 19712 U CAL BERK LIN, V1971 Short G, 2011, SP2010128 IEICE, P79 Short G, 2012, NONNATIVE CORPUS Shport I., 2008, ACQUISITION JAPANESE, P165 Sinha A., 2009, INT J PSYCHOL COUNSE, V1, P117 Takahashi S, 1999, INT WORKSH EMBR IMPL, P23 Valbret H., 1992, SPEECH COMMUN, V1, P145 Yoshimura T, 1994, P AUT M AC SOC JAP, P173 NR 26 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 1064 EP 1080 DI 10.1016/j.specom.2013.07.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500011 ER PT J AU Zhang, Y Zhao, YX AF Zhang, Yi Zhao, Yunxin TI Modulation domain blind speech separation in noisy environments SO SPEECH COMMUNICATION LA English DT Article DE Time-frequency masking; Direction of arrival; Modulation frequency; Blind speech separation ID INDEPENDENT COMPONENT ANALYSIS; PERMUTATION ALIGNMENT; NONSTATIONARY SOURCES; SPECTRAL SUBTRACTION; MIXTURE-MODELS; FREQUENCY; LOCALIZATION; SIGNALS; REAL AB We propose a noise robust blind speech separation (BSS) method by using two microphones. We perform BSS in the modulation domain to take advantage of the improved signal sparsity and reduced musical tone noise in this domain over the conventional acoustic frequency domain processing. We first use modulation domain real and imaginary spectral subtraction (MRISS) to enhance both magnitude and phase spectra of the noisy speech mixture inputs. We then estimate the direction of arrivals (DOAs) of the speech sources from subband inter-sensor phase differences (IPDs) by using an asymmetric Laplacian mixture model (ALMM), cluster the full-band IPDs via the estimated DOAs, and perform time frequency masking to separate the source signals, all in the modulation domain. Experimental evaluations in five types of noises have shown that the performance of the proposed method is robust in 0-10 dB SNRs and it is superior to acoustic domain separation without MRISS. (C) 2013 Elsevier B.V. All rights reserved. C1 [Zhang, Yi; Zhao, Yunxin] Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA. RP Zhao, YX (reprint author), Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA. EM yzcb3@mail.missouri.edu; ZhaoY@missouri.edu CR Aichner R, 2006, SIGNAL PROCESS, V86, P1260, DOI 10.1016/j.sigpro.2005.06.022 Amari S, 1997, FIRST IEEE SIGNAL PROCESSING WORKSHOP ON SIGNAL PROCESSING ADVANCES IN WIRELESS COMMUNICATIONS, P101, DOI 10.1109/SPAWC.1997.630083 Araki S, 2003, EURASIP J APPL SIG P, V2003, P1157, DOI 10.1155/S1110865703305074 Araki S., 2003, IEEE INT C AC SPEECH, P509 Araki S, 2006, INT CONF ACOUST SPEE, P33 ASANO F, 2001, ACOUST SPEECH SIG PR, P2729 Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Choi S, 2000, ELECTRON LETT, V36, P848, DOI 10.1049/el:20000623 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 Douglas SC, 2003, SPEECH COMMUN, V39, P65, DOI 10.1016/S0167-6393(02)00059-6 Ellis DPW, 2006, INT CONF ACOUST SPEE, P957 Falk T. H., 2007, P ISCA C INT SPEECH, P970 Hazewinkel M., 2001, ENCY MATH Hu R, 2008, EURASIP, V4, DOI 101155/2008/349214 HUANG J, 1995, IEEE T INSTRUM MEAS, V44, P733 Hurley N, 2009, IEEE T INFORM THEORY, V55, P4723, DOI 10.1109/TIT.2009.2027527 Ichir MM, 2006, IEEE T IMAGE PROCESS, V15, P1887, DOI 10.1109/TIP.2006.877068 IKRAM MZ, 2002, ACOUST SPEECH SIG PR, P881 Jian M., 1998, P IEEE INT S CIRC SY, V5, P293 Joho M., 2000, INDEPENDENT COMPONEN, P81 JOURJINE A, 2000, ACOUST SPEECH SIG PR, P2985 Kawamoto M, 1998, NEUROCOMPUTING, V22, P157, DOI 10.1016/S0925-2312(98)00055-1 Khademul M, 2007, IEEE T AUDIO SPEECH, V15, P893 Kinnunen T, 2008, P ISCA SPEAK LANG RE KURITA S, 2000, ACOUST SPEECH SIG PR, P3140 Lee DD, 1999, NATURE, V401, P788 Lu X, 2010, SPEECH COMMUN, V52, P1, DOI 10.1016/j.specom.2009.08.006 Winter S, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/24717 Mandel MI, 2010, IEEE T AUDIO SPEECH, V18, P382, DOI 10.1109/TASL.2009.2029711 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Matsuoka K., 2001, P INT C IND COMP AN, P722 Mitchell T.M., 1997, MACHINE LEARNING Mitianoudis N, 2007, IEEE T AUDIO SPEECH, V15, P1818, DOI 10.1109/TASL.2007.899281 Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Paliwal K. K., 2011, P INTERSPEECH FLOR I, P1209 Papoulis A, 1991, PROBABILITY RANDOM V, V3rd Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 Pearlmutter BA, 2004, LECT NOTES COMPUT SC, V3195, P478 Peterson J. M., 2003, P ICASSP 2003, VVI, P581 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 Roweis ST, 2001, ADV NEUR IN, V13, P793 RWCP, 2001, RWCP SOUND SCEN DAT Sawada H., 2007, P IEEE WORKSH APPL S, P139 Sawada H, 2004, IEEE T SPEECH AUDI P, V12, P530, DOI 10.1109/TSA.2004.832994 Sawada H, 2011, IEEE T AUDIO SPEECH, V19, P516, DOI 10.1109/TASL.2010.2051355 Schimidt M. N, 2006, P ICSLP, V2, P2 Schobben L, 2002, IEEE T SIGNAL PROCES, V50, P1855 Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397 VIELVA L, 2002, ACOUST SPEECH SIG PR, P3049 Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005 Vu D. H. T, 2008, ITG C VOIC COMM, P1 Yamanouchi K, 2007, PROC WRLD ACAD SCI E, V19, P113 Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896] Yu KM, 2005, COMMUN STAT-THEOR M, V34, P1867, DOI 10.1080/03610920500199018 ZADEH LA, 1950, P IRE, V38, P291, DOI 10.1109/JRPROC.1950.231083 Zhang Y, 2013, SPEECH COMMUN, V55, P509, DOI 10.1016/j.specom.2012.09.005 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 59 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2013 VL 55 IS 10 BP 1081 EP 1099 DI 10.1016/j.specom.2013.06.014 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 215RH UT WOS:000324227500012 ER PT J AU Milner, B AF Milner, Ben TI Enhancing speech at very low signal-to-noise ratios using non-acoustic reference signals SO SPEECH COMMUNICATION LA English DT Article DE Noise estimation; Speech enhancement; Speech quality; Speech intelligibility ID SPECTRAL AMPLITUDE ESTIMATOR; ENHANCEMENT ALGORITHMS; RECOGNITION AB An investigation is made into whether non-acoustic noise reference signals can be used for noise estimation, and subsequently speech enhancement, in very low signal-to-noise ratio (SNR) environments where conventional noise estimation methods may be less effective. The environment selected is Formula 1 motor racing where SNRs fall frequently to 15 dB. Analysis reveals three primary noise sources (engine, airflow and tyre) which are found to relate to data parameters measured by the car's onboard computer, namely engine speed, road speed and throttle opening. This leads to the proposal of a two stage noise reduction system that uses first engine speed to cancel engine noise within an adaptive filtering framework. Secondly, a maximum a posteriori (MAP) framework is developed to estimate airflow and tyre noise from data parameters which is subsequently removed. Objective measurements comparing noise estimation with conventional methods show the proposed method to be substantially more accurate. Subjective quality tests using comparative mean opinion score listening tests found that the proposed method achieves +1.43 compared to +0.66 for a conventional method. In subjective intelligibility tests, 81.8% of words were recognised correctly using the proposed method in comparison to 76.7% with no noise compensation and 66.0% for the conventional method. (C) 2013 Published by Elsevier B.V. C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. RP Milner, B (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. EM b.milner@uea.ac.uk CR Aboulnasr T, 1997, IEEE T SIGNAL PROCES, V45, P631, DOI 10.1109/78.558478 Afify M, 2009, IEEE T AUDIO SPEECH, V17, P1325, DOI 10.1109/TASL.2009.2018017 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Fransen J., 1994, CUEDFINFENGTR192 Gillick L., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266481 Hillier V, 2004, FUNDAMENTALS MOTOR V HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387 Hu Y, 2006, INT CONF ACOUST SPEE, P153 ITU-T, 1996, P 800 METH SUB DET T ITU-T, 2003, P 835 SUB TEST METH KWONG RH, 1992, IEEE T SIGNAL PROCES, V40, P1633, DOI 10.1109/78.143435 Linde Y., 1980, IEEE T COMMUN, V28, P94 Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Milner B., 2011, INTERSPEECH Pearce D., 2000, ICSLP, V4, P29 Puder H., 2003, EUROSPEECH, P1397 Puder H., 2000, EUSIPCO, P1851 Rangachari S., 2006, SPEECH COMMUN, V48, P22 Taghia J., 2011, ICASSP Therrien C., 1992, DISCRETE RANDOM SIGN TUCKER R, 1992, IEE PROC-I, V139, P377 Vaseghi S., 2000, ICASSP, P213 Vaseghi S. V., 2006, ADV DIGITAL SIGNAL P Widrow B, 1985, ADAPTIVE SIGNAL PROC NR 28 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2013 VL 55 IS 9 BP 879 EP 892 DI 10.1016/j.specom.2013.04.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 189US UT WOS:000322294400001 ER PT J AU Zhang, XR Demuynck, K Van Hamme, H AF Zhang, Xueru Demuynck, Kris Van Hamme, Hugo TI Rapid speaker adaptation in latent speaker space with non-negative matrix factorization SO SPEECH COMMUNICATION LA English DT Article DE Speaker adaptation; NMF; SAT; fMLLR; Eigenvoice ID MAXIMUM-LIKELIHOOD AB A novel speaker adaptation algorithm based on Gaussian mixture weight adaptation is described. A small number of latent speaker vectors are estimated with non-negative matrix factorization (NMF). These latent vectors encode the distinctive systematic patterns of Gaussian usage observed when modeling the individual speakers that make up the training data. Expressing the speaker dependent Gaussian mixture weights as a linear combination of a small number of latent vectors reduces the number of parameters that must be estimated from the enrollment data. The resulting fast adaptation algorithm, using 3 s of enrollment data only, achieves similar performance as fMLLR adapting on 100+ s of data. In order to learn richer Gaussian usage patterns from the training data, the NMF-based weight adaptation is combined with vocal tract length normalization (VTLN) and speaker adaptive training (SAT), or with a simple Gaussian exponentiation scheme that lowers the dynamic range of the Gaussian likelihoods. Evaluation on the Wall Street Journal tasks shows a 5% relative word error rate (WER) reduction over the speaker independent recognition system which already incorporates VTLN. The WER can be lowered further by combining weight adaptation with Gaussian mean adaptation by means of eigenvoice speaker adaptation. (C) 2013 Elsevier B.V. All rights reserved. C1 [Zhang, Xueru; Demuynck, Kris; Van Hamme, Hugo] Katholieke Univ Leuven, Dept Elect Engn ESAT, B-3001 Louvain, Belgium. RP Zhang, XR (reprint author), Katholieke Univ Leuven, Dept Elect Engn ESAT, Kasteelpk Arenberg 10,Bus 2441, B-3001 Louvain, Belgium. EM Xueru.Zhang@esat.kuleuven.be; Kris.Demuynck@esat.kuleuven.be; Hugo.Vanhamme@esat.kuleuven.be RI Van hamme, Hugo/D-6581-2012 FU Dutch-Flemish IMPact program (NWO-iMinds) FX This work is funded by the Dutch-Flemish IMPact program (NWO-iMinds). CR Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137 Anastasakos T, 1997, INT CONF ACOUST SPEE, P1043, DOI 10.1109/ICASSP.1997.596119 Baum L. E., 1972, INEQUALITIES, V3, P1 CHEN KT, 2000, P ICSLP, V3, P742 Chen S. F., 1998, TECHNICAL REPORT DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Demuynck K., 2001, THESIS KATHOLIEKE U Duchateau J., 2006, P ITRW SPEECH REC IN, P59 Duchateau J, 2008, INT CONF ACOUST SPEE, P4269, DOI 10.1109/ICASSP.2008.4518598 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gaussier E., 2005, P 28 ANN INT ACM SIG, P601, DOI DOI 10.1145/1076034.1076148 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Hofmann T, 1999, SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P50 Hofmann T., 1999, P 15 C UNC AI Huang X., 2001, SPOKEN LANGUAGE PROC Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Lee DD, 2001, ADV NEUR IN, V13, P556 Lee DD, 1999, NATURE, V401, P788 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leggetter C.J., 1994, CUEDFINFENGTR181 Nguyens P., 1998, THESIS I EURECOM Smaragdis P., 2003, IEEE WORKSH APPL SIG, P177 Smaragdis P., 2006, P ADV MOD AC PROC WO Tuerk C., 1993, P EUR, P351 Van hamme H., 2008, P ISCA TUT RES WORKS Virtanen T., 2007, IEEE T AUDIO SPEECH, V15, P291 Woodland P.C., 2001, ITRW ADAPTATION METH, P11 Xu W, 2003, P 26 ANN INT ACM SIG, P267, DOI DOI 10.1145/860435.860485 Zavaliagkos G, 1996, INT CONF ACOUST SPEE, P725, DOI 10.1109/ICASSP.1996.543223 Zhang XR, 2011, INT CONF ACOUST SPEE, P4456 Zhang XR, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4349 NR 31 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2013 VL 55 IS 9 BP 893 EP 908 DI 10.1016/j.specom.2013.05.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 189US UT WOS:000322294400002 ER PT J AU Rasilo, H Rasanen, O Laine, UK AF Rasilo, Heikki Rasanen, Okko Laine, Unto K. TI Feedback and imitation by a caregiver guides a virtual infant to learn native phonemes and the skill of speech inversion SO SPEECH COMMUNICATION LA English DT Article DE Language acquisition; Speech inversion; Articulatory modeling; Imitation; Phonetic learning; Caregiver feedback ID VOCAL-TRACT; MODEL; ACQUISITION; PERCEPTION; VOWEL; SHAPES; DISCRIMINATION; CATEGORIES; MOVEMENTS; SOUNDS AB Despite large-scale research, development of robust machines for imitation and inversion of human speech into articulatory movements has remained an unsolved problem. We propose a set of principles that can partially explain real infants' speech acquisition processes and the emergence of imitation skills and demonstrate a simulation where a learning virtual infant (LeVI) learns to invert and imitate a virtual caregiver's speech. Based on recent findings in infants' language acquisition, LeVI learns the phonemes of his native language in a babbling phase using only caregiver's feedback as guidance and to map acoustically differing caregiver's speech into its own articulation in a phase where LeVI is imitated by the caregiver with similar, but not exact, utterances. After the learning stage, LeVI is able to recognize vowels from the virtual caregiver's VCVC utterances perfectly and all 25 Finnish phonemes with an average accuracy of 88.42%. The place of articulation of consonants is recognized with an accuracy of 96.81%. LeVI is also able to imitate the caregiver's speech since the recognition occurs directly in the domain of articulatory programs for phonemes. The learned imitation ability (speech inversion) is strongly language dependent since it is based on the phonemic programs learned from the caregiver. The findings suggest that caregivers' feedback can act as an important signal in guiding infants' articulatory learning, and that the speech inversion problem can be effectively approached from the perspective of early speech acquisition. (C) 2013 Elsevier B.V. All rights reserved. C1 [Rasilo, Heikki; Rasanen, Okko; Laine, Unto K.] Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, FI-00076 Aalto, Finland. [Rasilo, Heikki] Vrije Univ Brussel, Artificial Intelligence Lab, B-1050 Brussels, Belgium. RP Rasilo, H (reprint author), Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, POB 13000, FI-00076 Aalto, Finland. EM heikki.rasilo@aalto.fi; okko.rasane-n@aalto.fi; unto.laine@aalto.fi FU graduate school of Electronics, Telecommunications and Automation (ETA); Finnish Foundation for Technology Promotion (TES); KAUTE foundation; Nokia foundation FX This study was supported by the graduate school of Electronics, Telecommunications and Automation (ETA), Finnish Foundation for Technology Promotion (TES), KAUTE foundation and the Nokia foundation. The authors would like to thank Bart de Boer for his valuable comments on the manuscript. CR Ananthakrishnan G., 2011, P INTERSPEECH 2011, P765 ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 Beaumont S. L., 1993, 1 LANGUAGE, V13, P235, DOI 10.1177/014272379301303805 Bickley C.A., 1989, THESIS MIT Bresch E, 2006, J ACOUST SOC AM, V120, P1791, DOI 10.1121/1.2335423 D'Ausillo A, 2009, CURR BIOL, V19, P381, DOI 10.1016/j.cub.2009.01.017 DAVIS BL, 1995, J SPEECH HEAR RES, V38, P1199 EIMAS PD, 1971, SCIENCE, V171, P303, DOI 10.1126/science.171.3968.303 ELBERS L, 1982, COGNITION, V12, P45, DOI 10.1016/0010-0277(82)90029-4 FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P780, DOI 10.1121/1.384817 FLASH T, 1985, J NEUROSCI, V5, P1688 Goldstein MH, 2008, PSYCHOL SCI, V19, P515, DOI 10.1111/j.1467-9280.2008.02117.x Goldstein MH, 2003, P NATL ACAD SCI USA, V100, P8030, DOI 10.1073/pnas.1332441100 Goodluck H, 1991, LANGUAGE ACQUISITION Gros-Louis J, 2006, INT J BEHAV DEV, V30, P509, DOI 10.1177/0165025406071914 Guenther FH, 2006, J COMMUN DISORD, V39, P350, DOI 10.1016/j.jcomdis.2006.06.013 Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 Hornstein J., 2007, IEEE RSJ INT C INT R, P3442 Hornstein J., 2007, S LANG ROB AV PORT Hornstein J., 2008, IROS 2008 WORKSH FRO Houston DM, 2000, J EXP PSYCHOL HUMAN, V26, P1570, DOI 10.1037//0096-1523.26.5.1570 Howard IS, 2011, MOTOR CONTROL, V15, P85 HUANG XD, 1992, IEEE T SIGNAL PROCES, V40, P1062, DOI 10.1109/78.134469 Ishihara H, 2009, IEEE T AUTON MENT DE, V1, P217, DOI 10.1109/TAMD.2009.2038988 Jones SS, 2007, PSYCHOL SCI, V18, P593, DOI 10.1111/j.1467-9280.2007.01945.x KENT RD, 1982, J ACOUST SOC AM, V72, P353, DOI 10.1121/1.388089 Kokkinaki T, 2000, J REPROD INFANT PSYC, V18, P173 KUHL PK, 1991, PERCEPT PSYCHOPHYS, V50, P93, DOI 10.3758/BF03212211 Kuhl PK, 1996, J ACOUST SOC AM, V100, P2425, DOI 10.1121/1.417951 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Markey K. L., 1994, THESIS U COLORADO BO MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Meltzoff AN, 1999, J COMMUN DISORD, V32, P251, DOI 10.1016/S0021-9924(99)00009-X Meltzoff AN, 1990, SELF TRANSITION INFA, P139 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 Miura K, 2007, ADV ROBOTICS, V21, P1583 Miura K, 2008, INT C DEVEL LEARN, P262, DOI 10.1109/DEVLRN.2008.4640840 MIYAWAKI K, 1975, PERCEPT PSYCHOPHYS, V18, P331, DOI 10.3758/BF03211209 Narayanan S., 2011, P INT, P837 Oller D. K., 2000, EMERGENCE SPEECH CAP Ouni S., 2005, J ACOUST SOC AM, V118, P411 Plummer A.R., 2012, P INT PORTL OR US Rasanen O, 2012, PATTERN RECOGN, V45, P606, DOI 10.1016/j.patcog.2011.05.005 Rasanen O., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6289052 Rasanen O., 2009, P INT 09 BRIGHT ENGL, P852 Rasanen O., 2012, P INT 2012 PORTL OR Rasanen O, 2012, SPEECH COMMUN, V54, P975, DOI 10.1016/j.specom.2012.05.001 Rasilo H., 2013, WORKSH SPEECH UNPUB Rasilo H, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2414 Rasilo H., 2011, P INT 11 FLOR IT, P2693 Rasilo H., 2013, ARTICULATORY MODEL S, DOI DOI 10.1016/J.SPEC0M.2013.05.002 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 Sorokin VN, 2000, SPEECH COMMUN, V30, P55, DOI 10.1016/S0167-6393(99)00031-X Stark RE, 1980, CHILD PHONOLOGY, P73 Tikhonov A. N., 1977, SOLUTION ILL POSED P Toda T., 2004, P ICSLP JEJ ISL KOR, P1129 TREHUB SE, 1976, CHILD DEV, V47, P466, DOI 10.2307/1128803 Vaz M.J.L.R.M., 2009, THESIS U MINHO ESCOL WERKER JF, 1981, CHILD DEV, V52, P349, DOI 10.2307/1129249 WERKER JF, 1984, INFANT BEHAV DEV, V7, P49, DOI 10.1016/S0163-6383(84)80022-3 Westermann G, 2004, BRAIN LANG, V89, P393, DOI 10.1016/S0093-934X(03)00345-6 Wiik K., 1965, THESIS U TURKU Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263 Yoshikawa Y, 2003, CONNECT SCI, V15, P245, DOI 10.1080/09540090310001655075 NR 67 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2013 VL 55 IS 9 BP 909 EP 931 DI 10.1016/j.specom.2013.05.002 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 189US UT WOS:000322294400003 ER PT J AU Bao, XL Zhu, J AF Bao, Xulei Zhu, Jie TI An Improved Method for Late-Reverberant Suppression Based on Statistical Model SO SPEECH COMMUNICATION LA English DT Article DE Dereverberation; Late reverberant spectral variance estimator; Shape parameter; Maximum likelihood; Probabilistic speech model ID SPEECH DEREVERBERATION; LINEAR PREDICTION; TIME; RECOGNITION; SIGNALS AB Model-based late reverberant spectral variance (LRSV) estimator is considered as an effective approach for speech dereverberation, which can construct a simple expression for the LRSV according to the past spectral variance of the reverberant signal. In this paper, we develop a new LRSV estimator based on the time-varying room impulse responses (RIRs) with the assumption that the background noise is comprised of reverberant noise and direct-path noise in a noisy and reverberant environment. In the LRSV estimator, more than one item of past spectral variance of the reverberant signals are used to obtain a smoother shape parameter, which can lead to a better performance for dereverberation compared to the classic methods. Since this shape parameter affected by the estimation error of LRSV may in turn affect the subsequent LRSV estimation, we combine this smoother shape parameter based LRSV estimator with maximum likelihood (ML) algorithm in spectral domain in order to get a more reliable estimation of LRSV. Furthermore, we use the proposed LRSV estimator prior rather than posterior to speech enhancement in noisy and reverberant environment. Experimental results demonstrate our new LRSV estimator is more effective for both noise-free and noisy reverberant speech. (C) 2013 Elsevier B.V. All rights reserved. C1 [Bao, Xulei; Zhu, Jie] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China. RP Bao, XL (reprint author), Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China. EM qunzhong@sjtu.edu.cn CR Bao X., 2012, P IEEE INT C MULT EX, P467 Delcroix M, 2009, IEEE T AUDIO SPEECH, V17, P324, DOI 10.1109/TASL.2008.2010214 Erkelens JS, 2010, IEEE T AUDIO SPEECH, V18, P1746, DOI 10.1109/TASL.2010.2051271 Erkelens JS, 2010, INT CONF ACOUST SPEE, P4706, DOI 10.1109/ICASSP.2010.5495178 Erkelens JS, 2008, IEEE T AUDIO SPEECH, V16, P1112, DOI 10.1109/TASL.2008.2001108 Gillespie B., 2003, P 2003 IEEE INT C AC, VI, P676 Gomez R., 2009, P INTERSPEECH, P1223 Habets EAP, 2009, IEEE SIGNAL PROC LET, V16, P770, DOI 10.1109/LSP.2009.2024791 Hikichi T, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/31013 Huang YTA, 2005, IEEE T SPEECH AUDI P, V13, P882, DOI 10.1109/TSA.2005.851941 Jeub M., 2010, P INT C DIGT SIGN PR, P1 Kingsbury BED, 1997, INT CONF ACOUST SPEE, P1259, DOI 10.1109/ICASSP.1997.596174 Kinoshita K, 2006, INT CONF ACOUST SPEE, P817 Kumar K, 2010, INT CONF ACOUST SPEE, P4282, DOI 10.1109/ICASSP.2010.5495667 Lebert K., 2001, ACTA ACOUST, V87, P359 Lu XG, 2011, COMPUT SPEECH LANG, V25, P571, DOI 10.1016/j.csl.2010.10.002 Nakatani T, 2007, IEEE T AUDIO SPEECH, V15, P80, DOI 10.1109/TASL.2006.872620 Nakatani T, 2008, INT CONF ACOUST SPEE, P85, DOI 10.1109/ICASSP.2008.4517552 Nakatani T, 2010, IEEE T AUDIO SPEECH, V18, P1717, DOI 10.1109/TASL.2010.2052251 SCHROEDE.MR, 1965, J ACOUST SOC AM, V37, P409, DOI 10.1121/1.1909343 Sehr A, 2010, IEEE T AUDIO SPEECH, V18, P1676, DOI 10.1109/TASL.2010.2050511 Yoshioka T, 2009, IEEE T AUDIO SPEECH, V17, P231, DOI 10.1109/TASL.2008.2008042 NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2013 VL 55 IS 9 BP 932 EP 940 DI 10.1016/j.specom.2013.04.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 189US UT WOS:000322294400004 ER PT J AU Santos, JF Cosentino, S Hazrati, O Loizou, PC Falk, TH AF Santos, Joao F. Cosentino, Stefano Hazrati, Oldooz Loizou, Philipos C. Falk, Tiago H. TI Objective speech intelligibility measurement for cochlear implant users in complex listening environments SO SPEECH COMMUNICATION LA English DT Article DE Intelligibility; Cochlear implants; Reverberation; Noise; Objective metrics ID NORMAL-HEARING; NOISE; REVERBERATION; PERCEPTION; QUALITY; MODEL; MASKING; IDENTIFICATION; STIMULATION; ENHANCEMENT AB Objective intelligibility measurement allows for reliable, low-cost, and repeatable assessment of innovative speech processing technologies, thus dispensing costly and time-consuming subjective tests. To date, existing objective measures have focused on normal hearing model, and limited use has been found for restorative hearing instruments such as cochlear implants (CIs). In this paper, we have evaluated the performance of five existing objective measures, as well as proposed two refinements to one particular measure to better emulate CI hearing, under complex listening conditions involving noise-only, reverberation-only, and noise-plus-reverberation. Performance is assessed against subjectively rated data. Experimental results show that the proposed CI-inspired objective measures outperformed all existing measures; gains by as much as 22% could be achieved in rank correlation. (c) 2013 Elsevier B.V. All rights reserved. C1 [Santos, Joao F.; Falk, Tiago H.] Inst Natl Rech Sci INRS EMT, Montreal, PQ, Canada. [Cosentino, Stefano] UCL, Ear Inst, London, England. [Hazrati, Oldooz; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. RP Falk, TH (reprint author), Inst Natl Rech Sci INRS EMT, Montreal, PQ, Canada. EM jfsantos@emt.inrs.ca; stefano.cosenti-no.10@ucl.ac.uk; hazrati@utdallas.edu; falk@emt.inrs.ca FU Natural Sciences and Engineering Research Council of Canada; UCL; Neurelec; National Institute of Deafness and Other Communication Disorders Grant [R01 DC 010494] FX THF and JFS thank the Natural Sciences and Engineering Research Council of Canada for their financial support. SC acknowledges funding from UCL and Neurelec. PCL and OH were supported by a National Institute of Deafness and Other Communication Disorders Grant (R01 DC 010494). CR [Anonymous], 2004, P563 ITUT ANSI, 1997, S351997 ANSI Arai T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2490 Chen F, 2013, BIOMED SIGNAL PROCES, V8, P311, DOI 10.1016/j.bspc.2012.11.007 Chen F, 2012, J MED BIOL ENG, V32, P189, DOI 10.5405/jmbe.885 Chen F, 2011, EAR HEARING, V32, P331, DOI 10.1097/AUD.0b013e3181ff3515 Cosentino S., 2012, P INT C INF SCI SIGN, P4710 Dorman MF, 1997, J ACOUST SOC AM, V102, P2403, DOI 10.1121/1.419603 Drgas S, 2010, HEARING RES, V269, P162, DOI 10.1016/j.heares.2010.06.016 Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020 Ewert SD, 2000, J ACOUST SOC AM, V108, P1181, DOI 10.1121/1.1288665 Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679 Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 Hazrati O., 2012, INT J AUDIOLOGY Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354 ITU, 2001, P862 ITUT Kates JM, 2005, INT INTEG REL WRKSP, P53, DOI 10.1109/ASPAA.2005.1540166 Kokkinakis K, 2011, J ACOUST SOC AM, V129, P3221, DOI 10.1121/1.3559683 Kokkinakis K, 2011, J ACOUST SOC AM, V130, P1099, DOI 10.1121/1.3614539 Kokkinakis K, 2011, INT CONF ACOUST SPEE, P2420 Loizou PC, 2005, J ACOUST SOC AM, V118, P2791, DOI 10.1121/1.2065847 Lorenzi C., 2008, 1 INT S AUD AUD RES, P263 Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493 Malfait L, 2006, IEEE T AUDIO SPEECH, V14, P1924, DOI 10.1109/TASL.2006.883177 Moller S, 2011, IEEE SIGNAL PROC MAG, V28, P18, DOI 10.1109/MSP.2011.942469 Moore BCJ, 2008, JARO-J ASSOC RES OTO, V9, P399, DOI 10.1007/s10162-008-0143-x Moore BCJ, 1996, ACUSTICA, V82, P335 Nabelek A., 1989, J ACOUST SOC AM, V86, P318 Nabelek A. K., 1993, ACOUSTICAL FACTORS A, P15 Neuman AC, 2010, EAR HEARING, V31, P336, DOI 10.1097/AUD.0b013e3181d3d514 Pearson K., 1894, Philosophical Transactions, V185a, P71, DOI 10.1098/rsta.1894.0003 PLOMP R, 1986, J SPEECH HEAR RES, V29, P146 Poissant SF, 2006, J ACOUST SOC AM, V119, P1606, DOI 10.1121/1.2168428 Qin MK, 2003, J ACOUST SOC AM, V114, P446, DOI 10.1121/1.1579009 Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058 Santos J., 2012, INTERSPEECH Schroder J., 2009, P INT C AC NAG DAGA, P606 Vandali AE, 2000, EAR HEARING, V21, P608, DOI 10.1097/00003446-200012000-00008 Van den Bogaert T, 2009, J ACOUST SOC AM, V125, P360, DOI 10.1121/1.3023069 Watkins AJ, 2000, ACUSTICA, V86, P532 Wilson BS, 2008, HEARING RES, V242, P3, DOI 10.1016/j.heares.2008.06.005 Xu L, 2008, HEARING RES, V242, P132, DOI 10.1016/j.heares.2007.12.010 Yang LP, 2005, J ACOUST SOC AM, V117, P1001, DOI 10.1121/1.1852873 Zheng YF, 2011, EAR HEARING, V32, P569, DOI 10.1097/AUD.0b013e318216eba6 NR 46 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2013 VL 55 IS 7-8 BP 815 EP 824 DI 10.1016/j.specom.2013.04.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 178YU UT WOS:000321484100001 ER PT J AU Lutfi, SL Fernandez-Martinez, F Lucas-Cuesta, JM Lopez-Lebon, L Montero, JM AF Lebai Lutfi, Syaheerah Fernandez-Martinez, Fernando Manuel Lucas-Cuesta, Juan Lopez-Lebon, Lorena Manuel Montero, Juan TI A satisfaction-based model for affect recognition from conversational features in spoken dialog systems SO SPEECH COMMUNICATION LA English DT Article DE Automatic affect detection; Affective spoken dialog system; Domestic environment; HiFi agent; Social intelligence; Dialog features; Conversational cues; User bias; Predicting user satisfaction ID AFFECTIVE SPEECH; EMOTION; ANNOTATION; LANGUAGE; TUTORS; CUES AB Detecting user affect automatically during real-time conversation is the main challenge towards our greater aim of infusing social intelligence into a natural-language mixed-initiative High-Fidelity (Hi-Fi) audio control spoken dialog agent. In recent years, studies on affect detection from voice have moved on to using realistic, non-acted data, which is subtler. However, it is more challenging to perceive subtler emotions and this is demonstrated in tasks such as labeling and machine prediction. This paper attempts to address part of this challenge by considering the role of user satisfaction ratings and also conversational/dialog features in discriminating contentment and frustration, two types of emotions that are known to be prevalent within spoken human-computer interaction. However, given the laboratory constraints, users might be positively biased when rating the system, indirectly making the reliability of the satisfaction data questionable. Machine learning experiments were conducted on two datasets, users and annotators, which were then compared in order to assess the reliability of these datasets. Our results indicated that standard classifiers were significantly more successful in discriminating the abovementioned emotions and their intensities (reflected by user satisfaction ratings) from annotator data than from user data. These results corroborated that: first, satisfaction data could be used directly as an alternative target variable to model affect, and that they could be predicted exclusively by dialog features. Second, these were only true when trying to predict the abovementioned emotions using annotator's data, suggesting that user bias does exist in a laboratory-led evaluation. (c) 2013 Elsevier B.V. All rights reserved. C1 [Lebai Lutfi, Syaheerah; Manuel Lucas-Cuesta, Juan; Lopez-Lebon, Lorena; Manuel Montero, Juan] Univ Politecn Madrid, Speech Technol Grp, E-28040 Madrid, Spain. [Lebai Lutfi, Syaheerah] Univ Sci Malaysia, Sch Comp Sci, George Town, Malaysia. [Fernandez-Martinez, Fernando] Univ Carlos III Madrid, Dept Signal Theory & Commun, Multimedia Proc Grp GPM, E-28903 Getafe, Spain. RP Montero, JM (reprint author), Univ Politecn Madrid, Speech Technol Grp, E-28040 Madrid, Spain. EM syaheerah@die.upm.es; ffm@tsc.uc3m.es; juanmak@die.upm.es; lorena.llebon@alumnos.upm.es; juancho@die.upm.es RI Montero, Juan M/K-2381-2014; Fernandez-Martinez, Fernando/M-2935-2014 OI Montero, Juan M/0000-0002-7908-5400; FU European Union [287678]; TIMPANO [TIN2011-28169-C05-03]; ITALIHA(CAM-UPM); INAPRA [DPI2010-21247-C02-02]; SD-TEAM [TIN2008-06856-C05-03]; MA2VICMR (Comunidad Autonoma de Madrid) [S2009/TIC-1542]; University Science of Malaysia; Malaysian Ministry of Higher Education FX The work leading to these results has received funding from the European Union under Grant agreement No. 287678. It has also been supported by TIMPANO(TIN2011-28169-C05-03), ITALIHA(CAM-UPM), INAPRA (DPI2010-21247-C02-02), SD-TEAM (TIN2008-06856-C05-03) and MA2VICMR (Comunidad Autonoma de Madrid, S2009/TIC-1542) projects. The corresponding author thanks University Science of Malaysia and the Malaysian Ministry of Higher Education for the PhD funding. Authors also thank all the other members of the Speech Technology Group for the continuous and fruitful discussion on these topics. CR Ai H, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P797 Ai H., 2007, 8 SIGDIAL WORKSH DIS Ang J., 2002, P INT C SPOK LANG PR [Anonymous], 2001, PERC EV SPEECH QUAL BAILEY JE, 1983, MANAGE SCI, V29, P530, DOI 10.1287/mnsc.29.5.530 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Barra-Chicote R., 2007, P WISP Barra-Chicote R, 2010, SPEECH COMMUN, V52, P394, DOI 10.1016/j.specom.2009.12.007 Barra-Chicote R., 2009, P INT, P336 Barra R, 2006, INT CONF ACOUST SPEE, P1085 Batliner A, 2011, COMPUT SPEECH LANG, V25, P4, DOI 10.1016/j.csl.2009.12.003 Burkhardt F., 2009, P IEEE Callejas Z, 2008, LECT NOTES ARTIF INT, V5078, P221, DOI 10.1007/978-3-540-69369-7_25 Callejas Z, 2008, SPEECH COMMUN, V50, P416, DOI 10.1016/j.specom.2008.01.001 Callejas Z, 2008, SPEECH COMMUN, V50, P646, DOI 10.1016/j.specom.2008.04.004 Charfuelan M., 2000, TALN Cowie R., 2010, ESSENTIAL ROLE HUMAN, P151 DANIELI M, 1995, AAAI SPRING S EMP ME, P34 Devillers L., 2002, LREC Devillers L, 2011, COMPUT SPEECH LANG, V25, P1, DOI 10.1016/j.csl.2010.07.002 D'Mello SK, 2008, USER MODEL USER-ADAP, V18, P45, DOI 10.1007/s11257-007-9037-6 DOLL WJ, 1988, MIS QUART, V12, P259, DOI 10.2307/248851 Dybkjaer L, 2004, SPEECH COMMUN, V43, P33, DOI 10.1016/j.specom.2004.02.001 Ekman P., 1978, FACIAL ACTION CODING Engelbrecht K.-P., 2009, P SIGDIAL WORKSH DIS, P170, DOI 10.3115/1708376.1708402 Fernandez-Martinez F., 2010, P 7 C INT LANG RES E Fernandez-Martinez F., 2010, P IEEE WORKSH DAT EX Fernandez-Martinez F., 2008, P IEEE WORKSH SPOK L Field A., 2005, DISCOVERING STAT USI, V2nd Forbes-Riley K, 2011, SPEECH COMMUN, V53, P1115, DOI 10.1016/j.specom.2011.02.006 Forbes-Riley K, 2011, COMPUT SPEECH LANG, V25, P105, DOI 10.1016/j.csl.2009.12.002 Gelbrich K., 2009, SCHMALENBACH BUSINES, V61, P40 Grichkovtsova I, 2012, SPEECH COMMUN, V54, P414, DOI 10.1016/j.specom.2011.10.005 Grothendieck J, 2009, INT CONF ACOUST SPEE, P4745, DOI 10.1109/ICASSP.2009.4960691 Hone K.S., 2000, NAT LANG ENG, V6, P287, DOI 10.1017/S1351324900002497 Kernbach S., 2005, J SERV MARK, V19, P438, DOI 10.1108/08876040510625945 Laukka P, 2011, COMPUT SPEECH LANG, V25, P84, DOI 10.1016/j.csl.2010.03.004 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Liscombe J., 2005, P INT LISB PORT, P1845 Litman DJ, 2006, SPEECH COMMUN, V48, P559, DOI 10.1016/j.specom.2005.09.008 Locke E. A., 1976, NATURE CAUSES JOB SA Lutfi S., 2010, P BRAIN INSP COGN SY Lutfi S. L., 2009, P 9 INT C EP ROB EPI, P221 Lutfi SL, 2009, HEALTHINF 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON HEALTH INFORMATICS, P488 Mairesse F, 2007, J ARTIF INTELL RES, V30, P457 Moller S, 2007, COMPUT SPEECH LANG, V21, P26, DOI 10.1016/j.csl.2005.11.003 Moller S., 2005, QUALITY TELEPHONE BA Nicholson J, 2000, NEURAL COMPUT APPL, V9, P290, DOI 10.1007/s005210070006 Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 Picard R. W., 1999, P HCI INT 8 INT C HU, V1, P829 Podsakoff PM, 2003, J APPL PSYCHOL, V88, P879, DOI 10.1037/0021-9101.88.5.879 Porayska-Pomsta K, 2008, USER MODEL USER-ADAP, V18, P125, DOI 10.1007/s11257-007-9041-x Reeves B., 1996, MEDIA EQUATION PEOPL Riccardi G, 2005, LECT NOTES COMPUT SC, V3814, P144 Saris W. E., 2010, SURVEY RES METHODS, V4, P61 Schuller B., 2012, COMPUTER SPEECH LANG Shami M., 2007, LECT NOTES COMPUTER, V4441, P43, DOI DOI 10.1007/978-3-540-74122-05 Tcherkassof A, 2007, EUR J SOC PSYCHOL, V37, P1325, DOI 10.1002/ejsp.427 Toivanen J, 2004, LANG SPEECH, V47, P383 Truong KP, 2012, SPEECH COMMUN, V54, P1049, DOI 10.1016/j.specom.2012.04.006 Vidrascu L., 2005, INTERSPEECH 05, P1841 Vogt T, 2005, 2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, P474, DOI 10.1109/ICME.2005.1521463 Walker M., 2000, P LANG RES EV C LREC Witten I.H., 2005, DATA MINING PRACTICA NR 65 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2013 VL 55 IS 7-8 BP 825 EP 840 DI 10.1016/j.specom.2013.04.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 178YU UT WOS:000321484100002 ER PT J AU Tan, LN Alwan, A AF Tan, Lee Ngee Alwan, Abeer TI Multi-band summary correlogram-based pitch detection for noisy speech SO SPEECH COMMUNICATION LA English DT Article DE Pitch detection; Multi-band; Correlogram; Comb-filter; Noise-robust ID DETECTION ALGORITHMS; CLASSIFICATION; RECOGNITION; SIGNALS; ROBUST AB A multi-band summary correlogram (MBSC)-based pitch detection algorithm (PDA) is proposed. The PDA performs pitch estimation and voiced/unvoiced (V/UV) detection via novel signal processing schemes that are designed to enhance the MBSC's peaks at the most likely pitch period. These peak-enhancement schemes include comb-filter channel-weighting to yield each individual subband's summary correlogram (SC) stream, and stream-reliability-weighting to combine these SCs into a single MBSC. V/UV detection is performed by applying a constant threshold on the maximum peak of the enhanced MBSC. Narrowband noisy speech sampled at 8 kHz are generated from Keele (development set) and CSTR - Centre for Speech Technology Research-(evaluation set) corpora. Both 4-kHz full-band speech, and G.712-filtered telephone speech are simulated. When evaluated solely on pitch estimation accuracy, assuming voicing detection is perfect, the proposed algorithm has the lowest gross pitch error for noisy speech in the evaluation set among the algorithms evaluated (RAPT, YIN, etc.). The proposed PDA also achieves the lowest average pitch detection error, when both pitch estimation and voicing detection errors are taken into account. (c) 2013 Elsevier B.V. All rights reserved. C1 [Tan, Lee Ngee; Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA. RP Tan, LN (reprint author), Univ Calif Los Angeles, Dept Elect Engn, 56-125B Engn 4 Bldg,Box 951594, Los Angeles, CA 90095 USA. EM ngee@seas.ucla.edu; alwan@ee.ucla.edu FU Defense Advanced Research Projects Agency (DARPA) [D10PC20024] FX This material is based on work supported in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D10PC20024. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA or its Contracting Agent, the US Department of the Interior, National Business Center, Acquisition and Property Management Division, Southwest Branch. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the US Government. Approved for Public Release, Distribution Unlimited. The authors thank the editor and the reviewers for their comments. CR Ahmadi S, 1999, IEEE T SPEECH AUDI P, V7, P333, DOI 10.1109/89.759042 ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800 Bagshaw P., 1993, P EUR C SPEECH COMM, P1003 Beritelli F, 2007, ELECTRON LETT, V43, P249, DOI [10.1049/el:20073800, 10.1049/e1:20073800] Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 Camacho A, 2008, J ACOUST SOC AM, V124, P1638, DOI 10.1121/1.2951592 Cariani PA, 1996, J NEUROPHYSIOL, V76, P1698 Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917 Chu W, 2009, INT CONF ACOUST SPEE, P3969 DELGUTTE B, 1980, J ACOUST SOC AM, V68, P843, DOI 10.1121/1.384824 DRULLMAN R, 1995, J ACOUST SOC AM, V97, P585, DOI 10.1121/1.413112 Frerking M. E., 1994, DIGITAL SIGNAL PROCE HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 Hirsch H-Guenter, 2005, FANT FILTERING NOISE ITU, 1996, REC G 712 TRANSM PER LICKLIDER JCR, 1951, EXPERIENTIA, V7, P128, DOI 10.1007/BF02156143 Loughlin PJ, 1996, J ACOUST SOC AM, V100, P1594, DOI 10.1121/1.416061 Luengo I, 2007, INT CONF ACOUST SPEE, P1057 MEDAN Y, 1991, IEEE T SIGNAL PROCES, V39, P40, DOI 10.1109/78.80763 MEDDIS R, 1991, J ACOUST SOC AM, V89, P2866, DOI 10.1121/1.400725 Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522 Oh K., 1984, P IEEE ICASSP, P85 PATTERSON RD, 1992, ADV BIOSCI, V83, P429 Plante F., 1995, P EUR, P837 RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P399, DOI 10.1109/TASSP.1976.1162846 ROSS MJ, 1974, IEEE T ACOUST SPEECH, VAS22, P353, DOI 10.1109/TASSP.1974.1162598 Rouat J, 1997, SPEECH COMMUN, V21, P191, DOI 10.1016/S0167-6393(97)00002-2 Secrest B. G., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing Secrest B. G., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Shah J., 2004, P IEEE INT C AC SPEE, P17 Slaney M., 1990, P IEEE INT C AC SPEE, P357 SUN XJ, 2002, ACOUST SPEECH SIG PR, P333 Talkin D., 1995, SPEECH CODING SYNTHE, P497 Tan LN, 2011, INT CONF ACOUST SPEE, P4464 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Walker K., 2012, ISCA ODYSSEY Wu MY, 2003, IEEE T SPEECH AUDI P, V11, P229, DOI 10.1109/TSA.2003.811539 NR 37 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2013 VL 55 IS 7-8 BP 841 EP 856 DI 10.1016/j.specom.2013.03.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 178YU UT WOS:000321484100003 ER PT J AU Mattheyses, W Latacz, L Verhelst, W AF Mattheyses, Wesley Latacz, Lukas Verhelst, Werner TI Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Visual speech synthesis; Viseme classification; Phoneme-to-viseme mapping; Context-dependent visemes; Visemes ID WORD-RECOGNITION; ANIMATION; PERFORMANCE; CONSONANTS; COARTICULATION; IMPLEMENTATION; PERCEPTION; SELECTION; MODEL; FACE AB The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. Viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to visual coarticulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme instead. In this research it was found that neither the use of standardized nor speaker-dependent many-to-one viseme labels could satisfy the quality requirements of concatenative visual speech synthesis. Therefore, a novel technique to define a many-to-many phoneme-to-viseme mapping scheme is introduced, which makes use of both tree-based and k-means clustering approaches. We show that these many-to-many viseme labels more accurately describe the visual speech information as compared to both phoneme-based and many-to-one viseme-based speech labels. In addition, we found that the use of these many-to-many visemes improves the precision of the segment selection phase in concatenative visual speech synthesis using limited speech databases. Furthermore, the resulting synthetic visual speech was both objectively and subjectively found to be of higher quality when the many-to-many visemes are used to describe the speech database and the synthesis targets. (c) 2013 Elsevier B.V. All rights reserved. C1 [Mattheyses, Wesley; Latacz, Lukas; Verhelst, Werner] Vrije Univ Brussel, Dept ETRO DSSP, B-1050 Brussels, Belgium. [Verhelst, Werner] IMinds, B-9050 Ghent, Belgium. RP Mattheyses, W (reprint author), Vrije Univ Brussel, Dept ETRO DSSP, Pl Laan 2, B-1050 Brussels, Belgium. EM wmatthey@etro.vub.ac.be; llatac-z@etro.vub.ac.be; wverhels@etro.vub.ac.be CR Arslan LM, 1999, SPEECH COMMUN, V27, P81, DOI 10.1016/S0167-6393(98)00068-5 Aschenberner B., 2005, PHONEME VISEME MAPPI Auer J., 1997, J ACOUST SOC AM, V102, P3704 Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107 BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196 Beskow J., 2005, P INT 2005 LISB PORT, P793 BINNIE CA, 1974, J SPEECH HEAR RES, V17, P619 Bozkurt E., 2007, P SIGN PROC COMM APP, P1 Breen AP, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2159 Bregler C., 1997, P ACM SIGGRAPH, P353, DOI 10.1145/258734.258880 Cappelletta Luca, 2012, Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods. ICPRAM 2012 Cohen M. M., 1993, Models and Techniques in Computer Animation COHEN MM, 1990, BEHAV RES METH INSTR, V22, P260 Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467 Corthals P., 1984, TIJDSCHR LOG AUDIOLJ, V14, P126 Costa P., 2010, P ACM SSPNET INT S F, P20, DOI 10.1145/1924035.1924047 Dahai Yu, 2010, IPSJ T COMPUTER VISI, V2, P25 De Martino JM, 2006, COMPUT GRAPH-UK, V30, P971, DOI 10.1016/j.cag.2006.08.017 DEMUYNCK K, 2008, P ICSLP, P495 Deng Z, 2008, COMPUT GRAPH FORUM, V27, P2096, DOI 10.1111/j.1467-8659.2008.01192.x EBERHARDT SP, 1990, J ACOUST SOC AM, V88, P1274, DOI 10.1121/1.399704 Eggermont J. P. M., 1964, TAALVERWERVING BIJ G Elisei F., 2001, P AUD VIS SPEECH PRO, P90 Ezzat T, 2002, ACM T GRAPHIC, V21, P388 Ezzat T, 2000, INT J COMPUT VISION, V38, P45, DOI 10.1023/A:1008166717597 Fagel S, 2004, SPEECH COMMUN, V44, P141, DOI 10.1016/j.specom.2004.10.006 FISHER CG, 1968, J SPEECH HEAR RES, V11, P796 Galanes F., 1998, P INT C AUD VIS SPEE, P191 Govokhina O., 2007, P 6 ISCA WORKSH SPEE, P1 Govokhina O., 2006, P JOURN ET PAR, P305 Hazen T.J., 2004, P INT C MULT INT, P235, DOI 10.1145/1027933.1027972 Hilder S., 2010, P INT C AUD VIS SPEE, P154 Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110 JACKSON PL, 1988, VOLTA REV, V90, P99 Jeffers J., 1971, SPEECHREADING LIPREA Keating Patricia A., 1988, PHONOLOGY, V5, P275, DOI 10.1017/S095267570000230X Kent R. D., 1977, J PHONETICS, V15, P115 LeGoff B, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2163 Lesner S. A., 1981, J ACADEMY REHABILITA, V14, P252 Liu K., 2011, P IEEE GLOB TEL C GL, P1 Lloyd S., 1982, IEEE T INFORMATION T, V28, P129, DOI DOI 10.1109/TIT.1982.1056489 Mattheyses W., 2010, P INT C AUD VIS SPEE, P148 Mattheyses W., 2011, P INT C AUD VIS SPEE, P1113 Mattheyses W, 2008, LECT NOTES COMPUT SC, V5237, P125, DOI 10.1007/978-3-540-85853-9_12 Mattheyses W, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1113 Mattheyses W., 2009, EURASIP J Mattys SL, 2002, PERCEPT PSYCHOPHYS, V64, P667, DOI 10.3758/BF03194734 Melenchon J., 2007, P INT C AUD VIS SPEE, P191 Melenchon J, 2009, IEEE T AUDIO SPEECH, V17, P459, DOI 10.1109/TASL.2008.2010213 MONTGOMERY AA, 1983, J ACOUST SOC AM, V73, P2134, DOI 10.1121/1.389537 MYERS C, 1980, IEEE T ACOUST SPEECH, V28, P623, DOI 10.1109/TASSP.1980.1163491 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 Olshen R., 1984, CLASSIFICATION REGRE, V1st OWENS E, 1985, J SPEECH HEAR RES, V28, P381 Pandzic I., 2003, MPEG 4 FACIAL ANIMAT Potamianos G., 2004, ISSUES VISUAL AUDIO Rogozan A., 1999, International Journal on Artificial Intelligence Tools (Architectures, Languages, Algorithms), V8, DOI 10.1142/S021821309900004X Saenko E., 2004, ARTICULARY FEATURES Senin P., 2008, DYNAMIC TIME WARPING Tamura M, 1998, P INT C AUD VIS SPEE Taylor S.L., 2012, P 11 ACM SIGGRAPH EU, P275 Tekalp AM, 2000, SIGNAL PROCESS-IMAGE, V15, P387, DOI 10.1016/S0923-5965(99)00055-7 Theobald B., 2008, P INTERSPEECH, P1875 Theobald BJ, 2004, SPEECH COMMUN, V44, P127, DOI 10.1016/j.specom.2004.07.002 Theobald BJ, 2012, IEEE T AUDIO SPEECH, V20, P2378, DOI 10.1109/TASL.2012.2202651 VANSON N, 1994, J ACOUST SOC AM, V96, P1341, DOI 10.1121/1.411324 Verma A., 2003, P IEEE INT C AC SPEE, P720 Visser M, 1999, LECT NOTES ARTIF INT, V1692, P349 Ypsilos I. A., 2004, Proceedings. 2nd International Symposium on 3D Data Processing, Visualization, and Transmission, DOI 10.1109/TDPVT.2004.1335143 Zelezny M, 2006, SIGNAL PROCESS, V86, P3657, DOI 10.1016/j.sigpro.2006.02.039 NR 70 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2013 VL 55 IS 7-8 BP 857 EP 876 DI 10.1016/j.specom.2013.02.005 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 178YU UT WOS:000321484100004 ER PT J AU Mattheyses, W Latacz, L Verhelst, W AF Mattheyses, Wesley Latacz, Lukas Verhelst, Werner TI Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis (vol 55, pg 857, 2013) SO SPEECH COMMUNICATION LA English DT Correction C1 [Mattheyses, Wesley; Latacz, Lukas; Verhelst, Werner] Vrije Univ Brussel, Dept ETRO DSSP, B-1050 Brussels, Belgium. [Verhelst, Werner] IMinds, B-9050 Ghent, Belgium. RP Mattheyses, W (reprint author), Vrije Univ Brussel, Dept ETRO DSSP, Pl Laan 2, B-1050 Brussels, Belgium. EM wmatthey@etro.vub.ac.be CR Mattheyses W, 2013, SPEECH COMMUN, V55, P857, DOI 10.1016/j.specom.2013.02.005 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2013 VL 55 IS 7-8 BP 877 EP 877 DI 10.1016/j.specom.2013.05.004 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 178YU UT WOS:000321484100005 ER PT J AU Rao, KS Vuppala, AK AF Rao, K. Sreenivasa Vuppala, Anil Kumar TI Non-uniform time scale modification using instants of significant excitation and vowel onset points SO SPEECH COMMUNICATION LA English DT Article DE Instants of significant excitation; Epochs; Vowel onset point; Time scale modification; Non-uniform time scale modification; Uniform time scale modification ID SPEECH AB In this paper, a non-uniform time scale modification (TSM) method is proposed for increasing or decreasing speech rate. The proposed method modifies the durations of vowel and pause segments by different modification factors. Vowel segments are modified by factors based on their identities, and pause segments by uniform factors based on the desired speaking rate. Consonant and transition (consonant-to-vowel) segments are not modified in the proposed TSM. These modification factors are derived from the analysis of slow and fast speech collected from professional radio artists. In the proposed TSM method, vowel onset points (VOPs) are used to mark the consonant, transition and vowel regions, and instants of significant excitation (ISE) are used to perform TSM as required. The VOPs indicate the instants at which the onsets of vowels take place. The ISE, also known as epochs, indicate the instants of glottal closure during voiced speech, and some random excitations such as burst onset during non-voiced speech. In this work, VOPs are determined using multiple sources of evidence from excitation source, spectral peaks, modulation spectrum and uniformity in epoch intervals. The ISEs are determined using a zero-frequency filter method. The performance of the proposed non-uniform TSM scheme is compared with uniform and existing non-uniform TSM schemes using epoch and time domain pitch synchronous overlap and add (TD-PSOLA) methods. (C) 2013 Elsevier B.V. All rights reserved. C1 [Rao, K. Sreenivasa] Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India. [Vuppala, Anil Kumar] IIIT Hyderabad, LTRC, Hyderabad, Andhra Pradesh, India. RP Rao, KS (reprint author), Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India. EM ksrao@iitkgp.ac.in; anil.vuppala@gmail.com CR Bonada J, 2000, P 2000 INT COMP MUS, P396 Deller J. R., 1993, DISCRETE TIME PROCES di Marino J., 2001, P IEEE INT C AC SPEE, V2, P853 Donnellan Olivia, 2003, P 3 IEEE INT C ADV L Duxbury C, 2001, P DIG AUD EFF C DAFX, P1 Duxbury C, 2002, P AES 112 CONV MUN G, P5530 Gangashetty S. V., 2004, THESIS IIT MADRAS Grofit S, 2008, IEEE T AUDIO SPEECH, V16, P106, DOI 10.1109/TASL.2007.909444 Hainsworth SW, 2001, PROCEEDINGS OF THE 2001 IEEE WORKSHOP ON THE APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, P23, DOI 10.1109/ASPAA.2001.969533 Hogg RV, 1987, ENG STAT Ilk HG, 2006, SIGNAL PROCESS, V86, P127, DOI 10.1016/j.sigpro.2005.05.006 Klapuri A., 1999, ACOUST SPEECH SIG PR, P3089 Kumar Anil Vuppala, 2012, INT J ELECT COMMUNIC, V66, P697 Mahadeva Prasanna S R, 2009, IEEE Transactions on Audio, Speech and Language Processing, V17, DOI 10.1109/TASL.2008.2010884 Moulines, 1995, SPEECH COMMUN, V16, P175 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Pickett J. M., 1999, ACOUSTICS SPEECH COM PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P374, DOI 10.1109/TASSP.1981.1163581 Prakash Dixit R, 1991, J PHONETICS, V19, P213 QUATIERI TF, 1992, IEEE T SIGNAL PROCES, V40, P497, DOI 10.1109/78.120793 Rodet X, 2001, P INT COMP MUS C ICM, P30 Roebel A, 2003, P INT C DIG AUD EFF, P344 Slaney Malcolm, P IEEE INT C AC SPEE Sreenivasa Rao K., 2009, SPEECH COMMUN, V51, P1263 Sreenivasa-Rao K, 2006, IEEE T SPEECH AUDIO, V14, P972 Sri Rama Murty K, 2008, IEEE T SPEECH AUDIO, V16, P1602 Stevens K.N., 1999, ACOUSTIC PHONETICS NR 27 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2013 VL 55 IS 6 BP 745 EP 756 DI 10.1016/j.specom.2013.03.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165IQ UT WOS:000320477500001 ER PT J AU Low, SY Pham, DS Venkatesh, S AF Low, Siow Yong Duc Son Pham Venkatesh, Svetha TI Compressive speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Compressed sensing; Speech enhancement; Sparsity ID VOICE ACTIVITY DETECTION; SIGNAL RECOVERY; BEAMFORMER; NOISE; PURSUIT AB This paper presents an alternative approach to speech enhancement by using compressed sensing (CS). CS is a new sampling theory, which states that sparse signals can be reconstructed from far fewer measurements than the Nyquist sampling. As such, CS can be exploited to reconstruct only the sparse components (e.g., speech) from the mixture of sparse and non-sparse components (e.g., noise). This is possible because in a time-frequency representation, speech signal is sparse whilst most noise is non-sparse. Derivation shows that on average the signal to noise ratio (SNR) in the compressed domain is greater or equal than the uncompressed domain. Experimental results concur with the derivation and the proposed CS scheme achieves better or similar perceptual evaluation of speech quality (PESQ) scores and segmental SNR compared to other conventional methods in a wide range of input SNR. (C) 2013 Elsevier B.V. All rights reserved. C1 [Low, Siow Yong] Curtin Univ, Miri, Malaysia. [Duc Son Pham] Curtin Univ, Dept Comp, Bentley, WA, Australia. [Venkatesh, Svetha] Deakin Univ, Ctr Pattern Recognit & Data Analyt, Geelong, Vic 3217, Australia. RP Low, SY (reprint author), Curtin Univ, Sarawak Campus, Miri, Malaysia. EM siowyong@curtin.edu.my; DucSon.Pham@curtin.edu.au; svetha.venkatesh@deakin.edu.au CR Benesty J., 2005, SPEECH ENHANCEMENT S BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Boufounos Petros, 2007, IEEE WORKSH STAT SIG, P299 Brandstein M., 2001, DIGITAL SIGNAL PROCE Candes E. J., 2006, P INT C MATH MADR SP Candes EJ, 2008, IEEE SIGNAL PROC MAG, V25, P21, DOI 10.1109/MSP.2007.914731 Candes EJ, 2006, IEEE T INFORM THEORY, V52, P5406, DOI 10.1109/TIT.2006.885507 Candes EJ, 2006, IEEE T INFORM THEORY, V52, P489, DOI 10.1109/TIT.2005.862083 Chen SSB, 1998, SIAM J SCI COMPUT, V20, P33, DOI 10.1137/S1064827596304010 Christensen M. G., 2009, P AS C SIGN SYST COM, P356 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 Dam H.Q, 2004, IEEE INT S CIRC SYST, V3, P433 Davis A, 2005, INT CONF ACOUST SPEE, P65 Donoho DL, 2006, IEEE T INFORM THEORY, V52, P1289, DOI 10.1109/TIT.2006.871582 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 Gardner TJ, 2006, P NATL ACAD SCI USA, V103, P6094, DOI 10.1073/pnas.0601707103 Garofolo J., 1988, GETTING STARTED DARP Ghosh PK, 2011, IEEE T AUDIO SPEECH, V19, P600, DOI 10.1109/TASL.2010.2052803 Giacobello D, 2012, IEEE T AUDIO SPEECH, V20, P1644, DOI 10.1109/TASL.2012.2186807 Golub G.H., 1996, MATRIX COMPUTATIONS Griffin A, 2011, IEEE T AUDIO SPEECH, V19, P1382, DOI 10.1109/TASL.2010.2090656 Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 ITU, 2001, 862 ITU, P862 Jancovic P, 2012, SPEECH COMMUN, V54, P108, DOI 10.1016/j.specom.2011.07.005 Karvanen J, 2003, P 4 INT S IND COMP A, P125 Kim SJ, 2007, IEEE J-STSP, V1, P606, DOI 10.1109/JSTSP.2007.910971 Kokkinakis K, 2008, J ACOUST SOC AM, V123, P2379, DOI 10.1121/1.2839887 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lotter T, 2003, EURASIP J APPL SIG P, V2003, P1147, DOI 10.1155/S1110865703305025 Low S.Y, 2002, ICCS 2002 8 INT C CO, V2, P1020 Low S.Y, 2005, IEEE INT C AC SPEECH, V3, P69 Lu CT, 2011, SPEECH COMMUN, V53, P495, DOI 10.1016/j.specom.2010.11.008 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Miyazaki R, 2012, IEEE T AUDIO SPEECH, V20, P2080, DOI 10.1109/TASL.2012.2196513 O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Pham D.T, 2009, LECT NOTES COMPUTER, V5441 Pinter I, 1996, COMPUT SPEECH LANG, V10, P1, DOI 10.1006/csla.1996.0001 Principi E, 2010, J ELECT COMPUTER ENG, P1 Rachlin Y, 2008, 46 ALL C COMM CONTR Shawe-Taylor J, 2002, ADV NEUR IN, V14, P511 Sreenivas TV, 2009, INT CONF ACOUST SPEE, P4125, DOI 10.1109/ICASSP.2009.4960536 Tropp JA, 2007, IEEE T INFORM THEORY, V53, P4655, DOI 10.1109/TIT.2007.909108 Uemura Y, 2008, INT WORKSH AC ECH NO Veen B. V., 1988, IEEE ASSP MAG APR, V5, P4 Wahlberg B, 2012, IFAC S SYST ID, V1, P16 Wu DS, 2011, ANN OPER RES, V185, P1, DOI 10.1007/s10479-010-0822-y Yang J., 1993, INT C AC SPEECH SIGN, V2, P363, DOI 10.1109/ICASSP.1993.319313 Yu T, 2009, INT CONF ACOUST SPEE, P213 NR 50 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2013 VL 55 IS 6 BP 757 EP 768 DI 10.1016/j.specom.2013.03.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165IQ UT WOS:000320477500002 ER PT J AU Hansen, JHL Suh, JW Leonard, MR AF Hansen, John H. L. Suh, Jun-Won Leonard, Matthew R. TI In-set/out-of-set speaker recognition in sustained acoustic scenarios using sparse data SO SPEECH COMMUNICATION LA English DT Article DE Speaker recognition; In-set/out-of-set; Sparse data; Environmental noise ID GAUSSIAN MIXTURE-MODELS; VERIFICATION; IDENTIFICATION; NOISE; ENROLLMENT; DISTANCE; SYSTEMS; CORPUS AB This study addresses the problem of identifying in-set versus out-of-set speakers in noise for limited train/test durations in situations where rapid detection and tracking is required. The objective is to form a decision as to whether the current input speaker is accepted as a member of an enrolled in-set group or rejected as an outside speaker. A new scoring algorithm that combines log likelihood scores across an energy-frequency grid is developed where high-energy speaker dependent frames are fused with weighted scores from low-energy noise dependent frames. By leveraging the balance between the speaker versus background noise environment, it is possible to realize an improvement in overall equal error rate performance. Using speakers from the TIMIT database with 5 s of train and 2 s of test, the average optimum relative EER performance improvement for the proposed full selective leveraging approach is +31.6%. The optimum relative EER performance improvement using 10 s of NIST SRE-2008 is +10.8% using the proposed approach. The results confirm that for situations in which the background environment type remains constant between train and test, an in-set/out-of-set speaker recognition system that takes advantage of information gathered from the environmental noise can be formulated which realizes significant improvement when only extremely limited amounts of train/test data is available. (C) 2013 Elsevier B.V. All rights reserved. C1 [Hansen, John H. L.; Suh, Jun-Won; Leonard, Matthew R.] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, Dept Elect Engn, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. EM John.Hansen@utdallas.edu FU AFRL [FA8750-09-C-0067]; University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering FX This project was funded by AFRL through a subcontract to RADC Inc. under FA8750-09-C-0067, and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J.H.L. Hansen. CR Akbacak M, 2007, IEEE T AUDIO SPEECH, V15, P465, DOI 10.1109/TASL.2006.881694 Angkititrakul P, 2004, IEEE ICASSP, V1, P169 Angkititrakul P, 2007, IEEE T AUDIO SPEECH, V15, P498, DOI 10.1109/TASL.2006.881689 Ariyaeeinia AM, 2006, IEE P-VIS IMAGE SIGN, V153, P618, DOI 10.1049/ip-vis:20050273 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 BEN M, 2003, ACOUST SPEECH SIG PR, P69 Do MN, 2003, IEEE SIGNAL PROC LET, V10, P115, DOI 10.1109/LSP.2003.809034 DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345 Garofolo JS, 1993, TIMIT ACOUSTIC PHONE Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088 HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P169, DOI 10.1109/89.388143 Logan B, 2001, IEEE INT C MULTIMEDI, V0, P190 Muller C, 2005, ESTIMATING ACOUSTIC Muller C, 2007, SPEAKER CLASSIFICATI, V1 NIST SRE: U.S. National Institute of Standards and Technology, 2011, NIST YEAR 2008 SPEAK Prakash V, 2007, IEEE T AUDIO SPEECH, V15, P2044, DOI 10.1109/TASL.2007.902058 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273 Shao Y, 2007, INT CONF ACOUST SPEE, P277 Stahl V, 2000, INT CONF ACOUST SPEE, P1875, DOI 10.1109/ICASSP.2000.862122 Suh JW, 2012, J ACOUST SOC AM, V131, P1515, DOI 10.1121/1.3672707 Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822 ZHansen J.H.L, 2004, GETTING STARTED CU M NR 23 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2013 VL 55 IS 6 BP 769 EP 781 DI 10.1016/j.specom.2013.01.006 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165IQ UT WOS:000320477500003 ER PT J AU Bayya, Y Gowda, DN AF Bayya, Yegnanarayana Gowda, Dhananjaya N. TI Spectro-temporal analysis of speech signals using zero-time windowing and group delay function SO SPEECH COMMUNICATION LA English DT Article DE Zero-time windowing; Zero-frequency filtering; Group delay function; NGD spectrum; HNGD spectrum ID VOCAL-TRACT; LINEAR-PREDICTION; EXTRACTION; REPRESENTATIONS; RECOGNITION; RESONANCES; SPECTRUM; F0 AB Traditional methods for estimating the vocal tract system characteristics typically compute the spectrum using a window size of 20-30 ms. The resulting spectrum is the average characteristics of the vocal tract system within the window segment. Also, the effect of pitch harmonics need to be countered in the process of spectrum estimation. In this paper, we propose a new approach for estimating the spectrum using a highly decaying window function. The impulse-like window function used is an approximation to integration operation in the frequency domain, and the operation is referred to as zero-time windowing analogous to the zero-frequency filtering operation in frequency domain. The apparent loss in spectral resolution due to the use of a highly decaying window function is restored by successive differencing in the frequency domain. The spectral resolution is further improved by the use of group delay function which has an additive property on the individual resonances as against the multiplicative nature of the magnitude spectrum. The effectiveness of the proposed approach in estimating the spectrum is evaluated in terms of its robustness to additive noise, and in formant estimation. (C) 2013 Elsevier B.V. All rights reserved. C1 [Bayya, Yegnanarayana] Int Inst Informat Technol, Hyderabad, Andhra Pradesh, India. [Gowda, Dhananjaya N.] Aalto Univ, Dept Informat & Comp Sci, Espoo, Finland. RP Gowda, DN (reprint author), Aalto Univ, Dept Informat & Comp Sci, Espoo, Finland. EM dhananjaya.gowda@aalto.fi FU Department of Information Technology, Government of India; Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN) [251170] FX The authors would like to thank the Department of Information Technology, Government of India for supporting this activity through sponsored research projects. The second author would also like to thank The Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN, 251170) for supporting his stay in Finland as a Postdoctoral Researcher. CR Abe T, 2006, IEEE T AUDIO SPEECH, V14, P1292, DOI 10.1109/TSA.2005.858545 Anand M. Joseph, 2006, P INT C SPOK LANG PR, P1009 Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 Deller J., 2000, DISCRETE TIME PROCES Deng L, 2006, IEEE T AUDIO SPEECH, V14, P425, DOI 10.1109/TSA.2005.855841 Deng L, 2006, P INT C AC SPEECH SI, P1 Garofolo JS, 1993, TIMIT ACOUSTIC PHONE Gianfelici F, 2007, IEEE T AUDIO SPEECH, V15, P823, DOI 10.1109/TASL.2006.889744 Kawahara H, 2008, INT CONF ACOUST SPEE, P3933, DOI 10.1109/ICASSP.2008.4518514 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Murty KSR, 2008, IEEE T AUDIO SPEECH, V16, P1602, DOI 10.1109/TASL.2008.2004526 Oppenheim A.V, 1975, DIGIT SIGNAL PROCESS, P1 Rabiner L. R., 2010, THEORY APPL DIGITAL Santhanam B, 2000, IEEE T COMMUN, V48, P473, DOI 10.1109/26.837050 Yegnanarayana B, 2009, IEEE T AUDIO SPEECH, V17, P614, DOI 10.1109/TASL.2008.2012194 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Vargas J, 2008, IEEE T AUDIO SPEECH, V16, P1, DOI 10.1109/TASL.2007.907573 Welling L, 1998, IEEE T SPEECH AUDI P, V6, P36, DOI 10.1109/89.650308 YEGNANARAYANA B, 1992, IEEE T SIGNAL PROCES, V40, P2281, DOI 10.1109/78.157227 YEGNANARAYANA B, 1978, J ACOUST SOC AM, V63, P1638, DOI 10.1121/1.381864 Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P313, DOI 10.1109/89.701359 NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2013 VL 55 IS 6 BP 782 EP 795 DI 10.1016/j.specom.2013.02.007 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165IQ UT WOS:000320477500004 ER PT J AU Zhang, CL Morrison, GS Enzinger, E Ochoa, F AF Zhang, Cuiling Morrison, Geoffrey Stewart Enzinger, Ewald Ochoa, Felipe TI Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison - Female voices SO SPEECH COMMUNICATION LA English DT Article DE Formant; Telephone; Landline; Mobile; Forensic; Validity ID ACOUSTIC CHARACTERISTICS; ENGLISH VOWELS; SPEECH; RECOGNITION; TRACKING; RELIABILITY AB In forensic-voice-comparison casework a common scenario is that the suspect's voice is recorded directly using a microphone in an interview room but the offender's voice is recorded via a telephone system. Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants, and the second formant is often assumed to be relatively robust to telephone-transmission effects. This study assesses the effects of telephone transmission on the performance of formant-trajectory-based forensic-voice-comparison systems. The effectiveness of both human-supervised and fully-automatic formant tracking is investigated. Human-supervised formant tracking is generally considered to be more accurate and reliable but requires a substantial investment of human labor. Measurements were made of the formant trajectories of haul tokens in a database of recordings of 60 female speakers of Chinese using one human-supervised and five fully-automatic formant trackers. Measurements were made under high-quality, landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions. High-quality recordings were treated as suspect samples and telephone-transmitted recordings as offender samples. Discrete cosine transforms (DCT) were fitted to the formant trajectories and likelihood ratios were calculated on the basis of the DCT coefficients. For each telephone-transmission condition the formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The systems based on human-supervised formant measurement always outperformed the systems based on fully-automatic formant measurement; however, in conditions involving mobile telephones neither the former nor the latter type of system provided meaningful improvement over the baseline system, and even in the other conditions the high cost in skilled labor for human-supervised formant-trajectory measurement is probably not warranted given the relatively good performance that can be obtained using other less-costly procedures. (C) 2013 Elsevier B.V. All rights reserved. C1 [Zhang, Cuiling] China Criminal Police Univ, Dept Forens Sci & Technol, Shenyang 110854, Liaoning, Peoples R China. [Zhang, Cuiling; Morrison, Geoffrey Stewart; Enzinger, Ewald; Ochoa, Felipe] Univ New S Wales, Sch Elect Engn & Telecommun, UNSW Sydney, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia. RP Morrison, GS (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, UNSW Sydney, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia. EM geoff-morrison@forensic-voice-comparison.net FU Australian Research Council; Australian Federal Police; New South Wales Police; Queensland Police; National Institute of Forensic Science; Australasian Speech Science and Technology Association; Guardia Civil through Linkage Project [LP100200142]; China Scholarship Council; Ministry of Education of the People's Republic of China [NCET-11-0836]; International Association of Forensic Phonetics and Acoustics Research Grant FX This research received support from multiple sources, including the following: Australian Research Council, Australian Federal Police, New South Wales Police, Queensland Police, National Institute of Forensic Science, Australasian Speech Science and Technology Association, and the Guardia Civil through Linkage Project LP100200142; China Scholarship Council State-Sponsored Scholarship Program for Visiting Scholars; Ministry of Education of the People's Republic of China "Program for New Century Excellent Talents in University" (NCET-11-0836); International Association of Forensic Phonetics and Acoustics Research Grant. Thanks to Terrance M. Nearey for providing clarifications on some of the algorithmic details of NAH2002 and FORMANTMEASURER. Unless otherwise explicitly attributed, the opinions expressed are those of the authors and do not necessarily represent the policies or opinions of any of the above mentioned organizations or individuals. Earlier versions of this paper were presented at the Special Session on Forensic Acoustics at the 162nd Meeting of the Acoustical Society of America, San Diego, November 2011 [J. Acoust. Soc. Amer. 130, 2519, doi: 10.1121/1.3655044]; at the 21st Annual Conference of the International Association for Forensic Phonetics and Acoustics, Santander, August 2012; and at the UNSW Forensic Speech Science Conference, Sydney, December 2012. CR Aitken CGG, 2004, J ROY STAT SOC C-APP, V53, P109, DOI 10.1046/j.0035-9254.2003.05271.x Aitken CGG, 2004, J ROY STAT SOC C-APP, V53, P665, DOI 10.1111/j.1467-9876.2004.02031.x Anderson N, 1978, MODERN SPECTRUM ANAL, P252 ASSMANN PF, 1987, J ACOUST SOC AM, V81, P520, DOI 10.1121/1.394918 Boersma P., 2011, PRAAT DOING PHONETIC Boersma P., 1993, P I PHONETIC SCI, V17, P97 Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001 BRUMMER N, 2005, FOCAL TOOLBOX TOOLS Byrne C, 2004, INT J SPEECH LANG LA, V11, P83, DOI 10.1558/sll.2004.11.1.83 CHEN NF, 2009, P INT 2009 INT SPEEC, P2203 de Castro A, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P2343 Deng L, 2007, IEEE T AUDIO SPEECH, V15, P13, DOI 10.1109/TASL.2006.876724 Duckworth M, 2011, INT J SPEECH LANG LA, V18, P35, DOI 10.1558/ijsll.v18i1.35 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Gold E., 2011, INT J SPEECH LANG LA, V18, P143, DOI DOI 10.1558/IJS11.V18I2.293 Gonzalez-Rodriguez J, 2007, IEEE T AUDIO SPEECH, V15, P2104, DOI 10.1109/TASL.2007.902747 Gonzalez-Rodriguez J., 2011, P INT 2011 INT SPEEC, P133 Guillemin BJ, 2008, INT J SPEECH LANG LA, V15, P193, DOI 10.1558/ijsll.v15i2.193 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 Kondaurova MV, 2012, J ACOUST SOC AM, V132, P1039, DOI 10.1121/1.4728169 Kunzel HJ, 2002, FORENSIC LINGUIST, V9, P83 Kunzel H.J., 2001, FORENSIC LINGUIST, V8, P80 Lawrence S, 2008, INT J SPEECH LANG LA, V15, P161, DOI 10.1558/ijsll.v15i2.161 Markel JD, 1976, LINEAR PREDICTION SP Morrison G, 2010, SOUNDLABELLER ERGONO Morrison G., 2013, VOWEL INHERENT SPECT, P263, DOI DOI 10.1007/978-3-642-14209-3_11 Morrison G, 2009, ROBUST VERSION TRAIN Morrison GS, 2011, SCI JUSTICE, V51, P91, DOI 10.1016/j.scijus.2011.03.002 Morrison GS, 2012, AUST J FORENSIC SCI, V44, P155, DOI 10.1080/00450618.2011.630412 Morrison GS, 2009, J ACOUST SOC AM, V125, P2387, DOI 10.1121/1.3081384 Morrison GS, 2011, SPEECH COMMUN, V53, P242, DOI 10.1016/j.specom.2010.09.005 MORRISON GS, 2007, FORENSIC LIKELIHOOD Morrison GS, 2013, AUST J FORENSIC SCI, V45, P173, DOI 10.1080/00450618.2012.733025 Mustafa K, 2006, IEEE T AUDIO SPEECH, V14, P435, DOI 10.1109/TSA.2005.855840 Nearey T. M., 2002, J ACOUST SOC AM, V112, P2323 Nolan F, 2002, FORENSIC LINGUIST, V9, P74, DOI 10.1558/sll.2002.9.1.74 OLIVE JP, 1971, J ACOUST SOC AM, V50, P661, DOI 10.1121/1.1912681 Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213 Pigeon S, 2000, DIGIT SIGNAL PROCESS, V10, P237, DOI 10.1006/dspr.1999.0358 Remez RE, 2011, J ACOUST SOC AM, V130, P2173, DOI 10.1121/1.3631667 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 ROSE P, 2003, EXPERT EVIDENCE Rudoy D., 2010, THESIS HARVARD U CAM Rudoy D., 2007, P INT, P526 SCHAFER RW, 1970, J ACOUST SOC AM, V47, P634, DOI 10.1121/1.1911939 SJOLANDER K, 2011, WAVESURFER VERSION 1 Sjolander K., 2000, P ICSLP, P464 Talkin D., 1987, J ACOUST SOC AM S, V82, pS55 Thomson RI, 2009, J ACOUST SOC AM, V126, P1447, DOI 10.1121/1.3177260 Vallabha GK, 2002, SPEECH COMMUN, V38, P141, DOI 10.1016/S0167-6393(01)00049-8 van Leeuwen David A, 2007, Speaker Classification I. Fundamentals, Features, and Methods. (Lecture Notes in Artificial Intelligence vol. 4343), DOI 10.1007/978-3-540-74200-5_19 Xue SA, 2006, J VOICE, V20, P391, DOI 10.1016/j.jvoice.2005.05.001 Zhang C., 2011, FORENSIC DATABASE AU Zhang C, 2011, P 17 INT C PHON SCI, P2280 ZHANG C, 2012, HUMAN SUPERVISED FUL Zhang CL, 2013, J ACOUST SOC AM, V133, pEL54, DOI 10.1121/1.4773223 NR 56 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2013 VL 55 IS 6 BP 796 EP 813 DI 10.1016/j.specom.2013.01.011 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165IQ UT WOS:000320477500005 ER PT J AU Pardede, HF Iwano, K Shinoda, K AF Pardede, Hilman F. Iwano, Koji Shinoda, Koichi TI Feature normalization based on non-extensive statistics for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; Normalization; q-Logarithm; Non-extensive statistics ID VECTOR TAYLOR-SERIES; NONEXTENSIVE STATISTICS; CROSS-TERMS; NOISE; MODEL; ENTROPY; ENHANCEMENT; ENVIRONMENT; CALCULUS; SPECTRA AB Most compensation methods to improve the robustness of speech recognition systems in noisy environments such as spectral subtraction, CMN, and MVN, rely on the fact that noise and speech spectra are independent. However, the use of limited window in signal processing may introduce a cross-term between them, which deteriorates the speech recognition accuracy. To tackle this problem, we introduce the q-logarithmic (q-log) spectral domain of non-extensive statistics and propose q-log spectral mean normalization (q-LSMN) which is an extension of log spectral mean normalization (LSMN) to this domain. The recognition experiments on a synthesized noisy speech database, the Aurora-2 database, showed that q-LSMN was consistently better than the conventional normalization methods, CMN, LSMN, and MVN. Furthermore, q-LSMN was even more effective when applied to a real noisy environment in the CEN-SREC-2 database. It significantly outperformed ETSI AFE front-end. (C) 2013 Elsevier B.V. All rights reserved. C1 [Pardede, Hilman F.; Shinoda, Koichi] Tokyo Inst Technol, Dept Comp Sci, Grad Sch Informat Sci & Engn, Meguro Ku, Tokyo 1528552, Japan. [Iwano, Koji] Tokyo City Univ, Fac Environm & Informat Studies, Tsuzuki Ku, Yokohama, Kanagawa 2248551, Japan. RP Pardede, HF (reprint author), Tokyo Inst Technol, Dept Comp Sci, Grad Sch Informat Sci & Engn, Meguro Ku, Ookayama 2-12-1, Tokyo 1528552, Japan. EM hilman@ks.cs.titech.ac.jp RI Shinoda, Koichi/D-3198-2014 OI Shinoda, Koichi/0000-0003-1095-3203 FU [24650079] FX This work is supported by Grant in Aid for Challenging Exploratory Research No. 24650079. CR Acero A., 2000, P ICSLP, P869 Agarwal A., 1999, P IEEE WORKSH AUT SP, P12 Avendano C, 1997, IEEE T SPEECH AUDI P, V5, P372, DOI 10.1109/89.593318 Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208 Bezerianos A, 2003, ANN BIOMED ENG, V31, P221, DOI 10.1114/1.1541013 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Borges EP, 2004, PHYSICA A, V340, P95, DOI 10.1016/j.physa.2004.03.082 Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940 Deng L, 2004, IEEE T SPEECH AUDI P, V12, P133, DOI 10.1109/TSA.2003.820201 Doblinger G., 1995, P 4 EUR C SPEECH COM, P1513 ETSI standard doc, 2002, 2002050 ETSI ES Evans N., 2006, P IEEE INT C AC SPEE, V1, P1520 Faubel F, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P553 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Gradojevic N, 2011, IEEE SIGNAL PROC MAG, V28, P116, DOI 10.1109/MSP.2011.941843 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H.-G, 2000, P ISCA ITRW ASR2000, P181 HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387 Itahashi S., 1990, P INT C SPOKEN LANGU, P1081 Ito Y., 2000, P INTERSPEECH, P530 JEONG JC, 1992, IEEE T SIGNAL PROCES, V40, P2608, DOI 10.1109/78.157305 Jiulin D., 2007, ASTROPHYS SPACE SCI, V312, P47, DOI 10.1007/s10509-007-9611-8 KADAMBE S, 1992, IEEE T SIGNAL PROCES, V40, P2498, DOI 10.1109/78.157292 Kim C, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P28 Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7 KOBAYASHI T, 1984, IEEE T ACOUST SPEECH, V32, P1087, DOI 10.1109/TASSP.1984.1164416 Li JY, 2009, COMPUT SPEECH LANG, V23, P389, DOI 10.1016/j.csl.2009.02.001 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z Mauuary L., 1996, P EUSIPCO McAuley J, 2005, IEEE T SPEECH AUDI P, V13, P956, DOI 10.1109/TSA.2005.851952 Ming J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1061 Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225 Moret MA, 2011, PHYSICA A, V390, P3055, DOI 10.1016/j.physa.2011.04.008 Nakamura S, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2330 Nivanen L, 2003, REP MATH PHYS, V52, P437, DOI 10.1016/S0034-4877(03)80040-X Olemskoi A, 2010, EPL-EUROPHYS LETT, V89, DOI 10.1209/0295-5075/89/50007 Pardede H.F., 2011, P INTERSPEECH, P1645 Plastino AR, 2004, ASTROPHYS SPACE SCI, V290, P275, DOI 10.1023/B:ASTR.0000032529.67037.21 Rufiner HL, 2004, PHYSICA A, V332, P496, DOI [10.1016/j.physa.2003.09.050, 10.1016/i.physa.2003.09.050] SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662 TSALLIS C, 1988, J STAT PHYS, V52, P479, DOI 10.1007/BF01016429 Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8 Weili S., 2009, P INT C MECH AUT, P1004 Wilk G, 2002, PHYSICA A, V305, P227, DOI 10.1016/S0378-4371(01)00666-5 Zhang YD, 2008, SENSORS-BASEL, V8, P7518, DOI 10.3390/s8117518 Zhu QF, 2002, IEEE SIGNAL PROC LET, V9, P275, DOI 10.1109/LSP.2002.801722 NR 47 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 587 EP 599 DI 10.1016/j.specom.2013.02.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800001 ER PT J AU Lander, K Capek, C AF Lander, Karen Capek, Cheryl TI Investigating the impact of lip visibility and talking style on speechreading performance SO SPEECH COMMUNICATION LA English DT Article DE Speechreading; Speechreadability; Lip visibility; Speaking style ID AUDIOVISUAL SPEECH-PERCEPTION; CONVERSATIONAL SPEECH; CLEAR SPEECH; HEARING; FACE; INTELLIGIBILITY; LISTENERS AB It has long been known that visual information from a talker's mouth and face plays an important role in the perception and understanding of spoken language. The reported experiments explore the impact of lip visibility (Experiments 1 & 2) and speaking style (Experiment 2) on talker speechreadability. Specifically we compare speechreading performance (words in Experiment 1; sentences in Experiment 2 with low level auditory input) from talkers with natural lips, with brightly coloured lips and with concealed lips. Results reveal that highlighting the lip area by the application of lipstick or concealer improves speechreading, relative to natural lips. Furthermore, speaking in a clear (rather than conversational) manner improves speechreading performance, with no interaction between lip visibility and speaking style. Results are discussed in relation to practical methods of improving speechreading and in relation to attention and movement parameters. (C) 2013 Elsevier B.V. All rights reserved. C1 [Lander, Karen; Capek, Cheryl] Univ Manchester, Sch Psychol Sci, Manchester M13 9PL, Lancs, England. RP Lander, K (reprint author), Univ Manchester, Sch Psychol Sci, Oxford Rd, Manchester M13 9PL, Lancs, England. EM karen.lander@manchester.ac.uk CR Auer ET, 2010, J AM ACAD AUDIOL, V21, P163, DOI 10.3766/jaaa.21.3.4 Bench J., 1979, SPEECH HEARING TESTS Bernstein LE, 2001, J SPEECH LANG HEAR R, V44, P5, DOI 10.1044/1092-4388(2001/001) Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 DeCarlo D, 2000, INT J COMPUT VISION, V38, P99, DOI 10.1023/A:1008122917811 DEMOREST ME, 1992, J SPEECH HEAR RES, V35, P876 Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078 Gagne JP, 2002, SPEECH COMMUN, V37, P213, DOI 10.1016/S0167-6393(01)00012-7 Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432 IJSSELDIJK FJ, 1992, J SPEECH HEAR RES, V35, P466 Irwin A, 2011, SPEECH COMMUN, V53, P807, DOI 10.1016/j.specom.2011.01.010 Krause J.C., 2004, J ACOUST SOC AM, V115, P363 KRICOS PB, 1985, VOLTA REV, V87, P5 Lander K, 2008, Q J EXP PSYCHOL, V61, P961, DOI 10.1080/17470210801908476 Lansing L.R., 2003, PERCEPT PSYCHOPHYS, V65, P536 MARASSA LK, 1995, J SPEECH HEAR RES, V38, P1387 Massaro D. W., 1998, PERCEIVING TALKING F MASSARO DW, 1993, PERCEPT PSYCHOPHYS, V53, P549, DOI 10.3758/BF03205203 McGrath M., 1985, THESIS U NOTTINGHAM Mills A. E., 1987, HEARING EYE PSYCHOL, P145 Munhall KG, 1998, HEARING EYE, P123 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 Preminger JE, 1998, J SPEECH LANG HEAR R, V41, P564 Reisberg D., 1987, HEARING EYE PSYCHOL Rosenblum LD, 1996, J SPEECH HEAR RES, V39, P1159 Smiljanic R., 2009, LANGUAGE LINGUISTICS, V3, P236, DOI DOI 10.1111/J.1749-818X.2008.00112.X SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009 Valentine G, 2008, SOC CULT GEOGR, V9, P469, DOI 10.1080/14649360802175691 Vatikiotis-Bateson E, 1998, PERCEPT PSYCHOPHYS, V60, P926, DOI 10.3758/BF03211929 Vogt M., 1997, P ESCA WORKSH AUD VI NR 31 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 600 EP 605 DI 10.1016/j.specom.2013.01.003 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800002 ER PT J AU Maia, R Akamine, M Gales, MJF AF Maia, Ranniery Akamine, Masami Gales, Mark J. F. TI Complex cepstrum for statistical parametric speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; Statistical parametric speech synthesis; Spectral analysis; Cepstral analysis; Complex cepstrum; Glottal source models ID ALGORITHM AB Statistical parametric synthesizers have typically relied on a simplified model of speech production. In this model, speech is generated using a minimum-phase filter, implemented from coefficients derived from spectral parameters, driven by a zero or random phase excitation signal. This excitation signal is usually constructed from fundamental frequencies and parameters used to control the balance between the periodicity and aperiodicity of the signal. The application of this approach to statistical parametric synthesis has partly been motivated by speech coding theory. However, in contrast to most real-time speech coders, parametric speech synthesizers do not require causality. This allows the standard simplified model to be extended to represent the natural mixed-phase characteristics of speech signals. This paper proposes the use of the complex cepstrum to model the mixed phase characteristics of speech through the incorporation of phase information in statistical parametric synthesis. The phase information is contained in the anti-causal portion of the complex cepstrum. These parameters have a direct connection with the shape of the glottal pulse of the excitation signal. Phase parameters are extracted on a frame-basis and are modeled in the same fashion as the minimum-phase synthesis filter parameters. At synthesis time, phase parameter trajectories are generated and used to modify the excitation signal. Experimental results show that the use of such complex cepstrum-based phase features results in better synthesized speech quality. Listening test results yield an average preference of 60% for the system with the proposed phase feature on both female and male voices. (C) 2013 Elsevier B.V. All rights reserved. C1 [Maia, Ranniery; Gales, Mark J. F.] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England. [Akamine, Masami] Toshiba Co Ltd, Ctr Corp Res & Dev, Saiwai Ku, Kawasaki, Kanagawa 2128582, Japan. RP Maia, R (reprint author), Toshiba Res Europe Ltd, Cambridge Res Lab, 208 Cambridge Sci Pk,Milton Rd, Cambridge CB4 0GZ, England. EM ranniery.maia@crl.toshiba.co.uk; masa.akamine@toshiba.co.jp; mjfg@crl.toshiba.co.uk CR BEDNAR JB, 1985, IEEE T ACOUST SPEECH, V33, P1014, DOI 10.1109/TASSP.1985.1164655 Bhanu B., 1980, IEEE T ACOUSTICS SPE, P583 Buchholz S., 2011, P INT, P3053 Buchholz S., 2007, TOSHIBA ENTRY 2007 B Cabral J., 2007, P 6 ISCA SPEECH SYNT, P113 Chu WC, 2003, SPEECH CODING ALGORI Deller J., 2000, DISCRETE TIME PROCES Drugman T., 2009, P INTERSPEECH, P1779 Drugman T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P657 Drugman T, 2011, SPEECH COMMUN, V53, P855, DOI 10.1016/j.specom.2011.02.004 Jackson PJB, 2001, IEEE T SPEECH AUDI P, V9, P713, DOI 10.1109/89.952489 Kawahara H., 2001, P MAVEBA, P13 Maia R., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288938 Maia R., 2007, P ISCA SSW6, P131 Oppenheim Alan V., 2010, DISCRETE TIME SIGNAL, V3rd QUATIERI TF, 1979, IEEE T ACOUST SPEECH, V27, P328, DOI 10.1109/TASSP.1979.1163252 Raitio T, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1881 SAMPA, COMPUTER READABLE PH Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tokuda K., 1994, P INT C SPOK LANG PR, P1043 TRIBOLET JM, 1977, IEEE T ACOUST SPEECH, V25, P170, DOI 10.1109/TASSP.1977.1162923 VERHELST W, 1986, IEEE T ACOUST SPEECH, V34, P43, DOI 10.1109/TASSP.1986.1164787 Vondra M, 2011, LECT NOTES COMPUT SC, V6456, P324, DOI 10.1007/978-3-642-18184-9_27 Yamagishi J., 2010, THE CSTR EMIME HTS S Yoshimura T., 2001, P EUROSPEECH, P2263 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 27 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 606 EP 618 DI 10.1016/j.specom.2012.12.008 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800003 ER PT J AU Xia, BY Bao, CC AF Xia, Bingyin Bao, Changchun TI Compressed domain speech enhancement method based on ITU-T G.722.2 SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Compressed domain; CELP; G.722.2; Parameter modification AB Based on the bit-stream of ITU-T G.722.2 speech coding standard, through the modification of codebook gains in the codec, a compressed domain speech enhancement method that is compatible with the discontinuous transmission (DTX) mode and frame erasure condition is proposed in this paper. In non-DTX mode, the Voice Activity Detection (VAD) is carried out in the compressed domain, and the background noise is classified into full-band distributed noise and low-frequency distributed noise. Then, the noise intensity is estimated based on the algebraic codebook power, and the a priori SNR is estimated according to the noise type. Next, the codebook gains are jointly modified under the rule of energy compensation. Especially, the adaptive comb filter is adopted to remove the residual noise in the excitation signal in low-frequency distributed noise. Finally, the modified codebook gains are re-quantized in speech or excitation domain. For non-speech frames in DTX mode, the logarithmic frame energy is attenuated to remove the noise, while the spectral envelope is kept unchanged. When frame erasure occurs, the recovered algebraic codebook gain is exponentially attenuated, and based on the reconstructed algebraic codebook vector, all the codec parameters are re-quantized to form the error concealed bit-stream. The result of performance evaluation under ITU-T G.160 shows that, with much lower computational complexity, better noise reduction, SNR improvement, and objective speech quality performances are achieved by the proposed method comparing with the state-of-art compressed domain methods. The subjective speech quality test shows that, the speech quality of the proposed method is better than the method that only modifies the algebraic codebook gain, and similar to the one with the assistance of linear domain speech enhancement method. (C) 2013 Elsevier B.V. All rights reserved. C1 [Xia, Bingyin; Bao, Changchun] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. EM baochch@bjut.edu.cn FU Beijing Natural Science Foundation Program; Scientific Research Key Program of Beijing Municipal Commission of Education [KZ201110005005]; Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality; Postgraduate Science Foundation of Beijing University of Technology [ykj-2012-7284]; Huawei Technologies Co., Ltd. FX This work was supported by the Beijing Natural Science Foundation Program and Scientific Research Key Program of Beijing Municipal Commission of Education (No. KZ201110005005), the Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality, the 10th Postgraduate Science Foundation of Beijing University of Technology (ykj-2012-7284), and Huawei Technologies Co., Ltd. CR [Anonymous], 2001, P862 ITUT Chandran R, 2000, PROCEEDINGS OF THE 43RD IEEE MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS I-III, P10, DOI 10.1109/MWSCAS.2000.951575 Duetsch N., 2004, P 5 ITG FACHB, P357 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Fapi E.T., 2008, P 7 INT C SOURC CHAN ITU-T, 2002, G7222 ITUT ITU-T, 2002, G7222 ITUT G ITU-T, 2003, G7222 ITUT G ITU-T, 2008, G160 ITUT ITU-T (Telecommunication Standardization Sector International Telecommunication Union), 2005, G191 ITUT Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 MARTIN R, 1994, P EUSPICO, V2, P1182 Schroeder M.R., 1985, P IEEE INT C AC SPEE, V3, P937 Sukkar R.A., 2006, United States Patent Application, Patent No. [US 2006/0217970 Al, 20060217970] Taddei H., 2004, P IEEE INT C AC SPEE, V1, P1497 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 NR 16 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 619 EP 640 DI 10.1016/j.specom.2013.02.001 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800004 ER PT J AU Daqrouq, K Al Azzawi, KY AF Daqrouq, K. Al Azzawi, K. Y. TI Arabic vowels recognition based on wavelet average framing linear prediction coding and neural network SO SPEECH COMMUNICATION LA English DT Article DE Arabic vowel; LPC; Average framing; Wavelet; Probabilistic neural network ID SPEECH RECOGNITION; ROBUSTNESS; ALGORITHM; ENTROPY AB In this work, an average framing linear prediction coding (AFLPC) technique for speaker-independent Arabic vowels recognition system was proposed. Usually, linear prediction coding (LPC) has been applied in many speech recognition applications, however, the combination of modified LPC termed AFLPC with wavelet transform (WT) is proposed in this study for vowel recognition. The investigation procedure was based on feature extraction and classification. In the stage of feature extraction, the distinguished resonance of vocal tract of Arabic vowel characteristics was extracted using the AFLPC technique. LPC order of 30 was found to be the best according to the system performance. In the phase of classification, probabilistic neural network (PNN) was applied because of its rapid response and ease in implementation. In practical investigation, performances of different wavelet transforms in conjunction with AFLPC were compared with one another. In addition, the capability analysis on the proposed system was examined by comparing with other systems proposed in latest literature. Referring to our experimental results, the PNN classifier could achieve a better recognition rate with discrete wavelet transform and AFLPC as a feature extraction method termed (LPCDWTF). (C) 2013 Elsevier B.V. All rights reserved. C1 [Daqrouq, K.] King Abdulaziz Univ, Elect & Comp Eng Dept, Jeddah 21413, Saudi Arabia. [Al Azzawi, K. Y.] Univ Technol Baghdad, Electromech Engn Dept, Baghdad, Iraq. RP Daqrouq, K (reprint author), King Abdulaziz Univ, Elect & Comp Eng Dept, Jeddah 21413, Saudi Arabia. EM haleddaq@yahoo.com RI Daqrouq, Khaled/K-1293-2012 CR Abu-Rabia A., 1999, J PSYCHOLINGUIST RES, V28, P93 Alghamdi M., 1998, J KING SAUD U, V10, P3 Alotaibi Y., 2009, P 1 INT C DEC FGIT J, P10 Alotaibi Y., 2009, P BIOID MULTICOMM MA Alotaibi YA, 2005, INFORM SCIENCES, V173, P115, DOI 10.1016/j.ins.2004.07.008 Amrouche A., 2009, ENG APPL ARTIFICIAL Amrouche A, 2003, Proceedings of the 46th IEEE International Midwest Symposium on Circuits & Systems, Vols 1-3, P689 Anani M., 1999, P 14 INT C PHON SCI, V9, P2117 Andrianopoulos MV, 2001, J VOICE, V15, P194, DOI 10.1016/S0892-1997(01)00021-2 Atal BS, 2006, IEEE SIGNAL PROC MAG, V23, P154, DOI 10.1109/MSP.2006.1598091 Avci D, 2009, EXPERT SYST APPL, V36, P6295, DOI 10.1016/j.eswa.2008.07.012 Avci E., 2006, EXPERT SYST APPL, V33, P582 Avci E, 2007, EXPERT SYST APPL, V32, P485, DOI 10.1016/j.eswa.2005.12.004 Cherif A, 2001, APPL ACOUST, V62, P1129, DOI 10.1016/S0003-682X(01)00007-X Daqrouq K, 2011, ENG APPL ARTIF INTEL, V24, P796, DOI 10.1016/j.engappai.2011.01.001 Daqrouq K., 2009, INT J INFORM SCI COM, V1 Daqrouq Khaled, 2010, International Journal of Speech Technology, V13, DOI 10.1007/s10772-010-9073-1 DAUBECHIES I, 1988, COMMUN PUR APPL MATH, V41, P909, DOI 10.1002/cpa.3160410705 Delac K, 2009, IMAGE VISION COMPUT, V27, P1108, DOI 10.1016/j.imavis.2008.10.007 Engin A., 2007, EXPERT SYSTEMS APPL, V32, P485 GOWDY JN, 2000, ACOUST SPEECH SIG PR, P1351 Hachkar Z., 2011, MACHINE LEARNING PAT, V2 Hermansky H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319236 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Jongman A, 2011, J PHONETICS, V39, P85, DOI 10.1016/j.wocn.2010.11.007 Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354 Karray L, 2003, SPEECH COMMUN, V40, P261, DOI 10.1016/S0167-6393(02)00066-3 Kirchoff K., 2002, TECHNICAL REPORT Kirschhoff K., 2003, P INT C ASSP ICASSP, P344 Kotnik B., 2003, P IEEE EUROCON 2003, P131 Lazli L, 2003, LECT NOTES ARTIF INT, V2734, P379 Lee S, 2008, CLIN LINGUIST PHONET, V22, P523, DOI 10.1080/02699200801945120 Lei Z., 2005, CIRC SYST SIGNAL PRO, V24, P287, DOI 10.1007/s00034-004-0529-x MACKOWIAK PA, 1992, JAMA-J AM MED ASSOC, V268, P1578, DOI 10.1001/jama.268.12.1578 Mallat S., 1998, WAVELET TOUR SIGNAL MALLAT SG, 1989, IEEE T PATTERN ANAL, V11, P674, DOI 10.1109/34.192463 Mayo R., 1995, J NATL BLACK ASS SPE, V17, P32 Mokbel C., 1995, EUR C SPEECH COMM TE, P141 Mokbel C, 1997, SPEECH COMMUN, V23, P141, DOI 10.1016/S0167-6393(97)00042-3 Natour Y.S., 2010, J VOICE, V25, pe75 Saeed K., 2005, P IEEE 7 INT C DSPA, P528 Saeed K, 2005, INFORMATION PROCESSING AND SECURITY SYSTEMS, P55, DOI 10.1007/0-387-26325-X_6 Selouani SA, 2001, ISSPA 2001: SIXTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1 AND 2, PROCEEDINGS, P719 Titze IR, 1995, WORKSH AC VOIC AN SU Tryon WW, 2001, PSYCHOL METHODS, V6, P371, DOI 10.1037//1082-989X.6.4.371 Tufekci Z., 2000, Proceedings of the IEEE SoutheastCon 2000. `Preparing for The New Millennium' (Cat. No.00CH37105), DOI 10.1109/SECON.2000.845444 Uchida S, 2002, INT C PATT RECOG, P572 VISHWANATH M, 1994, IEEE T SIGNAL PROCES, V42, P673, DOI 10.1109/78.277863 Wu J.-D., 2009, SPEAKER IDENTIFICATI Wu J.-D., 2009, EXPERT SYSTEMS APPL Xue SA, 2006, CLIN LINGUIST PHONET, V20, P691, DOI 10.1080/02699200500297716 Zitouni I, 2009, COMPUT SPEECH LANG, V23, P257, DOI 10.1016/j.csl.2008.06.001 NR 52 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 641 EP 652 DI 10.1016/j.specom.2013.01.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800005 ER PT J AU Mehrabani, M Hansen, JHL AF Mehrabani, Mahnoosh Hansen, John H. L. TI Singing speaker clustering based on subspace learning in the GMM mean supervector space SO SPEECH COMMUNICATION LA English DT Article DE Speaker clustering; Singing; Speaking styles; Subspace learning ID SPEECH; RECOGNITION; MODELS; VERIFICATION; MIXTURE; COMPENSATION; ADAPTATION; RECORDINGS; SEPARATION; STRESS AB In this study, we propose algorithms based on subspace learning in the GMM mean supervector space to improve performance of speaker clustering with speech from both reading and singing. As a speaking style, singing introduces changes in the time-frequency structure of a speaker's voice. The purpose of this study is to introduce advancements for speech systems such as speech indexing and retrieval which improve robustness to intrinsic variations in speech production. Speaker clustering techniques such as k-means and hierarchical are explored for analysis of acoustic space differences of a corpus consisting of reading and singing of lyrics for each speaker. Furthermore, a distance based on fuzzy c-means membership degrees is proposed to more accurately measure clustering difficulty or speaker confusability. Two categories of subspace learning methods are studied: unsupervised based on LPP, and supervised based on PLDA. Our proposed clustering method based on PLDA is a two stage algorithm: where first, initial clusters are obtained using full dimension supervectors, and next, each cluster is refined in a PLDA subspace resulting in a more speaker dependent representation that is less sensitive to speaking style. It is shown that LPP improves average clustering accuracy by 5.1% absolute versus a hierarchical baseline for a mixture of reading and singing, and PLDA based clustering increases accuracy by 9.6% absolute versus a k-means baseline. The advancements offer novel techniques to improve model formulation for speech applications including speaker ID, audio search, and audio content analysis. (C) 2012 Elsevier B.V. All rights reserved. C1 [Mehrabani, Mahnoosh; Hansen, John H. L.] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dallas, TX 75230 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dallas, TX 75230 USA. EM mahmehrabani@utdallas.edu; john.hansen@utdallas.edu CR Belkin M, 2002, ADV NEUR IN, V14, P585 Ben M., 2004, P ICSLP Bezdek J.C., 1981, PATTERN RECOGNITION Bishop C. M., 1995, NEURAL NETWORKS PATT Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086 Chu SM, 2009, IEEE INT CON MULTI, P494 Chu SM, 2009, INT CONF ACOUST SPEE, P4089, DOI 10.1109/ICASSP.2009.4960527 Dunn J.C., 1973, FUZZY RELATIVE ISODA Faltlhauser R., 2001, IEEE WORKSH AUT SPEE, P57, DOI 10.1109/ASRU.2001.1034588 Fan X, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1313 Hansen J. H. L., 2000, IMPACT SPEECH STRESS Hansen JHL, 2009, IEEE T AUDIO SPEECH, V17, P366, DOI 10.1109/TASL.2008.2009019 Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 He XF, 2005, IEEE T PATTERN ANAL, V27, P328 He Xiaofei, 2003, P C ADV NEUR INF PRO, V16, P153 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Ioffe S, 2006, LECT NOTES COMPUT SC, V3954, P531 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Li YP, 2007, IEEE T AUDIO SPEECH, V15, P1475, DOI 10.1109/TASL.2006.889789 Lippmann R., 1987, P INT C AC SPEECH SI, V12, P705 Makhoul J, 2000, P IEEE, V88, P1338, DOI 10.1109/5.880087 Mehrabani M., 2012, P INTERSPEECH Ozerov A, 2007, IEEE T AUDIO SPEECH, V15, P1564, DOI 10.1109/TASL.2007.899291 Prince S., 2007, IEEE 11 INT C COMP V, P1 Rabiner L., 1993, FUNDAM SPEECH RECOGN, V103 Reynolds D., 2009, P INTERSPEECH, P6 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Shriberg E., 2008, P INTERSPEECH SOLOMONOFF A, 1998, ACOUST SPEECH SIG PR, P757 Tang H, 2012, IEEE T PATTERN ANAL, V34, P959, DOI 10.1109/TPAMI.2011.174 Tang H, 2009, INT CONF ACOUST SPEE, P4101 Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 Tsai WH, 2004, COMPUT MUSIC J, V28, P68, DOI 10.1162/0148926041790630 Tsai WH, 2005, INT CONF ACOUST SPEE, P725 WARD JH, 1963, J AM STAT ASSOC, V58, P236, DOI 10.2307/2282967 Wooters C, 2008, LECT NOTES COMPUT SC, V4625, P509 Wu W, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2102 Zhang C, 2011, IEEE T AUDIO SPEECH, V19, P883, DOI 10.1109/TASL.2010.2066967 Zhang C., 2007, P INTERSPEECH, V2007, P2289 NR 41 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 653 EP 666 DI 10.1016/j.specom.2012.11.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800006 ER PT J AU Erath, BD Zanartu, M Stewart, KC Plesniak, MW Sommer, DE Peterson, SD AF Erath, Byron D. Zanartu, Matias Stewart, Kelley C. Plesniak, Michael W. Sommer, David E. Peterson, Sean D. TI A review of lumped-element models of voiced speech SO SPEECH COMMUNICATION LA English DT Article DE Lumped-mass models; Glottal aerodynamics; Acoustics; Vocal fold models; Vocal tract; Subglottal system; Acoustic interaction ID VOCAL-FOLD MODEL; TRACT AREA FUNCTION; GLOTTAL AIR-FLOW; IN-VITRO MODELS; VISCOELASTIC SHEAR PROPERTIES; INTRINSIC LARYNGEAL MUSCLES; INVERSE FILTERING TECHNIQUE; VERTICAL-BAR UTTERANCES; DOMAIN ACOUSTIC MODEL; 2-MASS MODEL AB Voiced speech is a highly complex process involving coupled interactions between the vocal fold structure, aerodynamics, and acoustic field. Reduced-order lumped-element models of the vocal fold structure, coupled with various aerodynamic and acoustic models, have proven useful in a wide array of speech investigations. These simplified models of speech, in which the vocal folds are approximated as arrays of lumped masses connected to one another via springs and dampers to simulate the viscoelastic tissue properties, have been used to study phenomena ranging from sustained vowels and pitch glides to polyps and vocal fold paralysis. Over the past several decades a variety of structural, aerodynamic, and acoustic models have been developed and deployed into the lumped-element modeling framework. This paper aims to provide an overview of advances in lumped-element models and their constituents, with particular emphasis on their physical foundations and limitations. Examples of the application of lumped-element models to speech studies will also be addressed, as well as an outlook on the direction and future of these models. (C) 2013 Elsevier B.V. All rights reserved. C1 [Erath, Byron D.] Clarkson Univ, Dept Mech & Aeronaut Engn, Potsdam, NY 13699 USA. [Zanartu, Matias] Univ Tecn Federico Santa Maria, Dept Elect Engn, Valparaiso, Chile. [Stewart, Kelley C.; Plesniak, Michael W.] George Washington Univ, Dept Mech & Aerosp Engn, Washington, DC 20052 USA. [Sommer, David E.; Peterson, Sean D.] Univ Waterloo, Dept Mech & Mechatron Engn, Waterloo, ON N2L 3G1, Canada. RP Erath, BD (reprint author), Clarkson Univ, Dept Mech & Aeronaut Engn, Potsdam, NY 13699 USA. EM berath@clarkson.edu; matias.zanartu@usm.cl; kstewart@gwu.edu; plesniak@gwu.edu; peterson@mme.uwaterloo.ca RI Zanartu, Matias/I-3133-2012 OI Zanartu, Matias/0000-0001-5581-4392 FU National Science Foundation [CBET 1036280]; UTFSM; CONICYT; [FONDECYT 11110147] FX This material is based upon work supported by the National Science Foundation under Grant No. CBET 1036280. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The work of Maths Zanartu was supported by UTFSM and CONICYT, Grant FONDECYT 11110147 CR Agarwal M, 2003, J VOICE, V17, P97, DOI 10.1016/S0892-1997(03)00012-2 Agarwal M, 2004, THESIS BOWLING GREEN Alipour F, 2001, ANN OTO RHINOL LARYN, V110, P550 ALIPOURHAGHIGHI F, 1991, J ACOUST SOC AM, V90, P1326, DOI 10.1121/1.401924 Alipour F., 2013, J ACOUST SOC AM, V132, P1017 Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801 ARNOLD GODFREY E., 1961, LARYNGOSCOPE, V71, P687 Avanzini F, 2008, SPEECH COMMUN, V50, P95, DOI 10.1016/j.specom.2007.07.002 Avanzini F., 2001, 7 EUR C SPEECH COMM, P51 Avanzini F, 2006, ACTA ACUST UNITED AC, V92, P731 Baer T., 1981, 11 ASHA, V11, P38 Bailly L, 2008, J ACOUST SOC AM, V124, P3296, DOI 10.1121/1.2977740 Bailly L, 2010, J ACOUST SOC AM, V127, P3212, DOI 10.1121/1.3365220 Baken R.J., 1997, PROFESSIONAL VOICE S Benjamin B., 1987, ANN OTO RHINOL LARYN, V99, P530 Birkholz P., 2011, P INT 2011 FLOR IT, P2681 Birkholz P., 2011, STUDIENTEXTE SPRACHK, P184 Birkholz P, 2011, IEEE T AUDIO SPEECH, V19, P1422, DOI 10.1109/TASL.2010.2091632 Birkholz P., 2011, 1 INT WORKSH PERF SP Birkhoz P, 2007, IEEE T AUDIO SPEECH, V15, P1218, DOI 10.1109/TASL.2006.889731 Bocklet T., 2011, P IEEE WORKSH AUT SP, P478 BROWN BL, 1974, J ACOUST SOC AM, V55, P313, DOI 10.1121/1.1914504 Bunton K, 2010, J ACOUST SOC AM, V127, pEL146, DOI 10.1121/1.3313921 Bunton K., 2011, J ACOUST SOC AM, V129, P2626 Chan RW, 2000, J ACOUST SOC AM, V107, P565, DOI 10.1121/1.428354 Chan RW, 1999, J ACOUST SOC AM, V106, P2008, DOI 10.1121/1.427947 Chen LQ, 2008, LECT NOTES COMPUT SC, V5209, P1 CHILDERS DG, 1986, J ACOUST SOC AM, V80, P1309, DOI 10.1121/1.394382 Cisonni J, 2011, ACTA ACUST UNITED AC, V97, P291, DOI 10.3813/AAA.918409 Cook D.D., 2010, 7 INT C VOIC PHYS BI CRANEN B, 1995, J PHONETICS, V23, P165, DOI 10.1016/S0095-4470(95)80040-9 CRANEN B, 1987, J ACOUST SOC AM, V81, P734, DOI 10.1121/1.394842 Dejonckere PH, 2009, FOLIA PHONIATR LOGO, V61, P171, DOI 10.1159/000219952 de Vries MP, 1999, J ACOUST SOC AM, V106, P3620, DOI 10.1121/1.428214 de Vries MP, 2002, J ACOUST SOC AM, V111, P1847, DOI 10.1121/1.1323716 Dollinger M, 2002, IEEE T BIO-MED ENG, V49, P773, DOI 10.1109/TBME.2002.800755 Drechsel JS, 2008, J ACOUST SOC AM, V123, P4434, DOI 10.1121/1.2897040 Dresel Christian, 2006, Logoped Phoniatr Vocol, V31, P61, DOI 10.1080/14015430500363232 Drioli C, 2002, MED ENG PHYS, V24, P453, DOI 10.1016/S1350-4533(02)00057-7 Dursun G, 1996, J VOICE, V10, P206, DOI 10.1016/S0892-1997(96)80048-8 Erath BD, 2011, CHAOS, V21, DOI 10.1063/1.3615726 Erath BD, 2012, INT J HEAT FLUID FL, V35, P93, DOI 10.1016/j.ijheatfluidflow.2012.03.006 Erath BD, 2006, J ACOUST SOC AM, V120, P1000, DOI 10.1121/1.2213522 Erath BD, 2010, INT J HEAT FLUID FL, V31, P468, DOI 10.1016/j.ijheatfluidflow.2010.02.014 Erath BD, 2006, EXP FLUIDS, V40, P683, DOI 10.1007/s00348-006-0106-0 Erath BD, 2011, J ACOUST SOC AM, V130, P389, DOI 10.1121/1.3586785 Erath BD, 2006, EXP FLUIDS, V41, P735, DOI 10.1007/s00348-006-0196-8 Erath B.D., 2010, J ACOUST SOC AM, V129, pEL64 Erath BD, 2010, EXP FLUIDS, V49, P131, DOI 10.1007/s00348-009-0809-0 ERIKSSON LJ, 1980, J ACOUST SOC AM, V68, P545, DOI 10.1121/1.384768 Fant G., 1960, ACOUSTIC THEORY SPEE Fant G., 1987, STL QPSR, V28, P13 Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JL, 1968, IEEE T ACOUST SPEECH, VAU16, P57, DOI 10.1109/TAU.1968.1161949 Fraile R, 2012, BIOMED SIGNAL PROCES, V7, P65, DOI 10.1016/j.bspc.2011.04.002 Fulcher LP, 2006, AM J PHYS, V74, P386, DOI 10.1119/1.2173272 GAY T, 1972, ANN OTO RHINOL LARYN, V81, P401 Goldberg D. E, 1989, GENETIC ALGORITHMS S Gunter HE, 2003, J ACOUST SOC AM, V113, P994, DOI 10.1121/1.1534100 GUPTA V, 1973, J ACOUST SOC AM, V54, P1607, DOI 10.1121/1.1914457 HANSON DG, 1988, LARYNGOSCOPE, V98, P541 Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991 HARTMAN DE, 1984, ARCH OTOLARYNGOL, V110, P394 HERZEL H, 1995, CHAOS, V5, P30, DOI 10.1063/1.166078 HERZEL H, 1995, NONLINEAR DYNAM, V7, P53 Hess MM, 1998, J VOICE, V12, P50, DOI 10.1016/S0892-1997(98)80075-1 HILLMAN RE, 1989, J SPEECH HEAR RES, V32, P373 Hirano M., 1975, OTOLOGIA FUKUOKA S1, V21, P239 Hirano M., 1977, DYNAMIC ASPECTS SPEE, P13 Hirano M., 1981, VOCAL FOLD PHYSL, P33 HIRANO M, 1974, FOLIA PHONIATR, V26, P89 Hirano M., 1983, VOCAL FOLD PHYSL CON, P22 HIRANO M, 1970, FOLIA PHONIATR, V22, P1 Hirschberg A, 1996, VOCAL FOLD, P31 Ho JC, 2011, J ACOUST SOC AM, V129, P1531, DOI 10.1121/1.3543971 Hofmans GCJ, 2003, J ACOUST SOC AM, V113, P1658, DOI 10.1121/1.1547459 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 Honda K, 2004, IEICE T INF SYST, VE87D, P1050 Horacek J, 2005, J FLUID STRUCT, V20, P853, DOI 10.1016/j.jfluidstructs.2005.05.003 ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P1193, DOI 10.1121/1.381221 ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P190, DOI 10.1121/1.381064 Ishizaka K, 1968, J ACOUST SOC JPN, V24, P312 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 Isshiki N., 1977, FUNCTIONAL SURG LARY Jiang J, 2000, OTOLARYNG CLIN N AM, V33, P699, DOI 10.1016/S0030-6665(05)70238-3 Jiang JJ, 2001, J ACOUST SOC AM, V110, P2120, DOI 10.1121/1.1395596 JIANG JJQ, 1994, J VOICE, V8, P132, DOI 10.1016/S0892-1997(05)80305-4 Johns M.M., 2003, HEAD NECK SURG, V11, P456 Kaneko T., 1972, J JPN SOC BRONCHOESO, V25, P133 Kelly J.L., 1973, STANFORD REV SPR, P1 Khosla S, 2007, ANN OTO RHINOL LARYN, V116, P217 Khosla S, 2008, CURR OPIN OTOLARYNGO, V16, P183, DOI 10.1097/MOO.0b013e3282ff5fc5 Khosla S, 2008, ANN OTO RHINOL LARYN, V117, P134 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 Kob M., 2002, THESIS U TECHNOLOGY KOIZUMI T, 1993, LARYNGOSCOPE, V103, P1035 KOIZUMI T, 1987, J ACOUST SOC AM, V82, P1179, DOI 10.1121/1.395254 Krane M, 2007, J ACOUST SOC AM, V122, P3659, DOI 10.1121/1.2409485 Kroger BJ, 2011, LECT NOTES COMPUT SC, V6456, P354, DOI 10.1007/978-3-642-18184-9_31 Kroger BJ, 2007, LARYNGO RHINO OTOL, V86, P365, DOI 10.1055/s-2006-944981 Kroger BJ, 2011, COGN COMPUT, V3, P449, DOI 10.1007/s12559-010-9071-2 Kroger B.J., 2011, PALADYN J BEHAV ROBO, V2, P82 Kuo J., 1998, THESIS HARVARD MIT D Li S, 2007, LECT NOTES COMPUT SC, V4561, P147 Liljencrants J., 1985, THESIS ROYAL I TECHN Liljencrants J., 1991, STL QPSR, V32, P1 Lo CY, 2000, ARCH SURG-CHICAGO, V135, P204, DOI 10.1001/archsurg.135.2.204 LOFQVIST A, 1995, SPEECH COMMUN, V16, P49, DOI 10.1016/0167-6393(94)00049-G LOGEMANN JA, 1978, J SPEECH HEAR DISORD, V43, P47 Lohscheller J, 2008, IEEE T MED IMAGING, V27, P300, DOI 10.1109/TMI.2007.903690 Lohscheller J, 2007, MED IMAGE ANAL, V11, P400, DOI 10.1016/j.media.2007.04.005 Lous NJC, 1998, ACUSTICA, V84, P1135 Lowell SY, 2006, J ACOUST SOC AM, V120, P386, DOI 10.1121/1.2204442 Lucero JC, 2005, J SOUND VIB, V282, P1247, DOI 10.1016/j.jsv.2004.05.008 Lucero JC, 2005, J ACOUST SOC AM, V117, P1362, DOI 10.1121/1.1853235 Luo HX, 2009, J ACOUST SOC AM, V126, P816, DOI 10.1121/1.3158942 Maeda S., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90017-6 MASSEY EW, 1985, SOUTHERN MED J, V78, P316 MCGOWAN RS, 1995, SPEECH COMMUN, V16, P67, DOI 10.1016/0167-6393(94)00048-F MCGOWAN RS, 1988, J ACOUST SOC AM, V83, P696, DOI 10.1121/1.396165 McGowan RS, 2010, J ACOUST SOC AM, V127, pEL215, DOI 10.1121/1.3397283 Mehta DD, 2011, J ACOUST SOC AM, V130, P3999, DOI 10.1121/1.3658441 Mergell P, 1997, SPEECH COMMUN, V22, P141, DOI 10.1016/S0167-6393(97)00016-2 Miller DG, 2005, FOLIA PHONIATR LOGO, V57, P278, DOI 10.1159/000087081 Mittal R, 2013, ANNU REV FLUID MECH, V45, P437, DOI 10.1146/annurev-fluid-011212-140636 Mokhtari P, 2008, SPEECH COMMUN, V50, P179, DOI 10.1016/j.specom.2007.08.001 Mongeau L, 1997, J ACOUST SOC AM, V102, P1121, DOI 10.1121/1.419864 Neubauer J, 2007, J ACOUST SOC AM, V121, P1102, DOI 10.1121/1.2409488 Park JB, 2008, J ACOUST SOC AM, V124, P1171, DOI 10.1121/1.2945116 Park JB, 2007, J ACOUST SOC AM, V121, P442, DOI 10.1121/1.2401652 PELORSON X, 1995, ACTA ACUST, V3, P191 PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449 Pelorson X, 1996, ACUSTICA, V82, P358 Perlman A. L., 1985, THESIS U IOWA IOWA C Qin XL, 2009, IEEE T BIO-MED ENG, V56, P1744, DOI 10.1109/TBME.2009.2015772 Qiu Q.J., 2002, P 2 INT S INSTR SCI, V3, P541 ROTHENBE.M, 1973, J ACOUST SOC AM, V53, P1632, DOI 10.1121/1.1913513 ROTHENBERG M, 1977, J ACOUST SOC AM, V61, P1063, DOI 10.1121/1.381392 Rothenberg M., 1984, VOCAL FOLD PHYSL BIO, P465 Rothenberg M., 1981, STL QPSR, V4, P1 Rupitsch SJ, 2011, J SOUND VIB, V330, P4447, DOI 10.1016/j.jsv.2011.05.008 Ruty N, 2007, J ACOUST SOC AM, V121, P479, DOI 10.1121/1.2384846 Scherer RC, 2010, J ACOUST SOC AM, V128, P828, DOI 10.1121/1.3455838 Schlichting H., 1968, BOUNDARY LAYER THEOR Schroeter J., 2008, HDB SPEECH PROCESSIN, P413 Schwarz R, 2008, J ACOUST SOC AM, V123, P2717, DOI [10.1121/1.2902167, 10.1121/1.29021671] Schwarz R, 2006, IEEE T BIO-MED ENG, V53, P1099, DOI 10.1109/TBME.2006.873396 Sciamarella D., 2003, EUROSPEECH Sciamarella D, 2009, SPEECH COMMUN, V51, P344, DOI 10.1016/j.specom.2008.10.004 Sciamarella D, 2004, ACTA ACUST UNITED AC, V90, P746 SERCARZ JA, 1992, ANN OTO RHINOL LARYN, V101, P567 SMITH ME, 1992, J SPEECH HEAR RES, V35, P545 SOBEY IJ, 1983, J FLUID MECH, V134, P247, DOI 10.1017/S0022112083003341 Sommer DE, 2013, J ACOUST SOC AM, V133, pEL214, DOI 10.1121/1.4790662 Sommer DE, 2012, J ACOUST SOC AM, V132, pEL271, DOI 10.1121/1.4734013 STEINECKE I, 1995, J ACOUST SOC AM, V97, P1874, DOI 10.1121/1.412061 Stevens K.N., 1998, ACOUSTIC PHONETICS STEVENS KN, 1955, J ACOUST SOC AM, V27, P484, DOI 10.1121/1.1907943 Story B. H., 1995, THESIS U IOWA IOWA C Story B. H., 2002, Acoustical Science and Technology, V23, DOI 10.1250/ast.23.195 Story BH, 2010, J SPEECH LANG HEAR R, V53, P1514, DOI 10.1044/1092-4388(2010/09-0127) STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234 Story B.H., 2007, EMOTIONS HUMAN VOICE, V1, P123 Story BH, 2008, J ACOUST SOC AM, V123, P327, DOI 10.1121/1.2805683 Story BH, 1996, J ACOUST SOC AM, V100, P537, DOI 10.1121/1.415960 Story BH, 2007, J ACOUST SOC AM, V121, P3770, DOI 10.1121/1.2730621 Story BH, 2005, J ACOUST SOC AM, V117, P3231, DOI 10.1121/1.1869752 Story B.H., 2009, J ACOUST SOC AM, V125, P2637 Takemoto H, 2006, J ACOUST SOC AM, V119, P1037, DOI 10.1121/1.2151823 Tao C, 2008, PHYS REV E, V77, DOI 10.1103/PhysRevE.77.061922 Tao C, 2007, IEEE T BIO-MED ENG, V54, P794, DOI 10.1109/TBME.2006.889182 Tao C, 2007, J BIOMECH, V40, P2191, DOI 10.1016/j.jbiomech.2006.10.030 Tao C, 2007, J ACOUST SOC AM, V122, P2270, DOI 10.1121/1.2773960 Titze I, 2008, J ACOUST SOC AM, V123, P1902, DOI 10.1121/1.2832339 Titze I. R., 2006, MYOELASTIC AERODYNAM TITZE IR, 1994, J VOICE, V8, P99, DOI 10.1016/S0892-1997(05)80302-9 Titze IR, 2002, J ACOUST SOC AM, V112, P1064, DOI 10.1121/1.1496080 TITZE IR, 1973, PHONETICA, V28, P129 Titze IR, 2009, J ACOUST SOC AM, V126, P1530, DOI 10.1121/1.3160296 TITZE IR, 1988, J ACOUST SOC AM, V83, P1536, DOI 10.1121/1.395910 Titze IR, 2002, J ACOUST SOC AM, V111, P367, DOI 10.1121/1.1417526 Titze IR, 2004, J VOICE, V18, P292, DOI 10.1016/j.jvoice.2003.12.010 Titze I.R., 1979, TRANSCR 9 S CAR PROF, P23 Titze IR, 1994, PRINCIPLES VOICE PRO Titze IR, 2008, J ACOUST SOC AM, V123, P2733, DOI 10.1121/1.2832337 Titze IR, 1997, J ACOUST SOC AM, V101, P2234, DOI 10.1121/1.418246 TITZE IR, 1974, PHONETICA, V29, P1 Tokuda IT, 2007, J ACOUST SOC AM, V122, P519, DOI 10.1121/1.2741210 Tokuda IT, 2008, CHAOS, V18, DOI 10.1063/1.2825295 Tokuda IT, 2010, J ACOUST SOC AM, V127, P1528, DOI 10.1121/1.3299201 Triep M, 2005, EXP FLUIDS, V39, P232, DOI 10.1007/s00348-005-1015-3 Triep M, 2010, J ACOUST SOC AM, V127, P1537, DOI 10.1121/1.3299202 VANDENBERG J, 1958, J SPEECH HEAR RES, V1, P227 VANDENBE.JW, 1968, ANN NY ACAD SCI, V155, P129 van den BERG J., 1959, PRACTICA OTO RHINO LARYNGOL, V21, P425 Vilain CE, 2004, J SOUND VIB, V276, P475, DOI 10.1016/j.jsv.2003.07.035 Voigt D, 2010, J ACOUST SOC AM, V128, pEL347, DOI 10.1121/1.3493637 Wegel RL, 1930, J ACOUST SOC AM, V1, P1, DOI 10.1121/1.1915199 WODICKA GR, 1989, IEEE T BIO-MED ENG, V36, P925, DOI 10.1109/10.35301 WONG D, 1991, J ACOUST SOC AM, V89, P383, DOI 10.1121/1.400472 Wurzbacher T., 2004, INT C VOIC PHYS BIOM Wurzbacher T, 2008, J ACOUST SOC AM, V123, P2324, DOI 10.1121/1.2835435 Wurzbacher T, 2006, J ACOUST SOC AM, V120, P1012, DOI 10.1121/1.2211550 Xue Q, 2010, J ACOUST SOC AM, V128, P818, DOI 10.1121/1.3458839 Yamana T, 2000, J VOICE, V14, P1, DOI 10.1016/S0892-1997(00)80089-2 Yang AX, 2012, J ACOUST SOC AM, V131, P1378, DOI 10.1121/1.3676622 Yang AX, 2011, J ACOUST SOC AM, V130, P948, DOI 10.1121/1.3605551 Yang AX, 2010, J ACOUST SOC AM, V127, P1014, DOI 10.1121/1.3277165 Yumoto E, 2002, AURIS NASUS LARYNX, V29, P41, DOI 10.1016/S0385-8146(01)00122-5 Zanartu M, 2007, J ACOUST SOC AM, V121, P1119, DOI 10.1121/1.2409491 Zanartu M, 2011, J ACOUST SOC AM, V129, P326, DOI 10.1121/1.3514536 Zanartu M., 2006, THESIS PURDUE U Zanartu M., 2010, THESIS PURDUE U Zhang C, 2002, J ACOUST SOC AM, V112, P2147, DOI 10.1121/1.1506694 Zhang Y, 2008, J SOUND VIB, V316, P248, DOI 10.1016/j.jsv.2008.02.026 Zhang Y, 2004, J ACOUST SOC AM, V115, P2270, DOI 10.1121/1.699392 Zhang Y, 2008, CHAOS, V18, DOI 10.1063/1.2988251 Zhang Y, 2005, CHAOS, V15, DOI 10.1063/1.1916186 Zhang Y, 2004, J ACOUST SOC AM, V115, P1266, DOI 10.1121/1.1648974 Zhang ZY, 2006, J ACOUST SOC AM, V119, P3995, DOI 10.1121/1.2195268 Zhang ZY, 2006, J ACOUST SOC AM, V120, P1558, DOI 10.1121/1.2225682 Zhao W, 2002, J ACOUST SOC AM, V112, P2134, DOI 10.1121/1.1506693 Zheng X, 2011, J ACOUST SOC AM, V130, P404, DOI 10.1121/1.3592216 Zheng XD, 2009, ANN BIOMED ENG, V37, P625, DOI 10.1007/s10439-008-9630-9 Zhuang P, 2009, LARYNGOSCOPE, V119, P811, DOI 10.1002/lary.20165 NR 225 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 667 EP 690 DI 10.1016/j.specom.2013.02.002 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800007 ER PT J AU Koniaris, C Salvi, G Engwall, O AF Koniaris, Christos Salvi, Giampiero Engwall, Olov TI On mispronunciation analysis of individual foreign speakers using auditory periphery models SO SPEECH COMMUNICATION LA English DT Article DE Second language learning; Auditory model; Distortion measure; Perceptual assessment; Pronunciation error detection; Phoneme ID SPEECH RECOGNITION; QUANTITATIVE MODEL; PRONUNCIATION; PERCEPTION; SYSTEM; ACCENT; REPRESENTATIONS AB In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis. (C) 2013 Elsevier B.V. All rights reserved. C1 [Koniaris, Christos; Salvi, Giampiero; Engwall, Olov] KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol, SE-10044 Stockholm, Sweden. RP Koniaris, C (reprint author), KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol, Lindstedtsvagen 24, SE-10044 Stockholm, Sweden. EM koniaris@kth.se; giampi@kth.se; engwall@kth.se FU Swedish Research Council [80449001] FX This work is supported by the Swedish Research Council project 80449001 Computer-Animated LAnguage TEAchers (CALATEA). The authors wish to thank our colleague Dr. Mats Blomberg, Dr. Saikat Chatterjee from the Communication Theory Laboratory, KTH - Royal Institute of Technology, and the anonymous reviewers for helpful suggestions. CR ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x Andrianopoulos MV, 2001, J VOICE, V15, P61, DOI 10.1016/S0892-1997(01)00007-8 Bannert R., 1984, FOLIA LINGUIST, V18, P193, DOI 10.1515/flin.1984.18.1-2.193 Bregman AS., 1990, AUDITORY SCENE ANAL Chatterjee S, 2011, IEEE T AUDIO SPEECH, V19, P1813, DOI 10.1109/TASL.2010.2101597 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Digalakis V., 1992, THESIS BOSTON U BOST Digalakis V, 1993, IEEE T SPEECH AUDI P, V1, P431, DOI 10.1109/89.242489 Eddins D.A., 1995, TEMPORAL INTEGRATION Eskenazi M., 1998, SPEECH TECH LANG LEA, P77 Flege J. E., 1995, SPEECH PERCEPTION LI, P233 Franco H, 1997, INT CONF ACOUST SPEE, P1471, DOI 10.1109/ICASSP.1997.596227 Franco H, 2010, LANG TEST, V27, P401, DOI 10.1177/0265532210364408 GARDNER WR, 1995, IEEE T SPEECH AUDI P, V3, P367, DOI 10.1109/89.466658 Guion SG, 2000, J ACOUST SOC AM, V107, P2711, DOI 10.1121/1.428657 Kawai G., 1998, INT C SPOK LANG PROC, P1823 Kluender K. R., 1989, ECOL PSYCHOL, V1, P121, DOI 10.1207/s15326969eco0102_2 Koniaris C, 2011, INT CONF ACOUST SPEE, P5704 Koniaris C, 2010, INT CONF ACOUST SPEE, P4342, DOI 10.1109/ICASSP.2010.5495648 Koniaris C., 2012, INT S AUT DET ERR PR, P59 Koniaris C., 2011, INTERSPEECH, P1157 Koniaris C, 2010, J ACOUST SOC AM, V127, pEL73, DOI 10.1121/1.3284545 Koniaris C., 2012, INTERSPEECH KUHL PK, 1993, J PHONETICS, V21, P125 Menzel W., 2000, WORK INT SPEECH TECH, P49 Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001 MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x Neumeyer L, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1457 Piske T, 2001, J PHONETICS, V29, P191, DOI 10.1006/jpho.2001.0134 Plasberg JH, 2007, IEEE T AUDIO SPEECH, V15, P310, DOI 10.1109/TASL.2006.876722 Pressnitzer D, 2008, CURR BIOL, V18, P1124, DOI 10.1016/j.cub.2008.06.053 Richardson M, 2003, SPEECH COMMUN, V41, P511, DOI 10.1016/S0167-6393(03)00031-1 Schmid PM, 1999, J SPEECH LANG HEAR R, V42, P56 Sjolander K., 2003, FONETIK, P93 Stevens K.N., 1998, ACOUSTIC PHONETICS Strik H, 2009, SPEECH COMMUN, V51, P845, DOI 10.1016/j.specom.2009.05.007 Tepperman J, 2008, IEEE T AUDIO SPEECH, V16, P8, DOI 10.1109/TASL.2007.909330 Thoren B., 2008, THESIS STOCKHOLM U S van de Par S., 2002, ACOUST SPEECH SIG PR, P1805 Wei S, 2009, SPEECH COMMUN, V51, P896, DOI 10.1016/j.specom.2009.03.004 WERKER JF, 1984, INFANT BEHAV DEV, V7, P49, DOI 10.1016/S0163-6383(84)80022-3 Wik P, 2009, SPEECH COMMUN, V51, P1024, DOI 10.1016/j.specom.2009.05.006 Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8 NR 45 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 691 EP 706 DI 10.1016/j.specom.2013.01.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800008 ER PT J AU Kua, JMK Epps, J Ambikairajah, E AF Kua, Jia Min Karen Epps, Julien Ambikairajah, Eliathamby TI i-Vector with sparse representation classification for speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Speaker recognition; Sparse representation classification; l(1)-Minimization; i-Vectors; Support vector machine; Cosine Distance Scoring ID MACHINES; RECOGNITION; SELECTION; MODELS; SYSTEMS; KERNEL; RECONSTRUCTION; REGULARIZATION; IDENTIFICATION; NORMALIZATION AB Sparse representation-based methods have very lately shown promise for speaker recognition systems. This paper investigates and develops an i-vector based sparse representation classification (SRC) as an alternative classifier to support vector machine (SVM) and Cosine Distance Scoring (CDS) classifier, producing an approach we term i-vector-sparse representation classification (i-SRC). Unlike SVM which fixes the support vector for each target example, SRC allows the supports, which we term sparse coefficient vectors, to be adapted to the test signal being characterized. Furthermore, similarly to CDS, SRC does not require a training phase. We also analyze different types of sparseness methods and dictionary composition to determine the best configuration for speaker recognition. We observe that including an identity matrix in the dictionary helps to remove sensitivity to outliers and that sparseness methods based on l(1) and l(2) norm offer the best performance. A combination of both techniques achieves a 18% relative reduction in EER over a SRC system based on l(1) norm and without identity matrix. Experimental results on NIST 2010 SRE show that the i-SRC consistently outperforms i-SVM and i-CDS in EER by 0.14-0.81%, and the fusion of i-CDS and i-SRC achieves a relative EER reduction of 8-19% over i-SRC alone. (C) 2013 Elsevier B.V. All rights reserved. C1 [Kua, Jia Min Karen; Epps, Julien; Ambikairajah, Eliathamby] Univ New S Wales, Sch Elect Engn & Telecommun, Unsw Sydney, NSW 2052, Australia. [Epps, Julien; Ambikairajah, Eliathamby] NICTA, ATP Res Lab, Eveleigh 2015, Australia. RP Kua, JMK (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Unsw Sydney, NSW 2052, Australia. EM j.kua@unswalumni.com; j.epps@unsw.edu.au; ambi@ee.unsw.edu.au CR Aharon M, 2006, IEEE T SIGNAL PROCES, V54, P4311, DOI 10.1109/TSP.2006.881199 Alex Solomonoff C.Q., 2004, P OD SPEAK LANG REC, P57 Amaldi E, 1998, THEOR COMPUT SCI, V209, P237, DOI 10.1016/S0304-3975(97)00115-1 Ariki Y, 1996, INT CONF ACOUST SPEE, P319, DOI 10.1109/ICASSP.1996.541096 Ariki Y., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing Baraniuk RG, 2007, IEEE SIGNAL PROC MAG, V24, P118, DOI 10.1109/MSP.2007.4286571 Blake C. L., 1998, UCI REPOSITORY MACHI, V460 Bruckstein AM, 2009, SIAM REV, V51, P34, DOI 10.1137/060657704 Brummer N., 2010, P NIST 2010 SPEAK RE Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 Campadelli P, 2005, NEUROCOMPUTING, V68, P281, DOI 10.1016/j.neucom.2005.03.005 Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003 Campbell WM, 2006, INT CONF ACOUST SPEE, P97 Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086 Candes E., 2005, 1 MAGIC RECOVERY SPA Candes E. J., 2006, P INT C MATH Candes EJ, 2006, IEEE T INFORM THEORY, V52, P5406, DOI 10.1109/TIT.2006.885507 Candes EJ, 2006, IEEE T INFORM THEORY, V52, P489, DOI 10.1109/TIT.2005.862083 Chapelle O, 2002, MACH LEARN, V46, P131, DOI 10.1023/A:1012450327387 Dehak N., 2009, P INTERSPEECH Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307 Dehak N, 2009, INT CONF ACOUST SPEE, P4237, DOI 10.1109/ICASSP.2009.4960564 Donoho D., 2005, SPARSELAB, P25 Donoho DL, 2006, COMMUN PUR APPL MATH, V59, P797, DOI 10.1002/cpa.20132 Fauve BGB, 2007, IEEE T AUDIO SPEECH, V15, P1960, DOI 10.1109/TASL.2007.902877 Figueiredo MAT, 2007, IEEE J-STSP, V1, P586, DOI 10.1109/JSTSP.2007.910281 Friedman J, 2010, J STAT SOFTW, V33, P1 Frohlich H, 2005, IEEE IJCNN, P1431 Georghiades AS, 2001, IEEE T PATTERN ANAL, V23, P643, DOI 10.1109/34.927464 Gunasekara N.A., 2010, METALEARNING STRING Hatch AO, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1471 Huang K., 2007, ADV NEURAL INFORM PR, V19, P609 Ji SH, 2008, IEEE T SIGNAL PROCES, V56, P2346, DOI 10.1109/TSP.2007.914345 Kanevsky D., 2010, P INTERSPEECH Karam ZN, 2008, INT CONF ACOUST SPEE, P4117, DOI 10.1109/ICASSP.2008.4518560 Kenny P, 2005, JOINT FACTOR ANAL SP Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Koh K., 2007, 11 IS MATLAB SOLVER Kreutz-Delgado K, 2003, NEURAL COMPUT, V15, P349, DOI 10.1162/089976603762552951 Kua JMK, 2011, INT CONF ACOUST SPEE, P4548 Lei Y., 2010, CRSS SYSTEMS 2010 NI Li M., 2011, P INTERSPEECH Li Ming, 2011, P ICASSP Mairal J., 2009, ADV NEURAL INFORM PR, V21, P1033 Martin A.F., 2010, NIST 2010 SPEAKER RE McLaren M., 2009, P ICASSP, P4041 McLaren M, 2009, LECT NOTES COMPUT SC, V5558, P474, DOI 10.1007/978-3-642-01793-3_49 Moreno P.J., 2003, P 8 EUR C SPEECH COM, P2965 Naseem Imran, 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), DOI 10.1109/ICPR.2010.1083 Plumbley MD, 2007, LECT NOTES COMPUT SC, V4666, P406 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Sainath T.N., 2010, SPARSE REPRESENTATIO Sainath TN, 2010, INT CONF ACOUST SPEE, P4370, DOI 10.1109/ICASSP.2010.5495638 Sedlak F, 2011, INT CONF ACOUST SPEE, P4544 Solomonoff A, 2005, INT CONF ACOUST SPEE, P629 Suh J.W., 2011, P ICASSP Tao DC, 2006, IEEE T PATTERN ANAL, V28, P1088 Tibshirani R, 1996, J ROY STAT SOC B MET, V58, P267 Tikhonov A. N., 1977, SOLUTION ILL POSED P Vainsencher D., 2010, PREPRINT Villalba J., 2010, I3A NIST SRE2010 SYS Wan V., 2000, IEEE INT WORKSH NEUR, P775 WAN V, 2002, ACOUST SPEECH SIG PR, P669 Webb A. R., 2002, STAT PATTERN RECOGNI, V2nd Wright J., 2008, IEEE T PATTERN ANAL, V31, P210 Yang A., 2007, FEATURE SELECTION FA Yang AY, 2010, P IEEE, V98, P1077, DOI 10.1109/JPROC.2010.2040797 Zou H, 2005, J ROY STAT SOC B, V67, P301, DOI 10.1111/j.1467-9868.2005.00503.x NR 69 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 707 EP 720 DI 10.1016/j.specom.2013.01.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800009 ER PT J AU Kurtic, E Brown, GJ Wells, B AF Kurtic, Emina Brown, Guy J. Wells, Bill TI Resources for turn competition in overlapping talk SO SPEECH COMMUNICATION LA English DT Article DE Overlapping talk; Turn competition; Prosody; Turn-taking; Turn end projection ID ORGANIZATION; GAZE; CONVERSATION; ALIGNMENT; PAUSES; GAPS AB Overlapping talk occurs frequently in multi-party conversations, and is a domain in which speakers may pursue various communicative goals. The current study focuses on turn competition. Specifically, we seek to identify the phonetic differences that discriminate turn-competitive from non-competitive overlaps. Conversation analysis techniques were used to identify competitive and non-competitive overlaps in a corpus of multi-party recordings. We then generated a set of potentially predictive features relating to prosody (F0, intensity, speech rate, pausing) and overlap placement (overlap duration, point of overlap onset, recycling etc.). Decision tree classifiers were trained on the features and tested on a classification task, in order to determine which features and feature combinations best differentiate competitive overlaps from non-competitive overlaps. It was found that overlap placement features played a greater role than prosodic features in indicating turn competition. Among the prosodic features tested, F0 and intensity were the most effective predictors of turn competition. Also, our decision tree models suggest that turn competitive and non-competitive overlaps can be initiated by a new speaker at many different points in the current speaker's turn. These findings have implications for the design of dialogue systems, and suggest novel hypotheses about how speakers deploy phonetic resources in everyday talk. (C) 2012 Elsevier B.V. All rights reserved. C1 [Kurtic, Emina; Wells, Bill] Univ Sheffield, Dept Human Commun Sci, Sheffield S10 2TA, S Yorkshire, England. [Kurtic, Emina; Brown, Guy J.] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Wells, B (reprint author), Univ Sheffield, Dept Human Commun Sci, 31 Claremont Crescent, Sheffield S10 2TA, S Yorkshire, England. EM bill.wells@sheffield.ac.uk FU University of Sheffield; UK Arts and Humanities Research Council [1-62874195] FX The research reported here was supported by a University of Sheffield Project Studentship. Preparation of the article was facilitated by UK Arts and Humanities Research Council Grant 1-62874195. We are grateful to our annotators for their time and effort; to Ahmed Aker for invaluable assistance at various stages of the research; to Gareth Walker and John Local for their sustained interest and encouragement; and to Jens Edlund and an anonymous reviewer for their constructive comments on an earlier draft. CR Adda-Decker M., 2008, P 6 INT LANG RES EV Barkhuysen P, 2008, J ACOUST SOC AM, V123, P354, DOI 10.1121/1.2816561 Bavelas JB, 2002, J COMMUN, V52, P566, DOI 10.1093/joc/52.3.566 Boersma P., 2001, GLOT INT, V5, P341 Carletta J, 2007, LANG RESOUR EVAL, V41, P181, DOI 10.1007/s10579-007-9040-x Cetin O., 2006, P 3 JOINT WORKSH MUL Couper-Kuhlen E., 1993, ENGLISH SPEECH RHYTH Dellwo V., 2006, P SPEECH PROS 2006 D Dhillon R, 2004, TR04002 ICSI French P., 1983, J PRAGMATICS, V7, P701 Gardner R., 2001, PRAGMATICS NEW SERIE, V92 GOODWIN C, 1980, SOCIOL INQ, V50, P272, DOI 10.1111/j.1475-682X.1980.tb00023.x GOODWIN MH, 1986, SEMIOTICA, V62, P51 Gorisch J, 2012, LANG SPEECH, V55, P57, DOI 10.1177/0023830911428874 Gravano A, 2011, COMPUT SPEECH LANG, V25, P601, DOI 10.1016/j.csl.2010.10.003 Hain T, 2012, IEEE T AUDIO SPEECH, V20, P486, DOI 10.1109/TASL.2011.2163395 Heldner M, 2010, J PHONETICS, V38, P555, DOI 10.1016/j.wocn.2010.08.002 Heldner M, 2011, J ACOUST SOC AM, V130, P508, DOI 10.1121/1.3598457 Hjalmarsson A, 2011, SPEECH COMMUN, V53, P23, DOI 10.1016/j.specom.2010.08.003 Janin A, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P364 Jefferson G., 1987, INTERACTION LANGUAGE, V9, P153 Jefferson G., 1983, 2 EXPLORATIONS ORG O Jefferson G., 2003, CONVERSATION ANAL ST Jefferson Gail, 2004, CONVERSATION ANAL ST, P13 KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4 Kurtic E, 2009, STUD PRAGMAT, V8, P183 Kurtic E., 2012, P INT C LANG RES EV Kurtic E., 2010, P INT 2010 MAK JAP Lee C., 2008, P INT C SPOK LANG PR Lerner G, 1999, LANGUAGE TURN SEQUEN, P225 Lerner G., 1999, CONVERSATION ANAL ST Levinson Stephen C, 2006, ROOTS HUMAN SOCIALIT, P39 Local J, 2005, PHONETICA, V62, P120, DOI 10.1159/000090093 Local J, 2005, FIGURE OF SPEECH: A FESTSCHRIFT FOR JOHN LAVER, P263 Mondada L., 2011, INTEGRATING GESTURES Olshen R., 1984, CLASSIFICATION REGRE, V1st Quinlan R., 1994, C4 5 PROGRAMS MACHIN Reidsma D, 2011, J MULTIMODAL USER IN, V4, P97, DOI 10.1007/s12193-011-0060-x SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 Schegloff Emanuel A., 1982, ANAL DISCOURSE TEXT, P71 Schegloff Emanuel A., 2001, HDB SOCIOLOGICAL THE, P287, DOI 10.1007/0-387-36274-6_15 Schegloff EA, 2000, LANG SOC, V29, P1 Schegloff Emanuel Abraham, 1987, TALK SOCIAL ORG, P70 Selting M., 1998, INTERACTION LINGUIST, V4, P1 Shriberg E., 2001, P 7 EUR C SPEECH COM Shriberg E., 2001, ISCA TUT RES WORKSH Sidnell J, 2001, J PRAGMATICS, V33, P1263, DOI 10.1016/S0378-2166(00)00062-X Stivers T, 2008, RES LANG SOC INTERAC, V41, P31, DOI 10.1080/08351810701691123 Szczepek-Reed Beatrice, 2006, PROSODIC ORIENTATION WALKER MB, 1982, J SOC PSYCHOL, V117, P305 Wells B, 1998, LANG SPEECH, V41, P265 Wells B., 2004, SOUND PATTERNS INTER, P119 Yngve Victor, 1970, 6 REG M CHIC LING SO, P567 NR 53 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2013 VL 55 IS 5 BP 721 EP 743 DI 10.1016/j.specom.2012.10.002 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 141RS UT WOS:000318744800010 ER PT J AU Zhang, Y Zhao, YX AF Zhang, Yi Zhao, Yunxin TI Real and imaginary modulation spectral subtraction for speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Spectral subtraction; Noise reduction; Speech phase; Modulation frequency ID ADDITIVE NOISE; RECOGNITION; PHASE; PERFORMANCE; SEPARATION AB In this paper, we propose a novel spectral subtraction method for noisy speech enhancement. Instead of taking the conventional approach of carrying out subtraction on the magnitude spectrum in the acoustic frequency domain, we propose to perform subtraction on the real and imaginary spectra separately in the modulation frequency domain, where the method is referred to as MRISS. By doing so, we are able to enhance magnitude as well as phase through spectral subtraction. We conducted objective and subjective evaluation experiments to compare the performance of the proposed MRISS method with three existing methods, including modulation frequency domain magnitude spectral subtraction (MSS), nonlinear spectral subtraction (NSS), and minimum mean square error estimation (MMSE). The objective evaluation used the criteria of segmental signal-to-noise ratio (Segmental SNR), PESQ, and average Itakura-Saito spectral distance (ISD). The subjective evaluation used a mean preference score with 14 participants. Both objective and subjective evaluation results have demonstrated that the proposed method outperformed the three existing speech enhancement methods. A further analysis has shown that the winning performance of the proposed MRISS method comes from improvements in the recovery of both acoustic magnitude and phase spectrum. (C) 2012 Elsevier B.V. All rights reserved. C1 [Zhang, Yi; Zhao, Yunxin] Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA. RP Zhao, YX (reprint author), Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA. EM yzcb3@mail.missouri.edu; Zhaoy@missouri.edu CR Aarabi P, 2004, IEEE T SYST MAN CY B, V34, P1763, DOI 10.1109/TSMCB.2004.830345 [Anonymous], 2001, RWCP SOUND SC DAT RE Araki S, 2006, INT CONF ACOUST SPEE, P33 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Evans NWD, 2006, INT CONF ACOUST SPEE, P145 Flanagan J. L., 1992, P IEEE INT C AC SPEE, V1, P285 Hansen J. H. L., 1998, P INT C SPOK LANG PR, V7, P2819 Hegde RM, 2007, IEEE T AUDIO SPEECH, V15, P190, DOI 10.1109/TASL.2006.876858 HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387 KAMATH S, 2002, ACOUST SPEECH SIG PR, P4164 Kleinschmidt T, 2011, COMPUT SPEECH LANG, V25, P585, DOI 10.1016/j.csl.2010.09.001 Lin L, 2003, ELECTRON LETT, V39, P754, DOI 10.1049/el:20030480 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lu Y, 2008, SPEECH COMMUN, V50, P453, DOI 10.1016/j.specom.2008.01.003 Makhoul J., 1979, P IEEE INT C AC SPEE, V23, P208 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 Nakagawa S., 2004, P INT C SPOK LANG PR, V23, P477 Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Papoulis A, 1991, PROBABILITY RANDOM V, V3rd Savoji M. H., 2010, TELECOMMUN IST, P895 Schluter R, 2001, INT CONF ACOUST SPEE, P133 Shannon BJ, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1423 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 Wojcicki K, 2008, IEEE SIGNAL PROC LET, V15, P461, DOI 10.1109/LSP.2008.923579 Yellin D, 1996, IEEE T SIGNAL PROCES, V44, P106, DOI 10.1109/78.482016 Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896] Yiteng Huang, 2004, AUDIO SIGNAL PROCESS Yoma NB, 1998, IEEE T SPEECH AUDI P, V6, P579, DOI 10.1109/89.725325 Zhu D, 2004, P IEEE INT C AC SPEE, V1, P125 Zhu QF, 2002, IEEE SIGNAL PROC LET, V9, P275, DOI 10.1109/LSP.2002.801722 NR 33 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2013 VL 55 IS 4 BP 509 EP 522 DI 10.1016/j.specom.2012.09.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122NW UT WOS:000317325800001 ER PT J AU Mirzahasanloo, TS Kehtarnavaz, N Gopalakrishna, V Loizou, PC AF Mirzahasanloo, Taher S. Kehtarnavaz, Nasser Gopalakrishna, Vanishree Loizou, Philipos C. TI Environment-adaptive speech enhancement for bilateral cochlear implants using a single processor SO SPEECH COMMUNICATION LA English DT Article DE Bilateral cochlear implants; Single-processor speech enhancement for bilateral cochlear implants; Environment-adaptive speech enhancement ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE-REDUCTION; ALGORITHMS; CHILDREN; RECOGNITION; PERCEPTION; BENEFITS AB A computationally efficient speech enhancement pipeline in noisy environments based on a single-processor implementation is developed for utilization in bilateral cochlear implant systems. A two-channel joint objective function is defined and a closed form solution is obtained based on the weighted-Euclidean distortion measure. The computational efficiency and no need for synchronization aspects of this pipeline make it a suitable solution for real-time deployment. A speech quality measure is used to show its effectiveness in six different noisy environments as compared to a similar one-channel enhancement pipeline when using two separate processors or when using independent sequential processing. (C) 2012 Elsevier B.V. All rights reserved. C1 [Mirzahasanloo, Taher S.; Kehtarnavaz, Nasser; Gopalakrishna, Vanishree; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75080 USA. RP Kehtarnavaz, N (reprint author), Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75080 USA. EM mirzahasanloo@utdallas.edu; kehtar@utdallas.edu FU NIH/NIDCD [DC010494] FX This work was supported by Grant No. DC010494 from NIH/NIDCD. CR Abramowitz M., 1965, HDB MATH FUNCTIONS Algazi V. R., 2001, IEEE WORKSH APPL SIG, P99 [Anonymous], 2000, P862 ITUT Chen J., 2006, EURASIP J APPL SIG P, V26, P19 Ching T Y C, 2007, Trends Amplif, V11, P161, DOI 10.1177/1084713807304357 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erkelens J, 2007, SPEECH COMMUN, V49, P530, DOI 10.1016/j.specom.2006.06.012 Erkelens JS, 2008, IEEE T AUDIO SPEECH, V16, P1112, DOI 10.1109/TASL.2008.2001108 Fetterman BL, 2002, OTOLARYNG HEAD NECK, V126, P257, DOI 10.1067/mhn.2002.123044 Gopalakrishna V, 2012, IEEE T BIO-MED ENG, V59, P1691, DOI 10.1109/TBME.2012.2191968 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y., 2007, J ACOUST SOC AM, V128, P128 IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058 Kokkinakis K, 2010, J ACOUST SOC AM, V127, P3136, DOI 10.1121/1.3372727 Kuhn-Inacker H, 2004, INT J PEDIATR OTORHI, V68, P1257, DOI 10.1016/j.ijporl.2004.04.029 Litovsky RY, 2006, INT J AUDIOL, V45, pS78, DOI 10.1080/14992020600782956 Litovsky RY, 2004, ARCH OTOLARYNGOL, V130, P648, DOI 10.1001/archotol.130.5.648 Loizou Philipos C, 2006, Adv Otorhinolaryngol, V64, P109 Loizou PC, 2011, STUD COMPUT INTELL, V346, P623 Loizou PC, 2005, J ACOUST SOC AM, V118, P2791, DOI 10.1121/1.2065847 Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lotter T, 2005, EURASIP J APPL SIG P, V2005, P1110, DOI 10.1155/ASP.2005.1110 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Mirzahasanloo T. S., 2012, IEEE INT C ENG MED B, V2012, P2271 Muller J, 2002, EAR HEARING, V23, P198 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Remus JJ, 2005, EURASIP J APPL SIG P, V2005, P2979, DOI 10.1155/ASP.2005.2979 van Hoesel RJM, 2003, J ACOUST SOC AM, V113, P1617, DOI 10.1121/1.1539520 van Hoesel RJM, 2004, AUDIOL NEURO-OTOL, V9, P234, DOI 10.1159/000078393 NR 32 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2013 VL 55 IS 4 BP 523 EP 534 DI 10.1016/j.specom.2012.10.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122NW UT WOS:000317325800002 ER PT J AU Dam, HH Rimantho, D Nordholm, S AF Dam, Hai Huyen Rimantho, Dedi Nordholm, Sven TI Second-order blind signal separation with optimal step size SO SPEECH COMMUNICATION LA English DT Article DE Blind signal separation; Convolutive mixture; Conjugate gradient; Fast convergent; Optimal step size ID NONSTATIONARY SOURCES; CONJUGATE-GRADIENT AB This paper proposes a new computational procedure for solving the second-order gradient-based blind signal separation (BSS) problem with convolutive mixtures. The problem is formulated as a constrained optimization problem where the time domain constraints on the unmixing matrices are added to ease the permutation effects associated with convolutive mixtures. A linear transformation using QR factorization is developed to transform the constrained optimization problem into an unconstrained problem. A conjugate gradient procedure with the step size derived optimally at each iteration is then proposed to solve the optimization problem. The advantage of the procedure is that it has low computational complexity, as it does not require multiple evaluations of the objective function. In addition, fast convergence of the conjugate gradient algorithm makes it suitable for online implementation. The convergence of the conjugate gradient algorithm with optimal step size is compared to the fixed step size case and the optimal step size steepest descent algorithm. Evaluations are performed in real and simulated environments. Crown Copyright (C) 2012 Published by Elsevier B.V. All rights reserved. C1 [Dam, Hai Huyen; Rimantho, Dedi] Curtin Univ Technol, Dept Math & Stat, Perth, WA, Australia. [Nordholm, Sven] Curtin Univ Technol, Dept Elect & Comp Engn, Perth, WA, Australia. RP Dam, HH (reprint author), Curtin Univ Technol, Dept Math & Stat, Perth, WA, Australia. EM H.dam@curtin.edu.au; dedi.rimantho@student.curtin.edu.au; S.Nordholm@curtin.edu.au RI Nordholm, Sven/J-5247-2014 FU ARC [DP120103859] FX This research was supported by ARC Discovery Project DP120103859. CR Benesty J., 2005, SPEECH ENHANCEMENT BORAY GK, 1992, IEEE T CIRCUITS-I, V39, P1, DOI 10.1109/81.109237 Buchner H, 2005, IEEE T SPEECH AUDI P, V13, P120, DOI 10.1109/TSA.2004.838775 Dam HH, 2008, IEEE SIGNAL PROC LET, V15, P79, DOI 10.1109/LSP.2007.910234 Dam HH, 2007, IEEE T SIGNAL PROCES, V55, P4198, DOI 10.1109/TSP.2007.894406 FLETCHER R, 1964, COMPUT J, V7, P149, DOI 10.1093/comjnl/7.2.149 Garofolo JS, 1993, TIMIT ACOUSTIC PHONE Hyvarinen A, 2001, INDEPENDENT COMPONEN McCormick G., 1983, NONLINEAR PROGRAMMIN Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 Schobben DWE, 2002, IEEE T SIGNAL PROCES, V50, P1855, DOI 10.1109/TSP.2002.800417 Wang WW, 2005, IEEE T SIGNAL PROCES, V53, P1654, DOI 10.1109/TSP.2005.845433 Yong P.C., 2011, P 19 EUR SIGN PROC C, P211 NR 13 TC 0 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2013 VL 55 IS 4 BP 535 EP 543 DI 10.1016/j.specom.2012.10.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122NW UT WOS:000317325800003 ER PT J AU Nam, KW Ji, YS Han, J Lee, S Kim, D Hong, SH Jang, DP Kim, IY AF Nam, Kyoung Won Ji, Yoon Sang Han, Jonghee Lee, Sangmin Kim, Dongwook Hong, Sung Hwa Jang, Dong Pyo Kim, In Young TI Clinical evaluation of the performance of a blind source separation algorithm combining beamforming and independent component analysis in hearing aid use SO SPEECH COMMUNICATION LA English DT Article DE Hearing aid; Independent component analysis; Beamforming; Noise reduction ID FREQUENCY-DOMAIN ICA; SPEECH RECOGNITION; NOISE; QUALITY; SYSTEMS AB There have been several reports on improved blind source separation algorithms that combine beamforming and independent component analysis. However, none of the prior reports verified the clinical efficacy of such combinational algorithms in real hearing aid situations. In the current study, we evaluated the clinical efficacy of such a combinational algorithm using the mean opinion score and speech recognition threshold tests in various types of real-world hearing aid situations involving environmental noise. Parameters of the testing algorithm were adjusted to match the geometric specifications of the real behind-the-ear type hearing aid housing. The study included 15 normal-hearing volunteers and 15 hearing-impaired patients. Experimental results demonstrated that the testing algorithm improved the speech intelligibility of all of the participants in noisy environments, and the clinical efficacy of the combinational algorithm was superior to either the beamforming or independent component analysis algorithms alone. Despite the computational complexity of the testing algorithm, our experimental results and the rapid enhancement of hardware technology indicate that the testing algorithm has the potential to be applied to real hearing aids in the near future, thereby improving the speech intelligibility of hearing-impaired patients in noisy environments. (C) 2012 Elsevier B.V. All rights reserved. C1 [Nam, Kyoung Won; Ji, Yoon Sang; Jang, Dong Pyo; Kim, In Young] Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea. [Han, Jonghee; Kim, Dongwook] Samsung Adv Inst Technol, Bio & Hlth Lab, Yongin 446712, South Korea. [Lee, Sangmin] Inha Univ, Dept Elect Engn, Inchon 402751, South Korea. [Hong, Sung Hwa] Samsung Med Ctr, Dept Otolaryngol Head & Neck Surg, Seoul 135710, South Korea. RP Kim, IY (reprint author), Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea. EM kwnam@bme.hanyang.ac.kr; ysji@bme.hanyang.ac.kr; apaper@bme.hanyang.ac.kr; sanglee@inha.ac.kr; steve7.kim@samsung.com; hongsh@skku.edu; dongpjang@gmail.com; iykim@hanyang.ac.kr FU Strategic Technology Development Program of the Ministry of Knowledge Economy [10031764]; Seoul University Industry Collaboration Foundation [SS100022] FX This work was supported by grants No. 10031764, from the Strategic Technology Development Program of the Ministry of Knowledge Economy, and No. SS100022, from theSeoul University Industry Collaboration Foundation. CR Duran-Diaz I, 2012, DIGIT SIGNAL PROCESS, V22, P1126, DOI 10.1016/j.dsp.2012.05.014 [Anonymous], 2012, P800 ITUT Arehart KH, 2010, EAR HEARING, V31, P420, DOI 10.1097/AUD.0b013e3181d3d4f3 Bentler Ruth A, 2005, J Am Acad Audiol, V16, P473, DOI 10.3766/jaaa.16.7.7 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chien JT, 2006, IEEE T AUDIO SPEECH, V14, P1245, DOI 10.1109/TSA.2005.858061 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 Guo W., 2012, CECNET 12, P3001 Han H, 2011, INT J AUDIOL, V50, P59, DOI 10.3109/14992027.2010.526637 KEATING PA, 1994, SPEECH COMMUN, V14, P131, DOI 10.1016/0167-6393(94)90004-3 Kocinski J, 2011, SPEECH COMMUN, V53, P390, DOI 10.1016/j.specom.2010.11.002 Kocinski J, 2008, SPEECH COMMUN, V50, P29, DOI 10.1016/j.specom.2007.06.003 Kokkinakis K, 2008, J ACOUST SOC AM, V123, P2379, DOI 10.1121/1.2839887 Lee TW, 1999, NEURAL COMPUT, V11, P417, DOI 10.1162/089976699300016719 Liquan Z., 2008, WICOM 08, P1 Luo FL, 2002, IEEE T SIGNAL PROCES, V50, P1583 Lv Q, 2005, LECT NOTES COMPUT SC, V3497, P538 Madhu N., 2006, IWAENC 06, P1 Mangalathu-Arumana J, 2012, NEUROIMAGE, V60, P2247, DOI 10.1016/j.neuroimage.2012.02.030 Marques I, 2012, SOFT COMPUT, V16, P1525, DOI 10.1007/s00500-012-0826-4 Mitianoudis N, 2004, LECT NOTES COMPUT SC, V3195, P669 Park HM, 1999, ELECTRON LETT, V35, P2011, DOI 10.1049/el:19991358 Parra LC, 2002, IEEE T SPEECH AUDI P, V10, P352, DOI 10.1109/TSA.2002.803443 Saruwatari H, 2003, EURASIP J APPL SIG P, V2003, P1135, DOI 10.1155/S1110865703305104 SARUWATARI H, 2001, ACOUST SPEECH SIG PR, P2733 Saruwatari H, 2006, IEEE T AUDIO SPEECH, V14, P666, DOI 10.1109/TSA.2005.855832 Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199 Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2 Thiede T, 2000, J AUDIO ENG SOC, V48, P3 Tunner C.W., 2004, J ACOUST SOC AM, V115, P1729 Ukai S, 2005, INT CONF ACOUST SPEE, P85 Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 NR 33 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2013 VL 55 IS 4 BP 544 EP 552 DI 10.1016/j.specom.2012.11.002 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122NW UT WOS:000317325800004 ER PT J AU Gonseth, C Vilain, A Vilain, C AF Gonseth, Chloe Vilain, Anne Vilain, Coriandre TI An experimental study of speech/gesture interactions and distance encoding SO SPEECH COMMUNICATION LA English DT Article DE Speech/gesture interaction; Pointing; Distance encoding; Sound symbolism ID SPEECH PRODUCTION; GESTURE; LANGUAGE AB This paper explores the possible encoding of distance information in vocal and manual pointing and its relationship with the linguistic structure of deictic words, as well as speech/gesture cooperation within the process of deixis. Two experiments required participants to point at and/or name a close or distant target, with speech only, with gesture only, or with speech gesture. Acoustic, articulatory, and manual data were recorded. We investigated the interaction between vocal and manual pointing, with respect to the distance to the target. There are two major findings. First, distance significantly affects both articulatory and manual pointing, since participants perform larger vocal and manual gestures to designate a more distant target. Second, modality influences both deictic speech and gesture, since pointing is more emphatic in unimodal use of either over bimodal use of both, to compensate for the loss of the other mode. These findings suggest that distance is encoded in both vocal and manual pointing. We also demonstrate that the correlates of distance encoding in the vocal modality can be related to the typology of deictic words. Finally, our data suggest a two-way interaction between speech and gesture, and support the hypothesis that these two modalities are cooperating within a single communication system. (C) 2012 Elsevier B.V. All rights reserved. C1 [Gonseth, Chloe; Vilain, Anne; Vilain, Coriandre] Grenoble Univ, CNRS UMR 5216, Gipsa Lab, Speech & Cognit Dept, F-38402 St Martin Dheres, France. RP Gonseth, C (reprint author), Grenoble Univ, CNRS UMR 5216, Gipsa Lab, Speech & Cognit Dept, 11 Rue Math,Grenoble Campus,BP 46, F-38402 St Martin Dheres, France. EM chloe.gonseth@gipsa-lab.grenoble-inp.fr; anne.vilain@gipsa-lab.grenoble-inp.fr; coriandre.vilain@gipsa-lab.grenoble-inp.fr CR Arbib MA, 2005, BEHAV BRAIN SCI, V28, P105, DOI 10.1017/S0140525X05000038 Astafiev SV, 2003, J NEUROSCI, V23, P4689 Bernardis P, 2006, NEUROPSYCHOLOGIA, V44, P178, DOI 10.1016/j.neuropsychologia.2005.05.007 Boersma P., 2001, GLOT INT, V5, P341 Bonfiglioli C, 2009, COGNITION, V111, P270, DOI 10.1016/j.cognition.2009.01.006 Bruner J.S., 1983, CHILD TALK Butterworth G., 1998, DEV SENSORY MOTOR CO, P171 Chieffi S, 2009, BEHAV BRAIN RES, V203, P200, DOI 10.1016/j.bbr.2009.05.003 de Ruiter JP, 1998, THESIS CATHOLIC U NI Diessel H., 1999, DEMONSTRATIVES FORM, P42 DIESSEL HOLGER, 2011, WORLD ATLAS LANGUAGE Enfield NJ, 2003, LANGUAGE, V79, P82, DOI 10.1353/lan.2003.0075 Feyereisen P, 1997, J MEM LANG, V36, P13, DOI 10.1006/jmla.1995.2458 GENTILUCCI M, 1991, NEUROPSYCHOLOGIA, V29, P361, DOI 10.1016/0028-3932(91)90025-4 BUTTERWORTH B, 1989, PSYCHOL REV, V96, P168, DOI 10.1037//0033-295X.96.1.168 Hostetter AB, 2008, PSYCHON B REV, V15, P495, DOI 10.3758/PBR.15.3.495 Iverson JM, 2005, PSYCHOL SCI, V16, P367, DOI 10.1111/j.0956-7976.2005.01542.x Johansson N., 2011, THESIS LUNDS U Kendon A., 2004, GESTURE VISIBLE ACTI Kita S., 2003, POINTING LANGUAGE CU Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3 Engberg-Pedersen E, 2003, POINTING: WHERE LANGAUAGE, CULTURE, AND COGNITON MEET, P269 Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005 Krause MA, 1997, J COMP PSYCHOL, V111, P330, DOI 10.1037/0735-7036.111.4.330 Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017 Leavens DA, 2005, CURR DIR PSYCHOL SCI, V14, P185, DOI 10.1111/j.0963-7214.2005.00361.x Levelt W. J., 1989, SPEAKING INTENTION A LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1 LINDBLOM B, 1979, J PHONETICS, V7, P147 Loevenbruck H, 2005, J NEUROLINGUIST, V18, P237, DOI 10.1016/j.jneuroling.2004.12.002 MAEDA S, 1991, J PHONETICS, V19, P321 McNeill D., 2005, GESTURE AND THOUGHT McNeill D., 2000, LANGUAGE AND GESTURE Rochet-Capellan A., 2008, P INT SEM SPEECH PRO Sapir E., 1949, SELECTED WRITINGS E, P61 Tomasello M, 2005, BEHAV BRAIN SCI, V28, P675, DOI 10.1017/S0140525X05000129 Traunmuller H., 1987, PSYCHOPHYSICS SPEECH, P293 Traunmuller H., 1996, TMH QPSR, V2, P147 Ultan R., 1978, UNIVERSALS HUMAN LAN, V2 Volterra Virginia, 2005, NATURE NURTURE ESSAY, P3 WOODWORTH NL, 1991, LINGUISTICS, V29, P273, DOI 10.1515/ling.1991.29.2.273 NR 42 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2013 VL 55 IS 4 BP 553 EP 571 DI 10.1016/j.specom.2012.11.003 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122NW UT WOS:000317325800005 ER PT J AU Cooke, M Mayo, C Valentini-Botinhao, C Stylianou, Y Sauert, B Tang, Y AF Cooke, Martin Mayo, Catherine Valentini-Botinhao, Cassia Stylianou, Yannis Sauert, Bastian Tang, Yan TI Evaluating the intelligibility benefit of speech modifications in known noise conditions SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Speech modification; Synthetic speech ID LISTENING CONDITIONS; HEARING; ENVIRONMENTS; ENHANCEMENT; RECOGNITION; ALGORITHM; MODEL; CLEAR AB The use of live and recorded speech is widespread in applications where correct message reception is important. Furthermore, the deployment of synthetic speech in such applications is growing. Modifications to natural and synthetic speech have therefore been proposed which aim at improving intelligibility in noise. The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech. Listeners identified keywords in phonetically-balanced sentences representing ten different types of speech: plain and Lombard speech, five types of modified speech, and three forms of synthetic speech. Sentences were masked by either a stationary or a competing speech masker. Modification methods varied in the manner and degree to which they exploited estimates of the masking noise. The best-performing modifications led to equivalent intensity changes of around 5 dB in moderate and high noise levels for the stationary masker, and 3-4 dB in the presence of competing speech. These gains exceed those produced by Lombard speech. Synthetic speech in noise was always less intelligible than plain natural speech, but modified synthetic speech reduced this deficit by a significant amount. (C) 2013 Elsevier B.V. All rights reserved. C1 [Cooke, Martin] Ikerbasque Basque Sci Fdn, Bilbao, Spain. [Cooke, Martin; Tang, Yan] Univ Basque Country, Language & Speech Lab, Vitoria, Spain. [Mayo, Catherine; Valentini-Botinhao, Cassia] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland. [Stylianou, Yannis] ICS FORTH, Inst Comp Sci, Iraklion, Greece. [Sauert, Bastian] Rhein Westfal TH Aachen, Inst Commun Syst & Data Proc, Aachen, Germany. RP Cooke, M (reprint author), Univ Basque Country, Language & Speech Lab, Vitoria, Spain. EM m.cooke@ikerbasque.org FU European Community [213850]; Future and Emerging Technologies (FET) programme under FET-Open grant [256230] FX We thank Vasilis Karaiskos for help in running the listening tests, Julian Villegas for contributions to the recording of speech material, and T-C. Zorila, V. Kandia and D. Erro for useful discussions on developing SSDRC and TMDRC. The research leading to these results was partly funded from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement 213850 (SCALE) and by the Future and Emerging Technologies (FET) programme under FET-Open grant number 256230 (LISTA). CR ANSI, 1997, S351997 ANSI Bell S.T., 1992, J SPEECH HEAR RES, V35, P950 Blesser B.A., 1969, IEEE T AUDIO ELECTRO, V17 Boersma P., 2001, GLOT INT, V5, P341 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004 Cooke M, 2010, J ACOUST SOC AM, V128, P2059, DOI 10.1121/1.3478775 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780 Dreschler WA, 2001, AUDIOLOGY, V40, P148 Erro D., 2012, P INTERSPEECH FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 Hazan V., 1996, SPEECH HEARING LANGU, V9, P43 Hazan V, 2011, J ACOUST SOC AM, V130, P2139, DOI 10.1121/1.3623753 Taal C. H., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288810 Holland J. H., 1975, ADAPTATION NATURAL A, V2nd Howell P, 2006, PERCEPT PSYCHOPHYS, V68, P139, DOI 10.3758/BF03193664 Huang D., 2010, P SSW7 KYOT JAP, P258 Kates JM, 1998, SPRING INT SER ENG C, P235 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Langner B, 2005, INT CONF ACOUST SPEE, P265 Lindblom B., 1990, SPEECH PRODUCTION SP, V55, P403 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 Lu YY, 2008, J ACOUST SOC AM, V124, P3261, DOI 10.1121/1.2990705 McLoughlin IV, 1997, DSP 97: 1997 13TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, P591 Moore R.K., 2011, 17 INT C PHON SCI, P1422 NIEDERJOHN RJ, 1976, IEEE T ACOUST SPEECH, V24, P277, DOI 10.1109/TASSP.1976.1162824 Patel R, 2008, J SPEECH LANG HEAR R, V51, P209, DOI 10.1044/1092-4388(2008/016) PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 Pittman AL, 2001, J SPEECH LANG HEAR R, V44, P487, DOI 10.1044/1092-4388(2001/038) Raitio T., 2012, P ICASSP, P4015 Raitio T., 2011, P INTERSPEECH, P2781 RIX AW, 2001, ACOUST SPEECH SIG PR, P749 Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058 Sauert B., 2010, P ITG FACHT SPRACHK Sauert B, 2006, INT CONF ACOUST SPEE, P493 Sauert B., 2011, P C EL SPRACHS ESSV, P333 Skowronski MD, 2006, SPEECH COMMUN, V48, P549, DOI 10.1016/j.specom.2005.09.003 SoX, 2012, SOX SOUND EXCHANGE S STUDEBAKER GA, 1987, J ACOUST SOC AM, V81, P1130, DOI 10.1121/1.394633 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881 Tang Y., 2010, P INTERSPEECH, P1636 Tang Y., 2011, P INTERSPEECH, P345 Tang Y., 2012, P INTERSPEECH Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003 Valentini-Botinhao C., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288794 Valentini-Botinhao C., 2012, P INTERSPEECH Yamagishi J., 2008, P BLIZZ CHALL WORKSH Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647 Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394 Yoo SD, 2007, J ACOUST SOC AM, V122, P1138, DOI 10.1121/1.2751257 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zorila T. C., 2012, P INTERSPEECH ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630 NR 56 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2013 VL 55 IS 4 BP 572 EP 585 DI 10.1016/j.specom.2013.01.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122NW UT WOS:000317325800006 ER PT J AU Dai, P Soon, IY AF Dai, Peng Soon, Ing Yann TI An improved model of masking effects for robust speech recognition system SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; Auditory modeling; Simultaneous masking; Temporal masking; AURORA2 ID AUDITORY-SYSTEM; LATERAL INHIBITION; WORD RECOGNITION; FRONT-END; FREQUENCY; NOISE; ADAPTATION; PERCEPTION; FEATURES; NERVE AB Performance of an automatic speech recognition system drops dramatically in the presence of background noise unlike the human auditory system which is more adept at noisy speech recognition. This paper proposes a novel auditory modeling algorithm which is integrated into the feature extraction front-end for Hidden Markov Model (HMM). The proposed algorithm is named LTFC which simulates properties of the human auditory system and applies it to the speech recognition system to enhance its robustness. It integrates simultaneous masking, temporal masking and cepstral mean and variance normalization into ordinary mel-frequency cepstral coefficients (MFCC) feature extraction algorithm for robust speech recognition. The proposed method sharpens the power spectrum of the signal in both the frequency domain and the time domain. Evaluation tests are carried out on the AURORA2 database. Experimental results show that the word recognition rate using our proposed feature extraction method has been effectively increased. (c) 2012 Elsevier B.V. All rights reserved. C1 [Dai, Peng; Soon, Ing Yann] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. RP Dai, P (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. EM daip0001@e.ntu.edu.sg; eiysoon@ntu.edu.sg CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Bouquin L., 1995, P IEEE INT C AC SPEE, V1, P800 Brookes M., VOICEBOX Chen CP, 2007, IEEE T AUDIO SPEECH, V15, P257, DOI 10.1109/TASL.2006.876717 Chen JD, 2007, SPEECH COMMUN, V49, P305, DOI 10.1016/j.specom.2007.02.002 CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 Dai P., 2010, SPEECH COMMUN, V53, P229 Dai P., 2009, P ICICS MAC, P1 Dai P, 2012, SPEECH COMMUN, V54, P402, DOI 10.1016/j.specom.2011.10.004 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Gold B., 2000, SPEECH AUDIO SIGNAL Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 Holmberg M, 2006, IEEE T AUDIO SPEECH, V14, P43, DOI 10.1109/TSA.2005.860349 HOUTGAST T, 1972, J ACOUST SOC AM, V51, P1885, DOI 10.1121/1.1913048 JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576 Lu X., 2000, P IEEE SIGN PROC SOC, V2, P785 McGovern S. G., MODEL ROOM ACOUSTICS MILNER B, 2002, ACOUST SPEECH SIG PR, P797 Mokbel C, 1996, SPEECH COMMUN, V19, P185, DOI 10.1016/0167-6393(96)00032-5 Oxenham AJ, 2001, J ACOUST SOC AM, V109, P732, DOI 10.1121/1.1336501 Palomaki KJ, 2011, SPEECH COMMUN, V53, P924, DOI 10.1016/j.specom.2011.03.005 Park KY, 2003, NEUROCOMPUTING, V52-4, P615, DOI 10.1016/S0925-2312(02)00791-9 Pearce D., 2000, P ISCA ITRW ASR, P17 Roeser R.J., 2000, AUDIOLOGY DIAGNOSIS SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1612, DOI 10.1121/1.392799 SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1622, DOI 10.1121/1.392800 Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569 Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950 Togneri R., 2010, P IEEE INT C AC SPEE, P1618 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Zhang B, 2012, ACTA ACUST UNITED AC, V98, P328, DOI 10.3813/AAA.918516 Zhu WZ, 2005, INT CONF ACOUST SPEE, P245 NR 33 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2013 VL 55 IS 3 BP 387 EP 396 DI 10.1016/j.specom.2012.12.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 115UI UT WOS:000316837000001 ER PT J AU Kane, J Gobl, C AF Kane, John Gobl, Christer TI Automating manual user strategies for precise voice source analysis SO SPEECH COMMUNICATION LA English DT Article DE Voice source; Glottal source; LF model; Inverse filtering; Voice quality ID PLUS NOISE MODEL; GLOTTAL FLOW; SPEECH SYNTHESIS; SIGNALS; WAVE; PARAMETRIZATION; MINIMIZATION; PREDICTION; ALGORITHM AB A large part of the research carried out at the Phonetics and Speech Laboratory is concerned with the role of the voice source in the prosody of spoken language, including its linguistic and expressive dimensions. Due to the lack of robustness of automatic voice source analysis methods we have tended to use labour intensive methods which require pulse-by-pulse manual optimisation. This has affected the feasibility of conducting analysis on large volumes of data. To address this, a new method is proposed for automatic parameterisation of the deterministic component of the voice source by simulating the strategies used in the manual optimisation approach. The method involves a combination of exhaustive search, dynamic programming and optimisation methods, with settings derived from analysis of previous manual voice source analysis. A quantitative evaluation demonstrated clearly closer model parameter values to our reference values, compared with a standard time domain-based approach and a phase minimisation method. A complementary qualitative analysis illustrated broadly similar findings, in terms of voice source dynamics in various placements of focus, when using the proposed algorithm compared with a previous study which employed the manual optimisation approach. (c) 2012 Elsevier B.V. All rights reserved. C1 [Kane, John; Gobl, Christer] Trinity Coll Dublin, Phonet & Speech Lab, Ctr Language & Commun Studies, Sch Linguist Speech & Commun Sci, Dublin, Ireland. RP Kane, J (reprint author), Trinity Coll Dublin, Phonet & Speech Lab, Ctr Language & Commun Studies, Sch Linguist Speech & Commun Sci, Dublin, Ireland. EM kanejo@tcd.ie; cegobl@tcd.ie FU Science Foundation Ireland [07/CE/I1142, 09/IN.1/I2631]; Irish Department of Arts, Heritage and the Gaeltacht (ABAIR project) FX This work is supported by the Science Foundation Ireland, Grant 07/CE/I1142 (Centre for Next Generation Localisation, www.cngl.ie) and Grant 09/IN.1/I2631 (FASTNET) as well as by the Irish Department of Arts, Heritage and the Gaeltacht (ABAIR project). We would like to thank Dr. Irena Yanushevskaya for carrying out the manually optimised voice source analysis used in this study. The authors would like to thank the anonymous reviewers whose comments and suggestions have helped us to significantly improve this paper. CR Airas M., 2007, P INT 2007, P1410 ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365 Alku P, 2011, SADHANA-ACAD P ENG S, V36, P623, DOI 10.1007/s12046-011-0041-5 Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801 Arroabarren I., 2003, P EUROSPEECH 0901, P57 Baayen R. Harald, 2008, ANAL LINGUISTIC DATA Bozkurt B, 2005, IEEE SIGNAL PROC LET, V12, P344, DOI 10.1109/LSP.2005.843770 Brent RP, 1973, ALGORITHMS MINIMIZAT Cabral JP, 2011, INT CONF ACOUST SPEE, P4704 Degottex G, 2011, IEEE T AUDIO SPEECH, V19, P1080, DOI 10.1109/TASL.2010.2076806 Degottex G, 2011, INT CONF ACOUST SPEE, P5128 Degottex G., 2009, P SPECOM ST PET, P226 Doval B., 2001, P EUROSPEECH SCAND Drugman T., 2009, P INTERSPEECH, P1779 Drugman T., 2009, P INTERSPEECH, P116 Drugman T., 2009, P INTERSPEECH, P2891 Drugman T, 2012, IEEE T AUDIO SPEECH, V20, P994, DOI 10.1109/TASL.2011.2170835 Drugman T, 2012, COMPUT SPEECH LANG, V26, P20, DOI 10.1016/j.csl.2011.03.003 Fant G., 1985, Q PROGR STATUS REPOR, V4, P1 Fant G., 1995, STL QPSR, V36, P119 Frohlich M, 2001, J ACOUST SOC AM, V110, P479, DOI 10.1121/1.1379076 Gobl C., 2003, THESIS KTH SPEECH MU Gobl C., 2003, AMPLITUDE BASED SOUR, P151 Gobl C., 2010, P INT 2010, P2606 HACKI T, 1989, FOLIA PHONIATR, V41, P43 Hanson HM, 2001, J PHONETICS, V29, P451, DOI 10.1006/jpho.2001.0146 Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991 ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641 Kominek J., 2004, ISCA SPEECH SYNTH WO, P223 Kreiman J, 2006, ANAL SYNTHESIS PATHO NELDER JA, 1965, COMPUT J, V7, P308 NEY H, 1983, IEEE T SYST MAN CYB, V13, P208 Ni Chasaide A., 2011, P ICPHS HONG KONG, P1470 Ni Chasaide A., 1999, COARTICULATION THEOR, P300 O' Brien D., 2011, P IR SIGN SYST C ISS O' Cinneide A., 2011, P INTERSPEECH, P57 Pantazis Y, 2008, INT CONF ACOUST SPEE, P4609, DOI 10.1109/ICASSP.2008.4518683 Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239 Rodet X., 2007, P C DIG AUD EFF DAFX, P1 Strik H., 1993, P 3 EUR C SPEECH TEC, V1, P103 Strik H, 1998, J ACOUST SOC AM, V103, P2659, DOI 10.1121/1.422786 Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068 Talkin D., 1995, SPEECH CODING SYNTHE, P495 Thomas MRP, 2009, IEEE T AUDIO SPEECH, V17, P1557, DOI 10.1109/TASL.2009.2022430 TIMCKE R, 1958, ARCHIV OTOLARYNGOL, V68, P1 Vainio M, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P921 Veeneman D., 1985, ACOUSTICS SPEECH SIG, V33, P369 Walker J, 2007, LECT NOTES COMPUT SC, V4391, P1 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 Yanushevskaya I, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P462 Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P313, DOI 10.1109/89.701359 NR 52 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2013 VL 55 IS 3 BP 397 EP 414 DI 10.1016/j.specom.2012.12.004 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 115UI UT WOS:000316837000002 ER PT J AU Hahm, SJ Watanabe, S Ogawa, A Fujimoto, M Hori, T Nakamura, A AF Hahm, Seong-Jun Watanabe, Shinji Ogawa, Atsunori Fujimoto, Masakiyo Hori, Takaaki Nakamura, Atsushi TI Prior-shared feature and model space speaker adaptation by consistently employing map estimation SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Speaker adaptation; Feature space normalization; Model space adaptation; Prior distribution sharing ID CONTINUOUS SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD APPROACH; HIDDEN MARKOV-MODELS; LINEAR-REGRESSION; NORMALIZATION; PARAMETERS; INFERENCE AB The purpose of this paper is to describe the development of a speaker adaptation method that improves speech recognition performance regardless of the amount of adaptation data. For that purpose, we propose the consistent employment of a maximum a posteriori (MAP)-based Bayesian estimation for both feature space normalization and model space adaptation. Namely, constrained structural maximum a posteriori linear regression (CSMAPLR) is first performed in a feature space to compensate for the speaker characteristics, and then, SMAPLR is performed in a model space to capture the remaining speaker characteristics. A prior distribution stabilizes the parameter estimation especially when the amount of adaptation data is small. In the proposed method, CSMAPLR and SMAPLR are performed based on the same acoustic model. Therefore, the dimension-dependent variations of feature and model spaces can be similar. Dimension-dependent variations of the transformation matrix are explained well by the prior distribution. Therefore, by sharing the same prior distribution between CSMAPLR and SMAPLR, their parameter estimations can be appropriately regularized in both spaces. Experiments on large vocabulary continuous speech recognition using the Corpus of Spontaneous Japanese (CSJ) and the MIT Open-CourseWare corpus (MIT-OCW) confirm the effectiveness of the proposed method compared with other conventional adaptation methods with and without using speaker adaptive training. (c) 2012 Elsevier B.V. All rights reserved. C1 [Hahm, Seong-Jun; Watanabe, Shinji; Ogawa, Atsunori; Fujimoto, Masakiyo; Hori, Takaaki; Nakamura, Atsushi] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan. RP Hahm, SJ (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto 6190237, Japan. EM seongjun.hahm@lab.ntt.co.jp RI Hahm, Seong-Jun/I-6719-2013 CR Anastasakos T, 1997, INT CONF ACOUST SPEE, P1043, DOI 10.1109/ICASSP.1997.596119 BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179 Breslin C., 2010, P INT JAP, P1644 CHEN KT, 2000, P ICSLP, V3, P742 Chou W., 1999, P ICASSP, P1 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 Eide E, 1996, INT CONF ACOUST SPEE, P346, DOI 10.1109/ICASSP.1996.541103 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Glass J., 2007, P INTERSPEECH, P2553 Hahm S, 2010, INT CONF ACOUST SPEE, P4302, DOI 10.1109/ICASSP.2010.5495672 Hahm SJ, 2010, IEICE T INF SYST, VE93D, P1927, DOI 10.1587/transinf.E93.D.1927 Hazen TJ, 2000, SPEECH COMMUN, V31, P15, DOI 10.1016/S0167-6393(99)00059-X Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790 Huang J., 2005, P INT C MULT EXP, P338 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Lei X, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P773 Maekawa K, 2000, P LREC2000, V2, P947 MENG XL, 1993, BIOMETRIKA, V80, P267, DOI 10.2307/2337198 Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225 Nakano Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2286 Povey D, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1145 Pye D, 1997, INT CONF ACOUST SPEE, P1047, DOI 10.1109/ICASSP.1997.596120 Rabiner L, 1993, FUNDAMENTALS SPEECH Shinoda K, 2001, IEEE T SPEECH AUDI P, V9, P276, DOI 10.1109/89.906001 Siohan O, 2001, IEEE T SPEECH AUDI P, V9, P417, DOI 10.1109/89.917687 Siohan O, 2002, COMPUT SPEECH LANG, V16, P5, DOI 10.1006/csla.2001.0181 Strang G., 2003, INTRO LINEAR ALGEBRA Watanabe S, 2004, IEEE T SPEECH AUDI P, V12, P365, DOI 10.1109/TSA.2004.828640 Watanabe S, 2010, IEEE T AUDIO SPEECH, V18, P395, DOI 10.1109/TASL.2009.2029717 Watanabe S., 2011, IEEE INT WORKSH MACH, P1 Woodland P. C., 2001, ISCA TUT RES WORKSH Yu K, 2007, IEEE T AUDIO SPEECH, V15, P1932, DOI 10.1109/TASL.2007.901300 Yu K, 2006, INT CONF ACOUST SPEE, P217 Zajic Z, 2009, LECT NOTES ARTIF INT, V5729, P274, DOI 10.1007/978-3-642-04208-9_39 NR 37 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2013 VL 55 IS 3 BP 415 EP 431 DI 10.1016/j.specom.2012.12.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 115UI UT WOS:000316837000003 ER PT J AU Xu, T Wang, WW Dai, W AF Xu, Tao Wang, Wenwu Dai, Wei TI Sparse coding with adaptive dictionary learning for underdetermined blind speech separation SO SPEECH COMMUNICATION LA English DT Article DE Underdetermined blind speech separation (BSS); Sparse representation; Signal recovery; Adaptive dictionary learning ID AUDIO SOURCE SEPARATION; OVERCOMPLETE DICTIONARIES; NONSTATIONARY SOURCES; MATCHING PURSUITS; LEAST-SQUARES; MIXTURES; REPRESENTATIONS; IDENTIFICATION; APPROXIMATION; DECOMPOSITION AB A block-based approach coupled with adaptive dictionary learning is presented for underdetermined blind speech separation. The proposed algorithm, derived as a multi-stage method, is established by reformulating the underdetermined blind source separation problem as a sparse coding problem. First, the mixing matrix is estimated in the transform domain by a clustering algorithm. Then a dictionary is learned by an adaptive learning algorithm for which three algorithms have been tested, including the simultaneous codeword optimization (SimCO) technique that we have proposed recently. Using the estimated mixing matrix and the learned dictionary, the sources are recovered from the blocked mixtures by a signal recovery approach. The separated source components from all the blocks are concatenated to reconstruct the whole signal. The block-based operation has the advantage of improving considerably the computational efficiency of the source recovery process without degrading its separation performance. Numerical experiments are provided to show the competitive separation performance of the proposed algorithm, as compared with the state-of-the-art approaches. Using mutual coherence and sparsity index, the performance of a variety of dictionaries that are applied in underdetermined speech separation is compared and analyzed, such as the dictionaries learned from speech mixtures and ground truth speech sources, as well as those predefined by mathematical transforms such as discrete cosine transform (DCT) and short time Fourier transform (STFT). (c) 2013 Elsevier B.V. All rights reserved. C1 [Xu, Tao; Wang, Wenwu] Univ Surrey, Dept Elect Engn, Guildford GU2 7XH, Surrey, England. [Dai, Wei] Univ London Imperial Coll Sci Technol & Med, Dept Elect & Elect Engn, London SW7 2AZ, England. RP Xu, T (reprint author), Univ Surrey, Dept Elect Engn, Guildford GU2 7XH, Surrey, England. EM t.xu@surrey.ac.uk; w.wang@surrey.ac.uk; wei.dai1@imperial.ac.uk FU Engineering and Physical Sciences Research Council (EPSRC) of the UK [EP/H050000/1, EP/H012842/1]; Centre for Vision Speech and Signal Processing (CVSSP); China Scholarship Council (CSC); MOD University Defence Research Centre (UDRC) in Signal Processing FX We thank the Associate Editor Dr. Bin Ma and the anonymous reviewers for their helpful comments for improving our paper, and Dr. Mark Barnard for proofreading the manuscript. This work was supported in part by the Engineering and Physical Sciences Research Council (EPSRC) of the UK (Grant Nos. EP/H050000/1 and EP/H012842/1), the Centre for Vision Speech and Signal Processing (CVSSP), and the China Scholarship Council (CSC), and in part by the MOD University Defence Research Centre (UDRC) in Signal Processing. CR Aharon M, 2006, IEEE T SIGNAL PROCES, V54, P4311, DOI 10.1109/TSP.2006.881199 Alinaghi A, 2011, INT CONF ACOUST SPEE, P209 Araki S, 2007, SIGNALS COMMUN TECHN, P243, DOI 10.1007/978-1-4020-6479-1_9 Arberet S, 2010, IEEE T SIGNAL PROCES, V58, P121, DOI 10.1109/TSP.2009.2030854 Beck A, 2009, SIAM J IMAGING SCI, V2, P183, DOI 10.1137/080716542 Berg EVD, 2008, SIAM J SCI COMPUT, V31, P890, DOI DOI 10.1137/080714488 Blumensath T, 2008, IEEE T SIGNAL PROCES, V56, P2370, DOI 10.1109/TSP.2007.916124 Bofill P, 2001, SIGNAL PROCESS, V81, P2353, DOI 10.1016/S0165-1684(01)00120-7 Chen S. S., 1999, SIAM J SCI COMPUT, V20, P33 Cichocki A., 2009, NONNEGATIVE MATRIX T Cichocki A., 2003, ADAPTIVE BLIND SIGNA Comon P, 1998, P SOC PHOTO-OPT INS, V3461, P2, DOI 10.1117/12.325670 Comon P, 2004, IEEE T SIGNAL PROCES, V52, P11, DOI 10.1109/TSP.2003.820073 Dai W, 2009, IEEE T INFORM THEORY, V55, P2230, DOI 10.1109/TIT.2009.2016006 Dai W, 2012, IEEE T SIGNAL PROCES, V60, P6340, DOI 10.1109/TSP.2012.2215026 Daubechies I, 2010, COMMUN PUR APPL MATH, V63, P1 Demmel J.W., 1997, APPL NUMERICAL LINEA Donoho D., 2005, SPARSELAB, P25 Donoho DL, 2006, IEEE T INFORM THEORY, V52, P1289, DOI 10.1109/TIT.2006.871582 Donoho DL, 2006, COMMUN PUR APPL MATH, V59, P797, DOI 10.1002/cpa.20132 Edelman A, 1998, SIAM J MATRIX ANAL A, V20, P303, DOI 10.1137/S0895479895290954 Elad M., 2010, IEEE T SIGNAL PROCES, V58, P1558 Elad M, 2006, IEEE T IMAGE PROCESS, V15, P3736, DOI 10.1109/TIP.2006.881969 Friedlander M.P., 2008, SPG11 SPECTRAL PROJE Gowreesunker BV, 2008, INT CONF ACOUST SPEE, P33 Gowreesunker BV, 2009, LECT NOTES COMPUT SC, V5441, P34, DOI 10.1007/978-3-642-00599-2_5 Gribonval R, 2006, IEEE T INFORM THEORY, V52, P255, DOI 10.1109/TIT.2005.860474 Gribonval R., 2006, ESANN 06, P323 Hulle M.V., 1999, IEEE WORKSH NEUR NET, P315 Hyvarinen A, 2001, INDEPENDENT COMPONEN Jafari MG, 2011, IEEE J-STSP, V5, P1025, DOI 10.1109/JSTSP.2011.2157892 Jan T, 2011, SPEECH COMMUN, V53, P524, DOI 10.1016/j.specom.2011.01.002 JOURJINE A, 2000, ACOUST SPEECH SIG PR, P2985 Kim SJ, 2007, IEEE J-STSP, V1, P606, DOI 10.1109/JSTSP.2007.910971 Kim S.-J., 2007, L1LS L1 REGULARISED Kowalski M, 2010, IEEE T AUDIO SPEECH, V18, P1818, DOI 10.1109/TASL.2010.2050089 Luo YH, 2006, IEEE T SIGNAL PROCES, V54, P2198, DOI 10.1109/TSP.2006.873367 Mailhe B., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288688 Makino S, 2007, SIGNALS COMMUN TECHN, P1, DOI 10.1007/978-1-4020-6479-1 MALLAT SG, 1993, IEEE T SIGNAL PROCES, V41, P3397, DOI 10.1109/78.258082 Mandel MI, 2010, IEEE T AUDIO SPEECH, V18, P382, DOI 10.1109/TASL.2009.2029711 Mohimani H, 2009, IEEE T SIGNAL PROCES, V57, P289, DOI 10.1109/TSP.2008.2007606 Needell D., 2008, APPL COMPUT HARMON A, V26, P301, DOI DOI 10.1016/J.ACHA.2008.07.002 Nion D, 2008, SIGNAL PROCESS, V88, P749, DOI 10.1016/j.sigpro.2007.07.024 Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 Pati Y. C., 1993, P 27 AS C SIGN SYST, V1, P40, DOI DOI 10.1109/ACSSC.1993.342465 Pedersen MS, 2008, IEEE T NEURAL NETWOR, V19, P475, DOI 10.1109/TNN.2007.911740 Peleg T, 2012, IEEE T SIGNAL PROCES, V60, P2286, DOI 10.1109/TSP.2012.2188520 Plumbley M.D., 2012, P ICML WORKSH SPARS Plumbley MD, 2010, P IEEE, V98, P995, DOI 10.1109/JPROC.2009.2030345 Sawada H, 2011, IEEE T AUDIO SPEECH, V19, P516, DOI 10.1109/TASL.2010.2051355 Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2 Sudhakar P., 2011, THESIS U RENNES 1 FR Tibshirani R, 1996, J ROY STAT SOC B MET, V58, P267 Tichavsky P, 2011, IEEE T SIGNAL PROCES, V59, P1037, DOI 10.1109/TSP.2010.2096221 Tropp JA, 2004, IEEE T INFORM THEORY, V50, P2231, DOI 10.1109/TIT.2004.834793 Vincent E, 2009, LECT NOTES COMPUT SC, V5441, P734, DOI 10.1007/978-3-642-00599-2_92 Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005 Wang W., 2008, P ICARN LIV UK SEP 2, P5 Wang W, 2007, PROC MONOGR ENG WATE, P347 Wang WW, 2005, IEEE T SIGNAL PROCES, V53, P1654, DOI 10.1109/TSP.2005.845433 Wang WW, 2009, IEEE T SIGNAL PROCES, V57, P2858, DOI 10.1109/TSP.2009.2016881 Wang WW, 2008, IEEE IJCNN, P3681 Wang WW, 2008, EURASIP J ADV SIG PR, DOI 10.1155/2008/231367 Xu T, 2010, INT CONF ACOUST SPEE, P2022, DOI 10.1109/ICASSP.2010.5494935 Xu T, 2009, 2009 IEEE/SP 15TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, P493 Xu T.C., 2011, P IEEE INT C MACH LE, P1 Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896] Zibulevsky M, 2001, NEURAL COMPUT, V13, P863, DOI 10.1162/089976601300014385 NR 69 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2013 VL 55 IS 3 BP 432 EP 450 DI 10.1016/j.specom.2012.12.003 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 115UI UT WOS:000316837000004 ER PT J AU Neiberg, D Salvi, G Gustafson, J AF Neiberg, Daniel Salvi, Giampiero Gustafson, Joakim TI Semi-supervised methods for exploring the acoustics of simple productive feedback SO SPEECH COMMUNICATION LA English DT Article DE Social signal processing; Affective annotation; Feedback modelling; Grounding ID TURN-TAKING; EMOTION; DIALOGUE; EXPRESSION; CUES; CONVERSATION; ORGANIZATION; RESPONSES; AGENTS; MODEL AB This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations. (c) 2013 Elsevier B.V. All rights reserved. C1 [Neiberg, Daniel; Salvi, Giampiero; Gustafson, Joakim] KTH Royal Inst Technol, Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden. RP Neiberg, D (reprint author), KTH Royal Inst Technol, Dept Speech Mus & Hearing, Lindstedtsv 24, S-10044 Stockholm, Sweden. EM neiberg@speech.kth.se FU Swedish Research Council (VR) project "Introducing interactional phenomena in speech synthesis" [2009-4291]; Swedish Research Council (VR) project "Biologically inspired statistical methods for flexible automatic speech understanding" [2009-4599] FX Funding was provided by the Swedish Research Council (VR) projects "Introducing interactional phenomena in speech synthesis" (2009-4291) and "Biologically inspired statistical methods for flexible automatic speech understanding" (2009-4599). CR Al Moubayed S., 2010, P FONETIK, P11 Allwood J, 2007, LANG RESOUR EVAL, V41, P273, DOI 10.1007/s10579-007-9061-5 Allwood J., 1987, TEMA KOMMUNIKATION, V1, P89 Allwood J., 1992, Journal of Semantics, V9, DOI 10.1093/jos/9.1.1 Audibert N., 2011, COGNITION EMOTION, P37 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Baron-Cohen S., 2004, MIND READING INTERAC BARSALOU LW, 1985, J EXP PSYCHOL LEARN, V11, P629, DOI 10.1037/0278-7393.11.1-4.629 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Bell L., 2000, 6 INT C SPOK LANG PR Benus S., 2007, P 16 INT C PHON SCI, P1065 Bunt H., 2007, P 8 SIGDIAL WORKSH D, P283 Buschmeier H, 2011, P 11 INT C INT VIRT, P169, DOI 10.1007/978-3-642-23974-8_19 Cassell J., 2007, P WORKSH EMB LANG PR, P41, DOI 10.3115/1610065.1610071 Cerrato L., 2006, THESIS KTH ROYAL I T CETIN O, 2006, P ICSLP PITTSB, P293 Chen A, 2004, LANG SPEECH, V47, P311 CLARK HH, 1994, SPEECH COMMUN, V15, P243, DOI 10.1016/0167-6393(94)90075-2 CLARK HH, 1989, COGNITIVE SCI, V13, P259, DOI 10.1207/s15516709cog1302_7 Dietrich S, 2006, PROG BRAIN RES, V156, P295, DOI 10.1016/S0079-6123(06)56016-9 DITTMANN AT, 1968, J PERS SOC PSYCHOL, V9, P79, DOI 10.1037/h0025722 Duncan Jr S., 1972, J PERSONALITY SOCIAL, P23 Duncan S., 1974, LANG SOC, V3, P161, DOI DOI 10.1017/S0047404500004322 Duncan Jr S, 1977, FACE TO FACE INTERAC Edlund J., 2009, NORD PROS P 10 C OCT, P57 Edlund J, 2008, SPEECH COMMUN, V50, P630, DOI 10.1016/j.specom.2008.04.002 Edlund J., 2010, P 7 C INT LANG RES E, P2992 Edlund J., 2005, P INT 2005 LISB PORT, P2389 Ekman P, 1972, NEBRASKA S MOTIVATIO, P207 EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068 Fries Charles C., 1952, STRUCTURE ENGLISH IN Fujimoto D. T., 2007, J OSAKA JOGAKUIN 2 Y, V37, P35 Gardner R, 2001, LISTENERS TALK RESPO Goodwin C., 1981, CONVERSATIONAL ORG I Goudbeek M, 2010, J ACOUST SOC AM, V128, P1322, DOI 10.1121/1.3466853 Gratch J, 2007, LECT NOTES ARTIF INT, V4722, P125 Gravano A, 2012, COMPUT LINGUIST, V38, P1, DOI 10.1162/COLI_a_00083 Gravano A., 2008, P 4 SPEECH PROS C CA Greenberg Joseph H., 1978, WORD STRUCTURE, V3, P297 Gustafson J., 2002, P ISCA WORKSH MULT D Gustafson J., 2010, 5 WORKSH DISFL SPONT Gustafson J, 2008, LECT NOTES ARTIF INT, V5078, P240, DOI 10.1007/978-3-540-69369-7_27 Heldner M., 2011, 12 ANN C INT SPEECH Hirschberg J., 1999, P AUT SPEECH REC UND, P349 Hjalmarsson A., 2008, P SIGDIAL 2008 COL O Hjalmarsson A., 2010, THESIS ROYAL I TECHN House David, 1990, TONAL PERCEPTION SPE KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4 Kopp S., 2006, ZIF WORKSH, P18 Kopp S, 2010, SPEECH COMMUN, V52, P587, DOI 10.1016/j.specom.2010.02.007 Krahmer E, 2002, SPEECH COMMUN, V36, P133, DOI 10.1016/S0167-6393(01)00030-9 Lai C., 2009, P INT 09 BRIGHT UK Lai C., 2010, P INT 2010 MAK JAP Larsson S., 2002, THESIS GOTEBORG U Laskowski K., 2004, P ISLP 2004 JEJ ISL, P973 Levitt E. A., 1964, COMMUNICATION EMOTIO, P87 Liscombe J., 2003, P EUR 2003 LLOYD SP, 1982, IEEE T INFORM THEORY, V28, P129, DOI 10.1109/TIT.1982.1056489 McGraw KO, 1996, PSYCHOL METHODS, V1, P390, DOI 10.1037//1082-989X.1.4.390 Neiberg D., 2012, INT WORKSH FEEDB BEH Neiberg D., 2010, INTERSPEECH 2010, P2562 Neiberg D., 2011, INTERSPEECH 2011 Neiberg D., 2011, INTERSPEECH 2011, P1581 Neiberg D, 2011, INT CONF ACOUST SPEE, P5836 NILSENOVA MARIE, 2006, THESIS U AMSTERDAM Payr S, 2011, APPL ARTIF INTELL, V25, P441, DOI 10.1080/08839514.2011.586616 Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 Reese Brian, 2007, THESIS U TEXAS AUSTI Reidsma D, 2011, J MULTIMODAL USER IN, V4, P97, DOI 10.1007/s12193-011-0060-x Russell JA, 2003, PSYCHOL REV, V110, P145, DOI 10.1037/0033-295X.110.1.145 SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 Sauter D., 2010, P NATL ACAD SCI USA, P107 Sauter DA, 2010, Q J EXP PSYCHOL, V63, P2251, DOI 10.1080/17470211003721642 Schegloff EA, 2000, LANG SOC, V29, P1 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 Scherer KR, 2009, COGNITION EMOTION, V23, P1307, DOI 10.1080/02699930902928969 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Schroder M., 2009, 3 INT C AFF COMP INT, P1 Sigurd B., 1984, SPRAKVARD, P3 Skantze G., 2007, THESIS ROYAL I TECHN Sloman A., 2010, CLOSE ENGAGEMENTS AR SOKAL ROBERT R., 1962, TAXON, V11, P33, DOI 10.2307/1217208 Stocksmeier T., 2007, P INT, P1290 Stolcke A, 2000, COMPUT LINGUIST, V26, P339, DOI 10.1162/089120100561737 Stromqvist S, 1999, J PRAGMATICS, V31, P1245, DOI 10.1016/S0378-2166(98)00104-0 Traum D., 1994, THESIS U ROCHESTER Wallers A., 2006, THESIS KTH STOCKHOLM Ward N., 2006, PRAGMAT COGN, V14, P129, DOI 10.1075/pc.14.1.08war Ward N, 2000, J PRAGMATICS, V32, P1177, DOI 10.1016/S0378-2166(99)00109-5 Ward Nigel G, 2007, Computer Assisted Language Learning, V20, DOI 10.1080/09588220701745825 Yngve Victor, 1970, 6 REG M CHIC LING SO, P567 NR 91 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2013 VL 55 IS 3 BP 451 EP 469 DI 10.1016/j.specom.2012.12.007 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 115UI UT WOS:000316837000005 ER PT J AU Nakagawa, S Iwami, K Fujii, Y Yamamoto, K AF Nakagawa, Seiichi Iwami, Keisuke Fujii, Yasuhisa Yamamoto, Kazumasa TI A robust/fast spoken term detection method based on a syllable n-gram index with a distance metric SO SPEECH COMMUNICATION LA English DT Article DE Spoken term detection; Syllable recognition; N-gram; Distant n-gram; Out-of-Vocabulary; Mis-recognition ID RETRIEVAL; SPEECH AB For spoken document retrieval, it is crucial to consider Out-of-vocabulary (OOV) and the mis-recognition of spoken words. Consequently, sub-word unit based recognition and retrieval methods have been proposed. This paper describes a Japanese spoken term detection method for spoken documents that robustly considers OOV words and mis-recognition. To solve the problem of OOV keywords, we use individual syllables as the sub-word unit in continuous speech recognition. To address OOV words, recognition errors, and high-speed retrieval, we propose a distant n-gram indexing/retrieval method that incorporates a distance metric in a syllable lattice. When applied to syllable sequences, our proposed method outperformed a conventional DTW method between syllable sequences and was about 100 times faster. The retrieval results show that we can detect OOV words in a database containing 44 h of audio in less than 10 m sec per query with an F-measure of 0.54. (c) 2012 Elsevier B.V. All rights reserved. C1 [Nakagawa, Seiichi; Iwami, Keisuke; Fujii, Yasuhisa; Yamamoto, Kazumasa] Toyohashi Univ Technol, Dept Comp Sci & Engn, Toyohashi, Aichi 4418580, Japan. RP Iwami, K (reprint author), Toyohashi Univ Technol, Dept Comp Sci & Engn, 1-1 Hibarigaoka,Tempaku Cho, Toyohashi, Aichi 4418580, Japan. EM nakagawa@slp.cs.tut.ac.jp; iwami@slp.cs.tut.ac.jp; fujii@slp.cs.tut.ac.jp; kyama@slp.cs.tut.ac.jp CR Akbacak M, 2008, INT CONF ACOUST SPEE, P5240, DOI 10.1109/ICASSP.2008.4518841 Akiba T., 2011, 9 NTCIR WORKSH SPOK, P1 Allauzen C., 2004, WORKSH INT APPR SPEE, P33 Can D, 2009, INT CONF ACOUST SPEE, P3957, DOI 10.1109/ICASSP.2009.4960494 Chaudhari UV, 2012, IEEE T AUDIO SPEECH, V20, P1633, DOI 10.1109/TASL.2012.2186805 Chen B., 2000, ICASSP, P2985 Dharanipragada S, 2002, IEEE T SPEECH AUDI P, V10, P542, DOI 10.1109/TSA.2002.804543 Fujii Y., 2011, MUSP, P110 Itoh Y, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P677 Iwami K., 2010, SLT, P200 Iwami K, 2011, INT CONF ACOUST SPEE, P5664 Iwami K., 2013, IPSJ, V54 K Ng, 1998, ICSLP, P1088 Kanda N., 2008, MMSP, P939 Katsurada K., 2009, INTERSPEECH, P2147 Larson M., 2003, EUROSPEECH, P1217 Mamou J, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2106 Mamou Jonathan, 2007, SIGIR 07, P615 Meng H.M., 2000, ICSLP, P101 Natori S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P681 Ng C, 2000, SPEECH COMMUN, V32, P61, DOI 10.1016/S0167-6393(00)00024-8 Nishizaki H., 2002, HLT, P144 Parada C., 2009, ASRU, P404 Parada C, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1269 Saito H., 2012, 6 WORKSH SPOK DOC PR Sakamoto N., 2013, SPRING M ASJ Saraclar M, 2004, HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P129 Wang HM, 2000, SPEECH COMMUN, V32, P49, DOI 10.1016/S0167-6393(00)00023-6 Wechsler M., 1998, SIGIR 98, P20 NR 29 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2013 VL 55 IS 3 BP 470 EP 485 DI 10.1016/j.specom.2012.12.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 115UI UT WOS:000316837000006 ER PT J AU Winters, S O'Brien, MG AF Winters, Stephen O'Brien, Mary Grantham TI Perceived accentedness and intelligibility: The relative contributions of F0 and duration SO SPEECH COMMUNICATION LA English DT Article DE Foreign accent; Intelligibility; Second language; Perception; Prosody ID GLOBAL FOREIGN ACCENT; IN-NOISE RECOGNITION; VOICE-ONSET-TIME; L2 SPEECH; ENGLISH; PERCEPTION; LISTENERS; PROSODY; EXPERIENCE; SPEAKERS AB The current study sought to determine the relative contributions of suprasegmental and segmental features to the perception of foreign accent and intelligibility in both first language (L1) and second language (L2) German and English speech. Suprasegmental and segmental features were manipulated independently by transferring (1) native intonation contours and/or syllable durations onto non-native segments and (2) non-native intonation contours and/or syllable durations onto native segments in both English and German. These resynthesized stimuli were then presented, in an intelligibility task, to native speakers of German and English who were proficient in both languages. Both of these groups of speakers and monolingual native speakers of English also rated the foreign accentedness of the manipulated stimuli. In general, tokens became more accented and less intelligible, the more they were manipulated. Tokens were also less accented and more intelligible when produced by speakers of (and in) the listeners' L1. Nonetheless, in certain L2 productions, there was both a reduction in perceived accentedness and decreased intelligibility for tokens in which native prosody was applied to non-native segments, indicating a disconnect between the perceptual processing of intelligibility and accent. (c) 2012 Elsevier B.V. All rights reserved. C1 [Winters, Stephen] Univ Calgary, Dept Linguist, Calgary, AB T2N 1N4, Canada. [O'Brien, Mary Grantham] Univ Calgary, Dept German Slav & East Asian Studies, Calgary, AB T2N 1N4, Canada. RP O'Brien, MG (reprint author), Univ Calgary, Dept German Slav & East Asian Studies, Craigie Hall C208,2500 Univ Dr NW, Calgary, AB T2N 1N4, Canada. EM swinters@ucalgary.ca; mgobrien@ucal-gary.ca FU German Academic Exchange Service (DAAD) FX Both authors contributed equally to this project. We would like to thank Kelly-Ann Casey, Roswita Dressler and Tara Dainton for providing paid research assistance and the members of the audience of the Germanic Linguistics Annual Conference 2009 for their feedback on the pilot study data. Any errors that remain are our own. This work was supported by a grant from the German Academic Exchange Service (DAAD). CR ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x Atterer M, 2004, J PHONETICS, V32, P177, DOI 10.1016/S0095-4470(03)00039-1 Backman N., 1979, INTERLANGUAGE STUDIE, V4, P239 Baker RE, 2011, J PHONETICS, V39, P1, DOI 10.1016/j.wocn.2010.10.006 Baker W, 2008, LANG SPEECH, V51, P317, DOI 10.1177/0023830908099068 Bannert R., 1995, PHONUM, V3, P7 Baumann S., 2000, LINGUISTISCHE BERICH, V181, P1 Baumann Stefan, 2006, METHODS EMPIRICAL PR, P153 Beckman M. E., 1994, TOBI ANNOTATION CONV Bel B., 2004, P SPEECH PROS 2004 N, P721 Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234 Boersma P., 2007, PRAAT DOING PHONETIC Bolinger D., 1998, INTONATION SYSTEMS S, P45 Boula de Mareuil P., 2004, P SPEECH PROS 2004 N, P681 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 de Mareuil PB, 2006, PHONETICA, V63, P247, DOI 10.1159/000097308 Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V19, P1, DOI [DOI 10.1017/S0272263197001010, 10.1017/S0272263197001010] Derwing T.M., 2005, TESOL Q, V39, P379 Eckert H, 1994, MENSCHEN IHRE STIMME Escudero P, 2009, J PHONETICS, V37, P452, DOI 10.1016/j.wocn.2009.07.006 ESSER J, 1978, PHONETICA, V35, P41 Fery C, 1993, GERMAN INTONATIONAL FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256 Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052 Fournier R, 2006, J PHONETICS, V34, P29, DOI 10.1016/j.wocn.2005.03.002 Fox A., 1984, GERMAN INTONATION Fries Charles C., 1964, HONOUR D JONES, P242 Gibbon D, 1998, INTONATION SYSTEMS S, P78 Goethe-Institut, 2004, EINST Gonzalez-Bueno M, 1997, IRAL-INT REV APPL LI, V35, P251, DOI 10.1515/iral.1997.35.4.251 GRABE E, 1998, COMP INTONATIONAL PH Grosser W., 1997, 2 LANGUAGE SPEECH ST, P211 Gulikers L., 1995, LEON CELEX LEXICAL D Gut U., 2009, NONNATIVE SPEECH COR Gut U., 2003, FREMDSPRACHEN LEHREN, V32, P133 Hahn LD, 2004, TESOL QUART, V38, P201 Heinrich A, 2010, SPEECH COMMUN, V52, P1038, DOI 10.1016/j.specom.2010.09.009 Hoehle B., 2009, LINGUISTICS, V47, P359 Holm S., 2008, THESIS NORWEGIAN U S James A., 2000, NEW SOUNDS 2000, P1 Jilka M., 2007, NONNATIVE PROSODY PH, P77 Jilka M., 2000, THESIS U STUTTGARD S Kennedy S, 2008, CAN MOD LANG REV, V64, P459, DOI 10.3138/cmlr.64.3.459 Kohler K. J., 2004, TRADITIONAL PHONOLOG, P205 Ladefoged Peter, 2011, COURSE PHONETICS, V6th Lai YH, 2009, LANG COGNITIVE PROC, V24, P1265, DOI 10.1080/01690960802113850 Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081 Major R. C., 1987, STUDIES 2ND LANGUAGE, V9, P63, DOI 10.1017/S0272263100006513 MAJOR RC, 1986, SECOND LANG RES, V2, P53, DOI 10.1177/026765838600200104 Major RC, 2007, STUD SECOND LANG ACQ, V29, P539, DOI 10.1017/S0272263107070428 Mennen I, 2004, J PHONETICS, V32, P543, DOI 10.1016/j.wocn.2004.02.002 Mennen I., 2007, NONNATIVE PROSODY PH, P53 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Moyer A, 1999, STUDIES 2 LANGUAGE A, V21, P81, DOI DOI 10.1017/50272263199001035 Munro M. J., 1995, STUDIES 2 LANGUAGE A, V17, P17, DOI 10.1017/S0272263100013735 Munro M. J., 2001, STUDIES 2 LANGUAGE A, V23, P451 Munro MJ, 1995, LANG SPEECH, V38, P289 Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049 Munro MJ, 2010, SPEECH COMMUN, V52, P626, DOI 10.1016/j.specom.2010.02.013 Munro Murray J., 2008, PHONOLOGY 2 LANGUAGE, P193 Ng ML, 2008, INT J SPEECH-LANG PA, V10, P404, DOI 10.1080/17549500802399007 O'Brien M.G., 2011, ACHIEVEMENTS PERSPEC, P205 Oxford University Press, 2009, OXFORD ONLINE PLACEM Pennington MC, 2000, MOD LANG J, V84, P372, DOI 10.1111/0026-7902.00075 Pfitzinger H.P., 2010, P 5 INT C SPEECH PRO, P1 Pinet M, 2010, J ACOUST SOC AM, V128, P1357, DOI 10.1121/1.3466857 Ramirez Verdugo D., 2002, ICAME J, V26, P115 Riney TJ, 1999, LANG LEARN, V49, P275, DOI 10.1111/0023-8333.00089 Riney TJ, 2005, TESOL QUART, V39, P441 Riney TJ, 2000, TESOL QUART, V34, P711, DOI 10.2307/3587782 Shah A.P., 2003, THESIS CITY U NEW YO Sidaras SK, 2009, J ACOUST SOC AM, V125, P3306, DOI 10.1121/1.3101452 Spitzer SM, 2007, J ACOUST SOC AM, V122, P3678, DOI 10.1121/1.2801545 Stibbard RM, 2006, J ACOUST SOC AM, V120, P433, DOI 10.1121/1.2203595 SYRDAL A, 1998, ACOUST SPEECH SIG PR, P273 Tajima K, 1997, J PHONETICS, V25, P1, DOI 10.1006/jpho.1996.0031 Trofimovich P, 2006, STUD SECOND LANG ACQ, V28, P1, DOI 10.1017/S0272263106060013 van Leyden K, 2006, PHONETICA, V63, P149, DOI 10.1159/000095306 Wegener H., 1998, 2 SPRACHE LERNEN EMP, P21 Winters S., 2011, 162 ANN M AC SOC AM NR 81 TC 1 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2013 VL 55 IS 3 BP 486 EP 507 DI 10.1016/j.specom.2012.12.006 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 115UI UT WOS:000316837000007 ER PT J AU Veisi, H Sameti, H AF Veisi, Hadi Sameti, Hossein TI Speech enhancement using hidden Markov models in Mel-frequency domain SO SPEECH COMMUNICATION LA English DT Article DE HMM-based speech enhancement; Mel-frequency; Parallel cepstral and spectral (PCS) ID PARAMETER GENERATION; NOISY SPEECH; RECOGNITION AB Hidden Markov model (HMM)-based minimum mean square error speech enhancement method in Mel-frequency domain is focused on and a parallel cepstral and spectral (PCS) modeling is proposed. Both Mel-frequency spectral (MFS) and Mel-frequency cepstral (MFC) features are studied and experimented for speech enhancement. To estimate clean speech waveform from a noisy signal, an inversion from the Mel-frequency domain to the spectral domain is required which introduces distortion artifacts in the spectrum estimation and the filtering. To reduce the corrupting effects of the inversion, the PCS modeling is proposed. This method performs concurrent modeling in both cepstral and magnitude spectral domains. In addition to the spectrum estimator, magnitude spectrum, log-magnitude spectrum and power spectrum estimators are also studied and evaluated in the HMM-based speech enhancement framework. The performances of the proposed methods are evaluated in the presence of five noise types with different SNR levels and the results are compared with several established speech enhancement methods especially auto-regressive HMM-based speech enhancement. The experimental results for both subjective and objective tests confirm the superiority of the proposed methods in the Mel-frequency domain over the reference methods, particularly for non-stationary noises. (C) 2012 Elsevier B.V. All rights reserved. C1 [Veisi, Hadi; Sameti, Hossein] Sharif Univ Technol, Dept Comp Engn, Tehran, Iran. RP Veisi, H (reprint author), Sharif Univ Technol, Dept Comp Engn, Tehran, Iran. EM veisi@ce.sharif.edu; sameti@sharif.edu FU Iranian Telecommunication Research Center (ITRC) FX This research was partially supported by the Iranian Telecommunication Research Center (ITRC). CR Arakawa T., 2006, P IEEE INT C AC SPEE, P1 Berouti M., 1979, P IEEE INT C AC SPEE, P208 Chen B, 2007, SPEECH COMMUN, V49, P134, DOI 10.1016/j.specom.2006.12.005 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P1303, DOI 10.1109/78.139237 EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 EPHRAIM Y, 1984, P IEEE T SPEECH AUD, V32, P1109 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 EPHRAIM Y, 1985, P IEEE INT C AC SPEE, V33, P443 Gales M.J.F., 1995, THESIS U CAMBRIDGE S GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Imai S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 Logan B.T., 1998, THESIS CAMBRIDGE U Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927 Perceptual Evaluation of Speech Quality (PESQ), 2001, OBJECTIVE METHOD END, P862 Porter J., 1984, P IEEE INT C AC SPEE, P53 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Sameti H., 1994, THESIS U WATERLOO WA Sameti H, 1998, IEEE T SPEECH AUDI P, V6, P445, DOI 10.1109/89.709670 Sasou A, 2006, SPEECH COMMUN, V48, P1100, DOI 10.1016/j.specom.2006.03.002 Segura J. C., 2001, P EUROSPEECH2001, P221 Srinivasan S, 2007, IEEE T AUDIO SPEECH, V15, P441, DOI 10.1109/TASL.2006.881696 Stouten V, 2006, SPEECH COMMUN, V48, P1502, DOI 10.1016/j.specom.2005.12.006 TOKUDA K, 1995, INT CONF ACOUST SPEE, P660, DOI 10.1109/ICASSP.1995.479684 Tokuda K, 2000, INT CONF ACOUST SPEE, P1315, DOI 10.1109/ICASSP.2000.861820 Veisi H, 2011, DIGIT SIGNAL PROCESS, V21, P36, DOI 10.1016/j.dsp.2010.07.004 Wolfe PJ, 2003, EURASIP J APPL SIG P, V2003, P1043, DOI 10.1155/S1110865703304111 You C.H., 2003, P IEEE INT C AC SPEE, P852 Yu D, 2008, IEEE T AUDIO SPEECH, V16, P1061, DOI 10.1109/TASL.2008.921761 Zhao DY, 2007, IEEE T AUDIO SPEECH, V15, P882, DOI 10.1109/TASL.2006.885256 NR 33 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 205 EP 220 DI 10.1016/j.specom.2012.08.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900001 ER PT J AU Bolanos, D Cole, RA Ward, WH Tindal, GA Schwanenflugel, PJ Kuhn, MR AF Bolanos, Daniel Cole, Ronald A. Ward, Wayne H. Tindal, Gerald A. Schwanenflugel, Paula J. Kuhn, Melanie R. TI Automatic assessment of expressive oral reading SO SPEECH COMMUNICATION LA English DT Article DE Oral reading fluency; Expressive reading; Prosody; Children's read speech; Education ID YOUNG READERS; FLUENCY; PROSODY; CLASSIFICATION; AGREEMENT; CHILDREN; THINGS AB We investigated the automatic assessment of expressive children's oral reading of grade level text passages using a standardized rubric. After a careful review of the reading literature and a close examination of the rubric, we designed a novel set of prosodic and lexical features to characterize fluent expressive reading. A number of complementary sources of information were used to design the features, each of them motivated by research on different components of reading fluency. Features are connected to the child's reading rate, to the presence and number of pauses, filled-pauses and word-repetitions, the correlation between punctuation marks and pauses, the length of word groupings, syllable stress and duration and the location of pitch peaks and contours. The proposed features were evaluated on a corpus of 783 one-minute reading sessions from 313 students reading grade-leveled passages without assistance (cold unassisted reading). Experimental results show that the proposed lexical and prosodic features provide complementary information and are able to capture the characteristics of expressive reading. The results showed that on both the 2-point and the 4-point expressiveness scales, computer-generated ratings of expressiveness agreed with human raters better than the human raters agreed with each other. The results of the study suggest that automatic assessment of expressive oral reading can be combined with automatic measures of word accuracy and reading rate to produce an accurate multidimensional estimate of children's oral reading ability. (C) 2012 Elsevier B.V. All rights reserved. C1 [Bolanos, Daniel; Cole, Ronald A.; Ward, Wayne H.] Boulder Language Technol, Boulder, CO 80301 USA. [Ward, Wayne H.] Univ Colorado, Boulder, CO 80309 USA. [Tindal, Gerald A.] Univ Oregon, Eugene, OR 97403 USA. [Schwanenflugel, Paula J.] Univ Georgia, Athens, GA 30602 USA. [Kuhn, Melanie R.] Boston Univ, Boston, MA 02215 USA. RP Bolanos, D (reprint author), Boulder Language Technol, 2960 Ctr Green Court,Suite 200, Boulder, CO 80301 USA. EM dani@bltek.com; rcole@bltek.com; wward@bltek.com; geraldt@uoregon.edu; pschwan@uga.edu; melaniek@bu.edu FU U.S. Department of Education [R305B070434]; National Science Foundation [0733323]; NIH [R43 DC009926-01] FX We gratefully acknowledge the help of Angel Stobaugh, Director of Literacy Education at Boulder Valley School District, and the principals and teachers who allowed us to visit their schools and classrooms. We appreciate the amazing efforts of Jennifer Borum, who organized the FLORA data collection effort, and the efforts of Linda Hill, Suzan Heglin and the rest of the human experts who scored the text passages. This work was supported by U.S. Department of Education award number R305B070434, National Science Foundation award number 0733323 and NIH award number R43 DC009926-01. CR Alonzo J., 2006, EASYCBM ONLINE PROGR Altman D, 1991, PRACTICAL STAT MED R Benjamin RG, 2010, READ RES QUART, V45, P388, DOI 10.1598/RRQ.45.4.2 Bolanos D., 2011, ACM T SPEECH LANGUAG, V7 Bolanos D., 2012, IEEE SPOK LANG TECHN Brenier J.M., 2005, P EUR 9 EUR C SPEECH, P3297 Chafe W., 1988, COMMUNICATION, P396 Chang C.-C., 2001, LIBSVM LIB SUPPORT V Chang Y., 2008, JMLR WORKSH C P WCCI, V3 Chard DJ, 2002, J LEARN DISABIL-US, V35, P386, DOI 10.1177/00222194020350050101 Clay M., 1971, J VERB LEARN VERB BE, P133 COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256 COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104 Cole R., 2006, TRCSLR200602 U COL Cole R., 2006, TRCSLR200603 U COL Cowie R, 2002, LANG SPEECH, V45, P47 Daane M. C., 2005, 2006469 NCES US DEP Duong M, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P769 Duong M., 2011, ACM T SPEECH LANGUAG, V7 Fisher B., 1996, TSYLB2 1 1 SYLLABIFI Fuchs L. S., 2001, SCI STUD READ, V5, P239, DOI DOI 10.1207/S1532799XSSR0503_3 Good R. H., 2002, DYNAMIC INDICATORS B, V6th Good R. H., 2001, SCI STUD READ, V5, P257, DOI DOI 10.1207/S1532799XSSR0503_ Good R. H., 2007, DYNAMIC INDICATORS B Guyon I, 2002, MACH LEARN, V46, P389, DOI 10.1023/A:1012487302797 Hasbrouck J, 2006, READ TEACH, V59, P636, DOI 10.1598/RT.59.7.3 Kane M., 2006, ED MEASUREMENT, V4th, P17 Kuhn M., 2005, READING PSYCHOL, V26, P127, DOI [10.1080/02702710590930492, DOI 10.1080/02702710590930492] Kuhn M., 2000, TECHNICAL REPORT Kuhn M. R., 2010, READING RES Q, V45, P230, DOI DOI 10.1598/RRQ.45.2.4 Liu Y, 2006, IEEE T AUDIO SPEECH, V14, P1526, DOI 10.1109/TASL.2006.878255 LOGAN GD, 1988, PSYCHOL REV, V95, P492, DOI 10.1037//0033-295X.95.4.492 Miller J, 2006, J EDUC PSYCHOL, V98, P839, DOI 10.1037/0022-0663.98.4.839 Miller J, 2008, READ RES QUART, V43, P336, DOI 10.1598/RRQ.43.4.2 Mostow J, 2009, FR ART INT, V200, P189, DOI 10.3233/978-1-60750-028-5-189 National Reading Panel N., 2000, TECHNICAL REPORT Patel R, 2011, SPEECH COMMUN, V53, P431, DOI 10.1016/j.specom.2010.11.007 Pinnell G., 1995, 1995726 NCES US DEP Platt J., 1999, ADV LARGE MARGIN CLA, V10, P61 Platt JC, 2000, ADV NEUR IN, V12, P547 Rasinski T., 2004, ASSESSING READING FL Rasinski T. V., 2009, LITERACY RES INSTRUC, V48, P350, DOI DOI 10.1080/19388070802468715 Rasinski T. V., 1991, THEOR PRACT, V30, P211, DOI DOI 10.1080/00405849109543502 Rosenfeld R., 1994, CMU STAT LANGUAGE MO Schwanenflugel P., 2011, COMMUNICATION Schwanenflugel P., 2012, FLUENCY INS IN PRESS Schwanenflugel PJ, 2004, J EDUC PSYCHOL, V96, P119, DOI 10.1037/0022-0663.96.1.119 Shinn, 1998, ADV APPL CURRICULUM Shinn M. R., 2002, AIMSWEB TRAINING WOR Shobaki K., 2000, P ICSLP 2000 BEIJ CH Sjolander K., 1997, TECHNICAL REPORT Snow C. E., 1998, PREVENTING READING D Spearman C, 1904, AM J PSYCHOL, V15, P72, DOI 10.2307/1412159 Steidl S, 2005, INT CONF ACOUST SPEE, P317 Valencia SW, 2010, READ RES QUART, V45, P270, DOI 10.1598/RRQ.45.3.1 Vapnik V., 1995, NATURE STAT LEARNING Vicsi K, 2010, SPEECH COMMUN, V52, P413, DOI 10.1016/j.specom.2010.01.003 Wayman MM, 2007, J SPEC EDUC, V41, P85 NR 58 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 221 EP 236 DI 10.1016/j.specom.2012.08.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900002 ER PT J AU Alam, MJ Kinnunen, T Kenny, P Ouellet, P O'Shaughnessy, D AF Alam, Md Jahangir Kinnunen, Tomi Kenny, Patrick Ouellet, Pierre O'Shaughnessy, Douglas TI Multitaper MFCC and PLP features for speaker verification using i-vectors SO SPEECH COMMUNICATION LA English DT Article DE Speaker verification; Multi-taper spectrum; Feature extraction; i-Vectors; MFCC; PLP ID SPECTRAL ESTIMATION; HARMONIC-ANALYSIS; VARIANCE; RECOGNITION; WINDOWS; SPEECH; SPHERE AB In this paper we study the performance of the low-variance multi-taper Mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) features in a state-of-the-art i-vector speaker verification system. The MFCC and PLP features are usually computed from a Hamming-windowed periodogram spectrum estimate. Such a single-tapered spectrum estimate has large variance, which can be reduced by averaging spectral estimates obtained using a set of different tapers, leading to a so-called multi-taper spectral estimate. The multi-taper spectrum estimation method has proven to be powerful especially when the spectrum of interest has a large dynamic range or varies rapidly. Multi-taper MFCC features were also recently studied in speaker verification with promising preliminary results. In this study our primary goal is to validate those findings using an up-to-date i-vector classifier on the latest NIST 2010 SRE data. In addition, we also propose to compute robust perceptual linear prediction (PLP) features using multitapers. Furthermore, we provide a detailed comparison between different taper weight selections in the Thomson multi-taper method in the context of speaker verification. Speaker verification results on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the latest NIST 2010 SRE corpus indicate that the multi-taper methods outperform the conventional periodogram technique. Instead of simply averaging (using uniform weights) the individual spectral estimates in forming the multi-taper estimate, weighted averaging (using non-uniform weights) improves performance. Compared to the MFCC and PLP baseline systems, the sine-weighted cepstrum estimator (SWCE) based multi-taper method provides average relative reductions of 12.3% and 7.5% in equal error rate, respectively. For the multi-peak multi-taper method, the corresponding reductions are 12.6% and 11.6%, respectively. Finally, the Thomson multi-taper method provides error reductions of 9.5% and 5.0% in EER for MFCC and PLP features, respectively. We conclude that both the MFCC and PLP features computed via multitapers provide systematic improvements in recognition accuracy. (C) 2012 Elsevier B.V. All rights reserved. C1 [Alam, Md Jahangir] Univ Quebec, INRS EMT, Montreal, PQ H5A 1K6, Canada. [Alam, Md Jahangir; Kenny, Patrick; Ouellet, Pierre] CRIM, Montreal, PQ, Canada. [Kinnunen, Tomi] Univ Eastern Finland, Sch Comp, Joensuu, Finland. RP Alam, MJ (reprint author), Univ Quebec, INRS EMT, 800 La Gauchetiere W,Suite 6900, Montreal, PQ H5A 1K6, Canada. EM alam@emt.inrs.ca; tkinnu@cs.joensuu.fi; Patrick.Kenny@crim.ca; Pierre.Ouellet@crim.ca; dougo@emt.inrs.ca FU Academy of Finland [132129] FX The work of T. Kinnunen was supported by the Academy of Finland (Project No. 132129). We would like to thank the anonymous reviewers for their comments that helped to improve the content of this paper. CR Alam J., 2011, LECT NOTES ARTIF INT, V7015, P239 Alam M.J., 2011, P IEEE AUT SPEECH RE, P547 Brummer N, 2010, P OD SPEAK LANG REC DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307 Djuric P., 1999, DIGITAL SIGNAL PROCE Garcia-Romero D., 2011, P INTERSPEECH, P249 Gold B., 2000, SPEECH AUDIO SIGNAL Hansson M, 1997, IEEE T SIGNAL PROCES, V45, P778, DOI 10.1109/78.558503 Hansson-Sandsten M, 2009, INT CONF ACOUST SPEE, P3077, DOI 10.1109/ICASSP.2009.4960274 HARRIS FJ, 1978, P IEEE, V66, P51, DOI 10.1109/PROC.1978.10837 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Honig Florian, 2005, P INTERSPEECH, P2997 Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Kay S. M., 1988, MODERN SPECTRAL ESTI Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1448, DOI 10.1109/TASL.2007.894527 Kenny P, 2010, P OD SPEAK LANG REC Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693 Kinnunen T, 2012, IEEE T AUDIO SPEECH, V20, P1990, DOI 10.1109/TASL.2012.2191960 Kinnunen T., 2010, P INTERSPEECH, P2734 Matejka Pavel, 2006, P IEEE OD 2006 SPEAK, P57 McCoy EJ, 1998, IEEE T SIGNAL PROCES, V46, P655, DOI 10.1109/78.661333 National Institute of Standards and Technology, NIST SPEAK REC EV Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213 Percival D.B., 1993, SPECTRAL ANAL PHYS A Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Reynolds D.A., 2008, UNIVERSAL BACKGROUND RIEDEL KS, 1995, IEEE T SIGNAL PROCES, V43, P188, DOI 10.1109/78.365298 Sandberg J, 2010, IEEE SIGNAL PROC LET, V17, P343, DOI 10.1109/LSP.2010.2040228 Senoussaoui M., 2011, P INTERSPEECH, P25 Senoussaoui M., 2010, P OD SPEAK LANG REC SLEPIAN D, 1961, AT&T TECH J, V40, P43 THOMSON DJ, 1990, PHILOS T ROY SOC A, V332, P539, DOI 10.1098/rsta.1990.0130 THOMSON DJ, 1982, P IEEE, V70, P1055, DOI 10.1109/PROC.1982.12433 WALDEN AT, 1994, IEEE T SIGNAL PROCES, V42, P479, DOI 10.1109/78.275635 Wieczorek MA, 2007, J FOURIER ANAL APPL, V13, P665, DOI 10.1007/s00041-006-6904-1 Wieczorek MA, 2005, GEOPHYS J INT, V162, P655, DOI 10.1111/j-1365-246X.2005.02687.x XIANG B, 2002, ACOUST SPEECH SIG PR, P681 Young S., 2006, HTK BOOK NR 39 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 237 EP 251 DI 10.1016/j.specom.2012.08.007 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900003 ER PT J AU Wollmer, M Schuller, B Rigoll, G AF Woellmer, Martin Schuller, Bjoern Rigoll, Gerhard TI Keyword spotting exploiting Long Short-Term Memory SO SPEECH COMMUNICATION LA English DT Article DE Keyword spotting; Long Short-Term Memory; Recurrent neural networks; Dynamic Bayesian Networks ID BIDIRECTIONAL LSTM NETWORKS; RECURRENT NEURAL-NETWORKS; SPEECH RECOGNITION; ARCHITECTURES; MODELS AB We investigate various techniques for keyword spotting which are exclusively based on acoustic modeling and do not presume the existence of an in-domain language model. Since adequate context modeling is nevertheless necessary for word spotting, we show how the principle of Long Short-Term Memory (LSTM) can be incorporated into the decoding process. We propose a novel technique that exploits LSTM in combination with Connectionist Temporal Classification in order to improve performance by using a self-learned amount of contextual information. All considered approaches are evaluated on read speech as contained in the TIMIT corpus as well as on the SEMAINE database which consists of spontaneous and emotionally colored speech. As further evidence for the effectiveness of LSTM modeling for keyword spotting, results on the CHiME task are shown. (C) 2012 Elsevier B.V. All rights reserved. C1 [Woellmer, Martin; Schuller, Bjoern; Rigoll, Gerhard] Tech Univ Munich, Inst Human Machine Commun, D-80333 Munich, Germany. RP Wollmer, M (reprint author), Tech Univ Munich, Inst Human Machine Commun, Theresienstr 90, D-80333 Munich, Germany. EM woellmer@tum.de FU Federal Republic of Germany through the German Research Foundation (DFG) [SCHU2508/4-1] FX The research leading to these results has received funding from the Federal Republic of Germany through the German Research Foundation (DFG) under Grant No. SCHU2508/4-1. CR Benayed Y., 2003, P ICASSP, P588 BILMES J, 2002, ACOUST SPEECH SIG PR, P3916 Bilmes J. A., 2003, MATH FDN SPEECH LANG, P191 Bilmes JA, 2005, IEEE SIGNAL PROC MAG, V22, P89 Charles F, 2007, LECT NOTES COMPUT SC, V4871, P210 Chen B., 2004, P INTERSPEECH Christensen H., 2010, P INTERSPEECH, P1918 Dekel O, 2004, WORKSH MULT INT REL, P146 Fernandez S, 2007, LECT NOTES COMPUT SC, V4669, P220 Gemmeke JF, 2011, IEEE T AUDIO SPEECH, V19, P2067, DOI 10.1109/TASL.2011.2112350 Gers FA, 2000, NEURAL COMPUT, V12, P2451, DOI 10.1162/089976600300015015 Graves A, 2005, NEURAL NETWORKS, V18, P602, DOI 10.1016/j.neunet.2005.06.042 Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI DOI 10.1145/1143844.1143891 Graves A., 2008, THESIS TU MUNCHEN Graves A., 2008, ADV NEURAL INFORM PR, V20, P1 Grezl F, 2008, INT CONF ACOUST SPEE, P4729, DOI 10.1109/ICASSP.2008.4518713 Hermansky H., 2008, P EUR C SPEECH COMM, P361 Heylen D., 2008, P 4 INT WORKSH HUM C, P1 Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI 10.1162/neco.1997.9.8.1735 Hochreiter S., 2001, FIELD GUIDE DYNAMICA, P1 Jensen F. V., 1996, INTRO BAYESIAN NETWO Keshet J., 2007, THESIS HEBREW U Keshet J, 2009, SPEECH COMMUN, V51, P317, DOI 10.1016/j.specom.2008.10.002 Ketabdar H., 2006, IDAIP RR, P1 McKee GJ, 2010, AGR ISSUES POLICIES, P1 Nijholt A, 2000, STUD FUZZ SOFT COMP, V45, P148 Parveen S., 2004, P ICASSP Principi E., 2009, P HSI CAT IT, P216 Rigoll G., 1994, IEEE T AUDIO SPEECH, V2 ROSE RC, 1995, COMPUT SPEECH LANG, V9, P309, DOI 10.1006/csla.1995.0015 Rosevear R. D., 1990, Power Technology International Schroder M, 2012, IEEE T AFFECT COMPUT, V3, P165, DOI 10.1109/T-AFFC.2011.34 Schuller B., 2011, P 1 INT AUD VIS EM C, P415 Schuster M, 1997, IEEE T SIGNAL PROCES, V45, P2673, DOI 10.1109/78.650093 Szoke Igor, 2010, Proceedings 2010 IEEE Spoken Language Technology Workshop (SLT 2010), DOI 10.1109/SLT.2010.5700849 Trentin E, 2001, NEUROCOMPUTING, V37, P91, DOI 10.1016/S0925-2312(00)00308-8 Vergyri D., 2007, P INTERSPEECH Wang H.C., 1997, P ROCLING 10 INT C, P325 Weninger F., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288963 Weninger F., 2011, P CHIME WORKSH FLOR, P24 Wik P, 2009, SPEECH COMMUN, V51, P1024, DOI 10.1016/j.specom.2009.05.006 Wollmer M, 2009, INT CONF ACOUST SPEE, P3949, DOI 10.1109/ICASSP.2009.4960492 Wollmer M, 2009, NEUROCOMPUTING, V73, P366, DOI 10.1016/j.neucom.2009.08.005 Wollmer M, 2010, COGN COMPUT, V2, P180, DOI 10.1007/s12559-010-9041-8 Wollmer M, 2011, IEEE T INTELL TRANSP, V12, P574, DOI 10.1109/TITS.2011.2119483 Wollmer Martin, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5373544 Wollmer M, 2011, INT CONF ACOUST SPEE, P4860 Wollmer M, 2010, IEEE J-STSP, V4, P867, DOI 10.1109/JSTSP.2010.2057200 Wollmer M, 2010, INT CONF ACOUST SPEE, P5274, DOI 10.1109/ICASSP.2010.5494980 Young S., 2006, HTK BOOK V3 4 Zhu QF, 2005, LECT NOTES COMPUT SC, V3361, P223 NR 51 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 252 EP 265 DI 10.1016/j.specom.2012.08.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900004 ER PT J AU Yeh, CY Chang, SC Hwang, SH AF Yeh, Cheng-Yu Chang, Shun-Chieh Hwang, Shaw-Hwa TI A consistency analysis on an acoustic module for Mandarin text-to-speech SO SPEECH COMMUNICATION LA English DT Article DE Consistency analysis; Hidden Markov model (HMM); Vector quantization (VQ); Acoustic module; Text-to-speech (TTS); Speech synthesis ID PROSODIC INFORMATION; CHINESE; SYSTEM; UNITS; CONVERSION; ALGORITHM AB In this work, a consistency analysis on an acoustic module for a Mandarin text-to-speech (TTS) is presented as a way to improve the speech quality. Found by an inspection on the pronunciation process of human beings, the consistency can be interpreted as a high correlation of a warping curve between the spectrum and the prosody intra a syllable. Through three steps in the procedure of the consistency analysis, the HMM algorithm is used firstly to decode HMM-state sequences within a syllable at the same time as to divide them into three segments. Secondly, based on a designated syllable, the vector quantization (VQ) with the Linde-Buzo-Gray (LBG) algorithm is used to train the VQ codebooks of each segment. Thirdly, the prosodic vector of each segment is encoded as an index by VQ codebooks, and then the probability of each possible path is evaluated as a prerequisite to analyze the consistency. It is demonstrated experimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word. These results offer a research direction that the warping process between the spectrum and the prosody intra a syllable must be considered in a TTS system to improve the speech quality. (C) 2012 Elsevier B.V. All rights reserved. C1 [Yeh, Cheng-Yu] Natl Chin Yi Univ Technol, Dept Elect Engn, Taichung 41170, Taiwan. [Chang, Shun-Chieh; Hwang, Shaw-Hwa] Natl Taipei Univ Technol, Dept Elect Engn, Taipei 10608, Taiwan. RP Yeh, CY (reprint author), Natl Chin Yi Univ Technol, Dept Elect Engn, 57,Sec 2,Zhongshan Rd, Taichung 41170, Taiwan. EM cy.yeh@ncut.edu.tw; t6319011@ntut.edu.tw; hsf@ntut.edu.tw FU Ministry of Economic Affairs [100-EC-17-A-03-S1-123]; National Science Council, Taiwan, Republic of China [NSC 95-2221-E-027-090] FX This research was financially supported by the Ministry of Economic Affairs under Grant No. 100-EC-17-A-03-S1-123 and the National Science Council under Grant No. NSC 95-2221-E-027-090, Taiwan, Republic of China. CR Bellegarda JR, 2010, IEEE T AUDIO SPEECH, V18, P1455, DOI 10.1109/TASL.2009.2035209 Chalamandaris A, 2010, IEEE T CONSUM ELECTR, V56, P1890, DOI 10.1109/TCE.2010.5606343 Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 Chou FC, 2002, IEEE T SPEECH AUDI P, V10, P481, DOI 10.1109/TSA.2002.803437 CHOU FC, 1998, ACOUST SPEECH SIG PR, P893 Chou FC, 1997, INT CONF ACOUST SPEE, P923 Chou FC, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1624 Dey S, 2007, ASIA S PACIF DES AUT, P298 Guo Q., 2008, P TRIANGL S ADV ICT, P1 Huang X.D., 2001, HIDDEN MARKOV MODELS, P377 Hwang SH, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1421 Hwang SH, 2005, GESTS INT T SPEECH S, V2, P91 Karabetsos S, 2009, IEEE T CONSUM ELECTR, V55, P613, DOI 10.1109/TCE.2009.5174430 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 LEE LS, 1989, IEEE T ACOUST SPEECH, V37, P1309 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z OMALLEY MH, 1990, COMPUTER, V23, P17, DOI 10.1109/2.56867 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Spelta Cristiano, 2010, IEEE Embedded Systems Letters, V2, DOI 10.1109/LES.2010.2052019 Wu CH, 2001, SPEECH COMMUN, V35, P219, DOI 10.1016/S0167-6393(00)00075-3 Yeh CY, 2005, IEE P-VIS IMAGE SIGN, V152, P793, DOI 10.1049/ip-vis:20045095 Yeh C.Y., 2010, P ISCCSP, P1 YING ZW, 2001, ACOUST SPEECH SIG PR, P809 Yoshimura T., 2000, P ICASSP, P1315 Yue D. J., 2010, P ICALIP, P1652 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zhu Y., 2002, P TENCON, P204 NR 28 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 266 EP 277 DI 10.1016/j.specom.2012.08.009 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900005 ER PT J AU Degottex, G Lanchantin, P Roebel, A Rodet, X AF Degottex, Gilles Lanchantin, Pierre Roebel, Axel Rodet, Xavier TI Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis SO SPEECH COMMUNICATION LA English DT Article DE Mixed source; Glottal model; Vocal tract filter; Voice quality; Voice transformation; Speech synthesis ID SPEECH SYNTHESIS; HMM; REPRESENTATION; EXCITATION; NOISE AB In current methods for voice transformation and speech synthesis, the vocal tract filter is usually assumed to be excited by a flat amplitude spectrum. In this article, we present a method using a mixed source model defined as a mixture of the Lihencrants-Fant (LF) model and Gaussian noise. Using the LF model, the base approach used in this presented work is therefore close to a vocoder using exogenous input like ARX-based methods or the Glottal Spectral Separation (GSS) method. Such approaches are therefore dedicated to voice processing promising an improved naturalness compared to generic signal models. To estimate the Vocal Tract Filter (VTF), using spectral division like in GSS, we show that a glottal source model can be used with any envelope estimation method conversely to ARX approach where a least square AR solution is used. We therefore derive a VTF estimate which takes into account the amplitude spectra of both deterministic and random components of the glottal source. The proposed mixed source model is controlled by a small set of intuitive and independent parameters. The relevance of this voice production model is evaluated, through listening tests, in the context of resynthesis, HMM-based speech synthesis, breathiness modification and pitch transposition. (C) 2012 Elsevier B.V. All rights reserved. C1 [Degottex, Gilles; Lanchantin, Pierre; Roebel, Axel; Rodet, Xavier] Ircam, CNRS, Anal Synth Team, STMS,UMR9912, F-75004 Paris, France. RP Degottex, G (reprint author), Univ Crete, Dept Comp Sci, Iraklion 71409, Crete, Greece. EM gilles.degottex@ircam.fr FU Affective Avatar ANR project; Respoken and AngelStudio FEDER projects; Centre National de la Recherche Scientifique (CNRS) FX This research was partly supported by the Affective Avatar ANR project, by the Respoken and AngelStudio FEDER projects and the Centre National de la Recherche Scientifique (CNRS) for the PhD grant. Authors also would like to thank Chunghsin Yeh and Joao Cabral for the discussions, their time and their code, the reviewers for their numerous and precise remarks and especially the listeners for their precious ears. CR Agiomyrgiannakis Y., 2009, P IEEE INT C AC SPEE, P3589 Agiomyrgiannakis Y., 2008, P INTERSPEECH, P1849 Alku P, 1999, CLIN NEUROPHYSIOL, V110, P1329, DOI 10.1016/S1388-2457(99)00088-7 Assembly T.I.R., 2003, TECHNICAL REPORT BANNO H, 1998, ACOUST SPEECH SIG PR, P861 Bechet F., 2001, TRAITEMENT AUTOMATIQ, V42, P47 Bonada J., 2008, THESIS U POMPEU FABR Cabral J. P., 2010, THESIS U EDINBURGH U Cabral JP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1829 Cabral JP, 2011, INT CONF ACOUST SPEE, P4704 de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 Degottex G, 2011, IEEE T AUDIO SPEECH, V19, P1080, DOI 10.1109/TASL.2010.2076806 Degottex G, 2011, INT CONF ACOUST SPEE, P5128 Degottex G., 2010, THESIS UPMC FRANCE del Pozo A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1457 Drugman T., 2009, P INTERSPEECH, P116 Drugman T., 2009, INTERSPEECH Fant G., 1995, STL QPSR, V36, P119 Fant Gunnar, 1985, STL QPSR, V4, P1 Flanagan J.L., 1966, BELL SYSTEM TECHNICA Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 Hamon C., 1989, P INT C AC SPEECH SI, V89, P238 Hedelin P., 1984, P IEEE INT C AC SPEE, P21 Henrich N., 2001, THESIS UPMC FRANCE HERMES DJ, 1991, SPEECH COMMUN, V10, P497, DOI 10.1016/0167-6393(91)90053-V Imai S., 1979, ELECT COMM, V62-A, P10 Kawahara H., 2001, MAVEBA Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Kim SJ, 2007, IEICE T INF SYST, VE90D, P378, DOI 10.1093/ietisy/e90-d.1.378 Lanchantin P., 2008, P INT C LANG RES EV, P2403 Lanchantin P, 2010, INT CONF ACOUST SPEE, P4630, DOI 10.1109/ICASSP.2010.5495550 Laroche J., 1993, P IEEE INT C AC SPEE, P550 Markel JD, 1976, LINEAR PREDICTION SP MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 Mehta D, 2005, INT INTEG REL WRKSP, P199, DOI 10.1109/ASPAA.2005.1540204 MILLER RL, 1959, J ACOUST SOC AM, V31, P667, DOI 10.1121/1.1907771 OPPENHEI.AV, 1968, PR INST ELECTR ELECT, V56, P1264, DOI 10.1109/PROC.1968.6570 Pantazis Y., 2010, IEEE T AUDIO SPEECH, V19, P290 Peelers G., 2001, THESIS UPMC FRANCE Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239 RODET X, 1984, COMPUT MUSIC J, V8, P15, DOI 10.2307/3679810 Roebel A., 2007, PATTERN RECOGN, V28, P1343 STEVENS KN, 1971, J ACOUST SOC AM, V50, P1180, DOI 10.1121/1.1912751 Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068 Stylianou Y., 1996, THESIS TELECOMPARIS Tokuda K., 1995, P EUROSPEECH, P757 Tokuda K, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P227 Tokuda K, 2002, IEICE T INF SYST, VE85D, P455 Tooher M., 2003, P ISCA VOIC QUAL FUN, P41 Valbret H., 1992, P ICASSP SAN FRANC U, V1, P145, DOI 10.1109/ICASSP.1992.225951 Vincent D, 2007, INT CONF ACOUST SPEE, P525 Young S., 1994, TECHNICAL REPORT Zen H, 2007, P ISCA WORKSH SPEECH Zen H., 2004, P ICSLP Zivanovic M, 2008, COMPUT MUSIC J, V32, P57, DOI 10.1162/comj.2008.32.2.57 NR 56 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 278 EP 294 DI 10.1016/j.specom.2012.08.010 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900006 ER PT J AU Kane, J Gobl, C AF Kane, John Gobl, Christer TI Evaluation of glottal closure instant detection in a range of voice qualities SO SPEECH COMMUNICATION LA English DT Article DE Glottal closure instant; Phonation type; Voice quality; Glottal epochs ID GROUP DELAY FUNCTION; SIGNIFICANT EXCITATION; SPEECH SIGNALS; ELECTROGLOTTOGRAPHIC SIGNALS; WAVELET TRANSFORM; EPOCH EXTRACTION; DYPSA ALGORITHM; PREDICTION; PHONATION; ENVELOPE AB Recently developed speech technology platforms, such as statistical speech synthesis and voice transformation systems, facilitate the modification of voice characteristics. To fully exploit the potential of such platforms, speech analysis algorithms need to be able to handle the different acoustic characteristics of a variety of voice qualities. Glottal closure instant (GCI) detection is typically required in the analysis stages, and thus the importance of robust GCI algorithms is evident. The current study examines some important analysis signals relevant to GCI detection, for a range of phonation types. Furthermore, a new algorithm is proposed which builds on an existing GCI algorithm to optimise the performance when analysing speech involving different phonation types. Results suggest improvements in the GCI detection rate for creaky voice due to a reduction in false positives. When there is a lack of prominent peaks in the Linear Prediction residual, as found for breathy and harsh voice, the results further indicate some enhancement of GCI identification accuracy for the proposed method. (C) 2012 Elsevier B.V. All rights reserved. C1 [Kane, John; Gobl, Christer] Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland. RP Kane, J (reprint author), Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland. EM kanejo@tcd.ie; cegobl@tcd.ie FU Science Foundation Ireland [07/CE/I1142, 09/IN.1/I 2631] FX This research was supported by the Science Foundation Ireland, Grant 07/CE/I1142 (Centre for Next Generation Localisation, http://www.cngl.ie) and Grant 09/IN.1/I 2631 (FASTNET). The authors would like to thank the anonymous reviewers for their insightful comments and suggestions. CR Agiomyrgiannakis Y., 2009, P IEEE INT C AC SPEE, P3589 ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267 Cabral J., 2011, P INTERSPEECH, P1989 Cabral JP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1829 Carlson R, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1300 CHENG YM, 1989, IEEE T ACOUST SPEECH, V37, P1805, DOI 10.1109/29.45529 d'Alessandro C., 2011, SADHANA 5, V36 Degottex G., 2009, P SPECOM ST PET, P226 Drugman T., 2009, P INTERSPEECH, P116 Drugman T., 2009, P INTERSPEECH, P2891 Drugman T., 2011, THESIS U MONS Drugman T, 2012, IEEE T AUDIO SPEECH, V20, P994, DOI 10.1109/TASL.2011.2170835 Drugman T., 2012, P INTERSPEECH, P130 Drugman T., 2011, P INTERSPEECH, P1973 Esling J. H., 2003, P 15 ICPHS BARC, P1049 Fant G., 1985, 4 KTH SPEECH TRANSM, V4, P1 GOBL C, 1992, SPEECH COMMUN, V11, P481, DOI 10.1016/0167-6393(92)90055-C Guruprasad S., 2007, P INT ANTW BELG, P554 Henrich N, 2004, J ACOUST SOC AM, V115, P1321, DOI 10.1121/1.1646401 Ishi CT, 2008, IEEE T AUDIO SPEECH, V16, P47, DOI 10.1109/TASL.2007.910791 Ishi C.T., 2008, SPEECH COMMUN, V50, P531 KADAMBE S, 1992, IEEE T INFORM THEORY, V38, P917, DOI 10.1109/18.119752 Kay S. M., 1988, MODERN SPECTRAL ESTI Kominek J., 2004, ISCA SPEECH SYNTH WO, P223 KOUNOUDES A, 2002, ACOUST SPEECH SIG PR, P349 Laver J, 1980, PHONETIC DESCRIPTION MONSEN RB, 1977, J ACOUST SOC AM, V62, P981, DOI 10.1121/1.381593 MOULINES E, 1990, SPEECH COMMUN, V9, P401, DOI 10.1016/0167-6393(90)90017-4 Murty KSR, 2008, IEEE T AUDIO SPEECH, V16, P1602, DOI 10.1109/TASL.2008.2004526 Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878 Ni Chasaide A., 1993, LANG SPEECH, V2, P303 O' Cinneide A., 2011, P INTERSPEECH, P57 OGDEN RICHARD, 2001, J INT PHON ASSOC, V31, P139 Podesva RJ, 2007, J SOCIOLING, V11, P478, DOI 10.1111/j.1467-9841.2007.00334.x Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239 Rao KS, 2006, IEEE T AUDIO SPEECH, V14, P972, DOI 10.1109/TSA.2005.858051 Rao KS, 2007, IEEE SIGNAL PROC LET, V14, P762, DOI 10.1109/LSP.2007.896454 Schnell K, 2007, LECT NOTES ARTIF INT, V4885, P221 Schroder M., 2001, P EUROSPEECH 2001 SE, P561 Schwartz R., 1990, P IEEE INT C AC SPEE, P81 SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 Sturmel N, 2009, INT CONF ACOUST SPEE, P4517, DOI 10.1109/ICASSP.2009.4960634 Stylianou Y., 1999, 6 EUR C SPEECH COMM Talkin D., 1989, J ACOUST SOC AM, V85, P149 Talkin D., 1995, SPEECH CODING SYNTHE Thomas MRP, 2009, IEEE T AUDIO SPEECH, V17, P1557, DOI 10.1109/TASL.2009.2022430 Thomas MRP, 2012, IEEE T AUDIO SPEECH, V20, P82, DOI 10.1109/TASL.2011.2157684 Tuan V., 1999, 6 EUR C SPEECH COMM Van den Berg J., 1968, MANUAL PHONETICS, P278 Villavicencio F, 2006, INT CONF ACOUST SPEE, P869 Vincent D, 2005, P INT LISB, P333 NR 52 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 295 EP 314 DI 10.1016/j.specom.2012.08.01 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900007 ER PT J AU Dromey, C Jang, GO Hollis, K AF Dromey, Christopher Jang, Gwi-Ok Hollis, Kristi TI Assessing correlations between lingual movements and formants SO SPEECH COMMUNICATION LA English DT Article DE Formant; Kinematic; Diphthong; Articulation ID MOTOR SPEECH DISORDERS; VOCAL-TRACT STEADINESS; LOUDNESS MANIPULATIONS; SPASMODIC DYSPHONIA; TONGUE MOVEMENTS; SPEAKING RATE; DYSARTHRIA; SPEAKERS; SYSTEM AB First and second formant histories have been used in studies of both normal and disordered speech to indirectly measure the activity of the vocal tract. The purpose of the present study was to determine the extent to which formant measures are reflective of lingual movements during diphthong production. Twenty native speakers of American English from the western United States produced four diphthongs in a sentence context while tongue movement was measured with a magnetic tracking system. Correlations were computed between the vertical tongue movements and the first formant, as well as between the anteroposterior movements and the second formant during the transition phase of the diphthong. In many instances the acoustic measures were clearly reflective of the kinematic data. However, there were also exceptions, where the acoustic and kinematic records were not congruent. These instances were evaluated quantitatively and qualitatively in an effort to understand the cause of the discrepancy. Factors such as coarticulation, motor equivalence (including the influence of structures other than the tongue), and nonlinearities in the linkage between movement and acoustics could account for these findings. Recognizing potential influences on the acoustic kinematic relationship may be valuable in the interpretation of articulatory acoustic data on the individual speaker level. (C) 2012 Elsevier B.V. All rights reserved. C1 [Dromey, Christopher; Jang, Gwi-Ok; Hollis, Kristi] Brigham Young Univ, Dept Commun Disorders, Provo, UT 84602 USA. RP Dromey, C (reprint author), Brigham Young Univ, Dept Commun Disorders, 133 John Taylor Bldg, Provo, UT 84602 USA. EM dromey@byu.edu FU David O. McKay School of Education at Brigham Young University FX We express appreciation to the individuals who volunteered their time to participate in this study. Funding was provided by the David O. McKay School of Education at Brigham Young University. This paper was based on the master's theses of the second and third authors. CR BARLOW SM, 1983, J SPEECH HEAR RES, V26, P283 Beddor PS, 2009, LANGUAGE, V85, P785 Boersma P., 2007, PRAAT CANNITO MP, 1989, RECENT ADVANCES IN CLINICAL DYSARTHRIA, P243 Dagli AS, 1997, EUR ARCH OTO-RHINO-L, V254, P78, DOI 10.1007/BF01526184 Dromey C, 2006, SPEECH COMMUN, V48, P463, DOI 10.1016/j.specom.2005.05.003 Dromey C, 2008, J SPEECH LANG HEAR R, V51, P196, DOI 10.1044/1092-4388(2008/015) Ferrand C. T., 2007, SPEECH SCI INTEGRATE GERRATT BR, 1983, J SPEECH HEAR RES, V26, P297 Glass G V., 1984, STAT METHODS ED PSYC Hollis K.L., 2009, THESIS B YOUNG U HUGHES OM, 1976, PHONETICA, V33, P199 Jong G.O., 2010, THESIS B YOUNG U Kent RD, 1999, J COMMUN DISORD, V32, P141, DOI 10.1016/S0021-9924(99)00004-0 LISS JM, 1992, J ACOUST SOC AM, V92, P2984, DOI 10.1121/1.404364 Mathworks, 2009, MATLAB Mefferd AS, 2010, J SPEECH LANG HEAR R, V53, P1206, DOI 10.1044/1092-4388(2010/09-0083) Nissen SL, 2007, PHONETICA, V64, P201, DOI [10.1159/000121373, 10.1159/006121373] OHALA JJ, 1993, LANG SPEECH, V36, P155 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7 SMITH A, 1995, EXP BRAIN RES, V104, P493 SPSS, 2012, SPSS STEVENS KN, 1989, J PHONETICS, V17, P3 Stevens KN, 2010, J PHONETICS, V38, P10, DOI 10.1016/j.wocn.2008.10.004 Stone M, 1996, J ACOUST SOC AM, V99, P3728, DOI 10.1121/1.414969 Story BH, 2009, J ACOUST SOC AM, V126, P825, DOI 10.1121/1.3158816 Tasko SM, 2002, J SPEECH LANG HEAR R, V45, P127, DOI 10.1044/1092-4388(2002/010) Tingley S, 2000, J MED SPEECH-LANG PA, V8, P249 Tjaden K, 2004, J SPEECH LANG HEAR R, V47, P766, DOI 10.1044/1092-4388(2004/058) WEISMER G, 1992, J ACOUST SOC AM, V91, P1085, DOI 10.1121/1.402635 Weismer G, 2003, J ACOUST SOC AM, V113, P3362, DOI 10.1121/1.1572142 Zwirner P, 1997, EUR ARCH OTO-RHINO-L, V254, P391, DOI 10.1007/BF01642557 ZWIRNER P, 1992, J SPEECH HEAR RES, V35, P761 NR 34 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 315 EP 328 DI 10.1016/j.specom.2012.09.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900008 ER PT J AU Iwata, T Watanabe, S AF Iwata, Tomoharu Watanabe, Shinji TI Influence relation estimation based on lexical entrainment in conversation SO SPEECH COMMUNICATION LA English DT Article DE Conversation analysis; Influence; Latent variable model; Entrainment ID EM ALGORITHM; SPEECH RECOGNITION; LANGUAGE MODEL; NETWORK AB In conversations, people tend to mimic their companions' behavior depending on their level of trust. This phenomenon is known as entrainment. We propose a probabilistic model for estimating influences among speakers from conversation data involving multiple people by modeling lexical entrainment. The proposed model estimates word use as a function of the weighted sum of the earlier word use of other speakers. The weights represent influences between speakers. The influences can be efficiently estimated by using the expectation maximization (EM) algorithm. We also develop its online inference procedures for sequentially modeling the dynamics of influence relations. Experiments performed on two meeting data sets one in Japanese and one in English demonstrate the effectiveness of the proposed method. (C) 2012 Elsevier B.V. All rights reserved. C1 [Iwata, Tomoharu; Watanabe, Shinji] NTT Commun Sci Labs, Kyoto 6190237, Japan. RP Iwata, T (reprint author), NTT Commun Sci Labs, Kyoto 6190237, Japan. EM iwata.tomoharu@lab.ntt.co.jp CR Auyeung A., 2010, P ACM C HYP HYP HT 1, P245 Blei D, 2006, P INT C MACH LEARN, V1, P113 Brennan S. E., 1996, INT S SPOK DIAL, P41 Chang F, 2006, PSYCHOL REV, V113, P234, DOI 10.1037/0033-295X.113.2.234 Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893 Coulston R., 2002, P 7 INT C SPOK LANG, V4, P2689 De Mori R., 2011, P 12 ANN C INT SPEEC, P3081 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIMBERG U, 1982, PSYCHOPHYSIOLOGY, V19, P643, DOI 10.1111/j.1469-8986.1982.tb02516.x Fiscus J.G., 2007, MULTIMODAL TECHNOLOG, P373 Gildea D., 1999, P EUROSPEECH, P2167 Giles H., 1991, ACCOMMODATION THEORY Hori T, 2010, IEEE WORKSH SPOK LAN, P412 Hori T., 2011, IEEE T AUDIO SPEECH Iwata Tomoharu, 2011, P 12 ANN C INT SPEEC, P3089 Ji Gang, 2004, P HLT NACAC 2004 MAS, P133, DOI 10.3115/1613984.1614018 KUHN R, 1990, IEEE T PATTERN ANAL, V12, P570, DOI 10.1109/34.56193 Mehler A, 2010, ENTROPY-SWITZ, V12, P1440, DOI 10.3390/e12061440 Neal RM, 1998, NATO ADV SCI I D-BEH, V89, P355 Nenkova A., 2008, P 46 ANN M ASS COMP, P169, DOI 10.3115/1557690.1557737 Otsuka K., 2008, P ACM ICMI, P257, DOI 10.1145/1452392.1452446 Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169 Purver M, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P17 Reitter D, 2011, COGNITIVE SCI, V35, P587, DOI 10.1111/j.1551-6709.2010.01165.x Reitter D., 2006, P HUM LANG TECHN C N, P121, DOI 10.3115/1614049.1614080 Renals S., 2007, P IEEE WORKSH AUT SP, P238 Sato M, 2000, NEURAL COMPUT, V12, P407, DOI 10.1162/089976600300015853 Scissors LE, 2008, CSCW: 2008 ACM CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK, CONFERENCE PROCEEDINGS, P277 Valente F., 2011, P 12 ANN C INT SPEEC, P3077 Vinciarelli A, 2007, IEEE T MULTIMEDIA, V9, P1215, DOI 10.1109/TMM.2007.902882 Watanabe S, 2011, COMPUT SPEECH LANG, V25, P440, DOI 10.1016/j.csl.2010.07.006 NR 31 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 329 EP 339 DI 10.1016/j.specom.2012.08.012 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900009 ER PT J AU Jeong, Y AF Jeong, Yongwon TI Unified framework for basis-based speaker adaptation based on sample covariance matrix of variable dimension SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Speaker adaptation; Two-dimensional principal component analysis; Eigenvoice speaker adaptation ID HIDDEN MARKOV-MODELS; 2-DIMENSIONAL PCA; RECOGNITION AB We present a unified framework for basis-based speaker adaptation techniques, which subsumes eigenvoice speaker adaptation using principal component analysis (PCA) and speaker adaptation using two-dimensional PCA (2DPCA). The basic idea is to partition a Gaussian mean vector of a hidden Markov model (HMM) for each state and mixture component into a group of subvectors and stack all the subvectors of a training speaker model into a matrix. The dimension of the matrix varies according to the dimension of the subvector. As a result, the basis vectors derived from the PCA of training model matrices have variable dimension and so does the speaker weight in the adaptation equation. When the amount of adaptation data is small, adaptation using the speaker weight of small dimension with the basis vectors of large dimension can give good performance, whereas when the amount of adaptation data is large, adaptation using the speaker weight of large dimension with the basis vectors of small dimension can give good performance. In the experimental results, when the dimension of basis vectors was chosen between those of the eigenvoice method and the 2DPCA-based method, the model showed the balanced performance between the eigenvoice method and the 2DPCA-based method. (C) 2012 Elsevier B.V. All rights reserved. C1 Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea. RP Jeong, Y (reprint author), Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea. EM jeongy@pusan.ac.kr CR Chen SC, 2004, PATTERN RECOGN, V37, P1081, DOI 10.1016/j.patcog.2003.09.004 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gottumukkal R, 2004, PATTERN RECOGN LETT, V25, P429, DOI 10.1016/j.patrec.2003.11.005 Jeong Y, 2010, IEEE SIGNAL PROC LET, V17, P193, DOI 10.1109/LSP.2009.2036696 Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd Jung HY, 2002, ETRI J, V24, P469, DOI 10.4218/etrij.02.0202.0003 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LIM YJ, 1995, INT CONF ACOUST SPEE, P89 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Shankar S., 2008, P IEEE C COMP VIS PA, P1, DOI 10.1145/1544012.1544028 Wang LW, 2005, PATTERN RECOGN LETT, V26, P57, DOI 10.1016/j.patrec.2004.08.016 Yang J, 2004, IEEE T PATTERN ANAL, V26, P131 NR 13 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 340 EP 346 DI 10.1016/j.specom.2012.09.002 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900010 ER PT J AU Nose, T Kobayashi, T AF Nose, Takashi Kobayashi, Takao TI An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model SO SPEECH COMMUNICATION LA English DT Article DE HMM-based expressive speech synthesis; Multiple-regression HSMM; Style control; Style intensity; Multiple-regression global variance model ID HIDDEN MARKOV MODEL; SYNTHESIS SYSTEM; RECOGNITION; EMOTION AB To control intuitively the intensities of emotional expressions and speaking styles for synthetic speech, we introduce subjective style intensities and multiple-regression global variance (MRGV) models into hidden Markov model (HMM)-based expressive speech synthesis. A problem in the conventional parametric style modeling and style control techniques is that the intensities of styles appearing in synthetic speech strongly depend on the training data. To alleviate this problem, the proposed technique explicitly takes into account subjective style intensities perceived for respective training utterances using multiple-regression hidden semi-Markov models (MRHSMMs). As a result, synthetic speech becomes less sensitive to the variation of style expressivity existing in the training data. Another problem is that the synthetic speech generally suffers from the over-smoothing effect of model parameters in the model training, so the variance of the generated speech parameter trajectory becomes smaller than that of the natural speech. To alleviate this problem for the case of style control, we extend the conventional variance compensation method based on a GV model for a single-style speech to the case of multiple styles with variable style intensities by deriving the MRGV modeling. The objective and subjective experimental results show that these two techniques significantly enhance the intuitive style control of synthetic speech, which is essential for the speech synthesis system to communicate para-linguistic information correctly to the listeners. (C) 2012 Elsevier B.V. All rights reserved. C1 [Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. EM takashi.nose@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp FU JSPS [23700195, 24300071] FX Part of this work was supported by JSPS Grant-in-Aid for Scientific Research 23700195 and 24300071. CR Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137 Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317 FUJINAGA K, 2001, ACOUST SPEECH SIG PR, P513 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223 Iida A, 2003, SPEECH COMMUN, V40, P161, DOI 10.1016/S0167-6393(02)00081-X Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Koriyama T, 2011, P INTERSPEECH, P2657 KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W Miyanaga K., 2004, P INTERSPEECH 2004 I, P1437 Niwase N, 2005, IEICE T INF SYST, VE88D, P2492, DOI 10.1093/ietisy/e88-d.11.2492 Nose T, 2009, IEICE T INF SYST, VE92D, P489, DOI 10.1587/transinf.E92.D.489 Nose T, 2007, IEICE T INF SYST, VE90D, P1406, DOI 10.1093/ietisy/e90-d.9.1406 Scherer K. R., 1977, MOTIV EMOTION, V1, P331, DOI 10.1007/BF00992539 Schroder M., 2001, P EUROSPEECH 2001 SE, P561 SCHULLER B, 2003, ACOUST SPEECH SIG PR, P1 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484 Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tsuzuki R, 2004, P INTERSPEECH 2004 I, P1185 Yamagishi J., 2008, HTS 2008 SYSTEM YET Yamagishi J, 2003, IEICE T FUND ELECTR, VE86A, P1956 Yamagishi J, 2003, IEICE T INF SYST, VE86D, P534 Yamagishi J., 2003, P INTERSPEECH 2003 E, P2461 Yoshimura T, 1999, P EUR, P2347 Yu K, 2011, SPEECH COMMUN, V53, P914, DOI 10.1016/j.specom.2011.03.003 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 NR 29 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 347 EP 357 DI 10.1016/j.specom.2012.09.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900011 ER PT J AU Yong, PC Nordholm, S Dam, HH AF Yong, Pei Chee Nordholm, Sven Dam, Hai Huyen TI Optimization and evaluation of sigmoid function with a priori SNR estimate for real-time speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; SNR estimation; Decision-directed approach; Sigmoid function; Objective evaluation ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE AB In this paper, an a priori signal-to-noise ratio (SNR) estimator with a modified sigmoid gain function is proposed for real-time speech enhancement. The proposed sigmoid gain function has three parameters, which can be optimized such that they match conventional gain functions. In addition, the joint temporal dynamics between the SNR estimate and the spectral gain function is investigated to improve the performance of the speech enhancement scheme. As the widely-used decision-directed (DD) a priori SNR estimate has a well-known one-frame delay that leads to the degradation of speech quality, a modified a priori SNR estimator is proposed for the DD approach to overcome this delay. Evaluations are performed by utilizing the objective evaluation metric that measures the trade-off between the noise reduction, the speech distortion and the musical noise in the enhanced signal. The results are compared using the PESQ and the SNRseg measures as well as subjective listening tests. Simulation results show that the proposed gain function, which can flexibly model exponential distributions, is a potential alternative speech enhancement gain function. (C) 2012 Elsevier B.V. All rights reserved. C1 [Yong, Pei Chee; Nordholm, Sven; Dam, Hai Huyen] Curtin Univ Technol, Bentley, WA 6102, Australia. RP Yong, PC (reprint author), 1-40 Marquis St, Bentley, WA 6102, Australia. EM peichee.yong@postgrad.curtin.edu.au CR Alam M. J., 2009, J ELECT ELECT ENG, V9, P809 Andrianakis I, 2009, SPEECH COMMUN, V51, P1, DOI 10.1016/j.specom.2008.05.018 Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Breithaupt C, 2011, IEEE T AUDIO SPEECH, V19, P277, DOI 10.1109/TASL.2010.2047681 Breithaupt C, 2008, INT CONF ACOUST SPEE, P4037, DOI 10.1109/ICASSP.2008.4518540 Breithaupt C, 2008, INT CONF ACOUST SPEE, P4897, DOI 10.1109/ICASSP.2008.4518755 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403 Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278] Davis A., 2006, P 14 EUR SIGN PROC C EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gustafsson S, 2002, IEEE T SPEECH AUDI P, V10, P245, DOI 10.1109/TSA.2002.800553 Hansen J. H. L., 1998, P INT C SPOK LANG PR, V7, P2819 Hendriks RC, 2010, INT CONF ACOUST SPEE, P4266, DOI 10.1109/ICASSP.2010.5495680 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Paliwal K, 2012, SPEECH COMMUN, V54, P282, DOI 10.1016/j.specom.2011.09.003 Park YS, 2007, IEICE T COMMUN, VE90B, P2182, DOI 10.1093/ietcom/e90-b.8.2182 Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621 Plourde E, 2009, IEEE SIGNAL PROC LET, V16, P485, DOI 10.1109/LSP.2009.2018225 Quackenbush S. R., 1988, OBJECTIVE MEASURES S RIX AW, 2001, ACOUST SPEECH SIG PR, P749 Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199 Suhadi S, 2011, IEEE T AUDIO SPEECH, V19, P186, DOI 10.1109/TASL.2010.2045799 Uemura Y., 2008, P INT WORKSH AC ECH Yong P.C., 2011, P 19 EUR SIGN PROC C, P211 NR 29 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 358 EP 376 DI 10.1016/j.specom.2012.09.004 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900012 ER PT J AU Oonishi, T Iwano, K Furui, S AF Oonishi, Tasuku Iwano, Koji Furui, Sadaoki TI A noise-robust speech recognition approach incorporating normalized speech/non-speech likelihood into hypothesis scores SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Noise robustness; Gaussian mixture model adaptation ID GAUSSIAN MIXTURE-MODELS AB ]In noisy environments, speech recognition decoders often incorrectly produce speech hypotheses for non-speech periods, and non-speech hypotheses, such as silence or a short pause, for speech periods. It is crucial to reduce such errors to improve the performance of speech recognition systems. This paper proposes an approach using normalized speech/non-speech likelihoods calculated using adaptive speech and non-speech GMMs to weight the scores of recognition hypotheses produced by the decoder. To achieve good decoding performance, the GMMs are adapted to the variations of acoustic characteristics of input utterances and environmental noise, using either of the two modern on-line unsupervised adaptation methods, switching Kalman filter (SKF) or maximum a posteriori (MAP) estimation. Experimental results on real-world in-car speech, the Drivers' Japanese Speech Corpus in a Car Environment (DJSC), and the AURORA-2 database show that the proposed method significantly improves recognition accuracy compared to a conventional approach using front-end voice activity detection (VAD). Results also confirm that our method significantly improves recognition accuracy under various noise and task conditions. (C) 2012 Elsevier B.V. All rights reserved. C1 [Oonishi, Tasuku; Furui, Sadaoki] Tokyo Inst Technol, Meguro Ku, Tokyo 1528552, Japan. [Iwano, Koji] Tokyo City Univ, Tsuzuki Ku, Yokohama, Kanagawa 2248551, Japan. RP Oonishi, T (reprint author), Tokyo Inst Technol, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan. EM oonishi@furui.cs.titech.ac.jp; iwano@tcu.ac.jp; furui@cs.titech.ac.jp CR Dixon P. R., 2007, P IEEE ASRU, P443 ETSI, 2002, 202050 ETSI ES Fujimoto M, 2007, INT CONF ACOUST SPEE, P797 Fujimoto M., 2007, P INT 07 AUG, P2933 Gales M.J.F, 1995, THESIS CAMBRIDGE U Hiraki K., 2008, IEICE TECHNICAL REPO, P93 Itou K., 1999, Journal of the Acoustical Society of Japan (E), V20 Iwano Koji, 2002, P ICSLP, P941 Metze F., 2002, P INT C SPOK LANG PR, P2133 Oonishi Tasuku, 2010, P INTERSPEECH, P3122 Pearce D., 2000, P ICSLP, V4, P29 Ramirez J., 2007, ROBUST SPEECH RECOGN, P1 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Satoshi Tamura, 2004, P INT C AC SPEECH SI, V1, P857 Shiell D., 2009, VISUAL SPEECH RECOGN, P1 SINGH R, 2001, ACOUST SPEECH SIG PR, P273 Zhang YX, 2008, PATTERN RECOGN LETT, V29, P735, DOI 10.1016/j.patrec.2007.12.006 NR 17 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2013 VL 55 IS 2 BP 377 EP 386 DI 10.1016/j.specom.2012.10.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 056SO UT WOS:000312511900013 ER PT J AU Black, MP Katsamanis, A Baucom, BR Lee, CC Lammert, AC Christensen, A Georgiou, PG Narayanan, SS AF Black, Matthew P. Katsamanis, Athanasios Baucom, Brian R. Lee, Chi-Chun Lammert, Adam C. Christensen, Andrew Georgiou, Panayiotis G. Narayanan, Shrikanth S. TI Toward automating a human behavioral coding system for married couples' interactions using speech acoustic features SO SPEECH COMMUNICATION LA English DT Article DE Behavioral signal processing (BSP); Couple therapy; Dyadic interaction; Human behavior analysis; Prosody; Emotion recognition ID RANDOMIZED CLINICAL-TRIAL; FUNDAMENTAL-FREQUENCY; SPEAKER DIARIZATION; NONVERBAL BEHAVIOR; THERAPY; SATISFACTION; EMOTIONS; DISTRESS; COMMUNICATION; RECOGNITION AB Observational methods are fundamental to the study of human behavior in the behavioral sciences. For example, in the context of research on intimate relationships, psychologists' hypotheses are often empirically tested by video recording interactions of couples and manually coding relevant behaviors using standardized coding systems. This coding process can be time-consuming, and the resulting coded data may have a high degree of variability because of a number of factors (e.g., inter-evaluator differences). These challenges provide an opportunity to employ engineering methods to aid in automatically coding human behavioral data. In this work, we analyzed a large corpus of married couples' problem-solving interactions. Each spouse was manually coded with multiple session-level behavioral observations (e.g., level of blame toward other spouse), and we used acoustic speech features to automatically classify extreme instances for six selected codes (e.g., "low" vs. "high" blame). Specifically, we extracted prosodic, spectral, and voice quality features to capture global acoustic properties for each spouse and trained gender-specific and gender-independent classifiers. The best overall automatic system correctly classified 74.1% of the instances, an improvement of 3.95% absolute (5.63% relative) over our previously reported best results. We compare performance for the various factors: across codes, gender, classifier type, and feature type. (C) 2011 Elsevier B.V. All rights reserved. C1 [Black, Matthew P.; Katsamanis, Athanasios; Lee, Chi-Chun; Lammert, Adam C.; Georgiou, Panayiotis G.; Narayanan, Shrikanth S.] Univ So Calif, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA. [Baucom, Brian R.; Narayanan, Shrikanth S.] Univ So Calif, Dept Psychol, Los Angeles, CA 90089 USA. [Christensen, Andrew] Univ Calif Los Angeles, Dept Psychol, Los Angeles, CA 90095 USA. RP Black, MP (reprint author), Univ So Calif, Signal Anal & Interpretat Lab, 3710 McClintock Ave, Los Angeles, CA 90089 USA. EM matthepb@usc.edu FU National Science Foundation; Viterbi Research Innovation Fund FX This research was supported in part by the National Science Foundation and the Viterbi Research Innovation Fund. Special thanks to the Couple Therapy research staff for collecting, transcribing, and coding the data. CR Atkins D., 2005, ANN M ASS BEH COGN T Batliner A, 2011, COGN TECHNOL, P71, DOI 10.1007/978-3-642-15184-2_6 Baucom B, 2007, J SOC CLIN PSYCHOL, V26, P689, DOI 10.1521/jscp.2007.26.6.689 Baucom BR, 2009, J CONSULT CLIN PSYCH, V77, P160, DOI 10.1037/a0014405 Baucom D., 2004, COUPLE OBSERVATIONAL, P159 Baucom DH, 1998, J CONSULT CLIN PSYCH, V66, P53, DOI 10.1037//0022-006X.66.1.53 Beck JG, 2006, BEHAV RES THER, V44, P737, DOI 10.1016/j.brat.2005.05.004 Black M., 2011, P INTERSPEECH BLACK M, 2010, P INT CHIB JAP, P2030 Boersma P., 2001, GLOT INT, V5, P341 Brune M, 2008, J NERV MENT DIS, V196, P282, DOI 10.1097/NMD.0b013e31816a4922 Bulut M, 2008, J ACOUST SOC AM, V123, P4547, DOI 10.1121/1.2909562 Burkhardt F, 2009, INT CONF ACOUST SPEE, P4761, DOI 10.1109/ICASSP.2009.4960695 Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578 Busso C., 2009, ROLE PROSODY AFFECTI, P309 Campbell N., 2000, ISCA TUT RES WORKSH Chen K, 2004, P ICASSP, V1, P509 Christensen A, 2006, J CONSULT CLIN PSYCH, V74, P1180, DOI 10.1037/0022-006X.74.6.1180 Christensen A, 2004, J CONSULT CLIN PSYCH, V72, P176, DOI 10.1037/0022-006X.72.2.176 Christensen A., 1995, CLIN HDB COUPLE THER, P31 Christensen A., 1990, GENDER ISSUES CONT S, V6, P113 Christensen A, 2010, J CONSULT CLIN PSYCH, V78, P225, DOI 10.1037/a0018132 Cowie R, 2009, PHILOS T R SOC B, V364, P3515, DOI 10.1098/rstb.2009.0139 Coy A, 2007, SPEECH COMMUN, V49, P384, DOI 10.1016/j.specom.2006.11.002 de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 Devillers L, 2011, COMPUT SPEECH LANG, V25, P1, DOI 10.1016/j.csl.2010.07.002 Douglas-Cowie E, 2007, LECT NOTES COMPUT SC, V4738, P488 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 Eyben F., 2010, ACM MULTIMEDIA, P1459 Fan RE, 2008, J MACH LEARN RES, V9, P1871 Fredman SJ, 2008, J FAM PSYCHOL, V22, P71, DOI 10.1037/0893-3200.22.1.71 Georgiou P.G., 2011, AFFECTIVE COMPUTING Ghosh P.K., 2010, IEEE T AUDIO SPEECH, V19, P600 Gibson J., 2011, P INTERSPEECH Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 Gonzaga GC, 2007, J PERS SOC PSYCHOL, V93, P34, DOI 10.1037/0022-3514.93.1.34 GOTTMAN JM, 1989, J CONSULT CLIN PSYCH, V57, P47, DOI 10.1037/0022-006X.57.1.47 GOTTMAN J, 1977, J MARRIAGE FAM, V39, P461, DOI 10.2307/350902 Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010 Han KJ, 2008, IEEE T AUDIO SPEECH, V16, P1590, DOI 10.1109/TASL.2008.2002085 Heavey C., 2002, COUPLES INTERACTION Heyman RE, 2001, PSYCHOL ASSESSMENT, V13, P5, DOI 10.1037//1040-3590.13.1.5 Hops H., 1971, TECHNICAL REPORT Joachims T., 1998, EUR C MACH LEARN CHE, V1398, P137 Jones J., 1998, COUPLES INTERACTION Jurafsky D., 2009, HUMAN LANGUAGE TECHN, P638 Juslin P. N., 2005, NEW HDB METHODS NONV, P65 KARNEY BR, 1995, PSYCHOL BULL, V118, P3, DOI 10.1037/0033-2909.118.1.3 Katsamanis A., 2011, AFFECTIVE COMPUTING Katsamanis A., 2011, VER LARG SCAL PHON W Keen D, 2005, RES DEV DISABIL, V26, P243, DOI 10.1016/j.ridd.2004.07.002 Kerig P. K., 2004, COUPLE OBSERVATIONAL Lee C., 2011, P INTERSPEECH Lee C, 2009, P INT, P320 Lee CC, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P793 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Margolin G, 1998, Clin Child Fam Psychol Rev, V1, P195, DOI 10.1023/A:1022608117322 Margolin G, 2004, DEV PSYCHOPATHOL, V16, P753, DOI 10.1017/S0954579404004766 McNemar Q, 1947, PSYCHOMETRIKA, V12, P153, DOI 10.1007/BF02295996 Mendenhall W., 2007, STAT ENG SCI, P302 Moreno P.J., 1998, P ICSLP SYDN AUSTR Murray K., 2001, SIGDIAL WORKSH DISC O'Brien M, 1994, Violence Vict, V9, P45 Ranganath R., 2009, C EMP METH NAT LANG, P334 Rozgic V., 2010, P INTERSPEECH Rozgic V, 2011, INT CONF ACOUST SPEE, P2368 Schuller B., 2009, P INT BRIGHT UK, V2009, P312 Schuller B., 2007, P INT, P2253 SCHULLER B., 2010, P INTERSPEECH 2010 M, P2794 Schuller B., 2009, ROLE PROSODY AFFECTI, P285 Schuller B, 2008, INT CONF ACOUST SPEE, P4501, DOI 10.1109/ICASSP.2008.4518656 Sevier M, 2008, BEHAV THER, V39, P137, DOI 10.1016/j.beth.2007.06.001 Shoham V, 1998, J FAM PSYCHOL, V12, P557 Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 Traunmuller H., 1994, TECHNICAL REPORT Vinciarelli A, 2009, IMAGE VISION COMPUT, V27, P1743, DOI 10.1016/j.imavis.2008.11.007 Williams-Baucom KJ, 2010, PERS RELATIONSHIP, V17, P41 Yildirim S., 2010, COMPUT SPEECH LANG, V25, P29 NR 79 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 1 EP 21 DI 10.1016/j.specom.2011.12.003 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900001 ER PT J AU Hofe, R Ell, SR Fagan, MJ Gilbert, JM Green, PD Moore, RK Rybchenko, SI AF Hofe, Robin Ell, Stephen R. Fagan, Michael J. Gilbert, James M. Green, Phil D. Moore, Roger K. Rybchenko, Sergey I. TI Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing SO SPEECH COMMUNICATION LA English DT Article DE Silent speech interfaces; Clinical speech technology; Articulography; Multi-modal speech recognition; Speech articulation AB This paper reports on word recognition experiments using a silent speech interface based on magnetic sensing of articulator movements. A magnetic field was generated by permanent magnet pellets fixed to relevant speech articulators. Magnetic field sensors mounted on a wearable frame measured the fluctuations of the magnetic field during speech articulation. These sensor data were used in place of conventional acoustic features for the training of hidden Markov models. Both small vocabulary isolated word recognition and connected digit recognition experiments are presented. Their results demonstrate the ability of the system to capture phonetic detail at a level that is surprising for a device without any direct access to voicing information. (C) 2012 Elsevier B.V. All rights reserved. C1 [Hofe, Robin; Green, Phil D.; Moore, Roger K.] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. [Ell, Stephen R.] Hull & E Yorkshire Hosp Trust, Castle Hill Hosp, Cottingham HU16 5JQ, England. [Fagan, Michael J.; Gilbert, James M.; Rybchenko, Sergey I.] Univ Hull, Dept Engn, Kingston Upon Hull HU6 7RX, Yorks, England. RP Hofe, R (reprint author), Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. EM r.hofe@sheffield.ac.uk FU Henry Smith Charity; Action Medical Research [AP 1164] FX The MVOCA was developed under the REdRESS project, which has been funded by a generous grant from The Henry Smith Charity and Action Medical Research (Grant Number AP 1164). CR BORDEN GJ, 1979, BRAIN LANG, V7, P307, DOI 10.1016/0093-934X(79)90025-7 Brumberg J.S., 2010, SPEECH COMMUNICATION, V52 Denby B., 2006, IEEE INT C AC SPEECH Denby B, 2010, SPEECH COMMUN, V52, P270, DOI 10.1016/j.specom.2009.08.002 ETSI, 2000, 201108V111 ETSI Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003 Gilbert JM, 2010, MED ENG PHYS, V32, P1189, DOI 10.1016/j.medengphy.2010.08.011 Gillick L., 1989, P IEEE C AC SPEECH S Hofe R., 2011, P INT 2011 FLOR IT Hofe R., 2010, P INT 2010 MAK JAP Kroos C., 2008, 8 INT SEM SPEECH PRO LEONARD R., 1984, IEEE INT C AC SPEECH Levinson SE, 2005, MATHEMATICAL MODELS FOR SPEECH TECHNOLOGY, P1, DOI 10.1002/0470020911 Maier-Hein L., 2005, P AUT SPEECH REC UND Petajan E., 1988, CHI 88 P SIGCHI C HU Qin C., 2008, P INT 2008 BRISB AUS RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 SCHONLE PW, 1983, BIOMED TECH, V28, P263, DOI 10.1515/bmte.1983.28.11.263 Schultz T, 2010, SPEECH COMMUN, V52, P341, DOI 10.1016/j.specom.2009.12.002 Wand M., 2011, INT C BIOINSP SYST S Young S., 2009, HTK BOOK HTK VERSION NR 21 TC 1 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 22 EP 32 DI 10.1016/j.specom.2012.02.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900002 ER PT J AU Boersma, P Chladkova, K AF Boersma, Paul Chladkova, Katerina TI Detecting categorical perception in continuous discrimination data SO SPEECH COMMUNICATION LA English DT Article DE Categorical perception; Dense sampling; Discrimination; Maximum likelihood ID VOWEL PERCEPTION; SPEECH; IDENTIFICATION; MEMORY AB We present a method for assessing categorical perception from continuous discrimination data. Until recently, categorical perception of speech has exclusively been measured by discrimination and identification experiments with a small number of different stimuli, each of which is presented multiple times. Experiments by Rogers and Davis (2009), however, suggest that using non-repeating stimuli yields a more reliable measure of categorization. If this idea is applied to a single phonetic continuum, the continuum has to be densely sampled and the obtained discrimination data is nearly continuous. In the present study, we describe a maximum-likelihood method that is appropriate for analysing such continuous discrimination data. (C) 2012 Elsevier B.V. All rights reserved. C1 [Boersma, Paul; Chladkova, Katerina] Univ Amsterdam, Amsterdam Ctr Language & Commun, NL-1012 WX Amsterdam, Netherlands. RP Boersma, P (reprint author), Univ Amsterdam, Amsterdam Ctr Language & Commun, NL-1012 WX Amsterdam, Netherlands. EM paul.boersma@uva.nl; k.chladkova@uva.nl CR AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 BABAUD J, 1986, IEEE T PATTERN ANAL, V8, P26 Best C.T., 1992, SR109110 HASK LAB, P89 Boersma P., 1993, P I PHONETIC SCI, V17, P97 Boersma Paul, 1997, P I PHONETIC SCI U A, V21, P43 EIMAS PD, 1963, LANG SPEECH, V6, P206 Fisher R.A., 1922, PHILOS T R SOC A, V222, P309, DOI DOI 10.1098/RSTA.1922.0009 Flannery B. P., 1992, NUMERICAL RECIPES C Gerrits E, 2004, PERCEPT PSYCHOPHYS, V66, P363, DOI 10.3758/BF03194885 KEWLEYPORT D, 1995, J ACOUST SOC AM, V97, P3139, DOI 10.1121/1.413106 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417 MERMELSTEIN P, 1978, J ACOUST SOC AM, V63, P572, DOI 10.1121/1.381756 NEAREY TM, 1990, J PHONETICS, V18, P347 PISONI DB, 1975, MEM COGNITION, V3, P7, DOI 10.3758/BF03198202 PISONI DB, 1973, PERCEPT PSYCHOPHYS, V13, P253, DOI 10.3758/BF03214136 Pitt MA, 2002, PSYCHOL REV, V109, P472, DOI 10.1037//0033-295X.109.3.472 Polka L, 2003, SPEECH COMMUN, V41, P221, DOI 10.1016/S0167-6393(02)00105-X Rogers J. C., 2009, P INT, P376 Weenink D., 1992, PRAAT DOING PHONETIC Wilks SS, 1938, ANN MATH STAT, V9, P60, DOI 10.1214/aoms/1177732360 NR 21 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 33 EP 39 DI 10.1016/j.specom.2012.05.003 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900003 ER PT J AU Qiu, W Li, BZ Li, XW AF Qiu, Wei Li, Bing-Zhao Li, Xue-Wen TI Speech recovery based on the linear canonical transform SO SPEECH COMMUNICATION LA English DT Article DE Linear canonical transform; Chirp signal; The AM-FM model; Speech; Signal reconstruction ID FRACTIONAL FOURIER-TRANSFORM; UNCERTAINTY PRINCIPLES; ENERGY OPERATORS; OPTICS; SEPARATION; FREQUENCY; SIGNALS; REPRESENTATIONS; DEMODULATION; AMPLITUDE AB As is well known, speech signal processing is one of the hottest signal processing directions. There are exist lots of speech signal models, such as speech sinusoidal model, straight speech model, AM FM model, gaussian mixture model and so on. This paper investigates AM FM speech model by the linear canonical transform (LCT). The LCT can be considered as a generalization of traditional Fourier transform and fractional Fourier transform, and proved to be one of the powerful tools for non-stationary signal processing. This has opened up the possibility of a new range of potentially promising and useful applications based on the LCT. Firstly, two novel recovery methods of speech based on the AM FM model are presented in this paper: one depends on the LCT domain filtering; the other one is based on the chirp signal parameter estimation to restore the speech signal in LCT domain. Then, experiments results are presented to verify the performance of the proposed methods. Finally, the summarization and the conclusion of the paper is given. (C) 2012 Published by Elsevier B.V. C1 [Qiu, Wei; Li, Bing-Zhao; Li, Xue-Wen] Beijing Inst Technol, Dept Math, Beijing 100081, Peoples R China. RP Li, BZ (reprint author), Beijing Inst Technol, Dept Math, Beijing 100081, Peoples R China. EM li_bingzhaobit@bit.edu.cn RI Li, Bing-Zhao/B-5165-2009 OI Li, Bing-Zhao/0000-0002-3850-4656 FU National Natural Science Foundation of China [60901058, 61171195]; Beijing Natural Science Foundation [1102029] FX This work was supported by the National Natural Science Foundation of China (Nos. 60901058 and 61171195), and also supported partially by Beijing Natural Science Foundation (No. 1102029). CR ABATZOGLOU TJ, 1986, IEEE T AERO ELEC SYS, V22, P708, DOI 10.1109/TAES.1986.310805 Aizenberg I, 2006, IEEE T SIGNAL PROCES, V54, P4261, DOI 10.1109/TSP.2006.881189 ALMEIDA LB, 1994, IEEE T SIGNAL PROCES, V42, P3084, DOI 10.1109/78.330368 BARGMANN V, 1961, COMMUN PUR APPL MATH, V14, P187, DOI 10.1002/cpa.3160140303 BOVIK AC, 1993, IEEE T SIGNAL PROCES, V41, P3245, DOI 10.1109/78.258071 CHEW KC, 1994, IEEE T SIGNAL PROCES, V42, P1939 COLLINS SA, 1970, J OPT SOC AM, V60, P1168 Dimitriadis D., 2005, IEEE SIGNAL PROCESSI, V12, P425 Fan HY, 2006, OPT LETT, V31, P2622, DOI 10.1364/OL.31.002622 FRIEDLANDER B, 1995, IEEE T SIGNAL PROCES, V43, P917, DOI 10.1109/78.376844 Gianfelici F, 2007, IEEE T AUDIO SPEECH, V15, P823, DOI 10.1109/TASL.2006.889744 Huang NE, 1998, P ROY SOC A-MATH PHY, V454, P903 James DFV, 1996, OPT COMMUN, V126, P207, DOI 10.1016/0030-4018(95)00708-3 Koc A, 2008, IEEE T SIGNAL PROCES, V56, P2383, DOI 10.1109/TSP.2007.912890 KOSTENBAUDER AG, 1990, IEEE J QUANTUM ELECT, V26, P1148, DOI 10.1109/3.108113 Li BZ, 2009, SIGNAL PROCESS, V89, P851, DOI 10.1016/j.sigpro.2008.10.030 Li BZ, 2007, SIGNAL PROCESS, V87, P983, DOI 10.1016/j.sigpro.2006.09.008 Li CP, 2012, SIGNAL PROCESS, V92, P1658, DOI 10.1016/j.sigpro.2011.12.024 MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P1532, DOI 10.1109/78.212729 Moshinsky M., 2006, J MATH PHYS, V27, P665 Pei SC, 2002, IEEE T SIGNAL PROCES, V50, P11 齐林, 2003, [中国科学. E辑, 技术科学, Science in China], V33, P749 Santhanam B, 2000, IEEE T COMMUN, V48, P473, DOI 10.1109/26.837050 Sharma KK, 2008, IEEE T SIGNAL PROCES, V56, P2677, DOI 10.1109/TSP.2008.917384 Sharma KK, 2006, OPT COMMUN, V265, P454, DOI 10.1016/j.optcom.2006.03.062 Smith III J.O., 1992, P ICMC 87 Stern A, 2008, J OPT SOC AM A, V25, P647, DOI 10.1364/JOSAA.25.000647 Tao R., 2009, FRACTIONAL FOURIER T Tao R, 2008, IEEE T SIGNAL PROCES, V56, P4199, DOI 10.1109/TSP.2008.925579 Teager H. M., 1989, NATO ADV STUDY I SPE Xu Tian-Zhou, 2012, MATH PROBL ENG, P19 Yao J, 2002, IEEE T BIO-MED ENG, V49, P1299, DOI 10.1109/TMBE.2002.804590 Yin H., 2008, 6 INT S CHIN SPOK LA Zhao J, 2009, IEEE T SIGNAL PROCES, V57, P2856, DOI 10.1109/TSP.2009.2020039 NR 34 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 40 EP 50 DI 10.1016/j.specom.2012.06.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900004 ER PT J AU Arsikere, H Leung, GKF Lulich, SM Alwan, A AF Arsikere, Harish Leung, Gary K. F. Lulich, Steven M. Alwan, Abeer TI Automatic estimation of the first three subglottal resonances from adults' speech signals with application to speaker height estimation SO SPEECH COMMUNICATION LA English DT Article DE Subglottal resonances; Automatic estimation; Bilingual speakers; Speaker height ID AMERICAN ENGLISH VOWELS; BODY-SIZE; NORMALIZATION; RECOGNITION; WEIGHT; SCALE AB Recent research has demonstrated the usefulness of subglottal resonances (SGRs) in speaker normalization. However, existing algorithms for estimating SGRs from speech signals have limited applicability they are effective with isolated vowels only. This paper proposes a novel algorithm for estimating the first three SGRs (Sg1,Sg2 and Sg3) from continuous adults' speech. While Sg1 and Sg2 are estimated based on the phonological distinction they provide between vowel categories, Sg3 is estimated based on its correlation with Sg2. The RMS estimation errors (approximately 30, 60 and 100 Hz for Sg1,Sg2 and Sg3, respectively) are not only comparable to the standard deviations in the measurements, but also are independent of vowel content and language (English and Spanish). Since SGRs correlate with speaker height while remaining roughly constant for a given speaker (unlike vocal tract parameters), the proposed algorithm is applied to the task of height estimation using speech signals. The proposed height estimation method matches state-of-the-art algorithms in performance (mean absolute error = 5.3 cm), but uses much less training data and a much smaller feature set. Our results, with additional analysis of physiological data, suggest the existence of a limit to the accuracy of speech-based height estimation. (C) 2012 Elsevier B.V. All rights reserved. C1 [Arsikere, Harish; Leung, Gary K. F.; Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA. [Lulich, Steven M.] Washington Univ, Dept Psychol, St Louis, MO 63130 USA. RP Arsikere, H (reprint author), Univ Calif Los Angeles, Dept Elect Engn, 56-125B Engn 4 Bldg,Box 951594, Los Angeles, CA 90095 USA. EM hari.arsikere@gmail.com; garyleung@ucla.edu; slulich@wustl.edu; alwan@ee.ucla.edu FU NSF [0905381] FX We would like to thank John R. Morton for his role in recording and labeling the WashU-UCLA corpora. We are also thankful to Dr. Mitchell S. Sommers for his valuable suggestions, and to Melissa Erickson for help with manual measurements. The work was supported in part by NSF Grant no. 0905381. CR [Anonymous], 2001, TRANSMISSION PERFORM Arsikere H, 2011, INT CONF ACOUST SPEE, P4616 Arsikere H., 2010, J ACOUST SOC AM, V128, P2288, DOI [10.1121/1.3508029, DOI 10.1121/1.3508029] Arsikere H., 2011, J ACOUST SOC AM EXPR, V129, P197 Arsikere H., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288792 Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 Boersma P., PRAAT SPEECH PROCESS Cheyne H.A., 2002, THESIS MIT Chi X., 2004, J ACOUST SOC AM, V115, P2540 Chi XM, 2007, J ACOUST SOC AM, V122, P1735, DOI 10.1121/1.2756793 CHISTOVICH LA, 1985, J ACOUST SOC AM, V77, P789, DOI 10.1121/1.392049 Csapo T. G., 2009, P INT, P484 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Dogil G., 2011, TONES FEATURES PHONE, P137 Dusan S., 2005, P EUROSPEECH 2005 LI, P1989 Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 Ganchev T, 2010, LECT NOTES ARTIF INT, V6040, P81 Ganchev T., 2010, P 2010 EUR SIGN PROC, P800 Garofolo J., 1988, GETTING STARTED DARP Gonzalez J, 2004, J PHONETICS, V32, P277, DOI 10.1016/S0095-4470(03)00049-4 Graczi T.E., 2011, P INT HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 Honda K, 2010, J PHONETICS, V38, P33, DOI 10.1016/j.wocn.2008.11.002 Hrdlicka A., 1925, OLD AM Jung Y., 2009, THESIS MIT HARVARD KUNZEL HJ, 1989, PHONETICA, V46, P117 Lulich S. M., 2010, J ACOUST SOC AM, V128 Lulich SM, 2011, J ACOUST SOC AM, V130, P2108, DOI 10.1121/1.3632091 Lulich SM, 2010, J PHONETICS, V38, P20, DOI 10.1016/j.wocn.2008.10.006 Lulich Steven M., 2006, THESIS MIT Madsack A, 2008, P LABPHON, V11, P91 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Nelson D, 1997, INT CONF ACOUST SPEE, P1643, DOI 10.1109/ICASSP.1997.598822 Pellom B. L., 1997, 40 MIDW S CIRC SYST, P873 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Rendall D, 2005, J ACOUST SOC AM, V117, P944, DOI 10.1121/1.1848011 Sjolander K., 1997, SNACK SOUND TOOLKIT Sonderegger M., 2004, THESIS MIT Stevens K.N., 1998, ACOUSTIC PHONETICS SYRDAL AK, 1986, J ACOUST SOC AM, V79, P1086, DOI 10.1121/1.393381 TRAUNMULLER H, 1990, J ACOUST SOC AM, V88, P97 Umesh S, 1999, IEEE T SPEECH AUDI P, V7, P40, DOI 10.1109/89.736329 vanDommelen WA, 1995, LANG SPEECH, V38, P267 Wang S., 2009, P INT, P1619 Wang SZ, 2008, INT CONF ACOUST SPEE, P4277 Wang SZ, 2009, J ACOUST SOC AM, V126, P3268, DOI 10.1121/1.3257185 Wang SZ, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1717 NR 47 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 51 EP 70 DI 10.1016/j.specom.2012.06.004 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900005 ER PT J AU Ming, Y Ruan, QQ Gao, GD AF Ming, Yue Ruan, Qiuqi Gao, Guodong TI A Mandarin edutainment system integrated virtual learning environments SO SPEECH COMMUNICATION LA English DT Article DE Mandarin learning; Pronunciation evaluation; Virtual Reality (VR); Edutainment; Virtual learning environment (VLE); 3D face recognition ID REALITY; GAMES AB In this paper, a novel Mandarin edutainment system is developed for learning Mandarin in immersing, interactive Virtual Learning Environments (VLE). Our system is mainly comprised of two parts: speech technology support and virtual 3D game design. First, 3D face recognition technology is introduced to discriminate the different learners and provide the personalized learning services based on the characteristics of the individuals. Then, a Mandarin pronunciation recognition and assessment scheme is constructed by state-of-the-art speech processing technology. According to the distinctive differences of Mandarin rhythm from the Western languages, we integrate the prosodic parameters into the recognition and evaluation model to highlight Mandarin characteristics and improve the evaluation performance. In order to promote the engagement of foreign learners, we embed our technology framework into a Virtual Reality (VR) game environment. The character design reflects the Chinese traditional culture, and the plots effectively give consider to learning pronunciation and learners' interest, providing the scoring feedback simultaneously. In the experimental design, first, we test the correlation of recognition results and machine scores with the different errors and human scores. Then, we evaluate the usability, likeability, and knowledgeability of the whole VLE system. We divide the learners into three categories in terms of their Mandarin levels, and they provide feedback via a questionnaire. The results show that our system can effectively promote the foreign learners' engagement and improve their Mandarin level. (C) 2012 Elsevier B.V. All rights reserved. C1 [Ming, Yue; Ruan, Qiuqi] Beijing JiaoTong Univ, Inst Informat Sci, Beijing 100044, Peoples R China. [Gao, Guodong] Beijing Traff Control Technol CO Ltd, Beijing 100044, Peoples R China. RP Ming, Y (reprint author), Beijing JiaoTong Univ, Inst Informat Sci, Beijing 100044, Peoples R China. EM myname35875235@126.com FU National Natural Science Foundation [60973060]; Research Fund for the Doctoral Program [20080004001]; Beijing Program [YB20081000401]; Fundamental Research Funds for the Central Universities [2009YJS025] FX This work is supported by National Natural Science Foundation (60973060), the Research Fund for the Doctoral Program (20080004001) and Beijing Program (YB20081000401) and the Fundamental Research Funds for the Central Universities (2009YJS025). Informedia digital video understanding lab in Carneige Mellon University provides the portions of experimental materials and environments. The authors would like also thank the Associate Editor and the anonymous reviewers for their helpful comments. CR Amory A., 1998, P ED MEDIA ED TELECO, P50 Asakawa S., 2005, P EUROSPEECH, P165 Conati C, 2002, LECT NOTES COMPUT SC, V2363, P944 Cucchiarini C., 1998, P ICSLP, P1738 Cucchiarini C., 1997, P IEEE WORKSH ASRU S, P622 de Wet F, 2009, SPEECH COMMUN, V51, P864, DOI 10.1016/j.specom.2009.03.002 Franco H, 2000, SPEECH COMMUN, V30, P121, DOI 10.1016/S0167-6393(99)00045-X Franco H.L., 1997, P ICASSP, P1465 Hauptmann Alex, 2012, IEEE INT C MULT EXP Hodges B, 2009, PRAGMAT COGN, V17, P628, DOI 10.1075/p&c.17.3.08hod Huang Jui-Ting, 2006, SPEECH PROSODY Kearney P., 2004, P ED MEDIA WORLD C E, P3915 Li XX, 2007, LECT NOTES COMPUT SC, V4402, P188 Minematsu N., 2004, P ICSLP, P1317 Neueyer L., 2000, SPEECH COMMUN, V30, P83 Neumeyer L., 1996, P ICSLP, P217 Pan ZG, 2006, COMPUT GRAPH-UK, V30, P20, DOI 10.1016/j.cag.2005.10.004 Rabiner L, 1993, FUNDAMENTALS SPEECH Raux A., 2002, P ICSLP, P737 Tsubota Y., 2004, P ICSLP, P849 Van Lier V., 2000, SOCIOCULTURAL THEORY, P245 Virvou M, 2008, COMPUT EDUC, V50, P154, DOI 10.1016/j.compedu.2006.04.004 Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129 Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8 Witt S.M., 1999, THESIS Young MF, 2000, THEORETICAL FOUNDATIONS OF LEARNING ENVIRONMENTS, P147 Young M.F., 2004, HDB RES ED COMMUNICA Yue Ming, 2012, IMAGE VISIO IN PRESS Zhang YB, 2008, INT CONF ACOUST SPEE, P5065 NR 29 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 71 EP 83 DI 10.1016/j.specom.2012.06.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900006 ER PT J AU Wang, FJ Swegles, K AF Wang, Fangju Swegles, Kyle TI Modeling user behavior online for disambiguating user input in a spoken dialogue system SO SPEECH COMMUNICATION LA English DT Article DE Spoken dialogue system; Automatic speech recognition; Natural language processing; Disambiguation; Reinforcement learning; User behavior modeling ID SIMULATION; STATE AB A spoken dialogue system (SDS) interacts with its user in a spoken natural language. It interprets user speech input and responds to the user. User speech in a spoken natural language may be ambiguous. A challenge in building an SDS is dealing with ambiguity. Without good abilities for disambiguation, an SDS can hardly have meaningful and smooth dialogues with its user in practical applications. The existing techniques for disambiguation are mainly based on statistical knowledge about language use. In practical situations, such knowledge alone is inadequate. In our research, we develop a new disambiguation technique, which is based on application of knowledge about user activity behavior, in addition to knowledge about language use. The technique is named MUBOD, standing for modeling user behavior online for disambiguation. The core component of MUBOD is an online reinforcement learning algorithm that is used to learn the knowledge and apply the knowledge for disambiguation. In this paper, we describe the technique and its implementation, and present and analyze some initial experimental results. (C) 2012 Elsevier B.V. All rights reserved. C1 [Wang, Fangju; Swegles, Kyle] Univ Guelph, Sch Comp Sci, Guelph, ON N1G 2W1, Canada. RP Wang, FJ (reprint author), Univ Guelph, Sch Comp Sci, Guelph, ON N1G 2W1, Canada. EM fjwang@uoguelph.ca FU Natural Sciences and Engineering Research Council of Canada (NSERC) FX This research is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). This paper is based on Kyle Swegles's MSc thesis work. CR Chickering DM, 2007, USER MODEL USER-ADAP, V17, P71, DOI 10.1007/s11257-006-9020-7 Chotimongkol A., 2001, P 7 EUR C SPEECH COM, P1829 Cifarelli C, 2007, J CLASSIF, V24, P205, DOI 10.1007/s00357-007-0012-z Frampton M, 2009, KNOWL ENG REV, V24, P375, DOI 10.1017/S0269888909990166 Frampton M, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P185 Gabsdil M., 2004, P ANN M ASS COMP LIN, P343, DOI 10.3115/1218955.1218999 Gasic M., 2011, P IEEE WORKSH AUT SP, P312 Griol D, 2008, SPEECH COMMUN, V50, P666, DOI 10.1016/j.specom.2008.04.001 Gruenstein A., 2008, P SIGDIAL, P11, DOI 10.3115/1622064.1622067 Higashinaka R, 2006, SPEECH COMMUN, V48, P417, DOI 10.1016/j.specom.2005.06.011 JOKINEN K., 2010, SPOKEN DIALOGUE SYST Jonson R., 2006, P SPOK LANG TECHN WO, P174 Jung S, 2009, COMPUT SPEECH LANG, V23, P479, DOI 10.1016/j.csl.2009.03.002 Jurafsky D., 2007, SPEECH LANGUAGE PROC Jurcicek F, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P90 Keizer S, 2008, 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, P121, DOI 10.1109/SLT.2008.4777855 Lemon O., 2009, P EACL, P505, DOI 10.3115/1609067.1609123 Levin E., 2000, IEEE T SPEECH AUDIO, V8, P1 Lidstone George James, 1920, T FACULTY ACTUARIES, V8, P182 Litman D.J., 2000, P 18 C COMP LING ASS, V1, P502, DOI 10.3115/990820.990893 Manning C. D., 1999, FDN STAT NATURAL LAN Png SW, 2011, INT CONF ACOUST SPEE, P2156 Schatzmann J., 2006, KNOWL ENG REV, V21, P1 Scheffler Konrad, 2002, P 2 INT C HUM LANG T, P12, DOI 10.3115/1289189.1289246 Sutton R. S., 2005, REINFORCEMENT LEARNI Tetreault JR, 2008, SPEECH COMMUN, V50, P683, DOI 10.1016/j.specom.2008.05.002 Thomson B, 2010, COMPUT SPEECH LANG, V24, P562, DOI 10.1016/j.csl.2009.07.003 Thomson B., 2010, SPOK LANG TECHN WORK, P271 Torres F, 2008, COMPUT SPEECH LANG, V22, P230, DOI 10.1016/j.csl.2007.09.002 Walker W., 2004, SPHINX 4 FLEXIBLE OP Williams J., 2005, P KNOWL REAS PRACT D Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008 NR 32 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 84 EP 98 DI 10.1016/j.specom.2012.06.006 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900007 ER PT J AU Sepulveda, A Guido, RC Castellanos-Dominguez, G AF Sepulveda, Alexander Guido, Rodrigo Capobianco Castellanos-Dominguez, G. TI Estimation of relevant time-frequency features using Kendall coefficient for articulator position inference SO SPEECH COMMUNICATION LA English DT Article DE Articulatory information; Wavelet-packet transform; Kendall coefficient; GMM; Neural networks ID VOCAL-TRACT AB The determination of relevant acoustic information for the inference of articulators position is an open issue. This paper presents a method to estimate those acoustic features better related to articulators movement. The input feature set is based on time-frequency representation calculated from the speech signal, whose parametrization is achieved using the wavelet-packet transform. The main focus is on measuring the relevant acoustic information, in terms of statistical association, for the inference of articulator positions. The rank correlation Kendall coefficient is used as the relevance measure. Attained statistical association is validated using the chi(2) information measure. The maps of relevant time frequency features are calculated for the MOCHA-TIMIT database, where the articulatory information is represented by trajectories of specific positions in the vocal tract. Relevant maps are estimated over the whole speech signal as well as on specific phones, for which a given articulator is known to be critical. The usefulness of the relevant maps is tested in an acoustic-to-articulatory mapping system based on gaussian mixture models. (C) 2012 Elsevier B.V. All rights reserved. C1 [Guido, Rodrigo Capobianco] Univ Sao Paulo, Inst Phys Sao Carlos IFSC, Dept Phys & Informat, BR-13566590 Sao Carlos, SP, Brazil. [Sepulveda, Alexander; Castellanos-Dominguez, G.] Univ Nacl Colombia, Signal Proc & Recognit Grp, Manizales, Colombia. RP Guido, RC (reprint author), Univ Sao Paulo, Inst Phys Sao Carlos IFSC, Dept Phys & Informat, Ave Trabalhador Saocarlense 400, BR-13566590 Sao Carlos, SP, Brazil. EM guido@ifsc.usp.br FU Administrative Department of Science, Technology and Innovation of Colombia (COLCIENCIAS); Red de Macrouniversidades de America Latina-Grupo Santander FX We thank the reviewers for valuable suggestions and constructive criticisms of an earlier version of this paper. This work was supported mainly by Administrative Department of Science, Technology and Innovation of Colombia (COLCIENCIAS); and, it was also financed by the Red de Macrouniversidades de America Latina-Grupo Santander, who provided the facilities to carry out a six month internship at SpeechLab, IFSC, University of Sao Paulo (Sao Carlos, SP, Brazil). CR Addison P.S., 2002, ILLUSTRATED WAVELET Akansu AN, 2001, MULTIRESOLUTION SIGN Al-Moubayed S., 2010, INTERSPEECH 2010 Bishop C. M., 2006, PATTERN RECOGNITION Choueiter G., 2007, IEEE T AUDIO SPEECH Dickinson J., 2003, NONPARAMETRIC STAT I Farooq O, 2001, IEEE SIGNAL PROC LET, V8, P196, DOI 10.1109/97.928676 Hasegawa-Johnson M., 2000, INT C SPOK LANG PROC, P133 Hogden J., 1996, J ACOUSTICAL SOC AM, V100 Jackson P., 2009, SPEECH COMMUNICATION Kent R. D., 2002, ACOUSTIC ANAL SPEECH Maji P, 2009, IEEE T BIO-MED ENG, V56, P1063, DOI 10.1109/TBME.2008.2004502 Mallat S., 1998, WAVELET TOUR SIGNAL Ozbek Y., 2011, IEEE T AUDIO SPEECH Papcun G., 1992, J ACOUSTICAL SOC AM Richmond K, 2003, COMPUT SPEECH LANG, V17, P153, DOI 10.1016/S0885-2308(03)00005-6 Richmond Korin, 2001, THESIS U EDINBURGH E Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 Silva J., 2009, IEEE T SIGNAL PROCES, V57 Sorokin VN, 2000, SPEECH COMMUN, V30, P55, DOI 10.1016/S0167-6393(99)00031-X Suzuki T., 2009, BMC BIOINFORMATICS, V10 Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001 Toutios A., 2008, 16 EUR SIGN PROC C E Wrench A., 1999, MOCHA TIMIT ARTICULA Yang HH, 2000, SPEECH COMMUN, V31, P35, DOI 10.1016/S0167-6393(00)00007-8 Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004 NR 26 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 99 EP 110 DI 10.1016/j.specom.2012.06.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900008 ER PT J AU Yagli, C Turan, MAT Erzin, E AF Yagli, Can Turan, M. A. Tugtekin Erzin, Engin TI Artificial bandwidth extension of spectral envelope along a Viterbi path SO SPEECH COMMUNICATION LA English DT Article DE Artificial bandwidth extension; Source-filter separation; Line spectral frequency; Joint temporal analysis ID NARROW-BAND; SPEECH AB In this paper, we propose a hidden Markov model (HMM)-based wideband spectral envelope estimation method for the artificial bandwidth extension problem. The proposed HMM-based estimator decodes an optimal Viterbi path based on the temporal contour of the narrowband spectral envelope and then performs the minimum mean square error (MMSE) estimation of the wideband spectral envelope on this path. Experimental evaluations are performed to compare the proposed estimator to the state-of-the-art HMM and Gaussian mixture model based estimators using both objective and subjective evaluations. Objective evaluations are performed with the log-spectral distortion (LSD) and the wideband perceptual evaluation of speech quality (PESQ) metrics. Subjective evaluations are performed with the A/B pair comparison listening test. Both objective and subjective evaluations yield that the proposed wideband spectral envelope estimator consistently improves performances over the state-of-the-art estimators. (C) 2012 Elsevier B.V. All rights reserved. C1 [Yagli, Can; Turan, M. A. Tugtekin; Erzin, Engin] Koc Univ, Coll Engn, Multimedia Vis & Graph Lab, TR-34450 Istanbul, Turkey. RP Erzin, E (reprint author), Koc Univ, Coll Engn, Multimedia Vis & Graph Lab, TR-34450 Istanbul, Turkey. EM canyagli@ku.edu.tr; mturan@ku.edu.tr; eerzin@ku.edu.tr CR Agiomyrgiannakis Y, 2007, IEEE T AUDIO SPEECH, V15, P377, DOI 10.1109/TASL.2006.881702 [Anonymous], 1993, SPEC INT REF SYST Cheng YM, 1994, IEEE T SPEECH AUDI P, V2, P544 Enbom N., 1999, IEEE WORKSH SPEECH C, P171 Erzin E, 2009, IEEE T AUDIO SPEECH, V17, P1316, DOI 10.1109/TASL.2009.2016733 ITU- T, 2005, WID EXT REC P 862 AS Jax P., 2004, AUDIO BANDWIDTH EXTE, P171 Jax P, 2003, SIGNAL PROCESS, V83, P1707, DOI 10.1016/S0165-1684(03)00082-3 Park KY, 2000, INT CONF ACOUST SPEE, P1843 Voran S., 1997, IEEE WORKSH SPEECH C, P81 Yagli C., 2012, ABE SPEECH SYNTHESIS Yagli C, 2011, INT CONF ACOUST SPEE, P5096 NR 12 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 111 EP 118 DI 10.1016/j.specom.2012.07.003 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900009 ER PT J AU Fan, X Hansen, JHL AF Fan, Xing Hansen, John H. L. TI Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams SO SPEECH COMMUNICATION LA English DT Article DE Speaker identification; Whispered speech; Vocal effort; Robust speaker verification ID MAXIMUM-LIKELIHOOD; EM ALGORITHM; RECOGNITION; VOWELS AB Whispered speech is an alternative speech production mode from neutral speech, which is used by talkers intentionally in natural conversational scenarios to protect privacy and to avoid certain content from being overheard or made public. Due to the profound differences between whispered and neutral speech in vocal excitation and vocal tract function, the performance of automatic speaker identification systems trained with neutral speech degrades significantly. In order to better understand these differences and to further develop efficient model adaptation and feature compensation methods, this study first analyzes the speaker and phoneme dependency of these differences by a maximum likelihood transformation estimation from neutral speech towards whispered speech. Based on analysis results, this study then considers a feature transformation method in the training phase that leads to a more robust speaker model for speaker ID on whispered speech without using whispered adaptation data from test speakers. Three estimation methods that model the transformation from neutral to whispered speech are applied, including convolutional transformation (ConvTran), constrained maximum likelihood linear regression (CM LLR), and factor analysis (FA). a speech mode independent (SMI) universal background model (UBM) is trained using collected real neutral features and transformed pseudo-whisper features generated with the estimated transformation. Text-independent closed set speaker ID results using the UT-VocalEffort II corpus show performance improvement by using the proposed training framework. The best performance of 88.87% is achieved by using the ConvTran model, which represents a relative improvement of 46.26% compared to the 79.29% accuracy of the GMM-UBM baseline system. This result suggests that synthesizing pseudo-whispered speaker and background training data with the ConvTran model results in improved speaker ID robustness to whispered speech. (C) 2012 Elsevier B.V. All rights reserved. C1 [Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. EM John.Hansen@utdallas.edu FU AFRL [FA8750-09-C-0067]; University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering FX This project was funded by AFRL through a subcontract to RADC Inc. under FA8750-09-C-0067 (Approved for public release, distribution unlimited), and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J. Hansen. CR Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deguchi D., 2010, APSIPA ASC BIOP SING, P502 Dehak N., 2009, INTERSPEECH, P1559 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Deng L, 2004, IEEE T SPEECH AUDI P, V12, P218, DOI 10.1109/TSA.2003.822627 Eklund I, 1996, PHONETICA, V54, P1 Fan X, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1313 Fan X., 2009, ISCA INTERSPEECH 09, P896 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Garofolo JS, 1993, LINGUISTIC DATA CONS GavidiaCeballos L, 1996, IEEE T BIO-MED ENG, V43, P373, DOI 10.1109/10.486257 Ito T, 2005, SPEECH COMMUN, V45, P139, DOI 10.1016/j.specom.2003.10.005 Jin Q., 2007, IEEE INT C MULT EXP, P1027 Jovicic ST, 1998, ACUSTICA, V84, P739 KALLAIL KJ, 1984, J SPEECH HEAR RES, V27, P245 Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693 Kenny P, 2005, IEEE T SPEECH AUDI P, V13, P345, DOI 10.1109/TSA.2004.840940 Kullback S., 1968, INFORM THEORY STAT Lei Y, 2009, INT CONF ACOUST SPEE, P4337, DOI 10.1109/ICASSP.2009.4960589 Li JY, 2009, COMPUT SPEECH LANG, V23, P389, DOI 10.1016/j.csl.2009.02.001 Matsuura M, 1999, ADV EARTHQ ENGN, P133 MEYEREPPLER W, 1957, J ACOUST SOC AM, V29, P104, DOI 10.1121/1.1908631 Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225 Morris RW, 2002, MED ENG PHYS, V24, P515, DOI 10.1016/S1350-4533(02)00060-7 THOMAS IB, 1969, J ACOUST SOC AM, V46, P468, DOI 10.1121/1.1911712 Tipping ME, 1999, J ROY STAT SOC B, V61, P611, DOI 10.1111/1467-9868.00196 Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839 Zhang C., 2007, INTERSPEECH 07, P2289 Zhang C., 2009, INTERSPEECH, P860 NR 30 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 119 EP 134 DI 10.1016/j.specom.2012.07.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900010 ER PT J AU Gibert, G Leung, Y Stevens, CJ AF Gibert, Guillaume Leung, Yvonne Stevens, Catherine J. TI Control of speech-related facial movements of an avatar from video SO SPEECH COMMUNICATION LA English DT Article DE Talking head; Auditory-visual speech; Puppetry; Facial animation; Face tracking ID ACTIVE SHAPE MODEL; EXPRESSION; FACE; TRACKING; HEAD AB Several puppetry techniques have been recently proposed to transfer emotional facial expressions to an avatar from a user's video. Whereas generation of facial expressions may not be sensitive to small tracking errors, generation of speech-related facial movements would be severely impaired. Since incongruent facial movements can drastically influence speech perception, we proposed a more effective method to transfer speech-related facial movements from a user to an avatar. After a facial tracking phase, speech articulatory parameters (controlling the jaw and the lips) were determined from the set of landmark positions. Two additional processes calculated the articulatory parameters which controlled the eyelids and the tongue from the 2D Discrete Cosine Transform coefficients of the eyes and inner mouth images. A speech in noise perception experiment was conducted on 25 participants to evaluate the system. Increase in intelligibility was shown for the avatar and human auditory visual conditions compared to the avatar and human auditory-only conditions, respectively. Depending on the vocalic context, the results of the avatar auditory visual presentation were different: all the consonants were better perceived in /a/ vocalic context compared to /i/ and /u/ because of the lack of depth information retrieved from video. This method could be used to accurately animate avatars for hearing impaired people using information technologies and telecommunication. (C) 2012 Elsevier B.V. All rights reserved. C1 [Gibert, Guillaume] INSERM, U846, F-69500 Bron, France. [Gibert, Guillaume] Stem Cell & Brain Res Inst, F-69500 Bron, France. [Gibert, Guillaume] Univ Lyon 1, F-69003 Lyon, France. [Gibert, Guillaume; Leung, Yvonne; Stevens, Catherine J.] Univ Western Sydney, Marcs Inst, Penrith, NSW 2751, Australia. [Stevens, Catherine J.] Univ Western Sydney, Sch Social Sci & Psychol, Penrith, NSW 2751, Australia. RP Gibert, G (reprint author), Stem Cell & Brain Res Inst, INSERM, U846, 18 Ave Doyen Lepine, F-69675 Bron, France. EM guillaume.gibert@inserm.fr; y.leung@uws.edu.au; kj.stevens@uws.edu.au RI Gibert, Guillaume/M-5816-2014 FU Australian Research Council; National Health and Medical Research Council [TS0669874]; SWoOZ project [ANR 11 PDOC 019 01] FX We thank James Heathers for manually segmenting the images. This work was supported by the Thinking Head project, a Special Initiative scheme of the Australian Research Council and the National Health and Medical Research Council (TS0669874) (Burnham et al., 2006) and by the SWoOZ project (ANR 11 PDOC 019 01). CR Allbeck J., 2010, INTELLIGENT VIRTUAL, V6356, P420 Badin P., 2006, 7 INT SEM SPEECH PRO Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107 Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4 Besle J, 2004, EUR J NEUROSCI, V20, P2225, DOI 10.1111/j.1460-9568.2004.03670.x Boker SM, 2009, PHILOS T R SOC B, V364, P3485, DOI 10.1098/rstb.2009.0152 Brand M, 1999, P 26 ANN C COMP GRAP Burnham D., 2006, TS0669874 ARCNH MRC Caridakis G., 2007, LANGUAGE RESOURCES E, P41 Chibelushi C., 2003, CVONLINE LINE COMPEN COOTES TF, 1995, COMPUT VIS IMAGE UND, V61, P38, DOI 10.1006/cviu.1995.1004 Fanelli G., 2007, 3DTV C, P1 Forster KI, 2003, BEHAV RES METH INS C, V35, P116, DOI 10.3758/BF03195503 Gallou S., 2007, P 7 INT C INT VIRT A Gibert G, 2005, J ACOUST SOC AM, V118, P1144, DOI 10.1121/1.1944587 Gross R, 2010, IMAGE VISION COMPUT, V28, P807, DOI 10.1016/j.imavis.2009.08.002 Jordan T. R., 1998, HEARING EYE, P155 Kuratate T., 1998, INT C AUD VIS SPEECH, P185 Lee SW, 2007, IET COMPUT VIS, V1, P17, DOI 10.1049/iet-cvi:20045243 Massaro D, 2011, AM J PSYCHOL, V124, P341, DOI 10.5406/amerjpsyc.124.3.0341 Massaro D. W., 1998, PERCEIVING TALKING F MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Milborrow S, 2008, LECT NOTES COMPUT SC, V5305, P504 Morishima S, 2001, IEEE SIGNAL PROC MAG, V18, P26, DOI 10.1109/79.924886 Ouni S., 2007, EURASIP J AUDIO SPEE Ouni S., 2003, INT C PHON SCI ICPHS Reveret L., 2000, 6 INT C SPOK LANG PR Rosenblum LD, 1996, J SPEECH HEAR RES, V39, P1159 Saragih JM, 2011, AUTOMATIC FACE GESTU SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Theobald BJ, 2009, LANG SPEECH, V52, P369, DOI 10.1177/0023830909103181 Theobald B.-J., 2007, P 9 861 INT C MULT I Viola P., 2001, P IEEE COMP SOC C CO, V1, P1 Vlasic D, 2005, ACM T GRAPHIC, V24, P426, DOI 10.1145/1073204.1073209 Weise T., 2009, 8 ACM SIGGRAPH EUR S NR 35 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 135 EP 146 DI 10.1016/j.specom.2012.07.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900011 ER PT J AU Lammert, A Goldstein, L Narayanan, S Iskarous, K AF Lammert, Adam Goldstein, Louis Narayanan, Shrikanth Iskarous, Khalil TI Statistical methods for estimation of direct and differential kinematics of the vocal tract SO SPEECH COMMUNICATION LA English DT Article DE Speech production; Direct kinematics; Differential kinematics; Task dynamics; Articulatory synthesis; Kinematic estimation; Statistical machine learning; Locally-weighted regression; Artificial neural networks ID TO-ARTICULATORY INVERSION; LOCALLY WEIGHTED REGRESSION; NEURAL-NETWORK MODEL; SPEECH PRODUCTION-MODEL; MARQUARDT ALGORITHM; MOTOR CONTROL; MOVEMENTS; ACOUSTICS; MANIPULATORS; GESTURES AB We present and evaluate two statistical methods for estimating kinematic relationships of the speech production system: artificial neural networks and locally-weighted regression. The work is motivated by the need to characterize this motor system, with particular focus on estimating differential aspects of kinematics. Kinematic analysis will facilitate progress in a variety of areas, including the nature of speech production goals, articulatory redundancy and, relatedly, acoustic-to-articulatory inversion. Statistical methods must be used to estimate these relationships from data since they are infeasible to express in closed form. Statistical models are optimized and evaluated - using a heldout data validation procedure on two sets of synthetic speech data. The theoretical and practical advantages of both methods are also discussed. It is shown that both direct and differential kinematics can be estimated with high accuracy, even for complex, nonlinear relationships. Locally-weighted regression displays the best overall performance, which may be due to practical advantages in its training procedure. Moreover, accurate estimation can be achieved using only a modest amount of training data, as judged by convergence of performance. The algorithms are also applied to real-time MRI data, and the results are generally consistent with those obtained from synthetic data. (C) 2012 Elsevier B.V. All rights reserved. C1 [Lammert, Adam; Narayanan, Shrikanth] Univ So Calif, SAIL, Los Angeles, CA 90089 USA. [Goldstein, Louis; Narayanan, Shrikanth; Iskarous, Khalil] Univ So Calif, Dept Linguist, Los Angeles, CA 90089 USA. [Goldstein, Louis; Iskarous, Khalil] Haskins Labs Inc, New Haven, CT 06511 USA. RP Lammert, A (reprint author), Univ So Calif, SAIL, 3710 McClintock Ave, Los Angeles, CA 90089 USA. EM lammert@usc.edu FU NIH NIDCD [02717]; NIH [DC008780, DC007124]; Annenberg Foundation FX This work was supported by NIH NIDCD Grant 02717, NIH R01 Grant DC008780, NIH Grant DC007124, as well as a graduate fellowship from the Annenberg Foundation. We would also like to acknowledge Elliot Saltzman for his technical insights, and Hosung Nam for his help with understanding TADA. CR ABBS JH, 1984, J NEUROPHYSIOL, V51, P705 Al Moubayed S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P937 Ananthakrishnan G., 2009, P INT BRIGHT UK, P2799 ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 Atal B.S., 1989, J ACOUSTICAL SOC AM, P86 Atkeson CG, 1997, ARTIF INTELL REV, V11, P11, DOI 10.1023/A:1006559212014 BAILLY G, 1991, J PHONETICS, V19, P9 Balestrino A., 1984, P 9 IFAC WORLD C, V5, P2435 BENNETT DJ, 1991, IEEE T ROBOTIC AUTOM, V7, P597, DOI 10.1109/70.97871 Bernstein N., 1967, COORDINATION REGULAT Bishop C. M., 2006, PATTERN RECOGNITION BOE LJ, 1992, J PHONETICS, V20, P27 Bresch E, 2009, IEEE T MED IMAGING, V28, P323, DOI 10.1109/TMI.2008.928920 BULLOCK D, 1993, J COGNITIVE NEUROSCI, V5, P408, DOI 10.1162/jocn.1993.5.4.408 CLEVELAND WS, 1979, J AM STAT ASSOC, V74, P829, DOI 10.2307/2286407 CLEVELAND WS, 1988, J ECONOMETRICS, V37, P87, DOI 10.1016/0304-4076(88)90077-2 CLEVELAND WS, 1988, J AM STAT ASSOC, V83, P596, DOI 10.2307/2289282 Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467 Cybenko G., 1989, Mathematics of Control, Signals, and Systems, V2, DOI 10.1007/BF02551274 D'Souza A., 2001, P CIRAS Duch W., 1999, NEURAL COMPUTING SUR, V2, P163 Fels S., 2005, P AUD VIS SPEECH PRO, P119 Gerard J.-M., 2006, SPEECH PRODUCTION MO, P85 Gerard JM, 2003, REC RES DEV BIOMECH, V1, P49 Ghosh PK, 2011, J ACOUST SOC AM, V130, pEL251, DOI 10.1121/1.3634122 Ghosh PK, 2010, J ACOUST SOC AM, V128, P2162, DOI 10.1121/1.3455847 GUENTHER FH, 1994, BIOL CYBERN, V72, P43, DOI 10.1007/BF00206237 Guenther FH, 1998, PSYCHOL REV, V105, P611 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 Guigon E, 2007, J NEUROPHYSIOL, V97, P331, DOI 10.1152/jn.00290.2006 HAGAN MT, 1994, IEEE T NEURAL NETWOR, V5, P989, DOI 10.1109/72.329697 Hinton GE, 2006, NEURAL COMPUT, V18, P1527, DOI 10.1162/neco.2006.18.7.1527 HIROYA S, 2002, ACOUST SPEECH SIG PR, P437 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 Hiroya S., 2003, P INT WORKSH SPEECH, P9 Hiroya S., 2002, P ICSLP, P2305 Hogden J, 1996, J ACOUST SOC AM, V100, P1819, DOI 10.1121/1.416001 HOLLERBACH JM, 1982, TRENDS NEUROSCI, V5, P189, DOI 10.1016/0166-2236(82)90111-4 Hollerbach JM, 1996, INT J ROBOT RES, V15, P573, DOI 10.1177/027836499601500604 Homilc K., 1989, NEURAL NETWORKS, V2 Iskarous K., 2003, P ICPHS Jacobs R. A., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.1.79 JORDAN MI, 1992, J MATH PSYCHOL, V36, P396, DOI 10.1016/0022-2496(92)90029-7 JORDAN MI, 1992, COGNITIVE SCI, V16, P307, DOI 10.1207/s15516709cog1603_1 Jordan M.I., 1995, HDB BRAIN THEORY NEU Kaburagi T., 1998, P ICSLP Kello CT, 2004, J ACOUST SOC AM, V116, P2354, DOI 10.1121/1.1715112 Kelso S., 1984, J EXP PSYCHOL, V10, P812 KHATIB O, 1987, IEEE T ROBOTIC AUTOM, V3, P43 Lammert A., 2011, J ACOUST SOC AM, V130, P2549 Lammert A., 2010, P INTERSPEECH Lammert A.C., 2008, P SAPA 08, P29 Lawrence S., 1996, Proceedings of the Seventh Australian Conference on Neural Networks (ACNN'96) McGowan RS, 2009, J ACOUST SOC AM, V126, P2011, DOI 10.1121/1.3184581 MERMELST.P, 1965, J ACOUST SOC AM, V37, P1186, DOI 10.1121/1.1939448 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 Mitra V, 2010, IEEE J-STSP, V4, P1027, DOI 10.1109/JSTSP.2010.2076013 Mitra V., 2009, P ICASSP Mitra V., 2011, IEEE T AUDIO SPEECH Mooring B.W., 1991, FUNDAMENTALS MANIPUL Mottet D, 2001, J EXP PSYCHOL HUMAN, V27, P1275, DOI 10.1037/0096-1523.27.6.1275 Nakamura K., 2006, P ICASSP NAKAMURA Y, 1986, J DYN SYST-T ASME, V108, P163 Nakanishi J, 2008, INT J ROBOT RES, V27, P737, DOI 10.1177/0278364908091463 Nam H., 2004, J ACOUST SOC AM, V115, P2430, DOI DOI 10.1016/J.SPECOM.2005.07.003 Nam H., 2006, TADA TASK DYNAMICS A Nam H., 2010, P ICASSP Narayanan S., 2004, JASA, V109, P2446 Narayanan S., 2011, P INTERSPEECH Panchapagesan S, 2011, J ACOUST SOC AM, V129, P2144, DOI 10.1121/1.3514544 PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994 Payan Y., 1997, SPEECH COMMUN, V22, P187 Perrier P, 1996, J PHONETICS, V24, P53, DOI 10.1006/jpho.1996.0005 Perrier P, 2003, J ACOUST SOC AM, V114, P1582, DOI 10.1121/1.1587737 Qin C., 2007, P INTERSPEECH Qin C., 2010, P INTERSPEECH Rahim M. G., 1991, P IEEE INT C AC SPEE, P485, DOI 10.1109/ICASSP.1991.150382 Richmond K., 2010, P INTERSPEECH, P577 Rubin P., 1996, P 1 ETRW SPEECH PROD Rumelhart D., 1986, PARALLEL DISTRIBUTED, V1 Rumelhart D., 1986, NATURE, V323 Saltzman E, 2000, HUM MOVEMENT SCI, V19, P499, DOI 10.1016/S0167-9457(00)00030-0 Saltzman E., 2006, DYNAMICS SPEECH PROD SALTZMAN E, 1987, PSYCHOL REV, V94, P84, DOI 10.1037//0033-295X.94.1.84 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 Scholz J. P., 1999, EXP BRAIN RES, V126, P189 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 Sciavicco L., 2005, MODELLING CONTROL RO Shiga Y., 2004, P INTERSPEECH Sklar M.E., 1989, P 2 C REC ADV ROB, P178 SOECHTING JF, 1982, BRAIN RES, V248, P392, DOI 10.1016/0006-8993(82)90601-1 Ting J.A., 2008, P ICRA PAS CA Toda T., 2004, P 5 ISCA SPEECH SYNT, P31 Toda T., 2008, SPEECH COMMUNICATION, V50 Toledo A, 2005, IEEE T NEURAL NETWOR, V16, P988, DOI 10.1109/TNN.2005.849849 Vogt F., 2005, J ACOUST SOC AM, V117, P2542 Vogt F., 2006, P ISSP, P51 WAKITA H, 1973, IEEE T ACOUST SPEECH, VAU21, P417, DOI 10.1109/TAU.1973.1162506 WAMPLER CW, 1986, IEEE T SYST MAN CYB, V16, P93, DOI 10.1109/TSMC.1986.289285 WHITNEY DE, 1969, IEEE T MAN MACHINE, VMM10, P47, DOI 10.1109/TMMS.1969.299896 Wilamowski BM, 2008, IEEE T IND ELECTRON, V55, P3784, DOI 10.1109/TIE.2008.2003319 Winkler R., 2011, P INTERSPEECH Winkler R., 2011, P ISSP Wolovich W. A., 1984, Proceedings of the 23rd IEEE Conference on Decision and Control (Cat. No. 84CH2093-3) Wrench A. A., 2000, P 5 SEM SPEECH PROD, P305 Wu JM, 2008, IEEE T NEURAL NETWOR, V19, P2032, DOI 10.1109/TNN.2008.2003271 NR 106 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 147 EP 161 DI 10.1016/j.specom.2012.08.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900012 ER PT J AU Deoras, A Mikolov, T Kombrink, S Church, K AF Deoras, Anoop Mikolov, Tomas Kombrink, Stefan Church, Kenneth TI Approximate inference: A sampling based modeling technique to capture complex dependencies in a language model SO SPEECH COMMUNICATION LA English DT Article DE Long-span language models; Recurrent neural networks; Speech recognition; Decoding AB In this paper, we present strategies to incorporate long context information directly during the first pass decoding and also for the second pass lattice re-scoring in speech recognition systems. Long-span language models that capture complex syntactic and/or semantic information are seldom used in the first pass of large vocabulary continuous speech recognition systems due to the prohibitive increase in the size of the sentence-hypotheses search space. Typically, n-gram language models are used in the first pass to produce N-best lists, which are then re-scored using long-span models. Such a pipeline produces biased first pass output, resulting in sub-optimal performance during re-scoring. In this paper we show that computationally tractable variational approximations of the long-span and complex language models are a better choice than the standard n-gram model for the first pass decoding and also for lattice re-scoring. (C) 2012 Elsevier B.V. All rights reserved. C1 [Deoras, Anoop] Microsoft Corp, Mountain View, CA 94043 USA. [Mikolov, Tomas; Kombrink, Stefan] Brno Univ Technol, Speech FIT, Brno, Czech Republic. [Church, Kenneth] IBM TJ Watson Res Ctr, Yorktown Hts, NY USA. RP Deoras, A (reprint author), Microsoft Corp, 1065 La Ave, Mountain View, CA 94043 USA. EM Anoop.Deoras@microsoft.com; imikolov@fit.vutbr.cz; kombrink@fit.vutbr.cz; kwchurch@us.ibm.com FU HLT-COE Johns Hopkins University; Technology Agency of the Czech Republic [TA01011328]; Grant Agency of Czech Republic [102/08/0707] FX HLT-COE Johns Hopkins University partially funded this research. BUT researchers were partly funded by the Technology Agency of the Czech Republic Grant No. TA01011328, and Grant Agency of Czech Republic Project No. 102/08/0707. Frederick Jelinek's contribution is acknowledged towards this work. He would be a co-author if he were available and willing to give his consent. We thank anonymous reviewers for their many helpful comments and suggestions. CR Allauzen C., 2003, P ASS COMP LING ACL Bengio Y., 2007, SCALING LEARNING ALG Bickel P. J., 1977, MATH STAT BASIC IDEA Bishop C. M., 2006, PATTERN RECOGNITION Boden M., 2002, GUIDE RECURRENT NEUR Calzolari N., 2010, P 7 C INT LANG RES E Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147 Chen SF, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P299, DOI 10.1109/ASRU.2009.5373380 Cover T M, 1991, ELEMENTS INFORM THEO Deng L., 2011, P AS PAC SIGN INF PR Deoras A., 2010, P IEEE SPOK LANG TEC Deoras A., 2011, P IEEE INT C AC SPEE Deoras A., 2011, THESIS J HOPKINS U Deoras A., 2011, P 2011 C EMP METH NA Deoras A, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P282, DOI 10.1109/ASRU.2009.5373438 ELMAN JL, 1990, COGNITIVE SCI, V14, P179, DOI 10.1207/s15516709cog1402_1 Filimonov D., 2009, P 2009 C EMP METH NA GEMAN S, 1984, IEEE T PATTERN ANAL, V6, P721 Glass J., 2007, P ICSLP INT Hain T, 2005, P RICH TRANSCR 2005 Jordon M.I., 1998, LEARNING GRAPHICAL M Mikolov T., 2011, P ICSLP INT Mikolov T., 2010, P ICSLP INT Mikolov T., 2011, P IEEE WORKSH AUT SP Mikolov T., 2011, P IEEE INT C AC SPEE Momtazi S., 2010, P ICSLP INT Nederhof M.-J., 2004, P ASS COMP LING ACL Nederhof M.-J., 2005, P ASS COMP LING ACL Roark B, 2001, COMPUT LINGUIST, V27, P249, DOI 10.1162/089120101750300526 Rumelhart D.E., 1986, PARALLEL DISTRIBUTED, V1, P318 SHANNON CE, 1951, AT&T TECH J, V30, P50 Soltau H., 2010, P IEEE WORKSH SPOK L Stolcke A., 1998, P DARPA BROADC NEWS, P8 Stolcke A., 1994, P ASS COMP LING ACL Xu P., 2005, THESIS J HOPKINS U NR 35 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 162 EP 177 DI 10.1016/j.specom.2012.08.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900013 ER PT J AU Cmejla, R Rusz, J Bergl, P Vokral, J AF Cmejla, Roman Rusz, Jan Bergl, Petr Vokral, Jan TI Bayesian changepoint detection for the automatic assessment of fluency and articulatory disorders SO SPEECH COMMUNICATION LA English DT Article DE Changepoint detection; Speech pathology; Speech signal processing; Disfluency; Articulation disorder ID CHANGE-POINT ANALYSIS; PARKINSONS-DISEASE; TIME-SERIES; SPEECH; VOICE; DYSARTHRIA; SEVERITY; SPEAKERS; HEALTHY; MODEL AB The accurate changepoint detection of different signal segments is a frequent challenge in a wide range of applications. With regard to speech utterances, the changepoints are related to significant spectral changes, mostly represented by the borders between two phonemes. The main aim of this study is to design a novel Bayesian autoregressive changepoint detector (BACD) and test its feasibility in the evaluation of fluency and articulatory disorders. The originality of the proposed method consists in its normalizing of a posteriori probability using Bayesian evidence and designing a recursive algorithm for reliable practice. For further evaluation of the BACD, we used data from (a) 118 people with various severity of stuttering to assess the extent of speech disfluency using a short reading passage, and (b) 24 patients with early Parkinson's disease and 22 healthy speakers for evaluation of articulation accuracy using fast syllable repetition. Subsequently, we designed two measures for each type of disorder. While speech disfluency has been related to greater distances between spectral changes, inaccurate dysarthric articulation has instead been associated with lower spectral changes. These findings have been confirmed by statistically significant differences, which were achieved in separating several degrees of disfluency and distinguishing healthy from parkinsonian speakers. In addition, a significant correlation was found between the automatic assessment of speech fluency and the judgment of human experts. In conclusion, the method proposed provides a cost-effective, easily applicable and freely available evaluation of speech disorders, as well as other areas requiring reliable techniques for changepoint detection. In a more modest scope, BACD may be used in diagnosis of disease severity, monitoring treatment, and support for therapist evaluation. (C) 2012 Elsevier B.V. All rights reserved. C1 [Cmejla, Roman; Rusz, Jan; Bergl, Petr] Czech Tech Univ, Dept Circuit Theory, Fac Elect Engn, Prague 16627 6, Czech Republic. [Rusz, Jan] Charles Univ Prague, Fac Med 1, Dept Neurol, Prague, Czech Republic. [Vokral, Jan] Gen Univ Hosp Prague, Fac Med 1, Dept Phoniatr, Prague, Czech Republic. RP Cmejla, R (reprint author), Czech Tech Univ, Dept Circuit Theory, Fac Elect Engn, Prague 16627 6, Czech Republic. EM cmejla@fel.cvut.cz FU Czech Science Foundation [GACR P102/12/2230]; Czech Ministry of Health [NT 12288-5/2011, NT11460-4/2010]; Czech Ministry of Education [MSM 0021620849] FX The authors are obliged to Miroslava Hrbkova, Libor Cerny, Hana Ruzickova, Jiri Klempir, Veronika Majerova, Jana Picmausova, Jan Roth, and Evzen Ruzicka for provision of clinical data. The study was partly supported by the Czech Science Foundation, project GACR P102/12/2230, Czech Ministry of Health, projects NT 12288-5/2011 and NT11460-4/2010, and Czech Ministry of Education, project MSM 0021620849. CR Ackermann H, 1997, BRAIN LANG, V56, P312, DOI 10.1006/brln.1997.1851 Ajmera J, 2004, IEEE SIGNAL PROC LET, V11, P649, DOI 10.1109/LSP.2004.831666 APPEL U, 1983, INFORM SCIENCES, V29, P27, DOI 10.1016/0020-0255(83)90008-7 Asgari M, 2010, IEEE ENG MED BIO, P5201, DOI 10.1109/IEMBS.2010.5626104 Asgari M., 2010, IEEE INT WORKSH MACH, P462 Basseville M., 1993, INFORM SYSTEM SCI SE Bergl P., 2010, THESIS CZECH TU PRAG Bergl P, 2007, Proceedings of the Fifth IASTED International Conference on Biomedical Engineering, P171 Bergl P, 2006, PROC WRLD ACAD SCI E, V18, P33 Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 Bocklet T, 2012, J VOICE, V26, P390, DOI 10.1016/j.jvoice.2011.04.010 Boersma P., 2001, GLOT INT, V5, P341 Brown J. R., 1969, J SPEECH HEAR RES, V12, P249 Chu PS, 2004, J CLIMATE, V17, P4893, DOI 10.1175/JCLI-3248.1 Cmejla R., 2004, EUROSIPCO P WIEN, P245 Cmejla R., 2001, 4 INT C TEXT SPEECH, P291 Conture E., 2001, STUTTERING ITS NATUR Couvreur L., 1999, P ESCA ETRW WORKSH A, P84 DARLEY FL, 1969, J SPEECH HEAR RES, V12, P462 Dejonckere PH, 2001, EUR ARCH OTO-RHINO-L, V258, P77, DOI 10.1007/s004050000299 deRijk MC, 1997, J NEUROL NEUROSUR PS, V62, P10, DOI 10.1136/jnnp.62.1.10 Dobigeon N., 2005, 13 IEEE WORKSH STAT, P335 Duffy JR, 2005, MOTOR SPEECH DISORDE, P592 Eyben F., 2010, P ACM MULT MM FLOR I, P1459, DOI 10.1145/1873951.1874246 Goberman AM, 2010, J NEUROLINGUIST, V23, P470, DOI 10.1016/j.jneuroling.2008.11.001 Godino-Llorente JI, 2004, IEEE T BIO-MED ENG, V51, P380, DOI 10.1109/TBME.2003.820386 Harel BT, 2004, J NEUROLINGUIST, V17, P439, DOI 10.1016/j.jneuroling.2004.06.001 Hariharan M, 2012, J MED SYST, V36, P1821, DOI 10.1007/s10916-010-9641-6 HARTELIUS L, 1994, FOLIA PHONIATR LOGO, V46, P9 Hawkins DM, 2005, TECHNOMETRICS, V47, P164, DOI 10.1198/004017004000000644 Henriquez P, 2009, IEEE T AUDIO SPEECH, V17, P1186, DOI 10.1109/TASL.2009.2016734 Hirano M, 1981, CLIN EXAMINATION VOI Hornykiewicz O, 1998, NEUROLOGY, V51, pS2 Jacobson BH, 1997, AM J SPEECH-LANG PAT, V6, P66 Kay Elemetrics Corp, 2003, MULTIDIMENSIONAL VOI Kent RD, 1999, J COMMUN DISORD, V32, P141, DOI 10.1016/S0021-9924(99)00004-0 KENT RD, 1989, J SPEECH HEAR DISORD, V54, P482 Lastovka M., 1998, 2371998C1LF Lechta V, 2004, STUTTERING Maier A, 2009, SPEECH COMMUN, V51, P425, DOI 10.1016/j.specom.2009.01.004 Mak B., 1996, 4 INT C SPOK LANG PR, V4 Middag C, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1745 Noth E., 2000, P 4 INT C SPOK LANG, P65 Prochazka A, 2008, 2008 3RD INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING, VOLS 1-3, P719, DOI 10.1109/ISCCSP.2008.4537317 Ravikumar K.M., 2009, ICGST INT J DIGITAL, V9, P19 Reeves J, 2007, J APPL METEOROL CLIM, V46, P900, DOI 10.1175/JAM2493.1 RILEY GD, 1972, J SPEECH HEAR DISORD, V37, P314 Robertson S., 1986, WORKING DYSARTHRICS Rosen KM, 2011, INT J SPEECH-LANG PA, V13, P165, DOI 10.3109/17549507.2011.529939 Rosen KM, 2006, J SPEECH LANG HEAR R, V49, P395, DOI 10.1044/1092-4388(2006/031) Ruanaidh J. J., 1996, SERIES STAT COMPUTIN Ruben RJ, 2000, LARYNGOSCOPE, V110, P241, DOI 10.1097/00005537-200002010-00010 Ruggiero C, 1999, J TELEMED TELECARE, V5, P11, DOI 10.1258/1357633991932333 Rusz J, 2011, J ACOUST SOC AM, V129, P350, DOI 10.1121/1.3514381 Rusz J, 2011, MOVEMENT DISORD, V26, P1951, DOI 10.1002/mds.23680 Sapir S, 2010, J SPEECH LANG HEAR R, V53, P114, DOI 10.1044/1092-4388(2009/08-0184) Singh N, 2007, PROG NEUROBIOL, V81, P29, DOI 10.1016/j.pneurobio.2006.11.009 Sooful J. J., 2001, P 12 S PATT REC ASS, P99 Su HY, 2008, INT CONF ACOUST SPEE, P4513 Teesson K, 2003, J SPEECH LANG HEAR R, V46, P1009, DOI 10.1044/1092-4388(2003/078) Tsanas A, 2011, J R SOC INTERFACE, V8, P842, DOI 10.1098/rsif.2010.0456 Ureten O., 1999, IEEE EURASIP WORKSH, P830 Van Borsel J, 2003, INT J LANG COMM DIS, V38, P119, DOI 10.1080/1368282021000042902 Western B, 2004, POLIT ANAL, V12, P354, DOI 10.1093/pan/mph023 Wisniewski M., 2007, COMPUTER RECOGNITION, V2, P445 Wong H, 2006, J HYDROL, V324, P323, DOI 10.1016/j.jhydrol.2005.10.007 Yairi E., 1999, J SPEECH LANG HEAR R, V42, P1098 NR 67 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 178 EP 189 DI 10.1016/j.specom.2012.08.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900014 ER PT J AU Zuo, X Sumii, T Iwahashi, N Nakano, M Funakoshi, K Oka, N AF Zuo, Xiang Sumii, Taisuke Iwahashi, Naoto Nakano, Mikio Funakoshi, Kotaro Oka, Natsuki TI Correcting phoneme recognition errors in learning word pronunciation through speech interaction SO SPEECH COMMUNICATION LA English DT Article DE Pronunciation learning; Interactive Phoneme Update; Phoneme recognition ID AUTOMATIC TRANSCRIPTION; UNKNOWN WORDS; SYSTEM; ALGORITHM AB This paper presents a method called Interactive Phoneme Update (IPU) that enables users to teach systems the pronunciation (phoneme sequences) of words in the course of speech interaction. Using the method, users can correct mis-recognized phoneme sequences by repeatedly making correction utterances according to the system responses. The originalities of this method are: (1) word-segment-based correction that allows users to use word segments for locating mis-recognized phonemes based on open-begin-end dynamic programming matching and generalized posterior probability, (2) history-based correction that utilizes the information of phoneme sequences that were recognized and corrected previously in the course of interactive learning of each word. Experimental results show that the proposed IPU method reduces the error rate by a factor of three over a previously proposed maximum-likelihood-based method. (C) 2012 Elsevier B.V. All rights reserved. C1 [Zuo, Xiang; Sumii, Taisuke; Oka, Natsuki] Kyoto Inst Technol, Sakyo Ku, Kyoto 6068585, Japan. [Iwahashi, Naoto] Natl Inst Informat & Commun Technol, Sora Ku, Kyoto 6190289, Japan. [Nakano, Mikio; Funakoshi, Kotaro] Honda Res Inst Japan Co Ltd, Wako, Saitama 3510188, Japan. RP Zuo, X (reprint author), Kyoto Inst Technol, Sakyo Ku, Hashigami Cho, Kyoto 6068585, Japan. EM edgarzx@gmail.com CR Bael C., 2007, COMPUT SPEECH LANG, V21, P652 Bansal D, 2009, INT CONF ACOUST SPEE, P4293, DOI 10.1109/ICASSP.2009.4960578 Bazzi I., 2002, P ICSLP, P1613 Chang S., 2000, P 6 INT C SPOK LANG, P330 Chung G., 2003, P HLT NAACL EDM CAN, P32 ELVIRA JM, 1998, ACOUST SPEECH SIG PR, P849 HAEBUMBACH R, 1995, INT CONF ACOUST SPEE, P840, DOI 10.1109/ICASSP.1995.479825 Holzapfel H, 2008, ROBOT AUTON SYST, V56, P1004, DOI 10.1016/j.robot.2008.08.012 KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leitner C., 2010, P 7 C INT LANG RES E, P3278 Mohamed A.-R., 2009, P INT C COMP GRAPH I, P1 Nakagawa S., 2006, P IEICE GEN C, P13 Nakamura S, 2006, IEEE T AUDIO SPEECH, V14, P365, DOI 10.1109/TSA.2005.860774 Nakano M., 2010, P NAT C ART INT AAAI, P74 Parada C., 2010, P N AM CHAPT ASS COM, P216 Rastrow A, 2009, INT CONF ACOUST SPEE, P3953, DOI 10.1109/ICASSP.2009.4960493 SAKOE H, 1979, IEEE T ACOUST SPEECH, V27, P588, DOI 10.1109/TASSP.1979.1163310 Soong F.K., 2004, P SPEC WORKSH MAUI S Sun H., 2003, P EUR, P2713 Svendsen T., 1995, P EUROSPEECH, P783 Waibel A., 2000, VERBMOBIL FDN SPEECH, P33 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 WU JX, 1999, ACOUST SPEECH SIG PR, P589 Yazgan A., 2004, IEEE ICASP PROCESS, V1, P745 Zuo X., 2010, P 3 IEEE WORKSH SPOK, P348 NR 26 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2013 VL 55 IS 1 BP 190 EP 203 DI 10.1016/j.specom.2012.08.008 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 055NF UT WOS:000312422900015 ER PT J AU Moattar, MH Homayounpour, MM AF Moattar, M. H. Homayounpour, M. M. TI A review on speaker diarization systems and approaches SO SPEECH COMMUNICATION LA English DT Review DE Speaker indexing; Speaker diarization; Speaker segmentation; Speaker clustering; Speaker tracking ID BAYESIAN INFORMATION CRITERION; LONG-TERM FEATURES; BROADCAST NEWS; SPEECH RECOGNITION; AUTOMATIC SEGMENTATION; TRANSCRIPTION SYSTEM; MICROPHONE MEETINGS; AUDIO SEGMENTATION; MODEL; CLASSIFICATION AB Speaker indexing or diarization is an important task in audio processing and retrieval. Speaker diarization is the process of labeling a speech signal with labels corresponding to the identity of speakers. This paper includes a comprehensive review on the evolution of the technology and different approaches in speaker indexing and tries to offer a fully detailed discussion on these approaches and their contributions. This paper reviews the most common features for speaker diarization in addition to the most important approaches for speech activity detection (SAD) in diarization frameworks. Two main tasks of speaker indexing are speaker segmentation and speaker clustering. This paper includes a separate review on the approaches proposed for these subtasks. However, speaker diarization systems which combine the two tasks in a unified framework are also introduced in this paper. Another discussion concerns the approaches for online speaker indexing which has fundamental differences with traditional offline approaches. Other parts of this paper include an introduction on the most common performance measures and evaluation datasets. To conclude this paper, a complete framework for speaker indexing is proposed, which is aimed to be domain independent and parameter free and applicable for both online and offline applications. (c) 2012 Elsevier B.V. All rights reserved. C1 [Moattar, M. H.; Homayounpour, M. M.] Amirkahir Univ Technol, Lab Intelligent Multimedia Proc IMP, Comp Engn & Informat Technol Dept, Tehran, Iran. RP Moattar, MH (reprint author), Amirkahir Univ Technol, Lab Intelligent Multimedia Proc IMP, Comp Engn & Informat Technol Dept, Tehran, Iran. EM moattar@aut.ac.ir; homayoun@aut.ac.ir FU Telecommunication Research Center (ITRC) [T/500/14939] FX The authors would like to thank Iran Telecommunication Research Center (ITRC) for supporting this work under contract No. T/500/14939. CR Ajmera J, 2004, IEEE SIGNAL PROC LET, V11, P649, DOI 10.1109/LSP.2004.831666 Ajmera J., 2002, P INT C SPOK LANG PR, P573 Ajmera J., 2004, P ICASSP, V1, P605 Ajmera J, 2003, P IEEE WORKSH AUT SP, P411 Ajmera J., 2002, IMPROVED UNKNOWN MUL Akita Y., 2003, P EUROSPEECH, P2985 Alabiso J., 1998, ENGLISH BROADCAST NE Anguera X., 2006, P OD, P1 Anguera X., 2006, THESIS U POLITECNICA Anguera X, BEAMFORMIT FAST ROBU Anguera X., 2004, 3 JORN TECN HABL VAL Anguera X., 2011, P ICASSP Anguera X., 2007, LECT NOTES COMPUTER, V4299 Anguera X., 2006, P 2 INT WORKSH MULT Anguera X., 2006, P 9 INT C SPOK LANG Anguera X, 2005, LECT NOTES COMPUT SC, V3869, P402 [Anonymous], OP SOURC PLATF BIOM [Anonymous], 2006, NIST FALL RICH TRANS [Anonymous], 2006, P 2 INT C IM VID RET, P488 [Anonymous], CHAIR COMP SCI 6 Antolin A.G., 2007, IEEE T COMPUT, V56, P1212 Arias J.A., 2005, P 13 EUR SIGN PROC C Bakis R., 1997, P SPEECH REC WORKSH, P67 Barras C., 2004, P FALL RICH TRANSCR Barras C, 2006, IEEE T AUDIO SPEECH, V14, P1505, DOI 10.1109/TASL.2006.878261 BARRAS C, 2003, ACOUST SPEECH SIG PR, P49 Basseville M., 1993, DETECTION ABRUPT CHA BENHARUSH O, 2008, P INTERSPEECH, P24 Biatov K, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2082 Bijankhan M, 2002, D91 SECURESCM Bimbot F., 1993, P EUROSPEECH, P169 Boakye K, 2008, INT CONF ACOUST SPEE, P4353, DOI 10.1109/ICASSP.2008.4518619 Boakye K., 2008, THESIS U CALIFORNIA Boehm C, 2009, INT CONF ACOUST SPEE, P4081, DOI 10.1109/ICASSP.2009.4960525 Bozonnet S, 2010, INT CONF ACOUST SPEE, P4958, DOI 10.1109/ICASSP.2010.5495088 Brandstein M., 2001, EXPLICIT SPEECH MODE Burger S, 2002, P INT C SPOK LANG PR, P301 Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 Campbell N., 2006, P WORKSH PROGR Castaldo F, 2008, INT CONF ACOUST SPEE, P4133, DOI 10.1109/ICASSP.2008.4518564 Cetin O, 2006, INT CONF ACOUST SPEE, P357 Cettolo M., 2005, COMPUT SPEECH LANG, V19, P1004 Cettolo M., 2003, P ICASSP HONG KONG C, P537 Chen SS, 2002, SPEECH COMMUN, V37, P69, DOI 10.1016/S0167-6393(01)00060-7 CHEN SS, 1998, ACOUST SPEECH SIG PR, P645 Chen T., 1996, P ICASSP, V4, P2056 Cheng S., 2004, P 8 INT C SPOK LANG, P1617 Chu SM, 2008, INT CONF ACOUST SPEE, P4329, DOI 10.1109/ICASSP.2008.4518613 Chu SM, 2009, INT CONF ACOUST SPEE, P4089, DOI 10.1109/ICASSP.2009.4960527 Cohen I., 2002, P 22 CONV EL EL ENG Collet M., 2003, P 2003 IEEE INT C AC, P713 COX H, 1987, IEEE T ACOUST SPEECH, V35, P1365, DOI 10.1109/TASSP.1987.1165054 COX H, 1986, IEEE T ACOUST SPEECH, V34, P393, DOI 10.1109/TASSP.1986.1164847 Delacourt P., 1999, P EUR C SPEECH COMM, V3, P1195 Delacourt P, 2000, SPEECH COMMUN, V32, P111, DOI 10.1016/S0167-6393(00)00027-3 Delphine C, 2010, INT CONF ACOUST SPEE, P4966, DOI 10.1109/ICASSP.2010.5495090 Deshayes J., 1986, ONLINE STAT ANAL CHA DESOBRY F, 2003, ACOUST SPEECH SIG PR, P872 El-Khoury E, 2009, INT CONF ACOUST SPEE, P4097, DOI 10.1109/ICASSP.2009.4960529 Ellis D.P.W., 2004, P NIST M REC WORKSH Evans NWD, 2009, INT CONF ACOUST SPEE, P4061, DOI 10.1109/ICASSP.2009.4960520 Fernandez D., 2009, P INT BRIGHT UK, P849 FIoshuyama O., 1999, IEEE T SIGNAL PROCES Fischer S., 1997, P ICASSP Fiscus J. G., 2004, P FALL 2004 RICH TRA Fiscus JG, 2005, LECT NOTES COMPUT SC, V3869, P369 Fisher JW, 2004, IEEE T MULTIMEDIA, V6, P406, DOI 10.1109/TMM.2004.827503 Fisher J.W., 2000, P NEUR INF PROC SYST, P772 Flanagan J, 1994, J ACOUST SOC AM, V78, P1508 Fredouille C., 2004, P NIST 2004 SPRING R Fredouille C., 2006, P MLMI 06 WASH US Friedland AG, 2009, INT CONF ACOUST SPEE, P4077, DOI 10.1109/ICASSP.2009.4960524 Friedland G, 2009, IEEE T AUDIO SPEECH, V17, P985, DOI 10.1109/TASL.2009.2015089 Gales MJF, 2006, IEEE T AUDIO SPEECH, V14, P1513, DOI 10.1109/TASL.2006.878264 Galliano S, 2006, P LANG EV RES C Garofolo J, 2002, NIST RICH TRANSCRIPT Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Gauvain J.L., 1998, P INT C SPOK LANG PR, P1335 Ghahramani Z, 1997, MACH LEARN, V29, P245, DOI 10.1023/A:1007425814087 Graff D, 2001, TDT3 MANDARIN AUDIO Gravier G., 2010, AUDIOSEG AUDIO SEGME Griffiths L., 1982, IEEE T ANTENNAS PROP Hain T., 1998, P DARPA BROADC NEWS, P133 Han K. J., 2008, P ICASSP 2008 MAR, P4373 Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088 Harb H., 2006, INT J DIGITAL LIB, V6, P70, DOI 10.1007/s00799-005-0120-5 Heck L., 1997, P EUR RHOD GREEC Huang C.H., 2004, P INT S CHIN SPOK LA, P109 Huang Y., 2007, P IEEE AUT SPEECH RE, P693 Huijbregts M., 2007, LECT NOTES COMPUTER, V4816 Hung H., 2008, P CVPR WORKSH HUM CO Hung J., 2000, P ICASLP BEIJ CHIN Imseng D, 2010, IEEE T AUDIO SPEECH, V18, P2028, DOI 10.1109/TASL.2010.2040796 Istrate D, 2005, LECT NOTES COMPUT SC, V3869, P428 Izmirli O, 2000, P INT S MUS INF RETR Jain A. K., 1988, ALGORITHMS CLUSTERIN Janin A., 2003, P ICCASP HONG KONG Jin H., 1997, P DARPA SPEECH REC W, P108 Jin Q., 2004, P NIST M REC WORKSH, P112 Johnson D, 1993, ARRAY SIGNAL PROCESS Johnson S., 1998, P 5 INT C SPOK LANG, P1775 Johnson S, 1999, P EUR BUD HUNG JOHNSON SE, 2000, ACOUST SPEECH SIG PR, P1427 Jothilakshmi S., 2009, ENG APPL ARTIFICIAL, V22 Juang B., 1985, AT T TECHNICAL J Kaneda Y., 1991, Journal of the Acoustical Society of Japan (E), V12 Kataoka A., 1990, Journal of the Acoustical Society of Japan (E), V11 Kemp T, 2000, INT CONF ACOUST SPEE, P1423, DOI 10.1109/ICASSP.2000.861862 Kim HG, 2005, INT CONF ACOUST SPEE, P745 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Kinnunen T., 2008, P SPEAK LANG REC WOR Koshinaka T, 2009, INT CONF ACOUST SPEE, P4093, DOI 10.1109/ICASSP.2009.4960528 Kotti M., 2006, P 2006 IEEE INT C MU, P1101 Kotti M, 2008, SIGNAL PROCESS, V88, P1091, DOI 10.1016/j.sigpro.2007.11.017 Kotti M, 2008, IEEE T AUDIO SPEECH, V16, P920, DOI 10.1109/TASL.2008.925152 Kotti M., 2006, P 2006 IEEE INT S CI Kristjansson T., 2005, P ICSLP LISB PORT Kubala F, 1997, P SPEECH REC WORKSH, P90 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Kuhn R., 1998, P ICSLP, P1771 Kwon S., 2002, P INT C SPOK LANG PR, P2537 Kwon S., 2004, P ICSLP, P1517 Kwon S., 2004, IEEE T SPEECH AUDIO, V13, P1004 Kwon S., 2003, P IEEE AUT SPEECH RE, P423 Kwon S., 2003, P EUR Lapidot I., 2002, 0260 IDIAP Lapidot I., 2001, P 2001 SPEAK OD SPEA, P169 Lathoud G., 2004, P ICASSP NIST M REC Leeuwen D., 2008, LNCS, V4625, P475 Liu D., 2004, P ICASSP MAY, P333 LIU D, 1999, P ESCA EUR 99 BUD HU, V3, P1031 Liu D., 2003, P IEEE INT C AC SPEE, P572 Lopez J.F., 2000, P ICSLP BEIJ CHIN Lu L., 2002, P ICPR QUEB CIT CAN, V2 Lu L, 2005, MULTIMEDIA SYST, V10, P332, DOI 10.1007/s00530-004-0160-5 Lu L, 2002, IEEE T SPEECH AUDI P, V10, P504, DOI 10.1109/TSA.2002.804546 Lu L., 2002, P 10 ACM INT C MULT, P602 Malegaonkar A, 2006, IEEE SIGNAL PROC LET, V13, P509, DOI 10.1109/LSP.2006.873656 Markov K, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P363 Markov K., 2007, P ASRU, P699 Markov K., 2007, P INTERSPEECH Marro C., 1998, IEEE T SPEECH AUDIO Martin A., 2001, P EUR C SPEECH COMM, V2, P787 McCowan IA, 2000, INT CONF ACOUST SPEE, P1723, DOI 10.1109/ICASSP.2000.862084 McNeill D., 2000, LANGUAGE AND GESTURE Meignier S., 2001, P OD SPEAK LANG REC, P175 Meignier S., 2005, COMPUT SPEECH LA SEP, P303 Meignier S, 2006, COMPUT SPEECH LANG, V20, P303, DOI 10.1016/j.csl.2005.08.002 Meignier Sylvain, 2010, CMU SPUD WORKSH Meinedo H., 2003, P ICASSP HONG KONG C Mesgarani N., 2004, P ICASSP, V1, P601 Mirghafori N., 2006, P ICASSP Moattar M. H., 2009, P EUSIPCO, P2549 Moattar M.H., 2009, P 14 INT COMP SOC IR, P501 Moh Y., 2003, P IEEE INT C AC SPEE, P85 MORARU D, 2003, ACOUST SPEECH SIG PR, P89 Moraru D., 2004, P OD 2004 SPEAK LANG Moraru D., 2003, P HUM COMP DIAL BUCH Moraru D., 2004, P ICASSP MONTR CAN Mori K, 2001, INT CONF ACOUST SPEE, P413, DOI 10.1109/ICASSP.2001.940855 Muthusamy Y. K., 1992, P INT C SPOK LANG PR, P895 Neal RM, 1998, NATO ADV SCI I D-BEH, V89, P355 Nguyen P., 2002, P RICH TRANSCR WORKS Nguyen TH, 2009, INT CONF ACOUST SPEE, P4085, DOI 10.1109/ICASSP.2009.4960526 Nishida M, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P172 NIST, 2002, RICH TRANSCR EV PROJ Nock H.J., 2003, LECT NOTES COMPUTER, V2728, P565 Noulas A.K., 2009, P COMP VIS IM UND Noulas AK, 2007, ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, P350 Nwe T.L., 2010, P ICASSP, P4073 Omar M., 2005, P ICASSP Otero P.L., 2010, P ICASSP, P4970 Ouellet P., 2005, P ICASSP LISB PORT Pardo JM, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2194 Pardo JM, 2006, LECT NOTES COMPUT SC, V4299, P257 Pelecanos J., 2001, P ISCA SPEAK REC WOR Pellom BL, 1998, SPEECH COMMUN, V25, P97, DOI 10.1016/S0167-6393(98)00031-4 Pfau T., 2001, P EUR Rao R., 1996, P INT PICT COD S Reynolds D. A., 2004, P FALL 2004 RICH TRA Roch M., 2004, P SPEAK OD, P349 Rosca J., 2003, P ICASSP Rougui J., 2006, P ICASSP TOUL FRANC Sanchez-Bote J., 2003, P ICASSP Sankar A., 1998, P DARPA BROADC NEWS Sankar A., 1995, P EUR MADR SPAIN SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 Shriberg E., 2007, SERIES LECT NOTES CO, V4343, P241, DOI 10.1007/978-3-540-74200-5_14 Shriberg Elizabeth, 2001, P EUROSPEECH, P1359 Sian Cheng S., 2003, P EUR GEN SWITZ Siegler M., 1997, P DARPA SPEECH REC W, P97 Sinha R., 2005, P EUR C SPEECH COMM, P2437 Siracusa M., 2007, P ICASSP Sivakumaran P., 2001, P EUR SCAND SOLOMONOFF A, 1998, ACOUST SPEECH SIG PR, P757 Stafylakis T., 2009, P INTERSPEECH Stafylakis T, 2010, INT CONF ACOUST SPEE, P4978, DOI 10.1109/ICASSP.2010.5495076 Stern R., 1997, P DARPA SPEECH REC W Stolcke A, 2010, INT CONF ACOUST SPEE, P4390, DOI 10.1109/ICASSP.2010.5495626 Sturim D., 2001, P ICASSP SALT LAK CI Sun H., 2009, P INTERSPEECH Sun HW, 2010, INT CONF ACOUST SPEE, P4982, DOI 10.1109/ICASSP.2010.5495077 Thyes O., 2000, P INT C SPOK LANG PR, V2, P242 Tranter S, 2005, P ICASSP MONTR CAN Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 Tranter S.E., 2004, P 2004 IEEE INT C AC, P433 Tritschler A, 1999, P EUROSPEECH, P679 Trueba-Hornero B., 2008, THESIS U POLITECNICA Tsai W.H., 2004, P ICASLP JEJ ISL KOR Vajaria H, 2006, INT C PATT RECOG, P1150 Valin J., 2004, P ICASSP van Leeuwan D.A, 2005, P MACH LEARN MULT IN, P440 Vandecatseye A., 2004, P LREC LISB PORT Vescovi M., 2003, P 8 EUR C SPEECH COM, P2997 Vijayasenan D., 2007, P IEEE AUT SPEECH RE, P250 Voitovetsky I., 1998, P 1 WORKSH TEXT SPEE, P321 Voitovetsky I, 1997, P IEEE WORKSH NEUR N, P578 Wactlar H., 1996, P ARPA STL WORKSH Wang D, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P468 Wang W, 2007, LECT NOTES COMPUT SC, V4477, P555 WARD JH, 1963, J AM STAT ASSOC, V58, P236, DOI 10.2307/2282967 Wegman S, 1997, P DARPA BROADC NEWS Wolfel M., 2009, P INTERSPEECH Woodland P., 1997, P SPEECH REC WORKSH, P73 Wooters C., 2004, P FALL 2004 RICH TRA Wooters C., 2008, LNCS, V4625 Wu CH, 2006, IEEE T AUDIO SPEECH, V14, P647, DOI 10.1109/TSA.2005.852988 Wu T., 2003, P ICME 03, V2, P721 Wu T., 2003, P INT C MULT MOD WU TY, 2003, ACOUST SPEECH SIG PR, P193 Yamaguchi M., 2005, P ICASLP Yoo IC, 2009, ETRI J, V31, P451, DOI 10.4218/etrij.09.0209.0104 Zamalloa, 2010, P ICASSP, P4962 Zdansky J., 2006, P INTERSPEECH, P2186 Zelinski R., 1988, P IEEE INT C AC SPEE, V5, P2578 Zhang C., 2006, P IEEE INT WORKSH MU Zhang X., 2004, P ICASSP Zhou B., 2000, P INT C SPOK LANG PR, P714 Zhou BW, 2005, IEEE T SPEECH AUDI P, V13, P467, DOI 10.1109/TSA.2005.845790 Zhu X, 2006, LECT NOTES COMPUT SC, V4299, P396 Zhu X, 2008, LECT NOTES COMPUT SC, V4625, P533 Zhu X., 2005, P EUR C SPEECH COMM Zhu YM, 2003, IEEE T SYST MAN CY A, V33, P502, DOI 10.1109/TSMCA.2003.809211 Zochova P., 2005, P ICSLP LISB PORT NR 245 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2012 VL 54 IS 10 BP 1065 EP 1103 DI 10.1016/j.specom.2012.05.002 PG 39 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 988LK UT WOS:000307492300001 ER PT J AU Karjigi, V Rao, P AF Karjigi, V. Rao, P. TI Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling SO SPEECH COMMUNICATION LA English DT Article DE Place of articulation; Unvoiced stops; Spectro-temporal features ID SPEECH RECOGNITION; AUDITORY-CORTEX; CONSONANTS; PERCEPTION; FEATURES; UTTERANCES AB Unvoiced stops are rapidly varying sounds with acoustic cues to place identity linked to the temporal dynamics. Neurophysiological studies have indicated the importance of joint spectro-temporal processing in the human perception of stops. In this study, two distinct approaches to modeling the spectra-temporal envelope of unvoiced stop phone segments are investigated with a view to obtaining a low-dimensional feature vector for automatic place classification. Classification accuracies on the TIMIT database and a Marathi words dataset show the overall superiority of classifier combination of polynomial surface coefficients and 2D-DCT. A comparison of performance with published results on the place classification of stops revealed that the proposed spectro-temporal feature systems improve upon the best previous systems' performances. The results indicate that joint spectro-temporal features may be usefully incorporated in hierarchical phone classifiers based on diverse class-specific features. (c) 2012 Elsevier B.V. All rights reserved. C1 [Karjigi, V.; Rao, P.] Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. RP Karjigi, V (reprint author), Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. EM veena@ee.iitb.ac.in; prao@ee.iitb.ac.in FU IIT Bombay under the National Programme on Perception Engineering; Department of Information Technology, MCIT, Government of India FX The authors are grateful to the anonymous reviewers for their valuable comments on a previous version of the submission. The research is partly supported by a project grant to IIT Bombay under the National Programme on Perception Engineering, sponsored by the Department of Information Technology, MCIT, Government of India. CR AHMED R, 1969, J ACOUST SOC AM, V45, P758, DOI 10.1121/1.1911459 Bonneau A, 1996, J ACOUST SOC AM, V100, P555, DOI 10.1121/1.415866 Bouvrie J, 2008, INT CONF ACOUST SPEE, P4733, DOI 10.1109/ICASSP.2008.4518714 Bunnell H.T., 2004, P INT 04 JEJ ISL KOR, P1313 Chen B., 2004, P ICSLP, P612 Chi T, 2005, J ACOUST SOC AM, V118, P887, DOI 10.1121/1.1945807 DATTA AK, 1980, IEEE T ACOUST SPEECH, V28, P85, DOI 10.1109/TASSP.1980.1163354 Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220 Dyer SA, 2001, IEEE INSTRU MEAS MAG, V4, P4 Ellis D.P.W., 1997, CORRELATION FEATURE Enzinger E., 2010, P AUD ENG SOC INT C, P47 Ezzat T., 2007, P INT C SPOK LANG PR, P506 Fant G., 1973, STOPS CV SYLLABLES S Feijoo S, 1999, SPEECH COMMUN, V27, P1, DOI 10.1016/S0167-6393(98)00064-8 FORREST K, 1988, J ACOUST SOC AM, V84, P115, DOI 10.1121/1.396977 Gish H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P466 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 Goldenthal W.D., 1994, THESIS MIT CAMBRIDGE Halberstadt A. K., 1998, THESIS MIT CAMBRIDGE Harder D.W., 2010, NUMERICAL ANAL ENG Hazen T., 1998, ACOUST SPEECH SIG PR, P653 Hermansky H., 1999, P AUT SPEECH REC UND, P63 Hou J., 2007, P INT 07 ANTW BELG, P1929 Kajarekar SS, 2001, INT CONF ACOUST SPEE, P137, DOI 10.1109/ICASSP.2001.940786 KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P322, DOI 10.1121/1.388813 Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416 Lamel L.F., 1986, DARPA SPEECH REC WOR, P61 Lee DD, 1999, NATURE, V401, P788 Mesgarani N., 2009, P INT 09, P2983 Meyer BT, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P906 Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826 Neagu A., 1998, P ICSLP, P2127 Niyogi P, 2003, SPEECH COMMUN, V41, P349, DOI 10.1016/S0167-6393(02)00151-6 NOSSAIR ZB, 1991, J ACOUST SOC AM, V89, P2978, DOI 10.1121/1.400735 Obleser J, 2010, FRONT PSYCHOL, V1, DOI 10.3389/fpsyg.2010.00232 Ohala M., 1995, INT C PHON SCI STOCK, P22 Ohala M., 1998, P INT C SPOK LANG PR, P2795 Pandey PC, 2009, IEEE T AUDIO SPEECH, V17, P277, DOI 10.1109/TASL.2008.2010285 Patil V., 2009, P INT 09 BRIGHT UK, P2543 Prasanna SRM, 2001, P SIGN P COMM BANG I, P81 Rifkin R., 2007, MITCSAILTR2007007 Sekhar CC, 2002, IEEE T SPEECH AUDI P, V10, P472, DOI 10.1109/TSA.2002.804298 Smits R, 1996, J ACOUST SOC AM, V100, P3852, DOI 10.1121/1.417241 Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009 Wang XH, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1221 NR 45 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2012 VL 54 IS 10 BP 1104 EP 1120 DI 10.1016/j.specom.2012.04.007 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 988LK UT WOS:000307492300002 ER PT J AU Ozimek, E Kutzner, D Libiszewski, P AF Ozimek, Edward Kutzner, Dariusz Libiszewski, Pawel TI Speech intelligibility tested by the Pediatric Matrix Sentence test in 3-6 year old children SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility in children; Sentence matrix test; Speech reception threshold ID RECEPTION THRESHOLD; NOISE AB Objective: The present study was aimed at the development and application of the Polish Pediatric Matrix Sentence Test (PPMST) for testing speech intelligibility in normal-hearing (NH) and hearing-impaired (HI) children aged 3-6. Methods & Procedures: The test was based on sentences of the subject-verb-object pattern. Pictures illustrating PPMST utterances were prepared and the picture-point (PP) method was used for administering the 1-up/1-down adaptive procedure converging the signal to noise ratio (SNR) to the speech reception threshold (SRT). The correctness of verbal responses (VR), preceding PP responses, was also judged. Outcomes & Results: The normative SRT for the PP method was shown to decrease with age. The guessing rate (gamma) turned out to be close to the theoretical value for forced-choice procedures, gamma = 1/n, where n = 6 for the six-alternative PP method (gamma approximate to 0.166) and n = 4 for the four-alternative PP method (gamma approximate to 0.25). Test optimization resulted in minimizing the lapse rate (lambda) (ratio gamma/lambda approximate to 8.0 for n = 4 and gamma/lambda approximate to 5.6 for n = 6, both for NH and HI children). Significantly higher SRTs were observed for HI children than for the NH group. Conclusions & Implications: For children aged 3-6, tested by the developed PPMST, speech intelligibility performance, for both the VR and PP method, increases with age. For each age group, significantly worse intelligibility was observed for HI children than for NH children. The PPMST combined with the PP method is a reliable tool for pediatric speech intelligibility measurements. (c) 2012 Elsevier B.V. All rights reserved. C1 [Ozimek, Edward; Kutzner, Dariusz; Libiszewski, Pawel] Adam Mickiewicz Univ, Inst Acoust, PL-61614 Poznan, Poland. RP Ozimek, E (reprint author), Adam Mickiewicz Univ, Inst Acoust, 85 Umultowska St, PL-61614 Poznan, Poland. EM ozimaku@amu.edu.pl FU Norway through the Norwegian Financial Mechanism [PNRF-167-AI-1/07] FX Supported by a grant from Norway through the Norwegian Financial Mechanism (project no. PNRF-167-AI-1/07). CR Bell T S, 2001, J Am Acad Audiol, V12, P514 Bench J, 1979, Br J Audiol, V13, P108, DOI 10.3109/03005367909078884 Brand T, 2002, J ACOUST SOC AM, V111, P2801, DOI 10.1121/1.1479152 Bulczynska K., 1987, SLOWNICTWO DZIECI WI Cameron S, 2006, INT J AUDIOL, V45, P99, DOI 10.1080/14992020500377931 Elliott L. L., 1980, NW U CHILDRENS PERCE ELLIOTT LL, 1979, J ACOUST SOC AM, V66, P651, DOI 10.1121/1.383691 ELLIOTT LL, 1979, J ACOUST SOC AM, V66, P12, DOI 10.1121/1.383065 Fallon M, 2000, J ACOUST SOC AM, V108, P3023, DOI 10.1121/1.1323233 Gescheider George A., 1997, PSYCHOPHYSICS FUNDAM HAGERMAN B, 1982, SCAND AUDIOL, V11, P79, DOI 10.3109/01050398209076203 Jerger S., 1984, PEDIAT SPEECH INTELL Kollmeier B., 1997, J ACOUST SOC AM, V102, P1085 Kollmeier B., 1990, THESIS U GOTTINGEN G LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 MACKIE K, 1986, J SPEECH HEAR RES, V29, P275 Mendel LL, 2008, INT J AUDIOL, V47, P546, DOI 10.1080/14992020802252261 Nilsson M. J., 1996, DEV HEARING NOISE TE Ozimek E., 2011, SPEECH INTELLI UNPUB Ozimek E, 2010, INT J AUDIOL, V49, P444, DOI 10.3109/14992021003681030 Ozimek E, 2009, INT J AUDIOL, V48, P433, DOI 10.1080/14992020902725521 PLOMP R, 1979, AUDIOLOGY, V18, P43 ROSS M, 1970, J SPEECH HEAR RES, V13, P44 Smits C, 2006, J ACOUST SOC AM, V120, P1608, DOI 10.1121/1.2221405 Steffens T, 2003, HNO, V51, P1012, DOI 10.1007/s00106-003-0848-4 SZMEJA Z, 1963, Otolaryngol Pol, V17, P367 Szuchnik J., 2002, PROGRAM ROZWIJANIA O Versfeld NJ, 2000, J ACOUST SOC AM, V107, P1671, DOI 10.1121/1.428451 Wagener K, 2005, Z AUDIOL, V44, P134 Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080 Wagener K, 1999, Z AUDIOL, V38, P4 ZAKRZEWSKI A, 1971, Otolaryngologia Polska, V25, P297 Zheng Y, 2009, INT J AUDIOL, V48, P718, DOI 10.1080/14992020902902658 NR 33 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2012 VL 54 IS 10 BP 1121 EP 1131 DI 10.1016/j.specom.2012.06.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 988LK UT WOS:000307492300003 ER PT J AU Baum, D AF Baum, Doris TI Recognising speakers from the topics they talk about SO SPEECH COMMUNICATION LA English DT Article DE Speaker recognition; Topic classification; High-level features ID MODELS; RECOGNITION AB We investigate how a speaker's preference for specific topics can be used for speaker identification. In domains like broadcast news or parliamentary speeches, speakers have a field of expertise they are associated with. We explore how topic information for a segment of speech, extracted from an automatic speech recognition transcript, can be employed to identify the speaker. Two methods for modelling topic preferences are compared: implicitly, based on speaker-characteristic keywords, and explicitly, by using automatically derived topic models to assign topics to the speech segments. In the keyword-based approach, the segments' tf-idf vectors are classified with Support Vector Machine speaker models. For the topic-model-based approach, a domain-specific topic model is used to represent each segment as a mixture of topics; the speakers' score is derived from the Kullback-Leibler divergence between the topic mixtures of their training data and of the segment. The methods were tested on political speeches given in German parliament by 235 politicians. We found that topic cues do carry speaker information, as the topic-model-based system yielded an equal error rate (EER) of 16.3%. The topic-based approach combined well with a spectral baseline system, improving the EER from 8.6% for the spectral to 6.2% for the fused system. (c) 2012 Elsevier B.V. All rights reserved. C1 Fraunhofer IAIS, St Augustin, Germany. RP Baum, D (reprint author), Fraunhofer IAIS, St Augustin, Germany. EM dorisbaum@gmx.net FU German Federal Ministry of Economics and Technology through the CONTENTUS scenario of the THESEUS project FX Part of this study was funded by the German Federal Ministry of Economics and Technology through the CONTENTUS scenario of the THESEUS project.7 CR Baum D, 2009, IEEE AUT SPEECH REC Biatov K, 2002, P 3 INT C LANG RES E Blei D, 2003, P 26 ANN INT ACM SIG, P127, DOI DOI 10.1145/860435.860460 Blei DM, 2003, J MACH LEARN RES, V3, P993, DOI 10.1162/jmlr.2003.3.4-5.993 Branavan S, 2008, P ACL, P263 Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001 Canseco L, 2005, P IEEE WORKSH AUT SP Canseco-Rodriguez L., 2004, ICSLP 2004, P1272 Chang J., 2009, NEURAL INFORM PROCES, V31 Doddington G.R., 2001, 7 EUR C SPEECH COMM, P2521 Feng Y., 2010, HUM LANG TECHN 2010, P831 Ferrer L, 2010, INT CONF ACOUST SPEE, P4414, DOI 10.1109/ICASSP.2010.5495632 Gillick L., 1989, ICASSP 1989 GLASG UK, V1, P532 Griffiths T., 2007, LATENT SEMANTIC ANAL Hatch AO, 2005, INT CONF ACOUST SPEE, P169 Joachims T., 1999, ADV KERNEL METHODS S Joachims T., 1998, P EUR C MACH LEARN Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Kockmann M, 2010, INT CONF ACOUST SPEE, P4418, DOI 10.1109/ICASSP.2010.5495616 Lee A., 2001, 7 EUR C SPEECH COMM Martin A., 1997, EUROSPEECH, P1895 Mauclair J., 2006, IEEE SPEAK LANG REC, P1 McCallum A, 2002, MALLET MACHINE LEARN McNemar Q, 1947, PSYCHOMETRIKA, V12, P153, DOI 10.1007/BF02295996 Mertens T, 2009, INT CONF ACOUST SPEE, P4885, DOI 10.1109/ICASSP.2009.4960726 Morik K, 1999, P 16 INT C MACH LEAR, P268 Porter M, 2001, SNOWBALL LANGUAGE ST REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Rosen-Zvi M, 2010, ACM T INFORM SYST, V28, DOI 10.1145/1658377.1658381 Shriberg E, 2005, SPEECH COMMUN, V46, P455, DOI 10.1016/j.specom.2005.02.018 NR 31 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2012 VL 54 IS 10 BP 1132 EP 1142 DI 10.1016/j.specom.2012.06.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 988LK UT WOS:000307492300004 ER PT J AU Rasanen, O AF Rasanen, Okko TI Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions SO SPEECH COMMUNICATION LA English DT Article DE Language acquisition; Distributional learning; Computer simulation; Phonetic learning; Lexical learning ID SPEECH-PERCEPTION; WORD SEGMENTATION; VOCABULARY DEVELOPMENT; CONNECTIONIST MODEL; PATTERN DISCOVERY; INFANT VOCABULARY; RECOGNITION; CUES; CATEGORIES; FEEDBACK AB This work reviews a number of existing computational studies concentrated on the question of how spoken language can be learned from continuous speech in the absence of linguistically or phonetically motivated background knowledge, a situation faced by human infants when they first attempt to learn their native language. Specifically, the focus is on how phonetic categories and word-like units can be acquired purely on the basis of the statistical structure of speech signals, possibly aided by some articulatory or visual constraints. The outcomes and shortcomings of the existing work are reflected onto findings from experimental and theoretical studies. Finally, some of the open questions and possible future research directions related to the computational models of language acquisition are discussed. (c) 2012 Elsevier B.V. All rights reserved. C1 Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, FI-00076 Aalto, Finland. RP Rasanen, O (reprint author), Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, POB 13000, FI-00076 Aalto, Finland. EM okko.rasanen@aalto.fi FU Finnish Graduate School of Language Studies (Langnet); Nokia Research Center Tampere FX This research was funded by the Finnish Graduate School of Language Studies (Langnet) and Nokia Research Center Tampere. The author would also like to thank Heikki Rasilo, Roger K. Moore, and the two anonymous reviewers for their invaluable comments on the manuscript. CR Ahissar E, 2005, AUDITORY CORTEX: SYNTHESIS OF HUMAN AND ANIMAL RESEARCH, P295 Aimetti G., 2009, P STUD RES WORKSH EA, P1, DOI 10.3115/1609179.1609180 Almpanidis G, 2008, SPEECH COMMUN, V50, P38, DOI 10.1016/j.specom.2007.06.005 Altosaar T., 2010, P INT C LANG RES EV, P1062 Aversano G, 2001, PROCEEDINGS OF THE 44TH IEEE 2001 MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1 AND 2, P516, DOI 10.1109/MWSCAS.2001.986241 Beal J., 2009, P 31 ANN C COGN SCI, P99 Best CC, 2003, LANG SPEECH, V46, P183 Blanchard D, 2010, J CHILD LANG, V37, P487, DOI 10.1017/S030500090999050X Brent MR, 1996, COGNITION, V61, P93, DOI 10.1016/S0010-0277(96)00719-6 Brent MR, 1999, MACH LEARN, V34, P71, DOI 10.1023/A:1007541817488 Brent MR, 2001, COGNITION, V81, pB33, DOI 10.1016/S0010-0277(01)00122-6 Brosch M, 2005, AUDITORY CORTEX: SYNTHESIS OF HUMAN AND ANIMAL RESEARCH, P127 Buttery P., 2006, 675 U CAMBR COMP LAB CASELLI MC, 1995, COGNITIVE DEV, V10, P159, DOI 10.1016/0885-2014(95)90008-X Christiansen MH, 2009, DEVELOPMENTAL SCI, V12, P388, DOI 10.1111/j.1467-7687.2009.00824.x Christiansen MH, 1998, LANG COGNITIVE PROC, V13, P221 Coen MH, 2005, P 20 NAT C ART INT A, P932 Coen Michael H., 2006, P 21 NAT C ART INT A, V2, P1451 Curtin S, 2005, COGNITION, V96, P233, DOI 10.1016/j.cognition.2004.08.005 Curtin S, 2001, PROC ANN BUCLD, P190 CUTLER A, 1994, LINGUA, V92, P81, DOI 10.1016/0024-3841(94)90338-7 Daland R, 2011, COGNITIVE SCI, V35, P119, DOI 10.1111/j.1551-6709.2010.01160.x de Marcken C., 1995, 1558 AI MIT de Boer B, 2003, ACOUST RES LETT ONL, V4, P129, DOI 10.1121/1.1613311 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Demuynck K., 2002, P 5 INT C TEXT SPEEC, P277 Driesen J., 2009, P INT 2009, p1731 Duran D., 2010, RES LANGUAGE COMPUTA, V8, P133, DOI 10.1007/s11168-011-9075-4 EIMAS PD, 1971, SCIENCE, V171, P303, DOI 10.1126/science.171.3968.303 ELMAN JL, 1990, COGNITIVE SCI, V14, P179, DOI 10.1207/s15516709cog1402_1 Emmorey K, 2006, ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM, P110, DOI 10.1017/CBO9780511541599.005 Esposito A, 2005, LECT NOTES ARTIF INT, V3445, P261 Estevan YP, 2007, INT CONF ACOUST SPEE, P937 Fant G., 1985, Q PROGR STATUS REPOR, V4, P1 Feldman N. H., 2009, P 31 ANN C COGN SCI, P2208 Feldman NH, 2009, PSYCHOL REV, V116, P752, DOI 10.1037/a0017196 Fenson L., 2003, MACARTHUS BATES COMM Gentner D., 1983, LANGUAGE COGNITION C, V2 Gleitman Lila, 1990, LANG ACQUIS, V1, P3, DOI DOI 10.1207/S153278171A0101_2 Goldstein MH, 2008, PSYCHOL SCI, V19, P515, DOI 10.1111/j.1467-9280.2008.02117.x GOLINKOFF RM, 1994, J CHILD LANG, V21, P125 Gros-Louis J, 2006, INT J BEHAV DEV, V30, P509, DOI 10.1177/0165025406071914 Guenther FH, 1996, J ACOUST SOC AM, V100, P1111, DOI 10.1121/1.416296 Hamilton A, 2000, J CHILD LANG, V27, P689, DOI 10.1017/S0305000900004414 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 Hirsch H.-G., 2000, P ISCA ITRW ASR2000, P29 Houston DM, 2003, J EXP PSYCHOL HUMAN, V29, P1143, DOI 10.1037/0096-1523.29.6.1143 Howard IS, 2011, MOTOR CONTROL, V15, P85 Iverson P., 1994, J ACOUST SOC AM, V95, P2976, DOI 10.1121/1.408983 Jones SS, 2007, PSYCHOL SCI, V18, P593, DOI 10.1111/j.1467-9280.2007.01945.x JUSCZYK PW, 1993, PROCEEDINGS OF THE FIFTEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P49 JUSCZYK PW, 1993, J PHONETICS, V21, P3 Kanerva P, 2009, COGN COMPUT, V1, P139, DOI 10.1007/s12559-009-9009-8 Kanerva P., 2000, P 22 ANN C COGN SCI, P103 Kaplan F, 2006, INTERACT STUD, V7, P135, DOI 10.1075/is.7.2.04kap Keshet J., 2005, P INT 05, P2961 Kirchhoff K, 2005, J ACOUST SOC AM, V117, P2238, DOI 10.1121/1.1869172 KOHONEN T, 1990, P IEEE, V78, P1464, DOI 10.1109/5.58325 Kokkinaki T, 2000, J REPROD INFANT PSYC, V18, P173 Kouki M., 2010, P INT 2010, P2914 Kuhl P., 2005, LANGUAGE LEARNING DE, V1, P237, DOI DOI 10.1080/15475441.2005.9671948 Kuhl PK, 2006, DEVELOPMENTAL SCI, V9, pF13, DOI 10.1111/j.1467-7687.2006.00468.x Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533 Kuhl PK, 2008, PHILOS T R SOC B, V363, P979, DOI 10.1098/rstb.2007.2154 KUHL PK, 1986, EXP BIOL, V45, P233 Kuwahara H., 1972, ACOUSTICAL SOC JAPAN, V28, P225 Lake BM, 2009, IEEE T AUTON MENT DE, V1, P35, DOI 10.1109/TAMD.2009.2021703 Lee DD, 1999, NATURE, V401, P788 LEVITT AG, 1992, J CHILD LANG, V19, P19 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 MacQueen J., 1967, P 5 BERK S MATH STAT, P281 MACWHINNEY B, 1985, J CHILD LANG, V12, P271 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Markey K.L., 1994, THESIS U COLORADO CO Marr D., 1982, VISION COMPUTATIONAL Maye J, 2002, COGNITION, V82, pB101, DOI 10.1016/S0010-0277(01)00157-3 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 McInnes F., 2011, P 33 ANN M COGN SCI, P2006 McMurray B, 2009, DEVELOPMENTAL SCI, V12, P365, DOI 10.1111/j.1467-7687.2009.00821.x McMurray B, 2009, DEVELOPMENTAL SCI, V12, P369, DOI 10.1111/j.1467-7687.2009.00822.x Mehler J., 1990, COGNITIVE MODELS SPE Meltzoff AN, 2009, SCIENCE, V325, P284, DOI 10.1126/science.1175626 MILLER KD, 1992, NEUROREPORT, V3, P73, DOI 10.1097/00001756-199201000-00019 Miller M., 2009, PERSONALMAGAZIN, P12 Newman M.E.J., 2004, PHYS REV E, V69 Norris D, 2000, BEHAV BRAIN SCI, V23, P299, DOI 10.1017/S0140525X00003241 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 Nowlan S, 1990, ADV NEURAL INFORMATI, V2, P574 Oates T., 2002, Proceedings 2002 IEEE International Conference on Data Mining. ICDM 2002, DOI 10.1109/ICDM.2002.1183920 Oates T., 2001, THESIS U MASSACHUSET Park A, 2006, INT CONF ACOUST SPEE, P409 Park A., 2005, P ASRU SAN JUAN PUER, P53 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Pinker Steven, 1989, LEARNABILITY COGNITI Pisoni D. B., 1997, TALKER VARIABILITY S, P9 Port R, 2007, NEW IDEAS PSYCHOL, V25, P143, DOI 10.1016/j.newideapsych.2007.02.001 Quine W.V.O., 1960, WORD OBJECT Rasanen O., P 34 ANN C COGN SCI Rasanen O, 2011, COGNITION, V120, P149, DOI 10.1016/j.cognition.2011.04.001 Rasanen O., STRUCTURE CONT UNPUB RASANEN O, 2008, P INT 08, P1980 Rasanen O, 2012, PATTERN RECOGN, V45, P606, DOI 10.1016/j.patcog.2011.05.005 Rasanen O., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6289052 Rasanen Okko, 2011, Speech Technologies Rasanen O., 2009, P 17 NORD C COMP LIN, P255 Roy D, 2003, IEEE T MULTIMEDIA, V5, P197, DOI 10.1109/TMM.2003.811618 Saffran JR, 1996, SCIENCE, V274, P1926, DOI 10.1126/science.274.5294.1926 Saffran JR, 2001, COGNITION, V81, P149, DOI 10.1016/S0010-0277(01)00132-9 Saffran JR, 1996, J MEM LANG, V35, P606, DOI 10.1006/jmla.1996.0032 Scharenborg O., 2007, P INT C SPOK LANG PR, P1953 Scharenborg O, 2010, PRAGMAT COGN, V18, P136, DOI 10.1075/pc.18.1.06sch Smith K, 2006, LECT NOTES ARTIF INT, V4211, P31 Smith K, 2011, COGNITIVE SCI, V35, P480, DOI 10.1111/j.1551-6709.2010.01158.x Smith L, 2008, COGNITION, V106, P1558, DOI 10.1016/j.cognition.2007.06.010 Stager CL, 1997, NATURE, V388, P381, DOI 10.1038/41102 Steels L., 2000, EVOLUTION COMMUNICAT, V4, P3 Steels L, 2003, TRENDS COGN SCI, V7, P308, DOI 10.1016/S1364-6613(03)00129-3 Stouten V., 2007, IEEE SIGNAL PROCESSI, V15, P131 Swingley D, 2005, COGNITIVE PSYCHOL, V50, P86, DOI 10.1016/j.cogpsych.2004.06.001 Ten Bosch L, 2007, P INTERSPEECH2007, P1481 ten Bosch L., 2009, P WORKSH CHILD COMP ten Bosch Louis, 2008, Speech Recognition - Technologies and Applications ten Bosch L, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P704 ten Bosch L, 2007, SPEECH COMMUN, V49, P331, DOI 10.1016/j.specom.2007.03.001 ten Bosch L, 2009, FUND INFORM, V90, P229, DOI 10.3233/FI-2009-0016 Thiessen ED, 2004, PERCEPT PSYCHOPHYS, V66, P779, DOI 10.3758/BF03194972 Thiessen ED, 2003, DEV PSYCHOL, V39, P706, DOI 10.1037/0012-1649.39.4.706 Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579 Tomasello M, 1995, JOINT ATTENTION ITS, V16, P103 Tomasello M., 2000, PRAGMATICS, V10, P401 Toscano JC, 2010, COGNITIVE SCI, V34, P434, DOI 10.1111/j.1551-6709.2009.01077.x TREHUB SE, 1976, CHILD DEV, V47, P466, DOI 10.2307/1128803 Tsao FM, 2004, CHILD DEV, V75, P1067, DOI 10.1111/j.1467-8624.2004.00726.x Unal F.A., 1992, P INT JOINT C NEUR N, P715 Vallabha GK, 2007, P NATL ACAD SCI USA, V104, P13273, DOI 10.1073/pnas.0705369104 Van hamme H., 2008, P INT C SPOK LANG PR, P2554 Venkataraman A, 2001, COMPUT LINGUIST, V27, P351, DOI 10.1162/089120101317066113 Versteegh M., 2010, P INT 10 CHIB JAP, P2930 Villing R., 2006, P ISSC, P521 Warren RM, 2000, BEHAV BRAIN SCI, V23, P350, DOI 10.1017/S0140525X00503240 Waterson N., 1971, J LINGUIST, V7, P179, DOI [10.1017/S0022226700002917S0022226700002917, DOI 10.1017/S0022226700002917] Werker J., 2005, LANGUAGE LEARNING DE, V1, P197, DOI DOI 10.1080/15475441.2005.9684216 WERKER JF, 1984, INFANT BEHAV DEV, V7, P49, DOI 10.1016/S0163-6383(84)80022-3 Witner S., 2010, P 11 INT C COMP LING, P86 NR 144 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2012 VL 54 IS 9 BP 975 EP 997 DI 10.1016/j.specom.2012.05.001 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 987LV UT WOS:000307419400001 ER PT J AU Irino, T Aoki, Y Kawahara, H Patterson, RD AF Irino, Toshio Aoki, Yoshie Kawahara, Hideki Patterson, Roy D. TI Comparison of performance with voiced and whispered speech in word recognition and mean-formant-frequency discrimination SO SPEECH COMMUNICATION LA English DT Article DE Whispered word recognition; Mean formant frequency discrimination; Modes of vocal excitation ID SPEAKER SIZE; VOCAL-TRACT; BODY-SIZE; VOWELS; IDENTIFICATION; SEX; PITCH; INFORMATION; PERCEPTION; PARAMETERS AB There has recently been a series of studies concerning the interaction of glottal pulse rate (GPR) and mean-formant-frequency (MFF) in the perception of speaker characteristics and speech recognition. This paper extends the research by comparing the recognition and discrimination performance achieved with voiced words to that achieved with whispered words. The recognition experiment shows that performance with whispered words is slightly worse than with voiced words at all MFFs when the GPR of the voiced words is in the middle of the normal range. But, as GPR decreases below this range, voiced-word performance decreases and eventually becomes worse than whispered-word performance. The discrimination experiment shows that the just noticeable difference (JND) for MFF is essentially independent of the mode of vocal excitation; the JND is close to 5% for both voiced and voiceless words for all speaker types. The interaction between GPR and VTL is interpreted in terms of the stability of the internal representation of speech which improves with GPR across the range of values used in these experiments. (c) 2012 Elsevier B.V. All rights reserved. C1 [Irino, Toshio; Aoki, Yoshie; Kawahara, Hideki] Wakayama Univ, Fac Syst Engn, Wakayama 6408510, Japan. [Patterson, Roy D.] Univ Cambridge, Dept Physiol Dev & Neurosci, Ctr Neural Basis Hearing, Cambridge CB2 3EG, England. RP Irino, T (reprint author), Wakayama Univ, Fac Syst Engn, 930 Sakaedani, Wakayama 6408510, Japan. EM irino@sys.wakayama-u.ac.jp; kawahara@sys.wakayama-u.ac.jp; rdp1@cam.ac.uk CR Aoki Y., 2008, ARO 31 MIDW M PHOEN Aoki Y., 2008, J ACOUST SOC AM 2, V123, P3718, DOI 10.1121/1.2935170 Assmann PF, 2005, J ACOUST SOC AM, V117, P886, DOI 10.1121/1.1852549 Assmann PF, 2008, J ACOUST SOC AM, V124, P3203, DOI 10.1121/1.2980456 Boersma P., 2001, GLOT INT, V5, P341 Chiba T., 1942, VOWEL ITS NATURE STR CORNSWEE.TN, 1965, J PHYSIOL-LONDON, V176, P294 Fant G., 1970, ACOUSTIC THEORY SPEE Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 FUJISAKI H, 1968, IEEE T ACOUST SPEECH, VAU16, P73, DOI 10.1109/TAU.1968.1161952 Ghazanfar A. A., 2008, CURR BIOL, V18, P457 Gonzalez J, 2004, J PHONETICS, V32, P277, DOI 10.1016/S0095-4470(03)00049-4 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 Huber JE, 1999, J ACOUST SOC AM, V106, P1532, DOI 10.1121/1.427150 Irino T, 2002, SPEECH COMMUN, V36, P181, DOI 10.1016/S0167-6393(00)00085-6 Irino T, 2006, IEEE T AUDIO SPEECH, V14, P2222, DOI 10.1109/TASL.2006.874669 Ives DT, 2005, J ACOUST SOC AM, V118, P3816, DOI 10.1121/1.2118427 Kawahara H., 2004, SPEECH SEPARATION HU, P167 Kawahara H., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.349 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 LASS NJ, 1976, J ACOUST SOC AM, V59, P675, DOI 10.1121/1.380917 Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 Liu C, 2004, ACOUST RES LETT ONL, V5, P31, DOI 10.1121/1.1635431 MARCUS SM, 1981, PERCEPT PSYCHOPHYS, V30, P247, DOI 10.3758/BF03214280 Markel J., 1975, LINEAR PREDICTION SP MILLER GA, 1947, J ACOUST SOC AM, V19, P609, DOI 10.1121/1.1916528 Nearey T. M., 2002, J ACOUST SOC AM, V112, P2323 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Pisanski K, 2011, J ACOUST SOC AM, V129, P2201, DOI 10.1121/1.3552866 Rendall D, 2005, J ACOUST SOC AM, V117, P944, DOI 10.1121/1.1848011 Rendall D, 2007, J EXP PSYCHOL HUMAN, V33, P1208, DOI 10.1037/0096-1523.33.5.1208 Sakamoto S., 2004, ACOUST SCI TECHNOL, V106, P1511 SCHWARTZ MF, 1968, J ACOUST SOC AM, V43, P1178, DOI 10.1121/1.1910954 SCHWARTZ MF, 1970, J SPEECH HEAR RES, V13, P445 SCHWARTZ MF, 1968, J ACOUST SOC AM, V43, P1448, DOI 10.1121/1.1911007 SCHWARTZ MF, 1968, J ACOUST SOC AM, V44, P1736, DOI 10.1121/1.1911324 SINNOTT JM, 1987, J COMP PSYCHOL, V101, P126, DOI 10.1037/0735-7036.101.2.126 Smith DRR, 2005, J ACOUST SOC AM, V117, P305, DOI 10.1121/1.1828637 Smith D.R.R., 2005, M BRIT SOC AUD CARD TARTTER VC, 1989, J ACOUST SOC AM, V86, P1678, DOI 10.1121/1.398598 TARTTER VC, 1991, PERCEPT PSYCHOPHYS, V49, P365, DOI 10.3758/BF03205994 TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151 Traunmuller H, 2000, J ACOUST SOC AM, V107, P3438, DOI 10.1121/1.429414 Tsujimura N, 2007, INTRO JAPANESE LINGU Turner RE, 2009, J ACOUST SOC AM, V125, P2374, DOI 10.1121/1.3079772 vanDommelen WA, 1995, LANG SPEECH, V38, P267 Vestergaard MD, 2009, J ACOUST SOC AM, V126, P2860, DOI 10.1121/1.3257582 Wichmann FA, 2001, PERCEPT PSYCHOPHYS, V63, P1293, DOI 10.3758/BF03194544 NR 48 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2012 VL 54 IS 9 BP 998 EP 1013 DI 10.1016/j.specom.2012.04.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 987LV UT WOS:000307419400002 ER PT J AU Ogawa, A Nakamura, A AF Ogawa, Atsunori Nakamura, Atsushi TI Joint estimation of confidence and error causes in speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Confidence estimation; Error cause detection; Joint estimation; Discriminative model ID COMBINATION AB Speech recognition errors are essentially unavoidable under the severe conditions of real fields, and so confidence estimation, which scores the reliability of a recognition result, plays a critical role in the development of speech recognition based real-field application systems. However, if we are to develop an application system that provides a high-quality service, in addition to achieving accurate confidence estimation, we also need to extract and exploit further supplementary information from a speech recognition engine. As a first step in this direction, in this paper, we propose a method for estimating the confidence of a recognition result while jointly detecting the causes of recognition errors based on a discriminative model. The confidence of a recognition result and the nonexistence/existence of error causes are naturally correlated. By directly capturing these correlations between the confidence and error causes, the proposed method enhances its estimation performance for the confidence and each error cause complementarily. In the initial speech recognition experiments, the proposed method provided higher confidence estimation accuracy than a discriminative model based state-of-the-art confidence estimation method. Moreover, the effective estimation mechanism of the proposed method was confirmed by the detailed analyses. (c) 2012 Elsevier B.V. All rights reserved. C1 [Ogawa, Atsunori; Nakamura, Atsushi] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto, Japan. RP Ogawa, A (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto, Japan. EM ogawa.atsunori@lab.ntt.co.jp; naka-mura.atsushi@lab.ntt.co.jp CR Berger AL, 1996, COMPUT LINGUIST, V22, P39 Burget L, 2008, INT CONF ACOUST SPEE, P4081, DOI 10.1109/ICASSP.2008.4518551 Chase L., 1997, P EUR C SPEECH COMM, P815 Fayolle J., 2010, P INTERSPEECH, P1492 Google Inc, 2011, ANDR DEV HAZEN TJ, 2001, ACOUST SPEECH SIG PR, P397 Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790 ICASSP Special Session (SS-6), 2008, P ICASSP PRAG CZECH, P5240 Interspeech Special Highlight Session, 2010, P INTERSPEECH Interspeech Special Session (Wed-Ses-S1), 2009, P INTERSPEECH Jiang H, 2005, SPEECH COMMUN, V45, P455, DOI 10.1016/j.specrom.2004.12.004 Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 Kombrink S., 2009, P INTERSPEECH, P80 Lafferty John D., 2001, ICML, P282 Lee C.-H., 2001, P INT WORKSH HANDS F, P27 Nakagawa S., 1994, J ACOUSTICAL SOC JAP, V50, P849 Nakano T., 2007, P IEEE ASRU, P601 Ogawa A, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P242 Ogawa A., 2009, P INT, P1199 Ogawa A, 2010, INT CONF ACOUST SPEE, P4454, DOI 10.1109/ICASSP.2010.5495608 Pearl J., 1988, PROBABILISTIC REASON Schalkwyk J., 2010, ADV SPEECH RECOGNITI, P61, DOI 10.1007/978-1-4419-5951-5_4 Sukkar RA, 1997, SPEECH COMMUN, V22, P333, DOI 10.1016/S0167-6393(97)00031-9 Wang YY, 2008, IEEE SIGNAL PROC MAG, V25, P29, DOI 10.1109/MSP.2008.918411 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 White C, 2007, INT CONF ACOUST SPEE, P809 White C, 2008, INT CONF ACOUST SPEE, P4085, DOI 10.1109/ICASSP.2008.4518552 Yoma NB, 2005, IEEE SIGNAL PROC LET, V12, P745, DOI [10.1109/LSP.2005.856888, 10.1109/LSP.2005.856988] YOUNG SR, 1994, INT CONF ACOUST SPEE, P21 Yu D, 2009, PATTERN RECOGN LETT, V30, P1295, DOI 10.1016/j.patrec.2009.06.005 Zhou B., 2011, COMPUTER SPEECH LANG NR 31 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2012 VL 54 IS 9 BP 1014 EP 1028 DI 10.1016/j.specom.2012.04.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 987LV UT WOS:000307419400003 ER PT J AU Clemente, IA Heckmann, M Wrede, B AF Clemente, Irene Ayllon Heckmann, Martin Wrede, Britta TI Incremental word learning: Efficient HMM initialization and large margin discriminative adaptation SO SPEECH COMMUNICATION LA English DT Article DE Incremental word learning; Learning from few examples; Speech recognition; Multiple sequence alignment; Discriminative training; Bootstrapping ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; LANGUAGE; MATRICES AB In this paper we present an incremental word learning system that is able to cope with few training data samples to enable speech acquisition in on-line human robot interaction. As with most automatic speech recognition systems (ASR), our architecture relies on a Hidden Markov Model (HMM) framework where the different word models are sequentially trained and the system has little prior knowledge. To achieve good performance, HMMs depends on the amount of training data, the initialization procedure and the efficiency of the discriminative training algorithms. Thus, we propose different approaches to improve the system. One major problem of using a small amount of training data is over-fitting. Hence we present a novel estimation of the variance floor dependent on the number of available training samples. Next, we propose a bootstrapping approach in order to get a good initialization of the HMM parameters. This method is based on unsupervised training of the parameters and subsequent construction of a new HMM by aligning and merging Viterbi decoded sequences. Finally, we investigate large margin discriminative training techniques to enlarge the generalization performance of the models using several strategies suitable for limited training data. In the evaluation of the results, we examine the contribution of the different stages proposed to the overall system performance. This includes the comparison of different state-of-the-art methods with our presented techniques and the investigation of the possible reduction of the number of training data samples. We compare our algorithms on isolated and continuous digit recognition tasks. To sum up, we show that the proposed algorithms yield significant improvements and are a step towards efficient learning with few examples. (c) 2012 Elsevier B.V. All rights reserved. C1 [Clemente, Irene Ayllon; Wrede, Britta] Univ Bielefeld, Res Inst Cognit & Robot, CoR Lab, D-33615 Bielefeld, Germany. [Clemente, Irene Ayllon; Heckmann, Martin] Honda Res Inst Europe GmbH, D-63073 Offenbach, Germany. RP Clemente, IA (reprint author), Univ Bielefeld, Res Inst Cognit & Robot, CoR Lab, D-33615 Bielefeld, Germany. EM iayllon@cor-lab.uni-bielefeld.de; martin.heckmann@honda-ri.de; bwrede@cor-lab.uni-bielefeld.de CR Ayllon Clemente I., 2010, P INTERSPEECH Ayllon Clemente I., 2010, P IEEE INT C AC SPEE BAKIS R, 1976, P ASA M WASH DC Bertsekas DP, 1999, NONLINEAR PROGRAMMIN Bilmes J., 2002, WHAT HMMS CAN DO Bimbot F., 1998, OVERVIEW CAVE PROJEC Bishop C. M., 2006, PATTERN RECOGNITION Bosch L., 2009, FUNDAMENTA INFORM, V90, P229 Bosch L., 2008, P INT C TEXT SPEECH, P261 Boves L, 2007, PROCEEDINGS OF THE SIXTH IEEE INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS, P349 Brandl H., 2010, THESIS BIELEFELD U Brandl H, 2008, INT C DEVEL LEARN, P31, DOI 10.1109/DEVLRN.2008.4640801 Chang T.-H., 2008, P IEEE INT C AC SPEE Charles W., 1992, TRULY INTELLIGENT CO Devijver P. A., 1982, PATTERN RECOGNITION Dong Yu, 2008, Computer Speech & Language, V22, DOI 10.1016/j.csl.2008.03.002 Fink G. A., 2008, MARKOV MODELS PATTER Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Ganapathiraju A., 2000, P SPEECH TRANSCR WOR Garofolo J.S., 1993, 20041068 USGS Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Geiger J.T., 2011, P INTERSPEECH Ghahramani Z, 2001, INT J PATTERN RECOGN, V15, P9, DOI 10.1142/S0218001401000836 Ghoshal A., 2005, P 28 ANN INT ACM SIG, P544, DOI 10.1145/1076034.1076127 He XD, 2008, IEEE SIGNAL PROC MAG, V25, P14, DOI 10.1109/MSP.2008.926652 Hermansky H., 1994, IEEE T SPEECH AUDIO, V2, P587 Huang X., 2001, SPOKEN LANGUAGE PROC Itaya Y, 2005, IEICE T INF SYST, VE88D, P425, DOI 10.1093/ietisy/e88-d.3.425 Iwahashi N, 2006, LECT NOTES ARTIF INT, V4211, P143 JONES DT, 1992, COMPUT APPL BIOSCI, V8, P275 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 Kadous MW, 2002, THESIS U NEW S WALES Lee C.-H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90022-V Leonard R., 1984, P IEEE INT C AC SPEE Li X., 2006, P INTERSPEECH Li X., 2005, P IEEE INT C AC SPEE Lin H., 2011, P INTERSPEECH Liu F.-H., 1994, THESIS CARNEGIE MELL Markov K., 2007, P INT McDermott E., 1997, THESIS WASEDA U Melin H., 1998, P INT C SPOK LANG PR Melin H., 1998, OPTIMIZING VARIANCE Melin H., 1999, P EUR C SPEECH COMM, P5 Minematsu N., 2010, P INT C SPEECH PROS Mokbel C., 1999, P IEEE INT C AC SPEE Moore RK, 2007, SPEECH COMMUN, V49, P418, DOI 10.1016/j.specom.2007.01.011 Morgan N., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, DOI 10.1142/S0218001493000455 Nabney I. T., 2002, ADV PATTERN RECOGNIT Nathan K, 1996, INT CONF ACOUST SPEE, P3502, DOI 10.1109/ICASSP.1996.550783 NEEDLEMA.SB, 1970, J MOL BIOL, V48, P443, DOI 10.1016/0022-2836(70)90057-4 Neukirchen C., 1998, P ICSLP, P2999 Nocedal J., 1999, NUMERICAL OPTIMIZATI RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 ROSENBERG A., 1998, P IEEE INT C AC SPEE Roy D, 2003, IEEE T MULTIMEDIA, V5, P197, DOI 10.1109/TMM.2003.811618 Scholkopf B., 2001, LEARNING KERNELS SUP Sha F., 2007, ADV NEURAL INFORM PR Sim KC, 2006, IEEE T AUDIO SPEECH, V14, P882, DOI 10.1109/TSA.2005.858062 SIU MH, 1999, ACOUST SPEECH SIG PR, P105 Smith K., 2002, HIDDEN MARKOV MODELS SMITH TF, 1981, J MOL BIOL, V147, P195, DOI 10.1016/0022-2836(81)90087-5 Stuttle M.N., 2004, THESIS HUGHES HALL C Theodoridis S, 2009, PATTERN RECOGNITION, 4RTH EDITION, P1 Van hamme H., 2008, P INT Van Segbroeck M, 2009, SPEECH COMMUN, V51, P1124, DOI 10.1016/j.specom.2009.05.003 Vapnik V, 1998, STAT LEARNING THEORY Wu Y., 2007, DATA SELECTION SPEEC Young S. J., 2006, HTK BOOK VERSION 3 4 NR 68 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2012 VL 54 IS 9 BP 1029 EP 1048 DI 10.1016/j.specom.2012.04.005 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 987LV UT WOS:000307419400004 ER PT J AU Truong, KP van Leeuwen, DA de Jong, FMG AF Truong, Khiet P. van Leeuwen, David A. de Jong, Franciska M. G. TI Speech-based recognition of self-reported and observed emotion in a dimensional space SO SPEECH COMMUNICATION LA English DT Article DE Affective computing; Automatic emotion recognition; Emotional speech; Emotion database; Audiovisual database; Emotion perception; Emotion annotation; Emotion elicitation; Videogames; Support Vector Regression ID SUPPORT VECTOR REGRESSION; AUTOMATIC RECOGNITION; GAME; FEATURES; MODEL AB The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance. (c) 2012 Elsevier B.V. All rights reserved. C1 [Truong, Khiet P.; de Jong, Franciska M. G.] Univ Twente, NL-7500 AE Enschede, Netherlands. [van Leeuwen, David A.] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands. RP Truong, KP (reprint author), Univ Twente, POB 217, NL-7500 AE Enschede, Netherlands. EM k.p.truong@utwente.nl; d.vanleeuwen@let.ru.nl; f.m.g.dejong@utwente.nl FU MultimediaN; European Community's Seventh Framework Programme (FP7) [231287] FX We would like to thank the anonymous reviewers for their helpful comments. This work was supported by MultimediaN and the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 231287 (SSPNet). CR Ang J, 2002, P INT C SPOK LANG PR, P2037 Auberge V., 2006, P 5 INT C LANG RES E Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Batliner A., 2006, LANGUAGE TECHNOLOGIE, P240 Batliner A, 2000, P ISCA WORKSH SPEECH, P195 Biersack S., 2005, P ISCA WORKSH PLAST, P211 Boersma P., 2009, PRAAT DOING PHONETIC Busso C, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P257 Chang C.-C., 2001, LIBSVM LIB SUPPORT V Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R., 2000, P ISCA WORKSH SPEECH, P19 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 Dellaert F, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1970 den Uyl M.J., 2005, P MEAS BEH, P589 Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 Devillers L., 2003, P ICME 2003, P549 Douglas-Cowie E, 2005, P INT 2005 LISB PORT, P813 Ekman P, 1972, NEBRASKA S MOTIVATIO, P207 Ekman P, 1975, UNMASKING FACE GUIDE Eyben F, 2010, J MULTIMODAL USER IN, V3, P7, DOI 10.1007/s12193-009-0032-6 Giannakopoulos T, 2009, INT CONF ACOUST SPEE, P65, DOI 10.1109/ICASSP.2009.4959521 Grimm M, 2007, INT CONF ACOUST SPEE, P1085 Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010 Gunes Hatice, 2011, Proceedings 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2011), DOI 10.1109/FG.2011.5771357 Hanjalic A, 2005, IEEE T MULTIMEDIA, V7, P143, DOI 10.1109/TMM.2004.840618 Joachims Thorsten, 1998, P 10 EUR C MACH LEAR, P137 Johnstone T, 2005, EMOTION, V5, P513, DOI 10.1037/1528-3542.5.4.513 Kim J., 2005, P 9 EUR C SPEECH COM, P809 KWON O.W., 2003, P 8 EUR C SPEECH COM, P125 Lang P. J., 1995, AM PSYCHOL, V50, P371, DOI [10.1037/0003-066X.50.5.372, DOI 10.1037/0003-066X.50.5.372] Lazarro N., 2004, WHY WE PLAY GAMES 4 Lee C. M., 2002, P INT C MULT EXP SWI, P737 Liscombe J., 2003, P EUR C SPEECH COMM, P725 Mower E., 2009, P INTERSPEECH, P1583 Nicolaou M., 2010, P LREC INT WORKSH MU, P43 Nicolaou MA, 2011, IEEE T AFFECT COMPUT, V2, P92, DOI 10.1109/T-AFFC.2011.9 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Petrushin V.A., 1999, P 1999 C ART NEUR NE POLZIN TS, 1998, P COOP MULT COMM CMC Ravaja N, 2006, PRESENCE-TELEOP VIRT, V15, P381, DOI 10.1162/pres.15.4.381 RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714 SALTON G, 1988, INFORM PROCESS MANAG, V24, P513, DOI 10.1016/0306-4573(88)90021-0 Scherer K. R., 2010, BLUEPRINT AFFECTIVE, P47 SCHLOSBERG H, 1954, PSYCHOL REV, V61, P81, DOI 10.1037/h0054570 SCHULLER B, 2003, ACOUST SPEECH SIG PR, P1 Smola AJ, 2004, STAT COMPUT, V14, P199, DOI 10.1023/B:STCO.0000035301.49549.88 Tato R.S., 2002, P INT C SPOK LANG PR, P2029 Truong K. P., 2009, P INTERSPEECH, P2027 Truong K.P., 2008, P INTERSPEECH, P381 Truong KP, 2008, LECT NOTES COMPUT SC, V5237, P161, DOI 10.1007/978-3-540-85853-9_15 Vapnik V., 2002, NATURE STAT LEARNING Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003 Ververidis D., 2005, P IEEE INT C MULT EX, P1500 Wang N, 2006, LECT NOTES ARTIF INT, V4133, P282 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Wollmer M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P597 Wollmer M., 2009, P INT BRIGHT UK, P1595 Wundt W., 1874, GRUNDZUGE PHYSL PSYC Yildirim S., 2005, P 9 EUR C SPEECH COM, P2209 Yu C., 2004, P 8 INT C SPOK LANG, P1329 Zeng Z., 2005, P IEEE INT C MULT EX, P828 NR 61 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2012 VL 54 IS 9 BP 1049 EP 1063 DI 10.1016/j.specom.2012.04.006 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 987LV UT WOS:000307419400005 ER PT J AU Yunusova, Y Baljko, M Pintilie, G Rudy, K Faloutsos, P Daskalogiannakis, J AF Yunusova, Yana Baljko, Melanie Pintilie, Grigore Rudy, Krista Faloutsos, Petros Daskalogiannakis, John TI Acquisition of the 3D surface of the palate by in-vivo digitization with Wave SO SPEECH COMMUNICATION LA English DT Article DE Hard palate modeling; Thin plate spline (TPS) technique; Palate casts; The Wave system ID ACCURACY; SHAPE AB An accurate characterization of the morphology of the hard palate is essential for understanding its role in human speech. The position of the tongue is adjusted in the oral cavity, of which the hard palate is a key anatomical structure. Methods for modeling the palate are limited at present. This paper evaluated the use of a thin plate spline (TPS) technique for reconstructing the palate surface from a series of in-vivo tracings obtained with electromagnetic articulography using Wave (NDI). Twenty-four individuals (13 females and 11 males) provided upper dental casts and in-vivo tracings. Models of the palate surfaces were derived from data acquired in-vivo and compared to the scanned casts. The optimal value for the smoothness parameter for the TPS technique, which provided the smallest error of fit between the modeled and scanned surfaces, was determined empirically (the value of 0.05). Significant predictors of the quality of the fit were determined and included the individuals' palate characteristics such as palate slope and curvature. The tracing protocol composed of four different traces produced the best palate models for the in-vivo procedure. Evidence demonstrated that the TPS procedure as a whole is suitable for modeling the palate surface using a small number of in-vivo tracings. (C) 2012 Elsevier B.V. All rights reserved. C1 [Yunusova, Yana; Pintilie, Grigore; Rudy, Krista] Univ Toronto, Dept Speech Language Pathol, Toronto, ON M5G 1V7, Canada. [Baljko, Melanie; Faloutsos, Petros] York Univ, Dept Comp Sci & Engn, Toronto, ON M3J 1P3, Canada. [Daskalogiannakis, John] Univ Toronto, Dept Orthodont, Toronto, ON M5G 1G6, Canada. RP Yunusova, Y (reprint author), Univ Toronto, Dept Speech Language Pathol, Rehabil Sci Bldg,160-500 Univ Ave, Toronto, ON M5G 1V7, Canada. EM yana.yunusova@utoronto.ca RI Yunusova, Yana/E-3428-2010 OI Yunusova, Yana/0000-0002-2353-2275 FU Natural Sciences and Engineering Research Council of Canada (NSERC); ASHA Foundation FX This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant and ASHA Foundation New Investigator Award. CR Berry JJ, 2011, J SPEECH LANG HEAR R, V54, P1295, DOI 10.1044/1092-4388(2011/10-0226) BESL PJ, 1992, IEEE T PATTERN ANAL, V14, P239, DOI 10.1109/34.121791 Bhagyalakshmi G., 2007, SYNDR RES PRACT, V12, P55 Brunner J, 2009, J ACOUST SOC AM, V125, P3936, DOI 10.1121/1.3125313 Brunner J., 2005, ZAS PAPERS LINGUISTI, V42, P43 Ferrario VF, 1998, CLEFT PALATE-CRAN J, V35, P396, DOI 10.1597/1545-1569(1998)035<0396:QDOTMO>2.3.CO;2 Ferrario VF, 2001, CLIN ORTHOD RES, V4, P141, DOI 10.1034/j.1600-0544.2001.040304.x Fuchs S., 2010, TURBULENT SOUNDS INT, P281 Hamilton C., 1993, SYND RES PRACT, V1, P15 HIKI S, 1986, SPEECH COMMUN, V5, P141, DOI 10.1016/0167-6393(86)90004-X Kumar V, 2008, ANGLE ORTHOD, V78, P873, DOI 10.2319/082907-399.1 Mooshammer C., 2004, AIPUK, V36, P47 Perkell J.S., 1998, HDB PHONETIC SCI, P333 PERRIER P, 1992, J SPEECH HEAR RES, V35, P53 Schneider PJ, 2003, GEOMETRIC TOOLS COMP Thompson G.W., 1977, J DENT RES, V56 Vorperian H.K., 2005, J ACOUST SOC AM, V117, P338 Wahba G., 1990, CBMS NSF REG C SER A, V59 Weirich M., 2011, P ISSP, P251 Westbury J.R., 1994, XRAY MICROBEAM SPEEC Yunusova Y, 2009, J SPEECH LANG HEAR R, V52, P547, DOI 10.1044/1092-4388(2008/07-0218) NR 21 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2012 VL 54 IS 8 BP 923 EP 931 DI 10.1016/j.specom.2012.03.006 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961UC UT WOS:000305494000001 ER PT J AU Sun, QH Hirose, K Minematsu, N AF Sun, Qinghua Hirose, Keikichi Minematsu, Nobuaki TI A method for generation of Mandarin F-0 contours based on tone nucleus model and superpositional model SO SPEECH COMMUNICATION LA English DT Article DE Mandarin; Tone; Fundamental frequency contour; Tone nucleus; Phrase component; Tone component ID SPEECH SYSTEM; CHINESE; RULES AB A new method was proposed for synthesizing sentence fundamental frequency (F-0) contours of Mandarin speech. The method is based on representing a sentence logarithmic F-0 contour as a superposition of tone components on phrase components, as in the case of the generation process model (F-0 model). However, the method is not fully depending on the model in that tone components are generated in a corpus-based way by concatenating F-0 patterns predicted for constituting syllables. Furthermore, the prediction is done only for the stable part of syllable tone component, known as tone nucleus. The entire tone components were obtained by concatenating the predicted patterns. Since effect of tone coarticulation is minor for tone nuclei, as compared to conventional methods of handling full syllable F-0 contours, a better prediction is possible especially when the size of training corpus is limited. While tone components are highly language specific, phrase components are assumed to be more language universal: analogy from a control scheme of phrase components developed for a language may applicable for other languages. Also, phrase components covers a wider range (phrase, clause, etc.) of speech and is tightly related to higher linguistic information (syntax), and, therefore, concatenation of short F-0 contour fragments predicted in a corpus-based method will not be appropriate. Taking these into consideration, rules similar to Japanese were constructed to control phrase commands, from which phrase components were generated with simple mathematical calculations in the framework of the generation process model. There is a tight relation between phrase and tone components, and, therefore, both components cannot be generated independently. To ensure the correct relation be held in the synthesized F-0 contour, a two-step scheme was developed, where information of generated phrase components was utilized for the prediction of tone components. A listening test was conducted for speech synthesized using F-0 contours generated by the developed method. Synthetic speech sounded highly natural, showing the validity of the method. Furthermore, it was shown through an experiment of word emphasis that flexible F-0 control was possible by the proposed method. (C) 2012 Elsevier B.V. All rights reserved. C1 [Sun, Qinghua] Univ Tokyo, Grad Sch Engn, Tokyo 1138654, Japan. [Hirose, Keikichi; Minematsu, Nobuaki] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo 1138654, Japan. RP Sun, QH (reprint author), Univ Tokyo, Grad Sch Engn, Tokyo 1138654, Japan. EM qinghua@gavo.t.u-tokyo.ac.jp; hirose@gavo.t.u-tokyo.ac.jp; mine@gavo.t.u-tokyo.ac.jp CR Chao Y. R., 1968, GRAMMAR APOKEN CHINE, P1 Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 Gu W., 2006, P SPEECH PROSODY, P561 Gu W., 2005, P EUR LISB PORT, P1825 Gu WT, 2004, IEICE T INF SYST, VE87D, P1079 Hirose K., 1986, P IEEE IECEJ ASJ ICA, P2415 HIROSE K, 1993, IEICE T FUND ELECTR, VE76A, P1971 LEE LS, 1989, IEEE T ACOUST SPEECH, V37, P1309 Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P287, DOI 10.1109/89.232612 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Sun Q., 2008, P INT C SPE IN PRESS Sun Q., 2006, P INT C SPEECH PROS, P561 Sun Q., 2008, P INT WORKSH NONL CI, P112 Sun Q., 2007, P ISCA WORKSH SPEECH, P154 Sun Q., 2005, P INT, P3265 Tao J., 2002, P INT C SPEECH LANG, P2097 Tokuda K., 1997, P IEEE ICASSP, P229 Wang R., 2006, ADV CHINESE SPOKEN L Zhang JS, 2004, SPEECH COMMUN, V42, P447, DOI 10.1016/j.specom.2004.01.001 NR 20 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2012 VL 54 IS 8 BP 932 EP 945 DI 10.1016/j.specom.2012.03.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961UC UT WOS:000305494000002 ER PT J AU Mok, PPK AF Mok, Peggy P. K. TI Effects of consonant cluster syllabification on vowel-to-vowel coarticulation in English SO SPEECH COMMUNICATION LA English DT Article DE Vowel-to-vowel coarticulation; Syllable structure; English; Consonant clusters ID SYLLABLE STRUCTURE; ACOUSTIC ANALYSIS; FINAL CONSONANTS; VCV UTTERANCES; STOP PLACE; SEQUENCES; PERCEPTION; PATTERNS; CATALAN; ARTICULATION AB This paper investigates how different syllable affiliations of intervocalic /st/ cluster affect vowel-to-vowel coarticulation in English. Very few studies have examined the effect of syllable structure on vowel-to-vowel coarticulation. Previous studies show that onset and coda consonants differ acoustically, articulatorily, perceptually and typologically. Onsets are stronger, more stable, more common and more distinguishable than codas. Since codas are less constrained, it was hypothesized that coda /st./ would allow more vowel-to-vowel coarticulation than onset /st/. Three vowels (/i a u/) were used to form the target sequences with the /st/ cluster in English: onset /CV.stVC/, heterosyllabic /CVs.tVC/, coda /CVst.VC/. F1 and F2 frequencies at vowel edges and the durations of the first vowel and the intervocalic consonants were measured from six speakers of Standard Southern British English. Factors included in the experiment are: Direction, Syllable Form, Target, Context. Results show that coda /st./ allows more vowel-to-vowel coarticulation than onset /.st/, and heterosyllabic /s.t/ is the most resistant among the Syllable Forms. Vowels in heterosyllabic /s.t/ are more extreme than in the other two Syllable Forms in the carryover direction. These findings suggest that vowel-to-vowel coarticulation is sensitive to different syllable structure with the same segmental composition. Possible factors contributing to the observed patterns are discussed. (C) 2012 Elsevier B.V. All rights reserved. C1 Chinese Univ Hong Kong, Dept Linguist & Modern Languages, Shatin, Hong Kong, Peoples R China. RP Mok, PPK (reprint author), Chinese Univ Hong Kong, Dept Linguist & Modern Languages, Leung Kau Kui Bldg, Shatin, Hong Kong, Peoples R China. EM peggymok@cuhk.edu.hk FU Sir Edward Youde Memorial Fellowship for Overseas Studies from Hong Kong; Overseas Research Studentship from the United Kingdom FX The author would like to thank Sarah Hawkins for help and guidance throughout the project. Thanks also go to Rachel Smith and Francis Nolan for helpful discussion. She thanks the Editor and two anonymous reviewers for their constructive comments. This research was supported by the Sir Edward Youde Memorial Fellowship for Overseas Studies from Hong Kong and an Overseas Research Studentship from the United Kingdom. CR Adank P, 2004, J ACOUST SOC AM, V116, P3099, DOI 10.1121/1.1795335 ANDERSON S, 1994, J PHONETICS, V22, P283 Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X Beddor PS, 2002, J PHONETICS, V30, P591, DOI 10.1006/jpho.2002.0177 Beddor P.S., 1995, 13 INT C PHON SCI IC, P44 Bell A., 1978, SYLLABLES SEGMENTS, P3 BOUCHER VJ, 1988, J PHONETICS, V16, P299 Browman C. P., 1995, PRODUCING SPEECH CON, P19 BROWMAN CP, 1988, PHONETICA, V45, P140 Byrd D, 1996, J PHONETICS, V24, P209, DOI 10.1006/jpho.1996.0012 BYRD D, 1995, PHONETICA, V52, P285 Byrd D, 1996, J PHONETICS, V24, P263, DOI 10.1006/jpho.1996.0014 Cho T, 2001, PHONETICA, V58, P129, DOI 10.1159/000056196 Cho TH, 2004, J PHONETICS, V32, P141, DOI 10.1016/S0095-4470(03)00043-3 CHRISTIE WM, 1974, J ACOUST SOC AM, V55, P819, DOI 10.1121/1.1914606 Culter A, 1987, COMPUT SPEECH LANG, V2, P133 De Jong KJ, 2001, LANG SPEECH, V44, P197 de Jong KJ, 2004, LANG SPEECH, V47, P241 Fabricius AH, 2009, LANG VAR CHANGE, V21, P413, DOI 10.1017/S0954394509990160 Ferragne E, 2010, J INT PHON ASSOC, V40, P1, DOI 10.1017/S0025100309990247 Fowler C.A., 1981, J SPEECH HEAR RES, V46, P127 Gick B, 2006, J PHONETICS, V34, P49, DOI 10.1016/j.wocn.2005.03.005 Greenberg J., 1978, PHONOLOGY, V2, P243 Haggard M., 1973, J PHONETICS, V1, P9 Haggard M., 1973, J PHONETICS, V1, P111 Hawkins S., 2005, J INT PHON ASSOC, V35, P183, DOI 10.1017/S0025100305002124 Hertrich I, 1995, LANG SPEECH, V38, P159 Honorof D.N., 1995, 13 INT C PHON SCI IC, P552 Hosung Nam, 2003, 15 INT C PHON SCI IC, P2253 Keating P., 2003, PAPERS LAB PHONOLOGY, P143 Kochetov A., 2006, PAPERS LAB PHONOLOGY, V8, P565 Krakow RA, 1999, J PHONETICS, V27, P23, DOI 10.1006/jpho.1999.0089 Lehiste I., 1970, SUPRASEGMENTALS Low E., 2000, LANG SPEECH, V43, P377 Low E.L., 1999, LANG SPEECH, V42, P39 MACCHI M, 1988, PHONETICA, V45, P109 MacNeilage PF, 2000, SCIENCE, V288, P527, DOI 10.1126/science.288.5465.527 Maddieson I., 1984, PATTERNS SOUNDS Magen HS, 1997, J PHONETICS, V25, P187, DOI 10.1006/jpho.1996.0041 MANUEL SY, 1990, J ACOUST SOC AM, V88, P1286, DOI 10.1121/1.399705 Modarresi G, 2004, J PHONETICS, V32, P291, DOI 10.1016/j.wocn.2003.11.002 Modarresi G, 2004, PHONETICA, V61, P2, DOI 10.1159/000078660 Mok P., LANGUAGE SP IN PRESS Mok PKP, 2010, J ACOUST SOC AM, V128, P1346, DOI 10.1121/1.3466859 Mok PPK, 2011, LANG SPEECH, V54, P527, DOI 10.1177/0023830911404961 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 PICKETT JM, 1995, PHONETICA, V52, P1 QUENE H, 1992, J PHONETICS, V20, P331 Recasens D, 2009, J ACOUST SOC AM, V125, P2288, DOI 10.1121/1.3089222 RECASENS D, 1989, SPEECH COMMUN, V8, P293, DOI 10.1016/0167-6393(89)90012-5 Recasens D, 2001, J PHONETICS, V29, P273, DOI 10.1006/jpho.2001.0139 Recasens D, 2002, J ACOUST SOC AM, V111, P2828, DOI 10.1121/1.1479146 RECASENS D, 1987, J PHONETICS, V15, P299 Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727 Redford MA, 1999, J ACOUST SOC AM, V106, P1555, DOI 10.1121/1.427152 SAMUEL AG, 1989, PERCEPT PSYCHOPHYS, V45, P485, DOI 10.3758/BF03208055 Schwab S, 2008, PHONETICA, V65, P173, DOI 10.1159/000144078 Smith Caroline L., 1995, PAPERS LAB PHONOLOGY, P205 SPROAT R, 1993, J PHONETICS, V21, P291 Stetson R.H., 1988, RETROSPECTIVE EDITIO, P235 Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567 Tuller B., 1990, ATTENTION PERFORM, P429 TULLER B, 1991, J SPEECH HEAR RES, V34, P501 Turk AE, 2000, J PHONETICS, V28, P397, DOI 10.1006/jpho.2000.0123 Vihman M. M., 1996, PHONOLOGICAL DEV Watt D., 2010, SOCIOPHONETICS STUDE, P107 Watt D., 2002, LEEDS WORKING PAPERS, V9, P159 Wells J., 1990, STUDIES PRONUNCIATIO, P76 ZSIGA EC, 1994, J PHONETICS, V22, P121 NR 69 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2012 VL 54 IS 8 BP 946 EP 956 DI 10.1016/j.specom.2012.04.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961UC UT WOS:000305494000003 ER PT J AU Li, ZB Zhao, SH Bruhn, S Wang, J Kuang, JM AF Li, Zhongbo Zhao, Shenghui Bruhn, Stefan Wang, Jing Kuang, Jingming TI Comparison and optimization of packet loss recovery methods based on AMR-WB for VoIP SO SPEECH COMMUNICATION LA English DT Article DE VoIP; AMR-WB; Packet loss; FEC; MDC ID IP; NETWORKS; SPEECH; AUDIO AB AMR-WB coclec, which has been standardized for wideband speech conversational applications, has a broad range of potential applications in the migration of wireless and wireline networks towards a single converged IP network. Forward error control (FEC) and multiple description coding (MDC) are two promising techniques to make the transmission robust against packet loss in Voice over IP (VoIP). However, how to achieve the optimal reconstructed speech quality with these methods for AMR-WB under different packet loss rate conditions is still an open problem. In this paper, we compare the performance of various FEC and MDC schemes for the AMR-WB codec both analytically and experimentally. Based on the comparison results, some advantageous configurations of FEC and MDC for the AMR-WB codec are obtained, and hence an optimization system is proposed by selecting the optimal packet loss recovery scheme in accordance with the variable network conditions. Subjective AB test results show that the optimization can lead to obvious improvements of the perceived speech quality in the IP environment. (C) 2012 Elsevier B.V. All rights reserved. C1 [Li, Zhongbo; Zhao, Shenghui; Wang, Jing; Kuang, Jingming] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China. [Li, Zhongbo] Inst Chinese Elect Equipment Syst Engn Co, Beijing 100141, Peoples R China. [Bruhn, Stefan] Multimedia Technol, Ericsson Res, Stockholm, Sweden. RP Zhao, SH (reprint author), Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China. EM shzhao@bit.edu.cn FU cooperation project between Ericsson and BIT FX The work in this paper is supported by the cooperation project between Ericsson and BIT. CR Altman E, 2002, COMPUT NETW, V39, P185, DOI 10.1016/S1389-1286(01)00309-7 [Anonymous], 2006, DUAL RAT SPEECH COD [Anonymous], 2001, 26191 3GPP TS [Anonymous], 2009, 181005 ETSI TS [Anonymous], 2001, 2690 3GPP TS [Anonymous], 1988, PULS COD MOD VOIC FR [Anonymous], 2001, 26190 3GPP TS [Anonymous], 2003, WID COD SPEECH 16 KB [Anonymous], 1992, COD SPEECH 16 KBITS [Anonymous], 1990, 40 32 24 16 KBITS AD Apostolopoulos J, 2002, IEEE INFOCOM SER, P1736 Bastiaan Kleijn W, 2006, IEEE T COMMUN, V54 De Martin J.C, 2001, P IEEE INT C AC SPEE, P753 Degermark M., 1999, 2507 RFC DONG H, 2004, IEEE ICASSP 04 QUEB, P277 ELGAMAL AA, 1982, IEEE T INFORM THEORY, V28, P851 Gunnar Karlsson, IEEE C ICC2006 IST T, P1002 International Telecommunication Union, 2007, COD SPEECH 8 KBITS U ITU- T, 2005, WID EXT REC P 862 AS ITU-T Rec, 1988, 7 KHZ AUD COD 64 KBI ITU-T Rec, 2005, LOW COMPL COD 24 32 Jingming Kuang, 2007, IEEE C ICITA2007 HAR Johansson I., 2002, IEEE SPEECH COD WORK Morinaga T., 2002, ACOUST SPEECH SIG PR OZAROW L, 1980, AT&T TECH J, V59, P1909 Perkins C, 1998, IEEE NETWORK, V12, P40, DOI 10.1109/65.730750 PETR DW, 1989, IEEE J SEL AREA COMM, V7, P644, DOI 10.1109/49.32328 Petracca M., 2004, 1 INT S CONTR COMM S, P587, DOI 10.1109/ISCCSP.2004.1296457 Podolsky M, 1998, IEEE INFOCOM SER, P505 Schulzrinneh, 1996, RFC1889 Wah BW, 2005, IEEE T MULTIMEDIA, V7, P167, DOI 10.1109/TMM.2004.840593 NR 31 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2012 VL 54 IS 8 BP 957 EP 974 DI 10.1016/j.specom.2012.04.003 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961UC UT WOS:000305494000004 ER PT J AU Wang, L Chen, H Li, S Meng, HM AF Wang, Lan Chen, Hui Li, Sheng Meng, Helen M. TI Phoneme-level articulatory animation in pronunciation training SO SPEECH COMMUNICATION LA English DT Article DE Phoneme-based articulatory models; HMM-based visual synthesis; 3D articulatory animation ID SPEECH SYNTHESIS; VISIBLE SPEECH; MODEL; HMM; MRI AB Speech visualization is extended to use animated talking heads for computer assisted pronunciation training. In this paper, we design a data-driven 3D talking head system for articulatory animations with synthesized articulator dynamics at the phoneme level. A database of AG500 EMA-recordings of three-dimensional articulatory movements is proposed to explore the distinctions of producing the sounds. Visual synthesis methods are then investigated, including a phoneme-based articulatory model with a modified blending method. A commonly used HMM-based synthesis is also performed with a Maximum Likelihood Parameter Generation algorithm for smoothing. The 3D articulators are then controlled by synthesized articulatory movements, to illustrate both internal and external motions. Experimental results have shown the performances of visual synthesis methods by root mean square errors. A perception test is then presented to evaluate the 3D animations, where a word identification accuracy is 91.6% among 286 tests, and an average realism score is 3.5 (1 = bad to 5 = excellent). (C) 2012 Elsevier B.V. All rights reserved. C1 [Wang, Lan; Li, Sheng; Meng, Helen M.] Chinese Acad Sci, Shenzhen Inst Adv Technol, Beijing 100864, Peoples R China. [Meng, Helen M.] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China. [Chen, Hui] Chinese Acad Sci, Inst Software, Beijing 100864, Peoples R China. RP Wang, L (reprint author), Chinese Acad Sci, Shenzhen Inst Adv Technol, Beijing 100864, Peoples R China. EM lan.wang@siat.ac.cn; chenhui@iscas.ac.cn; sheng.li@siat.ac.cn; hmmeng@se.cuhk.edu.hk FU National Nature Science Foundation of China [NSFC 61135003, NSFC 90920002]; National Fundamental Research Grant of Science and Technology (973 Project) [2009CB320804]; Knowledge Innovation Program of the Chinese Academy of Sciences [KJCXZ-YW-617] FX Our work is supported by National Nature Science Foundation of China (NSFC 61135003, NSFC 90920002), National Fundamental Research Grant of Science and Technology (973 Project: 2009CB320804), and The Knowledge Innovation Program of the Chinese Academy of Sciences (KJCXZ-YW-617). CR Badin P., 2008, P 5 C ART MOT DEF OB, P132 Badin P, 2010, SPEECH COMMUN, V52, P493, DOI 10.1016/j.specom.2010.03.002 Blackburn S.C., 2000, J ACOUST SOC AM, V107, P659 Chen H, 2010, VISUAL COMPUT, V26, P477, DOI 10.1007/s00371-010-0434-1 Cohen M. M., 1993, Models and Techniques in Computer Animation Dang J.W., 2005, P INT LISB, P1025 Deng A., 2008, COMPUTER GRAPHICS, V27, P2096 Engwall O, 2003, SPEECH COMMUN, V41, P303, DOI [10.1016/S0167-6393(02)00132-2, 10.1016/S0167-6393(03)00132-2] Fagel S, 2004, SPEECH COMMUN, V44, P141, DOI 10.1016/j.specom.2004.10.006 Grauwinkel K., 2007, P INT C PHON SCI 200, P2173 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 Hoole P., 2003, P INT C PHONETIC SCI, P265 King S., 2001, FACIAL MODEL ANIMATI Lado R., 1957, LINGUISTICS CULTURES Ling Z.H., 2008, P INT, P573 Ma JY, 2004, COMPUT ANIMAT VIRT W, V15, P485, DOI 10.1002/cav.11 Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025) Murray N., 1993, 4 NCVS, P41 Rathinavelu A., 2007, LECT NOTES COMPUT SC, P786 Serrurier A, 2008, J ACOUST SOC AM, V123, P2335, DOI 10.1121/1.2875111 Tamura M., 1999, P EUR C SPEECH COMM, P959 Tarabalka Y., 2007, P ASSISTH 2007 FRANC, P187 TOKUDA K, 1995, INT CONF ACOUST SPEE, P660, DOI 10.1109/ICASSP.1995.479684 Wang L., 2009, P INTERSPEECH 2009, P2247 Wik P., 2008, P FONETIK 2008, P57 Wik P, 2009, SPEECH COMMUN, V51, P1024, DOI 10.1016/j.specom.2009.05.006 Youssef B., 2009, P INT, P2255 Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004 Zierdt A., 2010, SPEECH MOTOR CONTROL, P331 NR 29 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2012 VL 54 IS 7 BP 845 EP 856 DI 10.1016/j.specom.2012.02.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961VA UT WOS:000305496400001 ER PT J AU Hashimoto, K Yamagishi, J Byrne, W King, S Tokuda, K AF Hashimoto, Kei Yamagishi, Junichi Byrne, William King, Simon Tokuda, Keiichi TI Impacts of machine translation and speech synthesis on speech-to-speech translation SO SPEECH COMMUNICATION LA English DT Article DE Speech-to-speech translation; Machine translation; Speech synthesis; Subjective evaluation AB This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques have been proposed for integration of speech recognition and machine translation. However, corresponding techniques have not yet been considered for speech synthesis. The focus of the current work is machine translation and speech synthesis, and we present a subjective evaluation designed to analyze their impact on speech-to-speech translation. The results of these analyses show that the naturalness and intelligibility of the synthesized speech are strongly affected by the fluency of the translated sentences. In addition, several features were found to correlate well with the average fluency of the translated sentences and the average naturalness of the synthesized speech. (C) 2012 Elsevier B.V. All rights reserved. C1 [Hashimoto, Kei; Tokuda, Keiichi] Nagoya Inst Technol, Dept Comp Sci & Engn, Nagoya, Aichi, Japan. [Yamagishi, Junichi; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland. [Byrne, William] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Hashimoto, K (reprint author), Nagoya Inst Technol, Dept Comp Sci & Engn, Nagoya, Aichi, Japan. EM bonanza@sp.nitech.ac.jp; jyamagis@inf.ed.ac.uk; bill.byrne@eng.cam.ac.uk; Simon.King@ed.ac.uk; tokuda@nitech.ac.jp FU European Community's Seventh Framework Programme [213845]; Strategic Information and Communications R&D Promotion Programme (SCOPE) of the Ministry of Internal Affairs and Communication, Japan; JSPS (Japan Society for the Promotion of Science) FX The research leading to these results was partly funded from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement 213845 (the EMIME project http://www.emime.org) and the Strategic Information and Communications R&D Promotion Programme (SCOPE) of the Ministry of Internal Affairs and Communication, Japan. A part of this research was supported by JSPS (Japan Society for the Promotion of Science) Research Fellowships for Young Scientists. CR Boidin C., 2009, P INT SPEC SESS MACH, P2487 Bulyko I, 2002, COMPUT SPEECH LANG, V16, P533, DOI 10.1016/S0885-2308(02)00023-2 Byrne William, 2009, P HLT NAACL BOULD CO, P433, DOI 10.3115/1620754.1620817 Callison-Burch C., 2010, P NAACL HLT 2010 WOR, P1 Callison-Burch C., 2009, P 2009 C EMP METH NA, P286 Casacuberta F, 2008, IEEE SIGNAL PROC MAG, V25, P80, DOI 10.1109/MSP.2008.917989 Chae J., 2009, P 12 C EUR CHAPT ASS, P139, DOI 10.3115/1609067.1609082 Fort K, 2011, COMPUT LINGUIST, V37, P413, DOI 10.1162/COLI_a_00057 Gispert A., 2009, P NAACL HLT 2009, P73 Heilman M., 2010, P NAACL HLT 2010 WOR, P35 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394 Koehn P, 2005, P 10 MACH TRANSL SUM, P79 Kunath S.A., 2010, P NAACL HLT 2010 WOR, P168 Malfait L, 2006, IEEE T AUDIO SPEECH, V14, P1924, DOI 10.1109/TASL.2006.883177 Mutton A., 2007, P 45 ANN M ASS COMP, P344 NAKATSU C, 2006, P 44 ANN M ASS COMP, P1113, DOI 10.3115/1220175.1220315 Ney H., 1999, P IEEE INT C AC SPEE, P1149 Noth E, 2000, IEEE T SPEECH AUDI P, V8, P519, DOI 10.1109/89.861370 Parlikar A., 2010, P INT 2010, P194 Snow R., 2008, P C EMP METH NAT LAN, P254, DOI 10.3115/1613715.1613751 Stolcke A., 2002, P INT C SPOK LANG PR, P901 Tokuda K., 2000, P ICASSP, P936 TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229 Vidal E., 1997, P INT C AC SPEECH SI, P111 Wan S., 2005, P 10 EUR WORKSH NAT White J. S., 1994, P 1 C ASS MACH TRANS, P193 Wolters M., 2010, P SSW7, P136 Wu Y.J., 2009, P INTERSPEECH, P528 Yamada S., 2005, P MT SUMM, P55 Yoshimura T, 1999, P EUR, P2347 Zen H., 2004, P ICSLP, P1185 NR 32 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2012 VL 54 IS 7 BP 857 EP 866 DI 10.1016/j.specom.2012.02.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961VA UT WOS:000305496400002 ER PT J AU Ikbal, S Misra, H Hermansky, H Magimai-Doss, M AF Ikbal, Shajith Misra, Hemant Hermansky, Hynek Magimai-Doss, Mathew TI Phase Auto Correlation (PAC) features for noise robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Noise robust speech recognition; Phase AutoCorrelation (PAC); Energy normalization; Inverse cosine; Inverse entropy; MLP feature combination ID WORD RECOGNITION AB In this paper, we introduce a new class of noise robust features derived from an alternative measure of autocorrelation representing the phase variation of speech signal frame over time. These features, referred to as Phase AutoCorrelation (PAC) features include PAC-spectrum and PAC-MFCC, among others. In traditional autocorrelation, correlation between two time delayed signal vectors is computed as their dot product. Whereas in PAC, angle between the vectors in the signal vector space is used to compute the correlation. PAC features are more noise robust because the angle is typically less affected by noise than the dot product. However, the use of angle as correlation estimate makes the PAC features inferior in clean speech. In this paper, we circumvent this problem by introducing another set of features where complementary information among the PAC features and the traditional features are combined adaptively to retain the best of both. An entropy based feature combination method in a multi-layer perceptron (MLP) based multi-stream framework is used to derive an adaptively combined representation of the component feature streams. An evaluation of the combined features using OGI Numbers95 database and Aurora-2 database under various noise conditions and noise levels show significant improvements in recognition accuracies in clean as well as noisy conditions. (C) 2012 Elsevier B.V. All rights reserved. C1 [Ikbal, Shajith] IBM Res Corp, Bangalore, Karnataka, India. [Misra, Hemant] Philips Res, Bangalore, Karnataka, India. [Hermansky, Hynek] Johns Hopkins Univ, Baltimore, MD USA. [Magimai-Doss, Mathew] Idiap Res Inst, Martigny, Switzerland. RP Ikbal, S (reprint author), IBM Res Corp, Bangalore, Karnataka, India. EM shajmoha@in.ibm.com; hemant.misra@philips.com; hynek@jhu.edu; mathew@idiap.ch FU Swiss National Science Foundation [MULTI: FN 2000-068231.02/1]; National Centre of Competence in Research (NCCR); DARPA FX The authors thank the Swiss National Science Foundation for the support of their work during their stay at Idiap Research Institute, through grant MULTI: FN 2000-068231.02/1 and through National Centre of Competence in Research (NCCR) on "Interactive Multimodal Information Management (IM2)". The authors also thank DARPA for supporting through the EARS (Effective, Affordable, Reusable Speech-to-Text) project. CR Alexandre P., 1993, P IEEE ICASSP 93, P99 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bourlard H., 1996, CC AI J SPECIAL ISSU Bourlard H., 1996, P EUR SIGN PROC C TR, P1579 Bourlard H.A., 1993, CONNECTIONIST SPEECH Cole R., 1995, P EUR C SPEECH COMM, P821 Cooke M, 1997, INT CONF ACOUST SPEE, P863, DOI 10.1109/ICASSP.1997.596072 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Ellis D., 2001, P IEEE ICASSP 01 SAL FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Furui S., 1992, P ESCA WORKSH SPEECH, P31 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Hagen A., 2001, THESIS EPFL LAUSANNE Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H., 2000, P IEEE ICASSP 00 IST HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hernando J., 1997, IEEE T SPEECH AUDIO, V5 Hirsch H. G., 2000, ISCA ITRW ASR2000 AU Ikbal S., 2003, P IEEE ASRU 03 ST TH IKBAL S, 2003, ACOUST SPEECH SIG PR, P133 Ikbal S., 2004, P INT 04 JEJ ISL KOR Ikbal S., 2008, P INT 08 BRISB AUSTR Ikbal S., 2004, P IEEE ICASSP 04 MON Ikbal S., 2004, THESIS EPFL LAUSANNE Kalgoankar K., 2009, P ASRU 09 TRENT IT Kim W, 2011, SPEECH COMMUN, V53, P1, DOI 10.1016/j.specom.2010.08.005 Klatt D., 1986, P IEEE ICASSP 86, P741 Legetter C.J., 1995, P ARPA WORKSH SPOK L, P110 Li J., 2007, P ASRU 07 KYOT JAP LIM JS, 1979, IEEE T ACOUST SPEECH, V27, P223 Lockwood P., 1992, P ICASSP SAN FRANC C, P265, DOI 10.1109/ICASSP.1992.225921 Mansour D., 1988, P ICASSP 88, P36 MISRA H, 2003, P ICASSP HONG KONG, V2, P741 Nolazco-Flores J.A., 1994, P ICASSP AD AUSTR, P409 Okawa S., 1998, P IEEE ICASSP 98 SEA Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Rabiner L, 1993, FUNDAMENTALS SPEECH Raj B., 1998, P ICSLP 98 SYDN AUST Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006 Shannon BJ, 2006, SPEECH COMMUN, V48, P1458, DOI 10.1016/j.specom.2006.08.003 Sharma S., 1999, THESIS OGI PORTLAND Stephenson TA, 2004, IEEE T SPEECH AUDI P, V12, P189, DOI 10.1109/TSA.2003.822631 Varga A., 1992, NOISEX 92 STUDY AFFE Varga A., 1989, P EUR 89, P167 Varga A.P., 1990, P ICASSP, P845 Young S., 1992, HTK BOOK VERSION 3 2 Zhu Q., 2004, P INT 04 JEJ ISL KOR NR 47 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2012 VL 54 IS 7 BP 867 EP 880 DI 10.1016/j.specom.2012.02.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961VA UT WOS:000305496400003 ER PT J AU Flynn, R Jones, E AF Flynn, Ronan Jones, Edward TI Reducing bandwidth for robust distributed speech recognition in conditions of packet loss SO SPEECH COMMUNICATION LA English DT Article DE Robust distributed speech recognition; Auditory front-end; Wavelet; Bandwidth reduction; Packet loss ID NETWORKS; CHANNELS AB This paper proposes a method to reduce the bandwidth requirements for a distributed speech recognition (DSR) system, with minimal impact on recognition performance. Bandwidth reduction is achieved by applying a wavelet decomposition to feature vectors extracted from speech using an auditory-based front-end. The resulting vectors undergo vector quantisation and are then combined in pairs for transmission over a statistically modeled channel that is subject to packet burst loss. Recognition performance is evaluated in the presence of both background noise and packet loss. When there is no packet loss, results show that the proposed method can reduce the bandwidth required to 50% of the bandwidth required for the system in which the proposed method is not used, without compromising recognition performance. The bandwidth can be further reduced to 25% of the baseline for a slight decrease in recognition performance. Furthermore, in the presence of packet loss, the proposed method for bandwidth reduction, when combined with a suitable redundancy scheme, gives a 29% reduction in bandwidth, when compared to the recognition performance of an established packet loss mitigation technique. (C) 2012 Elsevier B.V. All rights reserved. C1 [Flynn, Ronan] Athlone Inst Technol, Sch Engn, Athlone, Ireland. [Jones, Edward] Natl Univ Ireland, Coll Engn & Informat, Galway, Ireland. RP Flynn, R (reprint author), Athlone Inst Technol, Sch Engn, Athlone, Ireland. EM rflynn@ait.ie; edward.jones@nuigalway.ie CR Agarwal A., 1999, P ASRU, P67 [Anonymous], HTK SPEECH REC TOOLK Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141 Boulis C, 2002, IEEE T SPEECH AUDI P, V10, P580, DOI 10.1109/TSA.2002.804532 El Safty S, 2009, INT J ELEC POWER, V31, P604, DOI 10.1016/j.ijepes.2009.06.003 Ephraim Y., 2006, ELECT ENG HDB, P15 ETSI, 2003, 201108 ETSI ES ETSI, 2007, 202050 ETSI ES Flynn R., 2006, P IET IR SIGN SYST C, P111 Flynn R, 2008, SPEECH COMMUN, V50, P797, DOI 10.1016/j.specom.2008.05.004 Gallardo-Antolin G., 2005, IEEE T SPEECH AUDIO, V13, P1186 Gomez AM, 2009, SPEECH COMMUN, V51, P390, DOI 10.1016/j.specom.2008.12.002 Gomez AM, 2006, IEEE T MULTIMEDIA, V8, P1228, DOI 10.1109/TMM.2006.884611 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 Hsu W.H., 2004, P IEEE INT C AC SPEE, V1, P69 James A, 2006, SPEECH COMMUN, V48, P1402, DOI 10.1016/j.specom.2006.07.005 James A.B., 2004, P 2 COST278 ISCA TUT James A.B., 2004, P IEEE INT C AC SPEE, V1, P853 James A.B., 2005, P INT 2005 LISB PORT, P2857 Jayasree T, 2009, INT J COMP ELECT ENG, V1, P590 Karapantazis S, 2009, COMPUT NETW, V53, P2050, DOI 10.1016/j.comnet.2009.03.010 Li Q., 2000, P 6 INT C SPOK LANG, V3, P51 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Macho D., 2001, P IEEE INT C AC SPEE, V1, P305 Macho D., 2002, P ICSLP, P17 MALLAT SG, 1989, IEEE T PATTERN ANAL, V11, P674, DOI 10.1109/34.192463 Milner B.P., 2004, P 8 INT C SPOK LANG, P1549 Perkins C, 1998, IEEE NETWORK, V12, P40, DOI 10.1109/65.730750 Quercia D., 2002, P IEEE INT C AC SPEE, V4, P3820 Russo M, 2005, CONSUM COMM NETWORK, P493 Tan ZH, 2010, IEEE J-STSP, V4, P798, DOI 10.1109/JSTSP.2010.2057192 Tan ZH, 2005, SPEECH COMMUN, V47, P220, DOI 10.1016/j.specom.2005.05.007 Tan Z.-H., 2005, P IEEE MMSP NOV, P1 Tan ZH, 2007, IEEE T AUDIO SPEECH, V15, P1391, DOI 10.1109/TASL.2006.889799 Xie Q., 2005, 202050 ETSI ES Xie Q., 2003, 210108 ETSI ES NR 36 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2012 VL 54 IS 7 BP 881 EP 892 DI 10.1016/j.specom.2012.03.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961VA UT WOS:000305496400004 ER PT J AU Smit, T Turckheim, F Mores, R AF Smit, Thorsten Tuerckheim, Friedrich Mores, Robert TI Fast and robust formant detection from LP data SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Formant tracking; Speech analysis; Male and female voices ID LINEAR-PREDICTION; SPEECH; EXTRACTION; SPECTRA AB This paper introduces a method for real-time selective root finding from linear prediction (LP) coefficients using a combination of spectral peak picking and complex contour integration (Cl). The proposed method locates roots within predefined areas of the complex z-plane, for instance roots which correspond to formants while other roots are ignored. It includes an approach to limit the search area (SEA) as much as possible. For this purpose, peaks of the group delay function (GDF) serve as pointers. A frequency weighted wGDF will be introduced in which a simple modification enables a parametric emphasis of the GDF spikes to separate merged formants. Thus, a nearly zero defected separation of peaks is possible even when these are very closely spaced. The performance and efficiency of the proposed wGDF-CI method is demonstrated by comparative error-analysis evaluated on a subset of the DARPA TIMIT corpus. (C) 2012 Elsevier B.V. All rights reserved. C1 [Smit, Thorsten; Tuerckheim, Friedrich; Mores, Robert] Univ Appl Sci Hamburg, D-20081 Hamburg, Germany. RP Smit, T (reprint author), Univ Appl Sci Hamburg, Finkenau 35, D-20081 Hamburg, Germany. EM thorsten.smit@mt.haw-hamburg.de; friedrich.tuerckheim@haw-hamburg.de; mores@mt.haw-hamburg.de FU German Federal Ministry of Education and Research (AiF) [1767X07] FX The authors wish to thank the German Federal Ministry of Education and Research (AiF Project No. 1767X07). CR Alciatore D.G., 1995, WINDING NUMBER POINT ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 Boersma P., 2005, PRAAT DOING PHONETIC Dellar J.R., 1999, DISCRETE TIME PROCES Deng L., 2006, P IEEE INT C AUD SPE, P60 DUNN HK, 1961, J ACOUST SOC AM, V33, P1737, DOI 10.1121/1.1908558 Fant G., 1960, ACOUSTIC THEORY SPEE FLANAGAN JL, 1964, J ACOUST SOC AM, V36, P1030, DOI 10.1121/1.2143268 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Hanson HM, 1994, IEEE T SPEECH AUDI P, V2, P436, DOI 10.1109/89.294358 IEEE, P INT C AC SPEECH SI, V3, P1381 International Speech Communication Association (ISCA), EUR 95, V1 Itakura F., 1975, J ACOUST SOC AM, V57, P35 Kim C., 2006, EURASIP J APPL SIG P, V2006, P1 Kim HK, 1999, IEEE T SPEECH AUDI P, V7, P87 Knuth D. E., 1998, ART COMPUTER PROGRAM, V2 Kuwarabara H., 1995, SPEECH COMMUN, V16, P365 Markel JD, 1976, LINEAR PREDICTION SP MCCANDLE.SS, 1974, IEEE T ACOUST SPEECH, VSP22, P135, DOI 10.1109/TASSP.1974.1162559 MURTHY HA, 1989, ELECTRON LETT, V25, P1609, DOI 10.1049/el:19891080 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Peterson G., 1951, J ACOUST SOC AM, V24, P1441 Pfitzinger H., 2005, ZAS PAPERS LINGUISTI, V40, P133 REDDY NS, 1984, IEEE T ACOUST SPEECH, V32, P1136, DOI 10.1109/TASSP.1984.1164456 Rudin Walter, 1974, REAL COMPLEX ANAL Sandler M, 1991, IEE P, V9 Schafer R.W., 1970, J ACOUST SOC AM, V47, P637 Schleicher D, 2002, ERGOD THEOR DYN SYST, V22, P935, DOI 10.1017/S0143385702000482 Sjolander K., 2000, P INT C SPOK LANG PR Snell RC, 1993, IEEE T SPEECH AUDI P, V1, P129, DOI 10.1109/89.222882 Snell R.C., 1983, INVESTIGATION SPEAKE Stevens K.N., 2000, ACOUSTIC PHONETICS Talkin D., 1987, J ACOUST SOC AM S1, VS1, P55 Ueda Yuichi, 2007, Acoustical Science and Technology, V28, DOI 10.1250/ast.28.271 Welling L, 1998, IEEE T SPEECH AUDI P, V6, P36, DOI 10.1109/89.650308 Williams C. S., 1986, DESIGNING DIGITAL FI Wong D., 1980, IEEE T ACOUST SPEECH, V80, P263 NR 37 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2012 VL 54 IS 7 BP 893 EP 902 DI 10.1016/j.specom.2012.03.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961VA UT WOS:000305496400005 ER PT J AU Hassan, A Damper, RI AF Hassan, A. Damper, R. I. TI Classification of emotional speech using 3DEC hierarchical classifier SO SPEECH COMMUNICATION LA English DT Article DE Speech processing; Emotion recognition; Valence-arousal model; Multiclass support vector machines ID SYNTHETIC SPEECH; RECOGNITION; FEATURES; SIMULATION; DATABASES AB The recognition of emotion from speech acoustics is an important problem in human machine interaction, with many potential applications. In this paper, we first compare four ways to extend binary support vector machines (SVMs) to multiclass classification for recognising emotions from speech-namely two standard SVM schemes (one-versus-one and one-versus-rest) and two other methods (DAG and UDT) that form a hierarchy of classifiers, each making a distinct binary decision about class membership. These are trained and tested using 6552 features per speech sample extracted from three databases of acted emotional speech (DES, Berlin and Serbian) and a database of spontaneous speech (FAU Aibo Emotion Corpus) using the OpenEAR toolkit. Analysis of the errors made by these classifiers leads us to apply non-metric multi-dimensional scaling (NMDS) to produce a compact (two-dimensional) representation of the data suitable for guiding the choice of decision hierarchy. This representation can be interpreted in terms of the well-known valence-arousal model of emotion. We find that this model does not give a particularly good fit to the data: although the arousal dimension can be identified easily, valence is not well represented in the transformed data. We describe a new hierarchical classification technique whose structure is based on NMDS, which we call Data-Driven Dimensional Emotion Classification (3DEC). This new method is compared with the best of the four classifiers studied earlier and a state-of-the-art classification method on all four databases. We find no significant difference between these three approaches with respect to speaker-dependent performance. However, for the much more interesting and important case of speaker-independent emotion classification, 3DEC significantly outperforms the competitors. (C) 2012 Elsevier B.V. All rights reserved. C1 [Hassan, A.; Damper, R. I.] Univ Southampton, Syst Res Grp, Sch Elect & Comp Sci, Southampton SO17 1BJ, Hants, England. RP Damper, RI (reprint author), Univ Southampton, Syst Res Grp, Sch Elect & Comp Sci, Southampton SO17 1BJ, Hants, England. EM ah07r@ecs.soton.ac.uk; rid@ecs.soton.ac.uk CR Balentine B, 2002, BUILD SPEECH RECOGNI Batliner A., 2004, P 4 INT C LANG RES E, P171 Batliner A, 2008, USER MODEL USER-ADAP, V18, P175, DOI 10.1007/s11257-007-9039-4 Batliner A, 2006, P IS LTC 2006 LJUBL, P240 Breazeal C, 2002, AUTON ROBOT, V12, P83, DOI 10.1023/A:1013215010749 Burkhardt F., 2005, INTERSPEECH, P1517 Casale S., 2008, IEEE INT C SEM COMP, P158 Chawla NV, 2002, J ARTIF INTELL RES, V16, P321 Cristianini N., 2000, INTRO SUPPORT VECTOR DELLAERT F, 1996, P 4 INT C SPOK LANG, V3, P1970, DOI 10.1109/ICSLP.1996.608022 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 El Ayadi M, 2011, PATTERN RECOGN, V44, P572, DOI 10.1016/j.patcog.2010.09.020 Engberg I. S., 1996, DOCUMENTATION DANISH Eyben F., 2010, 7 INT C LANG RES EV, P77 Eyben F., 2009, P 4 INT HUMAINE ASS, P1 Hansen J.H.L., 1997, P EUR C SPEECH COMM, P1743 Hassan A., 2009, INTERSPEECH 09, P2403 Hassan A., 2010, INTERSPEECH 10, P2354 Hsu C., 2001, IEEE T NEURAL NETWOR, V13, P415 Hsu C. W., 2003, PRACTICAL GUIDE SUPP Jovicic S. T, 2004, P 9 C SPEECH COMP SP, P77 KRUSKAL JB, 1964, PSYCHOMETRIKA, V29, P115, DOI 10.1007/BF02289694 Lee CC, 2011, SPEECH COMMUN, V53, P1162, DOI 10.1016/j.specom.2011.06.004 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Li Y., 1998, P INT C SPOK LANG PR, P2255 Luengo I., 2009, 1NTERSPEECH 09, P332 Martin A., 1997, P EUR RHOD GREEC, V97, P1895 McTear MF, 2002, ACM COMPUT SURV, V34, P90, DOI 10.1145/505282.505285 Murray IR, 2008, COMPUT SPEECH LANG, V22, P107, DOI 10.1016/j.csl.2007.06.001 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 ORTONY A, 1990, PSYCHOL REV, V97, P315, DOI 10.1037//0033-295X.97.3.315 Paeschke A., 2004, P SPEECH PROS NAR JA, P671 Picard R. W., 1997, AFFECTIVE COMPUTING Planet S., 2009, INTERSPEECH 09, P316 PLATT J. C., 2000, P NEUR INF PROC SYST, P547 Ramanan A., 2007, INT C IND INF SYST I, P291 RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714 Schiel F., 2002, P 3 LANG RES EV C LR, P200 SCHLOSBERG H, 1954, PSYCHOL REV, V61, P81, DOI 10.1037/h0054570 Schuller B., 2011, P INTERSPEECH, V12, P3201 Schuller B., 2009, INTERSPEECH, P312 Schuller B., 2011, P 1 INT AUD VIS EM C, P415 SCHULLER B, 2009, IEEE WORKSH AUT SPEE, P552 Schuller B, 2010, IEEE T AFFECT COMPUT, V1, P119, DOI 10.1109/T-AFFC.2010.8 Shahid S., 2008, P 4 INT C SPEECH PRO, P669 Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006 Shaukat A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2771 Siegel S., 1988, NONPARAMETRIC STAT B Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3 Steidl S., 2009, THESIS U ERLANGEN NU Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003 Ververidis D., 2004, P IEEE INT C AC SPEE, V1, P593 Vidrascu L., 2005, INTERSPEECH 05, P1841 Vogt T., 2005, IEEE INT C MULT EXP, P474 Wilting J., 2006, INTERSPEECH 2006, P1093 Witten I.H., 2005, DATA MINING PRACTICA Yang B, 2010, SIGNAL PROCESS, V90, P1415, DOI 10.1016/j.sigpro.2009.09.009 NR 57 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2012 VL 54 IS 7 BP 903 EP 916 DI 10.1016/j.specom.2012.03.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961VA UT WOS:000305496400006 ER PT J AU Quene, H Semin, GR Foroni, F AF Quene, Hugo Semin, Gun R. Foroni, Francesco TI Audible smiles and frowns affect speech comprehension SO SPEECH COMMUNICATION LA English DT Article DE Smiles; Speech comprehension; Emotion; Affect perception; Motor resonance ID LANGUAGE COMPREHENSION; COMMUNICATING EMOTION; VOCAL COMMUNICATION; EXPRESSIONS; PROSODY; WORD; INTERFERENCE; RECOGNITION; RESPONSES AB Motor resonance processes are involved both in language comprehension and in affect perception. Therefore we predict that listeners understand spoken affective words slower, if the phonetic form of a word is incongruent with its affective meaning. A language comprehension study involving an interference paradigm confirmed this prediction. This interference suggests that affective phonetic cues contribute to language comprehension. A perceived smile or frown affects the listener, and hearing an incongruent smile or frown impedes our comprehension of spoken words. (C) 2012 Elsevier B.V. All rights reserved. C1 [Quene, Hugo] Univ Utrecht, Utrecht Inst Linguist, OTS, NL-3512 JK Utrecht, Netherlands. [Semin, Gun R.; Foroni, Francesco] Univ Utrecht, Fac Social & Behav Sci, NL-3584 CS Utrecht, Netherlands. RP Quene, H (reprint author), Univ Utrecht, Utrecht Inst Linguist, OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM h.quene@uu.nl; g.r.semin@uu.nl; f.foroni@uu.nl RI Foroni, Francesco/G-5469-2012 CR Adank P, 2010, PSYCHOL SCI, V21, P1903, DOI 10.1177/0956797610389192 Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Boersma P., 2011, PRAAT DOING PHONETIC Chuenwattanapranithi S, 2008, PHONETICA, V65, P210, DOI 10.1159/000192793 Dimberg U, 2000, PSYCHOL SCI, V11, P86, DOI 10.1111/1467-9280.00221 Drahota A, 2008, SPEECH COMMUN, V50, P278, DOI 10.1016/j.specom.2007.10.001 esink CMJY, 2008, J COGNITIVE NEUROSCI, V21, P2085 Foroni F, 2009, PSYCHOL SCI, V20, P974, DOI 10.1111/j.1467-9280.2009.02400.x FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 Galantucci B, 2006, PSYCHON B REV, V13, P361, DOI 10.3758/BF03193857 Gallese V, 2003, PSYCHOPATHOLOGY, V36, P171, DOI 10.1159/000072786 Gallese V, 2009, PSYCHOL RES-PSYCH FO, V73, P486, DOI 10.1007/s00426-009-0232-4 Gallese V, 1996, BRAIN, V119, P593, DOI 10.1093/brain/119.2.593 Grimshaw GM, 1998, BRAIN COGNITION, V36, P108, DOI 10.1006/brcg.1997.0949 Hawk ST, 2012, J PERS SOC PSYCHOL, V102, P796, DOI 10.1037/a0026234 Hietanen JK, 1998, PSYCHOPHYSIOLOGY, V35, P530, DOI 10.1017/S0048577298970445 Kitayama S, 2002, COGNITION EMOTION, V16, P29, DOI 10.1080/0269993943000121 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Klauer KC, 2003, PSYCHOLOGY OF EVALUATION, P7 Kohler E, 2002, SCIENCE, V297, P846, DOI 10.1126/science.1070311 Lasarcyk E., 2008, 8 INT SPEECH PROD SE MEHRABIA.A, 1967, J PERS SOC PSYCHOL, V6, P109, DOI 10.1037/h0024532 Neumann R, 2000, J PERS SOC PSYCHOL, V79, P211, DOI 10.1037//0022-3514.79.2.211 Niedenthal PM, 2010, BEHAV BRAIN SCI, V33, P417, DOI 10.1017/S0140525X10000865 Niedenthal PM, 2007, SCIENCE, V316, P1002, DOI 10.1126/science.1136930 Nygaard LC, 2008, J EXP PSYCHOL HUMAN, V34, P1017, DOI 10.1037/0096-1523.34.4.1017 Ohala J.J., 1980, J ACOUST SOC AM S, pS33 OHALA JJ, 1983, PHONETICA, V40, P1 Paulmann S, 2012, SPEECH COMMUN, V54, P92, DOI 10.1016/j.specom.2011.07.004 Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 Quene H, 2008, J MEM LANG, V59, P413, DOI 10.1016/j.jml.2008.02.002 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Schirmer A, 2002, COGNITIVE BRAIN RES, V14, P228, DOI 10.1016/S0926-6410(02)00108-8 Schirmer A, 2003, J COGNITIVE NEUROSCI, V15, P1135, DOI 10.1162/089892903322598102 Schroder M, 2006, IEEE T AUDIO SPEECH, V14, P1128, DOI 10.1109/TASL.2006.876118 Scott SK, 1997, NATURE, V385, P254, DOI 10.1038/385254a0 Stroop JR, 1935, J EXP PSYCHOL, V18, P643, DOI 10.1037/0096-3445.121.1.15 TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151 Van Berkum J.J.A., 2007, J COGNITIVE NEUROSCI, V20, P580 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263 Xu Y., 2010, PERCEPTION Zwaan RA, 2006, J EXP PSYCHOL GEN, V135, P1, DOI 10.1037/0096-3445.135.1.1 NR 44 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2012 VL 54 IS 7 BP 917 EP 922 DI 10.1016/j.specom.2012.03.004 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 961VA UT WOS:000305496400007 ER PT J AU Prieto, P Vanrell, MD Astruc, L Payne, E Post, B AF Prieto, Pilar Vanrell, Maria del Mar Astruc, Lluisa Payne, Elinor Post, Brechtje TI Phonotactic and phrasal properties of speech rhythm. Evidence from Catalan, English, and Spanish SO SPEECH COMMUNICATION LA English DT Article DE Rhythm; Rhythm index measures; Accentual lengthening; Final lengthening; Spanish language; Catalan language; English language ID DURATION; LANGUAGE; STRESS; FRENCH; MUSIC AB The goal of this study is twofold: first, to examine in greater depth the claimed contribution of differences in syllable structure to measures of speech rhythm for three languages that are reported to belong to different rhythmic classes, namely, English, Spanish, and Catalan; and second, to investigate differences in the durational marking of prosodic heads and final edges of prosodic constituents between the three languages and test whether this distinction correlates in any way with the rhythmic distinctions. Data from a total of 24 speakers reading 720 utterances from these three languages show that differences in the rhythm metrics emerge even when syllable structure is controlled for in the experimental materials, at least between English on the one hand and Spanish/Catalan on the other, suggesting that important differences in durational patterns exist between these languages that cannot simply be attributed to differences in phonotactic properties. In particular, the vocalic variability measures nPVI-V, Delta V, and VarcoV are shown to be robust tools for discrimination above and beyond such phonotactic properties. Further analyses of the data indicate that the rhythmic class distinctions under consideration finely correlate with differences in the way these languages instantiate two prosodic timing processes, namely, the durational marking of prosodic heads, and pre-final lengthening at prosodic boundaries. (C) 2011 Elsevier B.V. All rights reserved. C1 [Prieto, Pilar] ICREA Univ Pompeu Fabra, Barcelona 08018, Spain. [Vanrell, Maria del Mar] Univ Pompeu Fabra, Barcelona 08018, Spain. [Astruc, Lluisa] Open Univ, Dept Languages, Fac Languages & Language Studies, Milton Keynes MK7 6AA, Bucks, England. [Payne, Elinor] Univ Oxford, Phonet Lab, Oxford OX1 2JF, England. [Post, Brechtje] Dept Theoret & Appl Linguist, Cambridge CB3 4DA, England. RP Prieto, P (reprint author), ICREA Univ Pompeu Fabra, Carrer Roc Boronat 138,Despatx 51-600, Barcelona 08018, Spain. EM pilar.prieto@upf.edu RI Consolider Ingenio 2010, BRAINGLOT/D-1235-2009; Prieto, Pilar/E-7390-2013 OI Prieto, Pilar/0000-0001-8175-1081 FU The acquisition of rhythm in Catalan, Spanish and English [2007 PBR 29]; The acquisition of intonation in Catalan, Spanish and English [2009 PBR 00018]; Generalitat de Catalunya; British Academy [SG-51777]; Spanish Ministerio de Ciencia e Innovacion; SGR 701; [FFI2009-07648/FILO] FX We are grateful to the audience at these three conferences, and especially to M. D'Imperio, S. Frota, J.I. Hualde, F. Nolan, and L. White for fruitful discussions on some of the issues raised in this article. We also thank the action editor Jan van Santen and three anonymous reviewers for their comments on an earlier version of this paper. The idea for conducting the first experiment stems from an informal conversation with J.I. Hualde. We would like to thank N. Argemi, A. Barbera, M. Bell, A. Estrella, and F. Torres-Tamarit for recording the data in the three languages, and to N. Hilton and P. Roseano for carrying out the segmentation and coding of the data. Special thanks are due to F. Ramus et al. for making available to us the sentences used in their 1999 paper. This research has been funded by two Batista i Roca research projects entitled "The acquisition of rhythm in Catalan, Spanish and English" and "The acquisition of intonation in Catalan, Spanish and English" (Refs. 2007 PBR 29 and 2009 PBR 00018, respectively) awarded by the Generalitat de Catalunya, by the project "A cross-linguistic study of intonational development in young infants and children" awarded by the British Academy (Ref. SG-51777), and by the projects FFI2009-07648/FILO and CONSOLIDER-INGENIO 2010 "Bilinguismo y Neurociencia Cognitiva CSD2007-00012" awarded by the Spanish Ministerio de Ciencia e Innovacion, and 2009 SGR 701, awarded by the Generalitat de Catalunya. CR Abercrombie D, 1967, ELEMENTS GEN PHONETI Arvaniti A, 2009, PHONETICA, V66, P46, DOI 10.1159/000208930 Astruc L., 2009, CAMBRIDGE OCCASIONAL, V5, P1 Astruc L., 2006, P SPEECH PROS 2006, P337 Asu E., 2006, P SPEECH PROS 2006, P49 Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005 Barnes J, 2001, NE LING SOC NELS 32 Barry W, 2009, PHONETICA, V66, P78, DOI 10.1159/000208932 Beckman M., 1994, PAPERS LAB PHONOLOGY, P7 Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X Beckman M. E., 1992, SPEECH PERCEPTION PR, P457 Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008 Boersma P., 2007, PRAAT DOING PHONETIC Bolinger D., 1965, PITCH ACCENT SENTENC Byrd D, 2000, PHONETICA, V57, P3, DOI 10.1159/000028456 Carter PM, 2005, AMST STUD THEORY HIS, V272, P63 Cummins F, 2002, P SPEECH PROS AIR EN, P121 Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070 Dauer R., 1987, P 11 INT C PHON SCI, P447 DAUER RM, 1983, J PHONETICS, V11, P51 DELATTRE P, 1966, IRAL-INT REV APPL LI, V4, P183, DOI 10.1515/iral.1966.4.1-4.183 Dellwo V., 2004, P 8 ICSLP JEJ ISL KO Dellwo V, 2003, P 15 INT C PHON SCI, P471 Dellwo V., 2006, P 38 LING C LANG FRA, P231 DEMANRIQUE AMB, 1983, J PHONETICS, V11, P117 den Os E., 1988, RHYTHM TEMPO DUTCH I Dilley L., 1996, J PHONETICS, V26, P45 Estebas-Vilaplana Eva, 2010, TRANSCRIPTION INTONA, P17 Fant G., 1991, 12 INT C PHON SCI AI, P118 Ferragne E., 2004, P MOD ID LANG PAR Fougeron C., 1996, UCLA WORKING PAPERS, V92, P61 Frota S., 2007, SEGMENTAL PROSODIC I, P131 Frota S., 2001, PROBUS, V13, P247, DOI 10.1515/prbs.2001.005 Gavalda-Ferre N, 2007, THESIS U COLL LONDON Grabe Esther, 2002, LAB PHONOLOGY, V7, P515 Hasegawa-Johnson M., 2007, P 16 INT C PHON SCI, P1264 Hasegawa-Johnson M, 2004, P ICSA INT C SPOK LA, P2729 Hualde J.I., 2005, SOUNDS SPANISH IBM Corporation, 2010, IBM SPSS STAT VERS 1 LLOYD J., 1940, SPEECH SIGNALS TELEP Low E., 2000, LANG SPEECH, V43, P377 MASCARO IGNASI, 2010, P 5 INT C SPEECH PRO, P1 Mascaro J., 2002, GRAMATICA CATALA CON, P89 Nespor M., 1990, LOGICAL ISSUES LANGU, P157 Nolan F, 2009, PHONETICA, V66, P64, DOI 10.1159/000208931 O'Rourke E., 2008, P 4 C SPEECH PROS MA OLLER DK, 1973, J ACOUST SOC AM, V54, P1235, DOI 10.1121/1.1914393 Ortega-Llebaria M., 2010, LANG SPEECH, V54, P1 Ortega-Llebaria M., 2007, CURRENT ISSUES LINGU, P155 FANT G, 1991, J PHONETICS, V19, P351 Patel A., 2008, MUSIC LANGUAGE BRAIN Patel AD, 2003, COGNITION, V87, pB35, DOI 10.1016/S0010-0277(02)00187-7 Patel AD, 2006, J ACOUST SOC AM, V119, P3034, DOI 10.1121/1.2179657 Payne E., LANGUAGE SPEECH, V55 Payne E., 2009, OXFORD U WORKING PAP, V12, P123 Payne E., 2010, ATT CONV PROS GLI UN, P147 PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 Pike K. L., 1945, INTONATION AM ENGLIS Prieto P., 2010, J PHONETICS, V38, P688 Prieto Pilar, PROSODIC TY IN PRESS Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X Ramus F., 2003, P 15 INT C PHON SCI, P337 Randolph J. J., 2005, JOENS U LEARN INSTR Randolph JJ, 2008, ONLINE KAPPA CALCULA Roach P., 1982, LINGUISTIC CONTROVER Russo M., 2008, P 4 C SPEECH PROS MA Shen Y., 1962, U BUFFALO STUDIES LI, V9, P1 SYRDAL ANN, 2000, P INT C SPOK LANG PR, P235 Turk AE, 1997, J PHONETICS, V25, P25, DOI 10.1006/jpho.1996.0032 Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093 Warrens MJ, 2010, ADV DATA ANAL CLASSI, V4, P271, DOI 10.1007/s11634-010-0073-4 WENK BJ, 1982, J PHONETICS, V10, P193 White L., 2007, CURRENT ISSUES LINGU, P237 White L., 2009, PHONETICS PHONOLOGY White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 NR 76 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 681 EP 702 DI 10.1016/j.specom.2011.12.001 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400001 ER PT J AU Oura, K Yamagishi, J Wester, M King, S Tokuda, K AF Oura, Keiichiro Yamagishi, Junichi Wester, Mirjam King, Simon Tokuda, Keiichi TI Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping SO SPEECH COMMUNICATION LA English DT Article DE HMM-based speech synthesis; Unsupervised speaker adaptation; Cross-lingual speaker adaptation; Speech-to-speech translation ID SYNTHESIS SYSTEM; RECOGNITION; ALGORITHM; MODEL; TTS AB In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech. (C) 2012 Elsevier B.V. All rights reserved. C1 [Oura, Keiichiro; Tokuda, Keiichi] Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Nagoya, Aichi 4668555, Japan. [Yamagishi, Junichi; Wester, Mirjam; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland. RP Oura, K (reprint author), Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Gokiso Cho, Nagoya, Aichi 4668555, Japan. EM uratec@nitech.ac.jp FU European Community [213845]; Ministry of Internal Affairs and Communication, Japan FX The research leading to these results was partly funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement 213845 (the EMIME project), and the Strategic Information and Communications R&D Promotion Programme (SCOPE), Ministry of Internal Affairs and Communication, Japan. CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 Baumann O, 2010, PSYCHOL RES-PSYCH FO, V74, P110, DOI 10.1007/s00426-008-0185-z Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 Dines J., 2009, P INT 2009 BRIGHT UK Dines J, 2010, IEEE J-STSP, V4, P1046, DOI 10.1109/JSTSP.2010.2079315 Fitt S., 2000, DOCUMENTATION USER G Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gao Y., 2003, P EUR 2003 GEN, P365 Gibson M., 2009, P INT BRIGHT UK SEP, P1791 Goldberger J, 2003, ICCV 03, P487 Hashimoto K., SPEECH COMM Hashimoto K, 2011, INT CONF ACOUST SPEE, P5108 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsimaki T, 2009, IEEE T AUDIO SPEECH, V17, P724, DOI 10.1109/TASL.2008.2012323 Imai S, 1983, ICASSP 83, P93 Iskra D., 2002, P LREC, P329 Itou K., 1998, P ICSLP, P3261 Kawahara H., 2001, P 2 MAVEBA FIR IT Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 King S., 2008, P INT, P1869 Kneser R, 1995, P IEEE INT C AC SPEE, V1, P181 Ladefoged P., 1996, SOUNDS WORLDS LANGUA Liang H, 2010, INT CONF ACOUST SPEE, P4598, DOI 10.1109/ICASSP.2010.5495559 Liu F., 2003, P ICASSP, P636 MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089 Moore D., 2006, 3 JOINT WORKSH MULT NEY H, 1999, ACOUST SPEECH SIG PR, P517 Noth E, 2000, IEEE T SPEECH AUDI P, V8, P519, DOI 10.1109/89.861370 PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614 Qian Y, 2009, IEEE T AUDIO SPEECH, V17, P1231, DOI 10.1109/TASL.2009.2015708 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Sjolander K., 1998, P ICSLP 1998 Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tokuda K, 2002, IEICE T INF SYST, VE85D, P455 Tokuda K., 1994, P INT C SPOK LANG PR, P1043 Tsujimura N, 2006, INTRO JAPANESE LINGU Tsuzaki M., 2011, INTERSPEECH 2011 FLO, P157 Wester M, 2011, INT CONF ACOUST SPEE, P5372 Wester M., 2010, P 7 ISCA SPEECH SYNT Wester M, 2010, P INT 2010 TOK JAP Woodland P. C., 2001, P ISCA WORKSH AD MET, P11 Wu Y.J., 2009, P INTERSPEECH, P528 Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647 Yamagishi J., 2009, P BLIZZ CHALL WORKSH Yamagishi J, 2010, IEEE T AUDIO SPEECH, V18, P984, DOI 10.1109/TASL.2010.2045237 Yoneyama K, 2004, P INT 2004 JEJ ISL K Yoshimura T., 1999, P EUROSPEECH 99 SEPT, P2374 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Yu Z., 2008, P ICSP 2008, P655 Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 53 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 703 EP 714 DI 10.1016/j.specom.2011.12.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400002 ER PT J AU Kaufmann, T Pfister, B AF Kaufmann, Tobias Pfister, Beat TI Syntactic language modeling with formal grammars SO SPEECH COMMUNICATION LA English DT Article DE Large-vocabulary continuous speech recognition; Language modeling; Formal grammar; Discriminative reranking ID SYSTEM AB It has repeatedly been demonstrated that automatic speech recognition can benefit from syntactic information. However, virtually all syntactic language models for large-vocabulary continuous speech recognition are based on statistical parsers. In this paper, we investigate the use of a formal grammar as a source of syntactic information. We describe a novel approach to integrating formal grammars into speech recognition and evaluate it in a series of experiments. For a German broadcast news transcription task, the approach was found to reduce the word error rate by 9.7% (relative) compared to a competitive baseline speech recognizer. We provide an extensive discussion on various aspects of the approach, including the contribution of different kinds of information, the development of a precise formal grammar and the acquisition of lexical information. (C) 2012 Elsevier B.V. All rights reserved. C1 [Kaufmann, Tobias; Pfister, Beat] ETH, Speech Proc Grp, Comp Engn & Networks Lab, CH-8092 Zurich, Switzerland. RP Kaufmann, T (reprint author), ETH, Speech Proc Grp, Comp Engn & Networks Lab, Gloriastr 35, CH-8092 Zurich, Switzerland. EM tobias.kaufmann@stairwell.ch FU Swiss National Science Foundation FX This work was supported by the Swiss National Science Foundation. We cordially thank Jean-Luc Gauvain for providing us with word lattices from the LIMSI German broadcast news transcription system. We further thank Canoo Engineering AG for granting us access to their morphological database. CR Baldwin T., 2005, P 6 M PAC ASS COMP L, P23 Baldwin Timothy, 2005, P ACL SIGLEX WORKSH, P67, DOI 10.3115/1631850.1631858 Beutler R, 2007, THESIS ETH ZURICH Beutler R., 2005, P IEEE ASRU WORKSH S, P104 Brants S., 2002, P WORKSH TREEB LING Brants T., 2000, P 6 C APPL NAT LANG, P224, DOI 10.3115/974147.974178 Carrol J., 2005, P 2 INT JOINT C NAT, P165 Charnak E., 2001, P 39 ANN M ASS COMP, P124, DOI 10.3115/1073012.1073029 Charniak E, 2005, P 43 ANN M ASS COMP, P173, DOI 10.3115/1219840.1219862 Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147 Chen S. F., 1999, GAUSSIAN PRIOR SMOOT Collins C., 2004, P ACL Collins M, 2005, COMPUT LINGUIST, V31, P25, DOI 10.1162/0891201053630273 Collins M., 2005, P ACL, P507, DOI 10.3115/1219840.1219903 Crysmann Berhold, 2003, P RANLP Crysmann Berthold, 2005, RES LANGUAGE COMPUTA, V3, P61, DOI 10.1007/s11168-005-1287-z Duden, 1999, GROSSE WORTERBUCH DT ERMAN LD, 1980, COMPUT SURV, V12, P213 Fano R. M., 1961, TRANSMISSION INFORM Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9 Gillick L., 1989, P ICASSP, P532 Goddeau D, 1992, P ICSLP Goodine D., 1991, P 2 EUR C SPEECH COM Hall K., 2003, P IEEE AUT SPEECH RE, P507 Church K. W., 1990, Computational Linguistics, V16 Johnson Mark, 1999, P 37 ANN M ASS COMP, P535, DOI 10.3115/1034678.1034758 Jurafsky D., 1995, P 1995 INT C AC SPEE, VI, P189 Kaplan R. M., 2004, P HUM LANG TECHN C N, P97 Kaufmann T., 2009, P INT 09 BRIGHT ENGL Kaufmann T, 2009, THESIS ETH ZURICH Kiefer B, 2000, ART INTEL, P280 Kita K., 1991, P IEEE INT C AC SPEE, P269, DOI 10.1109/ICASSP.1991.150329 Malouf R., 2004, P IJCNLP 04 WORKSH S Malouf R., 2002, P 6 C NAT LANG LEARN, P1 MANBER U, 1990, PROCEEDINGS OF THE FIRST ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P319 McNeill W. P., 2006, P IEEE SPOK LANG TEC, P90 McTait K., 2003, P EUR 03 GEN SWITZ Moore R., 1995, P ARPA SPOK LANG SYS Muller S, 2000, ART INTEL, P238 Muller S, 2007, STAUFFENBURG EINFUHR Muller Stefan, 1999, LINGUISTISCHE ARBEIT, V394 Nederhof M. J., 1997, P ACL EACL WORKSH SP, P66 Nicholson J., 2008, P 6 INT C LANG RES E Osborne M., 2000, P 18 INT C COMP LING, P586 Pollard Carl, 1994, HEAD DRIVEN PHRASE S Price P., 1988, P IEEE INT C AC SPEE, V1, P651 Price P., 1990, P DARPA SPEECH NAT L, P91, DOI 10.3115/116580.116612 Prins R., 2003, TRAITEMENT AUTOMATIQ, V44, P121 Rayner M., 2006, PUTTING LINGUISTICS Riezler S., 2002, P 40 ANN M ASS COMP, P271 Roark B, 2001, COMPUT LINGUIST, V27, P249, DOI 10.1162/089120101750300526 Roark B, 2001, P ACL 01 PHIL US, P287, DOI 10.3115/1073083.1073131 Sag I., 2002, LECT NOTES COMPUT SC, V2276, P1 Sag I. A., 1987, CSLI LECT NOTES, VI Sag I. A., 2003, SYNTACTIC THEORY FOR Schone P, 2001, PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P100 Tomita M., 1991, GEN LR PARSING Toutanova K., 2005, J RES LANGUAGE COMPU, V3, P83 van Noord G., 2004, P ACL van Noord G, 2001, ROBUSTNESS LANGUAGE van Noord Gertjan, 2006, ACT 13 C TRAIT AUT L, P20 Wahlster W., 2000, VERBMOBIL FDN SPEECH Wang W., 2003, P IEEE AUT SPEECH RE, P519 Wang W., 2002, P C EMP METH NAT LAN, V10, P238, DOI 10.3115/1118693.1118724 Wang W., 2004, P IEEE ICASSP 04 MON, VI, P261 Xu P., 2002, P 40 ANN M ACL PHIL, P191 Yamamoto M, 2001, COMPUT LINGUIST, V27, P1, DOI 10.1162/089120101300346787 Zhang Yi, 2007, P WORKSH DEEP LING P, P128, DOI 10.3115/1608912.1608932 Zue V., 1990, P ICASSP 90, P73 NR 69 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 715 EP 731 DI 10.1016/j.specom.2012.01.001 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400003 ER PT J AU Zelinka, P Sigmund, M Schimmel, J AF Zelinka, Petr Sigmund, Milan Schimmel, Jiri TI Impact of vocal effort variability on automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Vocal effort level; Robust speech recognition; Machine learning ID STRESS; NOISE AB The impact of changes in a speaker's vocal effort on the performance of automatic speech recognition has largely been overlooked by researchers and virtually no speech resources exist for the development and testing of speech recognizers at all vocal effort levels. This study deals with speech properties in the whole range of vocal modes - whispering, soft speech, normal speech, loud speech, and shouting. Fundamental acoustic and phonetic changes are documented. The impact of vocal effort variability on the performance of an isolated-word recognizer is shown and effective means of improving the system's robustness are tested. The proposed multiple model framework approach reaches a 50% relative reduction of word error rate compared to the baseline system. A new specialized speech database, BUT-VE1, is presented, which contains speech recordings of 13 speakers at 5 vocal effort levels with manual phonetic segmentation and sound pressure level calibration. (C) 2012 Elsevier B.V. All rights reserved. C1 [Zelinka, Petr; Sigmund, Milan] Brno Univ Technol, Dept Radio Elect, Brno 61200, Czech Republic. [Schimmel, Jiri] Brno Univ Technol, Dept Telecommun, Brno 61200, Czech Republic. RP Zelinka, P (reprint author), Brno Univ Technol, Dept Radio Elect, Purkynova 118, Brno 61200, Czech Republic. EM xzelin06@stud.feec.vutbr.cz; sigmund@feec.vutbr.cz; schimmel@feec.vutbr.cz FU Czech Grant Agency [102/08/H027]; European Social Fund [CZ.1.07/2.3.00/20.0007, MSM 0021630513] FX The described research was supported by the Czech Grant Agency under Grant No. 102/08/H027. The research leading to the results has also received funding from the European Social Fund under Grant agreement CZ.1.07/2.3.00/20.0007 (the WICOMT project) and under the research program MSM 0021630513 (ELCOM). CR Acero A., 2000, P ICSLP, P869 Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 Brungart D.S., 2001, P EUR AALB DENM, P747 Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Greenberg J. H., 2002, INDO EUROPEAN ITS CL Ikeno A., 2007, P 2007 IEEE AER C Ito T, 2005, SPEECH COMMUN, V45, P139, DOI 10.1016/j.specom.2003.10.005 Itoh T., 2001, P ASRU 01, P429 Junqua JC, 1996, SPEECH COMMUN, V20, P13, DOI 10.1016/S0167-6393(96)00041-6 Kuttruff H., 2000, ROOM ACOUSTICS Lippmann R. P., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Lu YY, 2009, SPEECH COMMUN, V51, P1253, DOI 10.1016/j.specom.2009.07.002 Meyer BT, 2011, SPEECH COMMUN, V53, P753, DOI 10.1016/j.specom.2010.07.002 Murray-Smith Roderick, 1997, MULTIPLE MODEL APPRO Nouza J., 1997, Radioengineering, V6 Oppenheim A. V, 2002, DISCRETE TIME SPEECH Paliwal K.K, 1996, AUTOMATIC SPEECH SPE Ternstrom S, 2006, J ACOUST SOC AM, V119, P1648, DOI 10.1121/1.2161435 Traunmuller H, 2000, J ACOUST SOC AM, V107, P3438, DOI 10.1121/1.429414 Wu TF, 2004, J MACH LEARN RES, V5, P975 Xu H., 2005, P INT 2005 LISB PORT, P977 Zelinka Petr, 2010, Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), DOI 10.1109/MLSP.2010.5589174 Zhang BB, 2004, PATTERN RECOGN, V37, P131, DOI 10.1016/S0031-3203(03)00140-7 Zhang C., 2008, P INT 2008 Zhang C, 2007, P INT 2007, P2289 Zhang C., 2009, P INT 2009 Zhang C., 2008, P ITRW 2008 NR 28 TC 12 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 732 EP 742 DI 10.1016/j.specom.2012.01.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400004 ER PT J AU Kotsakis, R Kalliris, G Dimoulas, C AF Kotsakis, R. Kalliris, G. Dimoulas, C. TI Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification SO SPEECH COMMUNICATION LA English DT Article DE Audio-semantics; Radio-programmes; Content-management; Speech/non-speech segmentation; Pattern classification; Neural networks ID NEWS TRANSCRIPTION SYSTEM; SPEECH RECOGNITION; BIOACOUSTICS APPLICATION; LONG-TERM; SEGMENTATION; FEATURES; WAVELETS AB The present paper focuses on the investigation of various audio pattern classifiers in broadcast-audio semantic analysis, using radio-programme-adaptive classification strategies with supervised training. Multiple neural network topologies and training configurations are evaluated and compared in combination with feature-extraction, ranking and feature-selection procedures. Different pattern classification taxonomies are implemented, using programme-adapted multi-class definitions and hierarchical schemes. Hierarchical and hybrid classification taxonomies are deployed in speech analysis tasks, facilitating efficient speaker recognition/identification, speech/music discrimination, and generally speech/non-speech detection-segmentation. Exhaustive qualitative and quantitative evaluation is conducted, including indicative comparison with non-neural approaches. Hierarchical approaches offer classification-similarities for easy adaptation to generic radio-broadcast semantic analysis tasks. The proposed strategy exhibits increased efficiency in radio-programme content segmentation and classification, which is one of the most demanding audio semantics tasks. This strategy can be easily adapted in broader audio detection and classification problems, including additional real-world speech-communication demanding scenarios. (C) 2012 Elsevier B.V. All rights reserved. C1 [Kotsakis, R.; Kalliris, G.; Dimoulas, C.] Aristotle Univ Thessaloniki, Dept Journalism & Mass Commun, Lab Elect Media, Thessaloniki, Greece. RP Dimoulas, C (reprint author), Aristotle Univ Thessaloniki, Dept Journalism & Mass Commun, Lab Elect Media, Thessaloniki, Greece. EM babis@eng.auth.gr CR Ajmera J, 2003, SPEECH COMMUN, V40, P351, DOI 10.1016/S0167-6393(02)00087-0 Avdelidis K, 2010, LECT NOTES ARTIF INT, V6086, P100, DOI 10.1007/978-3-642-13529-3_12 Avdelidis K., 2010, P 128 AES CONV Bach JH, 2011, SPEECH COMMUN, V53, P690, DOI 10.1016/j.specom.2010.07.003 Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006 Beyerlein P, 2002, SPEECH COMMUN, V37, P109, DOI 10.1016/S0167-6393(01)00062-0 Bishop M, 1995, NEURAL NETWORKS PATT Burnett I.S., 2006, MPEG 21 BOOK Burred JJ, 2004, J AUDIO ENG SOC, V52, P724 Celma O, 2008, J WEB SEMANT, V6, P162, DOI 10.1016/j.websem.2008.01.003 Dhanalakshmi P, 2009, EXPERT SYST APPL, V36, P6069, DOI 10.1016/j.eswa.2008.06.126 Dhanalakshmi P, 2011, ENG APPL ARTIF INTEL, V24, P350, DOI 10.1016/j.engappai.2010.10.011 Dhanalakshmi P., 2010, ASIAN J INFORM TECHN, V9, P323 Dimoulas C, 2008, EXPERT SYST APPL, V34, P26, DOI 10.1016/j.eswa.2006.08.014 Dimoulas C., 2010, 4 PAN HELL C AC ELIN Dimoulas C., 2007, P 122 AES CONV Dimoulas C, 2007, COMPUT BIOL MED, V37, P438, DOI 10.1016/j.compbiomed.2006.08.013 Dimoulas CA, 2011, EXPERT SYST APPL, V38, P13082, DOI 10.1016/j.eswa.2011.04.115 Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9 Hall M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI DOI 10.1145/1656274.1656278 Hosmer DW, 2000, APPL LOGISTIC REGRES, V2nd Huijbiegts M, 2011, SPEECH COMMUN, V53, P143, DOI 10.1016/j.specom.2010.08.008 Jain AK, 2000, IEEE T PATTERN ANAL, V22, P4, DOI 10.1109/34.824819 Jothilakshmi S, 2009, EXPERT SYST APPL, V36, P9799, DOI 10.1016/j.eswa.2009.02.040 Kakumanu P, 2006, SPEECH COMMUN, V48, P598, DOI 10.1016/j.specom.2005.09.005 Kalliris G., 2009, P INT C NEW MED INF Kalliris G.M., 2002, P 12 AES CONV Kim H.-G., 2006, MPEG 7 AUDIO AUDIO C Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Koenen R, 2000, SIGNAL PROCESS-IMAGE, V15, P463, DOI 10.1016/S0923-5965(99)00058-2 Kotsakis R., 2011, THESIS ARISTOTLE U T Lartillot O., 2007, P INT C MUS INF RETR Lee CC, 2011, SPEECH COMMUN, V53, P1162, DOI 10.1016/j.specom.2011.06.004 Loviscach J., 2010, P 128 AES CONV LOND Markaki M, 2011, SPEECH COMMUN, V53, P726, DOI 10.1016/j.specom.2010.08.007 Matsiola M., 2009, THESIS ARISTOTLE U T MOODY JE, 1992, ADV NEUR IN, V4, P847 Nguyen MN, 2010, EURASIP J AUDIO SPEE, DOI 10.1155/2010/572571 Pater N., 2005, MACH LEARN C PAP ECE Platt JC, 1999, ADVANCES IN KERNEL METHODS, P185 QUINLAN JR, 1986, MACH LEARN, V1, P1, DOI DOI 10.1023/A:1022643204877 Rongqing Huang, 2006, IEEE Transactions on Audio, Speech and Language Processing, V14, DOI 10.1109/TSA.2005.858057 Sankar A, 2002, SPEECH COMMUN, V37, P133, DOI 10.1016/S0167-6393(01)00063-2 Seidl T., 1998, P ACM SIGMOD INT C M, P154, DOI DOI 10.1145/276304.276319 Sjolander K., 2000, P INT C SPOK LANG PR Spyridou P.L., INT COMMUNI IN PRESS Spyridou P.L., 2009, THESIS ARISTOTLE U T Stouten F, 2006, SPEECH COMMUN, V48, P1590, DOI 10.1016/j.specom.2006.04.004 Taniguchi T, 2008, SPEECH COMMUN, V50, P547, DOI 10.1016/j.specom.2008.03.007 van Rossum G, 1991, PYTHON LANGUAGE Vegiris C., 2009, P 126 AES CONV AUD E Veglis A., 2008, 1 MONDAY, V13 Veglis A., 2008, JOURNALISM, V9, P52, DOI 10.1177/1464884907084340 Woodland PC, 2002, SPEECH COMMUN, V37, P47, DOI 10.1016/S0167-6393(01)00059-0 Wu CH, 2009, IEEE T AUDIO SPEECH, V17, P1612, DOI 10.1109/TASL.2009.2021304 Wu JD, 2009, EXPERT SYST APPL, V36, P8056, DOI 10.1016/j.eswa.2008.10.051 NR 56 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 743 EP 762 DI 10.1016/j.specom.2012.01.004 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400005 ER PT J AU Moattar, MH Homayounpour, MM AF Moattar, M. H. Homayounpour, M. M. TI Variational conditional random fields for online speaker detection and tracking SO SPEECH COMMUNICATION LA English DT Article DE Conditional random fields; Gaussian mixture model; Variational approximation; Speaker verification; Speaker diarization; Speaker tracking ID SUPPORT VECTOR MACHINES; VERIFICATION; RECOGNITION; MODELS; SYSTEMS; TEXT AB There are many references that concern a specific aspect of speaker tracking. This paper focuses on the speaker modeling issue and proposes conditional random fields (CRF) for this purpose. CRF is a class of undirected graphical models for classifying sequential data. CRF has some interesting characteristics which have encouraged us to use this model in a speaker modeling and tracking task. The main concern of CRF model is its training. Known approaches for CRF training are prone to overfitting and unreliable convergence. To solve this problem, variational approaches are proposed in this paper. The main novelty of this paper is to adapt variational framework for CRF training. The resulted approach is evaluated on three different areas. First, the best CRF model configuration for speaker modeling is evaluated on text independent speaker verification. Next, the selected model is used in a speaker detection task, in which the models of the existing speakers in the conversation are known a priori. Then, the proposed CRF approach is compared with GMM in an online speaker tracking framework. The results show that the proposed CRF model is superior to GMM in speaker detection and tracking, due to its capability for sequence modeling and segmentation. (C) 2012 Elsevier B.V. All rights reserved. C1 [Moattar, M. H.; Homayounpour, M. M.] Amirkabir Univ Technol, Comp Engn & Informat Technol Dept, Lab Intelligent Sound & Speech Proc, Tehran, Iran. RP Homayounpour, MM (reprint author), Amirkabir Univ Technol, Comp Engn & Informat Technol Dept, Lab Intelligent Sound & Speech Proc, Tehran, Iran. EM moattar@aut.ac.ir; homayoun@aut.ac.ir FU Iran Telecommunication Research Center (ITRC) [T/500/14939] FX The authors would like to thank Iran Telecommunication Research Center (ITRC) for supporting this work under contract No. T/500/14939. CR Anguera X., 2006, P 2 INT WORKSH MULT Anguera X, 2011, IEEE TASLP IN PRESS [Anonymous], 2009, 2009 RT 09 RICH TRAN [Anonymous], 2009, NIST YEAR 2010 SPEAK Attias H., 1999, P 15 C UNC ART INT, P21 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Beal M.J., 2003, THESIS U CAMBRIDGE U Bijankhan M, 2003, P EUROSPEECH 2003, P1525 Bijankhan M, 2002, GREAT FARSDAT DATABA Bishop C. M., 2006, PATTERN RECOGNITION Blei DM, 2006, BAYESIAN ANAL, V1, P121 Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003 Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086 CASELLA G, 1992, AM STAT, V46, P167, DOI 10.2307/2685208 Cournapeau D, 2010, INT CONF ACOUST SPEE, P4462, DOI 10.1109/ICASSP.2010.5495610 DARROCH JN, 1972, ANN MATH STAT, V43, P1470, DOI 10.1214/aoms/1177692379 Davy M., 2000, ACOUST SPEECH SIG PR, P33 Ding N, 2010, INT CONF ACOUST SPEE, P2098, DOI 10.1109/ICASSP.2010.5495125 Garofolo J., 2002, P LREC MAY 29 31 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Gonina E., 2011, P AUT SPEECH REC UND Gunawardana A., 2005, P INTERSPEECH, P1117 Izmirli O, 2000, P INT S MUS INF RETR, P284 Jordan M. I., 1998, LEARNING GRAPHICAL M Jordan M. I., 1999, LEARNING GRAPHICAL M, P105 Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Kotti M, 2008, SIGNAL PROCESS, V88, P1091, DOI 10.1016/j.sigpro.2007.11.017 Kumar S., 2004, ADV NEURAL INFOR PRO, V16 Kwon S., 2004, P ICSLP, P1517 Kwon S., 2004, IEEE T SPEECH AUDIO, V13, P1004 Lafferty John D., 2001, ICML, P282 Li H., 2009, P INT C AC SPEECH SI, P4201 Liao CP, 2010, INT CONF ACOUST SPEE, P2002, DOI 10.1109/ICASSP.2010.5495215 Liu Y., 2005, P 9 ANN INT C COMP B, P408 Markov K., 2007, P INT, P1437 MARKOV K, 2008, P INT, P363 Markov K., 2007, P ASRU, P699 Martin A., 2001, P EUR C SPEECH COMM, V2, P787 McCallum A., 2003, P 19 C UNC ART INT U, P403 Mirghafori N., 2006, P ICASSP Mishra H.K., 2009, P INT C ADV PATT REC, P183 Moattar M. H., 2009, P EUSIPCO, P2549 Morency L. P., 2007, MITCSAILTR2007002 Muthusamy Y.K., 1992, P INT C SPOK LANG PR, V2, P895 Nasios N, 2006, IEEE T SYST MAN CY B, V36, P849, DOI 10.1109/TSMCB.2006.872273 NIST, 2002, RICH TRANSCR EV PROJ Parisi G., 1988, STAT FIELD THEORY Prabhavalkar R, 2010, INT CONF ACOUST SPEE, P5534, DOI 10.1109/ICASSP.2010.5495222 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds D.A., 2000, DIGIT SIGNAL PROCESS, V10, P1 Sahu V.P., 2009, LECT NOTES COMPUT SC, P513 Sato K, 2005, BIOINFORMATICS, V21, P237, DOI 10.1093/bioinformatics/bti1139 Schmidt M., 2008, P IEEE INT S PAR DIS, P1 Schnitzspan P, 2009, PROC CVPR IEEE, P2238 Settles B, 2005, BIOINFORMATICS, V21, P3191, DOI 10.1093/bioinformatics/bti475 Sha F., 2003, P HLT NAACL, P213 Shen YA, 2010, J SIGNAL PROCESS SYS, V61, P51, DOI 10.1007/s11265-008-0299-y Somervuo P., 2002, P ICSLP, P1245 Su D, 2010, INT CONF ACOUST SPEE, P4890, DOI 10.1109/ICASSP.2010.5495122 Sung Y.H., 2007, P ASRU, P347 Sutton C., 2006, INTRO STAT RELATIONA Teh Y.W., 2008, ADV NEURAL INFOR PRO, V20 Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 Valente F, 2005, THESIS EURECOM You CH, 2009, IEEE SIGNAL PROC LET, V16, P49, DOI 10.1109/LSP.2008.2006711 Yu D, 2010, INT CONF ACOUST SPEE, P5030, DOI 10.1109/ICASSP.2010.5495072 Zamalloa, 2010, P ICASSP, P4962 Zhao XY, 2009, INT CONF ACOUST SPEE, P4049 Zhu DL, 2009, INT CONF ACOUST SPEE, P4045, DOI 10.1109/ICASSP.2009.4960516 NR 69 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 763 EP 780 DI 10.1016/j.specom.2012.01.005 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400006 ER PT J AU Wester, M AF Wester, Mirjam TI Talker discrimination across languages SO SPEECH COMMUNICATION LA English DT Article DE Human speech perception; Talker discrimination; Cross-language ID VOICE IDENTIFICATION; SPEECH; RECOGNITION; SPEAKERS AB This study investigated the extent to which listeners are able to discriminate between bilingual talkers in three language pairs - English-German, English-Finnish and English-Mandarin. Native English listeners were presented with two sentences spoken by bilingual talkers and were asked to judge whether they thought the sentences were spoken by the same person. Equal amounts of cross-language and matched-language trials were presented. The results show that native English listeners are able to carry out this task well; achieving percent correct levels at well above chance for all three language pairs. Previous research has shown this for English German, this research shows listeners also extend this to Finnish and Mandarin, languages that are quite distinct from English from a genetic and phonetic similarity perspective. However, listeners are significantly less accurate on cross-language talker trials (English foreign) than on matched-language trials (English English and foreign foreign). Understanding listeners' behaviour in cross-language talker discrimination using natural speech is the first step in developing principled evaluation techniques for synthesis systems in which the goal is for the synthesised voice to sound like the original speaker, for instance, in speech-to-speech translation systems, voice conversion and reconstruction. (C) 2012 Elsevier B.V. All rights reserved. C1 Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland. RP Wester, M (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland. EM mwester@inf.ed.ac.uk FU European Community [213845] FX The research leading to these results was partly funded from the European Community's Seventh Framework Programme (FP7/2007-2013) under the grant agreement 213845 (the EMIME project). The author thanks Vasilis Karaiskos for running the perception experiments. CR ABE M, 1991, J ACOUST SOC AM, V90, P76, DOI 10.1121/1.402284 Bradlow A, 2010, SPEECH COMMUN, V52, P930, DOI 10.1016/j.specom.2010.06.003 GOGGIN JP, 1991, MEM COGNITION, V19, P448, DOI 10.3758/BF03199567 Karhila R., 2011, P INT 11 FLOR IT KREIMAN J, 1991, SPEECH COMMUN, V10, P265, DOI 10.1016/0167-6393(91)90016-M Latorre J, 2006, SPEECH COMMUN, V48, P1227, DOI 10.1016/j.specom.2006.05.003 Lewis M. P., 2009, ETHNOLOGUE LANGUAGES Liang H., 2010, P ICASSP 10 Mashimo M., 2001, P EUR 01 Nygaard LC, 1998, PERCEPT PSYCHOPHYS, V60, P355, DOI 10.3758/BF03206860 Nygaard LC, 2005, BLACKW HBK LINGUIST, P390, DOI 10.1002/9780470757024.ch16 Perrachione TK, 2007, NEUROPSYCHOLOGIA, V45, P1899, DOI 10.1016/j.neuropsychologia.2006.11.015 Perrachione TK, 2009, J EXP PSYCHOL HUMAN, V35, P1950, DOI 10.1037/a0015869 Philippon AC, 2007, APPL COGNITIVE PSYCH, V21, P539, DOI 10.1002/acp.1296 R Development Core Team, 2010, R LANG ENV STAT COMP SAMMON JW, 1969, IEEE T COMPUT, VC 18, P401, DOI 10.1109/T-C.1969.222678 Stanislaw H, 1999, BEHAV RES METH INS C, V31, P137, DOI 10.3758/BF03207704 Stockmal V., 2004, P 17 C LING PRAG Stockmal V, 2000, APPL PSYCHOLINGUIST, V21, P383, DOI 10.1017/S0142716400003052 Sundermann D., 2006, P INT 06 THOMPSON CP, 1987, APPL COGNITIVE PSYCH, V1, P121, DOI 10.1002/acp.2350010205 Tsuzaki M., 2011, P INT 11 FLOR IT VANLANCKER D, 1987, NEUROPSYCHOLOGIA, V25, P829, DOI 10.1016/0028-3932(87)90120-5 Wester M., 2010, P SSW7 Wester M., 2010, EDIINFRR1388 U ED Wester M., 2011, P INT 11 FLOR IT Wester M., 2011, P ICASSP Wester M., 2011, EDIINFRR1396 U ED Wester M., 2010, P INT 10 Winters S., 2005, ENCY LANGUAGE LINGUI, V12, P31 Winters SJ, 2008, J ACOUST SOC AM, V123, P4524, DOI 10.1121/1.2913046 NR 31 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 781 EP 790 DI 10.1016/j.specom.2012.01.006 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400007 ER PT J AU Oba, T Hori, T Nakamura, A AF Oba, Takanobu Hori, Takaaki Nakamura, Atsushi TI Efficient training of discriminative language models by sample selection SO SPEECH COMMUNICATION LA English DT Article DE Discriminative language model; Error correction; Discriminative training; Sample selection AB This paper focuses on discriminative language models (DLMs) for large vocabulary speech recognition tasks. To train such models, we usually use a large number of hypotheses generated for each utterance by a speech recognizer, namely an n-best list or a lattice. Since the data size is large, we usually need a high-end machine or a large-scale distributed computation system consisting of many computers for model training. However, it is still unclear whether or not such a large number of sentence hypotheses are necessary. Furthermore, we do not know which kinds of sentences are necessary. In this paper, we show that we can generate a high performance model using small subsets of the n-best lists by choosing samples properly, i.e., we describe a sample selection method for DLMs. Sample selection reduces the memory footprint needed for holding training samples and allows us to train models in a standard machine. Furthermore, it enables us to generate a highly accurate model using various types of features. Specifically, experimental results show that even training using two samples in each list can provide an accurate model with a small memory footprint. (C) 2012 Elsevier B.V. All rights reserved. C1 [Oba, Takanobu; Hori, Takaaki; Nakamura, Atsushi] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto, Japan. RP Oba, T (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto, Japan. EM oba.takanobu@lab.ntt.co.jp; hori.t@lab.ntt.co.jp; nakamura.atsushi@lab.ntt.co.jp CR Arisoy E, 2010, INT CONF ACOUST SPEE, P5538, DOI 10.1109/ICASSP.2010.5495226 Collins M, 2005, COMPUT LINGUIST, V31, P25, DOI 10.1162/0891201053630273 Collins M., 2005, P ACL, P507, DOI 10.3115/1219840.1219903 Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1 Hori T., 2005, P INT, P284 KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394 LIU DC, 1989, MATH PROGRAM, V45, P503, DOI 10.1007/BF01589116 Maekawa K., 2000, P 2 INT C LANG RES E, V2, P947 McDermott E, 2007, IEEE T AUDIO SPEECH, V15, P203, DOI 10.1109/TASL.2006.876778 Oba T, 2010, INT CONF ACOUST SPEE, P5126, DOI 10.1109/ICASSP.2010.5495028 Oba Takanobu, 2007, P INT, P1753 Och F.J, 2003, P 41 ANN M ASS COMP, P160 Roark B., 2004, P ICASSP, V1, P749 Roark B, 2007, COMPUT SPEECH LANG, V21, P373, DOI 10.1016/j.csl.2006.06.006 Roark B., 2004, P ACL, P47, DOI 10.3115/1218955.1218962 Shafran I., 2006, P EMNLP SYDN AUSTR, P390, DOI 10.3115/1610075.1610130 Zhou Z., 2006, P ICASSP, V1, P141 NR 17 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 791 EP 800 DI 10.1016/j.specom.2012.01.007 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400008 ER PT J AU Kamper, H Mukanya, FJM Niesler, T AF Kamper, Herman Mukanya, Felicien Jeje Muamba Niesler, Thomas TI Multi-accent acoustic modelling of South African English SO SPEECH COMMUNICATION LA English DT Article DE Multi-accent acoustic modelling; Multi-accent speech recognition; Under-resourced languages; South African English accents ID SPEECH RECOGNITION AB Although English is spoken throughout South Africa it is most often used as a second or third language, resulting in several prevalent accents within the same population. When dealing with multiple accents in this under-resourced environment, automatic speech recognition (ASR) is complicated by the need to compile multiple, accent-specific speech corpora. We investigate how best to combine speech data from five South African accents of English in order to improve overall speech recognition performance. Three acoustic modelling approaches are considered: separate accent-specific models, accent-independent models obtained by pooling training data across accents, and multi-accent models. The latter approach extends the decision-tree clustering process normally used to construct tied-state hidden Markov models (HMMs) by allowing questions relating to accent. We find that multi-accent modelling outperforms accent-specific and accent-independent modelling in both phone and word recognition experiments, and that these improvements are statistically significant. Furthermore, we find that the relative merits of the accent-independent and accent-specific approaches depend on the particular accents involved. Multi-accent modelling therefore offers a mechanism by which speech recognition performance can be optimised automatically, and for hard decisions regarding which data to pool and which to separate to be avoided. (C) 2012 Elsevier B.V. All rights reserved. C1 [Kamper, Herman; Mukanya, Felicien Jeje Muamba; Niesler, Thomas] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa. RP Niesler, T (reprint author), Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa. EM kamperh@sun.ac.za; trn@sun.ac.za FU National Research Foundation (NRF) FX The authors would like to thank Febe de Wet for her helpful comments and suggestions. The financial assistance of the National Research Foundation (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at are those of the authors and are not necessarily to be attributed to the NRF. Parts of this work were executed using the High Performance Computer (HPC) facility at Stellenbosch University. CR Badenhorst J.A.C., 2008, P PRASA CAP TOWN S A Beattie V, 1995, P EUR 95, P1123 Bisani M., 2004, P IEEE INT C AC SPEE, P409 Bowerman S., 2004, HDB VARIETIES ENGLIS, P949 Bowerman S, 2004, HDB VARIETIES ENGLIS, P931 Caballero M, 2009, SPEECH COMMUN, V51, P217, DOI 10.1016/j.specom.2008.08.003 Chengalvarayan R, 2001, P EUR AALB DENM, P2733 Crystal D., 1991, DICT LINGUISTICS PHO Despres J., 2009, P INT BRIGHT, P96 Diakoloukas V., 1997, P IEEE INT C AC SPEE, P1455 Finn P., 2004, HDB VARIETIES ENGLIS, P964 Fischer V., 1998, P ICSLP SYDN AUSTR, P787 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd Hershey JR, 2008, INT CONF ACOUST SPEE, P4557, DOI 10.1109/ICASSP.2008.4518670 Humphries J. J., 1997, P EUR, P2367 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Kirchhoff K, 2005, SPEECH COMMUN, V46, P37, DOI 10.1016/j.specom.2005.01.004 Mak B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607191 McCormick K., 2004, HDB VARIETIES ENGLIS, P993 Mesthrie R., 2004, HDB VARIETIES ENGLIS, P953 Mesthrie R., 2004, HDB VARIETIES ENGLIS, P974 NEY H, 1994, COMPUT SPEECH LANG, V8, P1, DOI 10.1006/csla.1994.1001 Niesler T, 2007, SPEECH COMMUN, V49, P453, DOI 10.1016/j.specom.2007.04.001 Olsen P.A., 2007, P INT C, P46 Roux JC, 2004, P LREC LISB PORT, P93 Schneider EW, 2004, HDB VARIETIES ENGLIS Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 Statistics South Africa, 2004, CENS 2001 PRIM TABL Stolcke A., 2002, P INT C SPOK LANG PR, P901 ten Bosch L., 2000, P ICSLP BEIJ CHIN, P1009 Van Compernolle D., 1991, P EUR 91, P723 Van Rooy B., 2004, HDB VARIETIES ENGLIS, V1, P943 Vihola M., 2002, ACOUST SPEECH SIG PR, P933 Wang Z., 2003, P IEEE INT C AC SPEE, P540 Young S., 2009, HTK BOOK VERSION 3 4 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 NR 36 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 801 EP 813 DI 10.1016/j.specom.2012.01.008 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400009 ER PT J AU Pavez, E Silva, JF AF Pavez, Eduardo Silva, Jorge F. TI Analysis and design of Wavelet-Packet Cepstral coefficients for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Wavelet Packets; Filter-bank analysis; Automatic speech recognition; Filter-bank selection; Cepstral coefficients; The Gray code ID HIDDEN MARKOV-MODELS; FEATURE-EXTRACTION; SAMPLING THEOREM; BASIS SELECTION; SIGNAL; CLASSIFICATION; REPRESENTATIONS; SUBSPACES; FRAMEWORK; FILTERS AB This work proposes using Wavelet-Packet Cepstral coefficients (WPPCs) as an alternative way to do filter-bank energy-based feature extraction (FE) for automatic speech recognition (ASR). The rich coverage of time-frequency properties of Wavelet Packets (WPs) is used to obtain new sets of acoustic features, in which competitive and better performances are obtained with respect to the widely adopted Mel-Frequency Cepstral coefficients (MFCCs) in the TIMIT corpus. In the analysis, concrete filter-bank design considerations are stipulated to obtain most of the phone-discriminating information embedded in the speech signal, where the filter-bank frequency selectivity, and better discrimination in the lower frequency range [200 Hz-1 kHz] of the acoustic spectrum are important aspects to consider. (C) 2012 Elsevier B.V. All rights reserved. C1 [Pavez, Eduardo; Silva, Jorge F.] Univ Chile, Dept Elect Engn, Santiago 4123, Chile. RP Silva, JF (reprint author), Univ Chile, Dept Elect Engn, Av Tupper 2007, Santiago 4123, Chile. EM epavez@ing.uchile.cl; josilva@ing.uchile.cl RI Silva, Jorge/M-8750-2013 FU FONDECYT, CONICYT-Chile [1110145] FX The work was supported by funding from FONDECYT Grant 1110145, CONICYT-Chile. We are grateful to the anonymous reviewers for their suggestions and comments that contribute to improve the quality and organization of the work. We thank S. Beckman for proofreading this material. CR Atto A.M., 2010, IEEE T INFORM THEORY, V56, P429 Atto AM, 2007, SIGNAL PROCESS, V87, P2320, DOI 10.1016/j.sigpro.2007.03.014 BOHANEC M, 1994, MACH LEARN, V15, P223, DOI 10.1007/BF00993345 Chang T, 1993, IEEE T IMAGE PROCESS, V2, P429, DOI 10.1109/83.242353 CHOU PA, 1989, IEEE T INFORM THEORY, V35, P299, DOI 10.1109/18.32124 Choueiter GF, 2007, IEEE T AUDIO SPEECH, V15, P939, DOI 10.1109/TASL.2006.889793 Coifman R., 1990, SIGNAL PROCESSING CO Coifman R., 1992, WAVELETS THEIR APPL, P153 COIFMAN RR, 1992, IEEE T INFORM THEORY, V38, P713, DOI 10.1109/18.119732 Cormen T. H., 1990, INTRO ALGORITHMS Cover T M, 1991, ELEMENTS INFORM THEO Crouse MS, 1998, IEEE T SIGNAL PROCES, V46, P886, DOI 10.1109/78.668544 Daubechies I., 1992, 10 LECT WAVELETS DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Duda R.O., 1983, PATTERN CLASSIFICATI Etemad K, 1998, IEEE T IMAGE PROCESS, V7, P1453, DOI 10.1109/83.718485 Farooq O, 2001, IEEE SIGNAL PROC LET, V8, P196, DOI 10.1109/97.928676 GRAY R. M., 1990, ENTROPY INFORM THEOR Kim K, 2000, IEEE SYS MAN CYBERN, P2891 Kullback S., 1958, INFORM THEORY STAT Learned R.E., 1992, WAVELET PACKET BASED, P109 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Mallat S, 2009, WAVELET TOUR OF SIGNAL PROCESSING: THE SPARSE WAY, P1 MALLAT SG, 1989, IEEE T PATTERN ANAL, V11, P674, DOI 10.1109/34.192463 Olshen R., 1984, CLASSIFICATION REGRE, V1st Padmanabhan M, 2005, IEEE T SPEECH AUDI P, V13, P512, DOI 10.1109/TSA.2005.848876 Quatieri T. F., 2002, DISCRETE TIME SPEECH RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Ramchandran K, 1996, P IEEE, V84, P541, DOI 10.1109/5.488699 SAITO N, 1994, P SOC PHOTO-OPT INS, V2303, P2, DOI 10.1117/12.188763 Scott C, 2005, IEEE T SIGNAL PROCES, V53, P4518, DOI 10.1109/TSP.2005.859220 Scott C, 2004, IEEE T SIGNAL PROCES, V52, P2264, DOI 10.1109/TSP.2004.831121 Shen JH, 1998, APPL COMPUT HARMON A, V5, P312, DOI 10.1006/acha.1997.0234 Shen JH, 1996, P AM MATH SOC, V124, P3819, DOI 10.1090/S0002-9939-96-03557-5 Silva J, 2009, IEEE T SIGNAL PROCES, V57, P1796, DOI 10.1109/TSP.2009.2013898 Silva J., 2007, IEEE WORKSH MACH LEA Silva JF, 2012, PATTERN RECOGN, V45, P1853, DOI 10.1016/j.patcog.2011.11.015 Tan BT, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2431 Vaidyanathan P. P., 1993, MULTIRATE SYSTEMS FI Vasconcelos N, 2004, IEEE T SIGNAL PROCES, V52, P2322, DOI 10.1109/TSP.2004.831125 Vetterli M., 1995, WAVELET SUBBAND CODI WALTER GG, 1992, IEEE T INFORM THEORY, V38, P881, DOI 10.1109/18.119745 Willsky AS, 2002, P IEEE, V90, P1396, DOI 10.1109/JPROC.2002.800717 Young S., 2009, HTK BOOK HTK VERSION Zhou XW, 1999, J FOURIER ANAL APPL, V5, P347, DOI 10.1007/BF01259375 NR 45 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 814 EP 835 DI 10.1016/j.specom.2012.02.002 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400010 ER PT J AU Flynn, R Jones, E AF Flynn, Ronan Jones, Edward TI Feature selection for reduced-bandwidth distributed speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Distributed speech recognition; Feature selection; Bandwidth reduction; HLDA ID MODELS AB The impact on speech recognition performance in a distributed speech recognition (DSR) environment of two methods used to reduce the dimension of the feature vectors is examined in this paper. The motivation behind reducing the dimension of the feature set is to reduce the bandwidth required to send the feature vectors over a channel from the client front-end to the server back-end in a DSR system. In the first approach, the features are empirically chosen to maximise recognition performance. A data-centric transform-based dimensionality-reduction technique is applied in the second case. Test results for the empirical approach show that individual coefficients have different impacts on the speech recognition performance, and that certain coefficients should always be present in an empirically selected reduced feature set for given training and test conditions. Initial results show that for the empirical method, the number of elements in a feature vector produced by an established DSR front-end can be reduced by 23% with low impact on the recognition performance (less than 8% relative performance drop compared to the full bandwidth case). Using the transform-based approach, for a similar impact on recognition performance, the number of feature vector elements can be reduced by 30%. Furthermore, for best recognition performance, the results indicate that the SNR of the speech signal should be considered using either approach when selecting the feature vector elements that are to be included in a reduced feature set. (C) 2012 Elsevier B.V. All rights reserved. C1 [Flynn, Ronan] Athlone Inst Technol, Dept Elect Engn, Athlone, Ireland. [Jones, Edward] Natl Univ Ireland, Coll Engn & Informat, Galway, Ireland. RP Flynn, R (reprint author), Athlone Inst Technol, Dept Elect Engn, Athlone, Ireland. EM rflynn@ait.ie; edward.jones@nuigalway.ie CR [Anonymous], HTK SPEECH RECOGNITI [Anonymous], 2007, 202050 ETSI ES [Anonymous], 2003, 201108 ETSI ES Bocchieri E. L., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1012 Cernak M., 2007, P 12 INT C SPEECH CO, VI, P188 Chakroborty S, 2010, SPEECH COMMUN, V52, P693, DOI 10.1016/j.specom.2010.04.002 Choi E., 2002, P 9 AUSTR INT C SPEE, P166 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 Jafari A, 2010, SPEECH COMMUN, V52, P725, DOI 10.1016/j.specom.2010.04.005 Koniaris C, 2010, INT CONF ACOUST SPEE, P4342, DOI 10.1109/ICASSP.2010.5495648 Koniaris C, 2010, J ACOUST SOC AM, V127, pEL73, DOI 10.1121/1.3284545 Kumar N, 1998, SPEECH COMMUN, V26, P283, DOI 10.1016/S0167-6393(98)00061-2 Nicholson S., 1997, P EUR SPEECH C SPEEC, P413 Paliwal K. K., 1992, Digital Signal Processing, V2, DOI 10.1016/1051-2004(92)90005-J Peinado A. M., 2006, SPEECH RECOGNITION D Plasberg JH, 2007, IEEE T AUDIO SPEECH, V15, P310, DOI 10.1109/TASL.2006.876722 Tan ZH, 2008, ADV PATTERN RECOGNIT, P1, DOI 10.1007/978-1-84800-143-5 NR 19 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 836 EP 843 DI 10.1016/j.specom.2012.01.003 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400011 ER PT J AU Howard, DM Abberton, E Fourcin, A AF Howard, David M. Abberton, Evelyn Fourcin, Adrian TI Disordered voice measurement and auditory analysis (vol 54, pg 611, 2012) SO SPEECH COMMUNICATION LA English DT Correction C1 [Howard, David M.] Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England. [Abberton, Evelyn; Fourcin, Adrian] UCL, Dept Phonet & Linguist, London WC1E 6BT, England. RP Howard, DM (reprint author), Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England. EM dh@ohm.york.ac.uk CR Howard DM, 2012, SPEECH COMMUN, V54, P611, DOI 10.1016/j.specom.2011.03.008 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2012 VL 54 IS 6 BP 844 EP 844 DI 10.1016/j.specom.2012.03.007 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 940QP UT WOS:000303908400012 ER PT J AU Rodriguez, WR Saz, O Lleida, E AF Rodriguez, William R. Saz, Oscar Lleida, Eduardo TI A prelingual tool for the education of altered voices SO SPEECH COMMUNICATION LA English DT Article DE Altered voice; Voice therapy; Speech processing; Formant normalization; Vocal tract length estimation AB This paper addresses the problem of Computer-Aided Voice Therapy for altered voices. The proposal of the work is to develop a set of free activities called PreLingua for providing interactive voice therapy to a population of individuals with voice disorders. The interactive tools are designed to train voice skills like: voice production, intensity, blow, vocal onset, phonation time, tone, and vocalic articulation for Spanish language. The development of these interactive tools along with the underlying speech technologies that support them requires the existence of speech processing, whose algorithms must be robust with respect to the sources of speech variability that are characteristic of this population of speakers. One of the main problem addressed is how to estimate reliably formant frequencies in high-pitched speech (typical in children and women) and how to normalize these estimations independently of the characteristics of the speakers. Linear prediction coding, homomorphic analysis and modeling of the vocal tract are the core of the speech processing techniques used to allow such normalization through vocal tract length. This paper also presents the result of an experimental study where PreLingua was applied in a population with voice disorders and pathologies in special education centers in Spain and Colombia. Promising results were obtained in this preliminary study after 12 weeks of therapy, as it showed improvements in the voice capabilities of a remarkable number of users and the ability of the tool to educate impaired users with voice alterations. This improvement was assessed by the evaluation of the educators before and after the study and also by the performance of the subjects in the activities of PreLingua. The results were very encouraging to keep working in this direction, with the overall aim of providing further functionalities and robustness to the system. (C) 2011 Elsevier B.V. All rights reserved. C1 [Rodriguez, William R.; Saz, Oscar; Lleida, Eduardo] Univ Zaragoza, Commun Technol Grp GTC, Aragon Inst Engn Res 13A, Zaragoza 50018, Spain. RP Rodriguez, WR (reprint author), Univ Zaragoza, Commun Technol Grp GTC, Aragon Inst Engn Res 13A, Maria de Luna 1, Zaragoza 50018, Spain. EM wricardo@unizar.es; oskarsaz@unizar.es; lleida@unizar.es RI Lleida, Eduardo/K-8974-2014; Saz Torralba, Oscar/L-7329-2014 OI Lleida, Eduardo/0000-0001-9137-4013; FU MEC of the Spanish government [TIN-2008-06856-C05-04]; Santander Bank FX This work was supported under TIN-2008-06856-C05-04 from MEC of the Spanish government and Santander Bank scholarships. The authors want to acknowledge Center of Special Education "CEDESNID" in Bogota (Colombia), and the Public School for Especial Education "Alborada" in Zaragoza (Spain), for their collaboration applying and testing PreLingua. CR Arias C., 2005, DISFONIA INFANTIL Aronso A., 1993, CLIN VOICE DISORDERS Fant G., 1960, ACOUSTIC THEORY SPEE Gurlekian J., 2000, CARACTERIZACION ARTI Kenneth D., 1966, P ANN CONV AM SPEECH Kirschning I., 2007, IDEA GROUP Kornilov A.-U., 2004, P 9 INT C SPEECH COM Martinez-Celdran E., 1989, FONOLOGIA GEN ESPANO Necioglu B., 2000, ACOUST SPEECH SIGNAL, V3, P1319 Rabiner L., 1978, DIGITAL PROCESSING S Rabiner L. R., 2007, INTRO DIGITAL SPEECH Rodriguez W.-R., 2009, P 2009 WORKSH SPEECH Sakhnov K., 2009, IAENG INT J COMPUT S, V36 Saz O, 2009, SPEECH COMMUN, V51, P948, DOI 10.1016/j.specom.2009.04.006 Shahidur Rahman M., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.502 Traunmuller H., 1997, P EUROSPEECH 1997, V1, P477 VERHELST W, 1986, IEEE T ACOUST SPEECH, V34, P43, DOI 10.1109/TASSP.1986.1164787 WAKITA H, 1977, IEEE T ACOUST SPEECH, V25, P183, DOI 10.1109/TASSP.1977.1162929 Watt D., 2002, LEEDS WORKING PAPERS, V9, P159 NR 19 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 583 EP 600 DI 10.1016/j.specom.2011.05.006 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600001 ER PT J AU Vaiciukynas, E Verikas, A Gelzinis, A Bacauskiene, M Uloza, V AF Vaiciukynas, Evaldas Verikas, Antanas Gelzinis, Adas Bacauskiene, Marija Uloza, Virgilijus TI Exploring similarity-based classification of larynx disorders from human voice SO SPEECH COMMUNICATION LA English DT Article DE Laryngeal disorder; Pathological voice; Mel-frequency cepstral coefficients; Sequence kernel; Kullback-Leibler divergence; Earth mover's distance; GMM; SVM AB In this paper identification of laryngeal disorders using cepstral parameters of human voice is researched. Mel-frequency cepstral coefficients (MFCCs), extracted from audio recordings of patient's voice, are further approximated, using various strategies (sampling, averaging, and clustering by Gaussian mixture model). The effectiveness of similarity-based classification techniques in categorizing such pre-processed data into normal voice, nodular, and diffuse vocal fold lesion classes is explored and schemes to combine binary decisions of support vector machines (SVMs) are evaluated. Most practiced RBF kernel was compared to several constructed custom kernels: (i) a sequence kernel, defined over a pair of matrices, rather than over a pair of vectors and calculating the kernelized principal angle (KPA) between subspaces; (ii) a simple supervector kernel using only means of patient's GMM; (iii) two distance kernels, specifically tailored to exploit covariance matrices of GMM and using the approximation of the Kullback-Leibler divergence from the Monte-Carlo sampling (KL-MCS), and the Kullback-Leibler divergence combined with the Earth mover's distance (KL-EMD) as similarity metrics. The sequence kernel and the distance kernels both outperformed the popular RBF kernel, but the difference is statistically significant only in the distance kernels case. When tested on voice recordings, collected from 410 subjects (130 normal voice, 140 diffuse, and 140 nodular vocal fold lesions), the KL-MCS kernel, using GMM with full covariance matrices, and the KL-EMD kernel, using GMM with diagonal covariance matrices, provided the best overall performance. In most cases, SVM reached higher accuracy than least squares SVM, except for common binary classification using distance kernels. The results indicate that features, modeled with GMM, and kernel methods, exploiting this information, is an interesting fusion of generative (probabilistic) and discriminative (hyperplane) models for similarity-based classification. (C) 2011 Elsevier B.V. All rights reserved. C1 [Vaiciukynas, Evaldas; Verikas, Antanas; Gelzinis, Adas; Bacauskiene, Marija] Kaunas Univ Technol, Dept Elect & Control Equipment, LT-51368 Kaunas, Lithuania. [Verikas, Antanas] Halmstad Univ, Intelligent Syst Lab, S-30118 Halmstad, Sweden. [Uloza, Virgilijus] Lithuanian Univ Hlth Sci, Dept Otolaryngol, LT-50009 Kaunas, Lithuania. RP Vaiciukynas, E (reprint author), Kaunas Univ Technol, Dept Elect & Control Equipment, Studentu 50, LT-51368 Kaunas, Lithuania. EM evaldas.vaiciukynas@stud.ktu.lt; antanas.verikas@hh.se; adas.gelzinis@ktu.lt; marija.bacauskiene@ktu.lt; virgilijus.ulozas@kmuk.lt CR Benesty J., 2007, SPRINGER HDB SPEECH Chen YH, 2009, J MACH LEARN RES, V10, P747 Doremalen J., 2007, THESIS RADBOUD U NIJ Dubnov S., 2008, COMPUTER AUDITION TO Gelzinis A, 2008, COMPUT METH PROG BIO, V91, P36, DOI 10.1016/j.cmpb.2008.01.008 Godino-Llorente JI, 2005, LECT NOTES ARTIF INT, V3817, P219 Kuroiwa S, 2006, IEICE T INF SYST, VE89D, P1074, DOI 10.1093/ietisy/e89-d.3.1074 Levina Elizaveta, 2001, P IEEE 8 INT C COMP, P251, DOI 10.1109/ICCV.2001.937632 Markaki M, 2010, INT CONF ACOUST SPEE, P5162, DOI 10.1109/ICASSP.2010.5495020 McLaren M, 2011, COMPUT SPEECH LANG, V25, P327, DOI 10.1016/j.csl.2010.02.004 Pampalk E., 2004, P 5 INT C MUS INF RE Pouchoulin G., 2007, P 8 ANN C INT SPEECH Suykens JAK, 1999, NEURAL PROCESS LETT, V9, P293, DOI 10.1023/A:1018628609742 TVERSKY A, 1982, PSYCHOL REV, V89, P123, DOI 10.1037/0033-295X.89.2.123 Wang X., 2011, J VOICE Weston J., 2006, MATLAB TOOLBOX KERNE Weston J., 1999, P 7 EUR S ART NEUR N Wolf L., 2003, J MACHINE LEARNING R, V4, P913, DOI 10.1162/jmlr.2003.4.6.913 Zheng WM, 2006, NEURAL COMPUT, V18, P979 NR 19 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 601 EP 610 DI 10.1016/j.specom.2011.04.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600002 ER PT J AU Howard, DM Abberton, E Fourcin, A AF Howard, David M. Abberton, Evelyn Fourcin, Adrian TI Disordered voice measurement and auditory analysis SO SPEECH COMMUNICATION LA English DT Article DE Hearing modelling; Pathological voice; Dysphonia; Temporal analysis; Spectral analysis; Laryngograph; Clinical voice analysis ID FILTER SHAPES; SPEECH; NOISE; IDENTIFICATION; RECOGNITION; FEATURES AB Although voice disorder is ordinarily first detected by listening, hearing is little used in voice measurement. Auditory critical band approaches to the quantitative analysis of dysphonia are compared with the results of applying cycle-by-cycle time based methods and the results from a listening test. The comparisons show that quite large rough/smooth differences, that are readily perceptible, are not as robustly measurable using either peripheral human hearing based GammaTone spectrograms, or a cepstral prominence algorithm, as they may be when using cycle-by-cycle based computations that are linked to temporal criteria. The implications of these tentative observations are discussed for the development of clinically relevant analyses of pathological voice signals with special reference to the analytic advantages of employing appropriate auditory criteria. (C) 2011 Elsevier B.V. All rights reserved. C1 [Howard, David M.] Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England. [Abberton, Evelyn; Fourcin, Adrian] UCL, Dept Phonet & Linguist, London WC1E 6BT, England. RP Howard, DM (reprint author), Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England. EM dh@ohm.york.ac.uk CR Abberton Evelyn, 2005, Logoped Phoniatr Vocol, V30, P175 ABBERTON ERM, 1989, CLIN LINGUIST PHONET, V3, P281, DOI 10.3109/02699208908985291 Abdulla Waleed H, 2010, International Journal of Biometrics, V2, DOI 10.1504/IJBM.2010.035448 Baken RJ, 2000, CLIN MEASUREMENT SPE Baken R.J., 1991, READINGS CLIN SPECTR Bridle J., 1974, 1003 JSRU Brookes T, 2000, Logoped Phoniatr Vocol, V25, P72 BROWN JC, 1991, J ACOUST SOC AM, V89, P425, DOI 10.1121/1.400476 Buder EH., 2000, VOICE QUALITY MEASUR, P119 Caeiros A.M., 2010, J APPL RES TECHNOL, V8, P56 Cavalli L, 2010, LOGOP PHONIATR VOCO, V35, P60, DOI 10.3109/14015439.2010.482860 Cooke M, 2010, COMPUT SPEECH LANG, V24, P1, DOI 10.1016/j.csl.2009.02.006 DEBOER E, 1978, J ACOUST SOC AM, V63, P115, DOI 10.1121/1.381704 de Cheveigne A., 2010, OXFORD HDB AUDITORY, P71 Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679 Fletcher H, 1940, REV MOD PHYS, V12, P0047, DOI 10.1103/RevModPhys.12.47 Fourcin A, 2009, FOLIA PHONIATR LOGO, V61, P126, DOI 10.1159/000219948 Fourcin A., 2000, VOICE QUALITY MEASUR Fourcin A, 2008, LOGOP PHONIATR VOCO, V33, P35, DOI 10.1080/14015430701251574 FOURCIN AJ, 1971, MED BIOL ILLUS, V21, P172 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311 HOWARD D., 1997, ORG SOUND, V2, P65, DOI 10.1017/S1355771897009011 Howard D., 2009, ACOUSTICS PSYCHOACOU Howard D.M., 1995, FORENSIC LINGUIST, V2, P28 Howard D.M., 2009, LARYNX Koike Y., 1973, STUDIA PHONOLOGICA, V7, P17 Li Q, 2010, INT CONF ACOUST SPEE, P4514, DOI 10.1109/ICASSP.2010.5495589 LISKER L, 1964, WORD, V20, P384 Malyska N, 2005, INT CONF ACOUST SPEE, P873 Mermelstein P., 1976, SR47 HASK LAB Moore B. C., 2004, INTRO PSYCHOL HEARIN PATTERSON RD, 1976, J ACOUST SOC AM, V59, P640, DOI 10.1121/1.380914 Ptok M., 2006, Z HNO, V54, P1326 Ramig LA, 1987, J VOICE, V1, P162, DOI 10.1016/S0892-1997(87)80040-1 Sayles M, 2008, J NEUROSCI, V28, P11925, DOI 10.1523/JNEUROSCI.3137-08.2008 Scharf B., 1970, F MODERN AUDITORY TH, V1, P159 Shao Y, 2010, COMPUT SPEECH LANG, V24, P77, DOI 10.1016/j.csl.2008.03.004 Slaney M., 1993, 35 APPL COMP Slaney M., 2003, 45 APPL COMP Sumner CJ, 2002, J ACOUST SOC AM, V111, P2178, DOI 10.1121/1.1453451 Wang D., 2008, P INT C AUD LANG IM, P1340 Watkins AJ, 2007, J ACOUST SOC AM, V121, P257, DOI 10.1121/1.2387134 Weyer E.G., 1936, J PSYCHOL, V3, P101 ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630 NR 46 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 611 EP 621 DI 10.1016/j.specom.2011.03.008 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600003 ER PT J AU Falk, TH Chan, WY Shein, F AF Falk, Tiago H. Chan, Wai-Yip Shein, Fraser TI Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility SO SPEECH COMMUNICATION LA English DT Article DE Dysarthria; Vocal source excitation; Temporal dynamics; Intelligibility; Linear prediction ID SPEECH RECOGNITION; DISORDERED SPEECH; QUALITY; SYSTEM; RECEPTION; VOICE AB Objective measurement of dysarthric speech intelligibility can assist clinicians in the diagnosis of speech disorder severity as well as in the evaluation of dysarthria treatments. In this paper, several objective measures are proposed and tested as correlates of subjective intelligibility. More specifically, the kurtosis of the linear prediction residual is proposed as a measure of vocal source excitation oddity. Additionally, temporal perturbations resultant from imprecise articulation and atypical speech rates are characterized by short- and long-term temporal dynamics measures, which in turn, are based on log-energy dynamics and on an auditory-inspired modulation spectral signal representation, respectively. Motivated by recent insights in the communication disorders literature, a composite measure is developed based on linearly combining a salient subset of the proposed measures with conventional prosodic parameters. Experiments with the publicly-available 'Universal Access' database of spastic dysarthric speech (10 patient speakers; 300 words spoken in isolation, per speaker) show that the proposed composite measure can achieve correlation with subjective intelligibility ratings as high as 0.97; thus the measure can serve as an accurate indicator of dysarthric speech intelligibility. (C) 2011 Elsevier B.V. All rights reserved. C1 [Falk, Tiago H.] INRS EMT, Inst Natl Rech Sci, Montreal, PQ, Canada. [Chan, Wai-Yip] Queens Univ, Dept Elect & Comp Engn, Kingston, ON, Canada. [Shein, Fraser] Holland Bloorview Kids Rehabil Hosp, Bloorview Res Inst, Toronto, ON, Canada. [Shein, Fraser] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada. RP Falk, TH (reprint author), INRS EMT, Inst Natl Rech Sci, Montreal, PQ, Canada. EM tiago.falk@ieee.org FU Natural Sciences and Engineering Research Council of Canada FX The authors wish to acknowledge Dr. Mark Hasegawa-Johnson for making the UA-Speech database available, the Natural Sciences and Engineering Research Council of Canada for their financial support, and the anonymous reviewers for their insightful comments. CR ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267 [Anonymous], 2004, P563 ITUT, P563 Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 Baken RJ, 2000, CLIN MEASUREMENT SPE Benesty J., 2008, SPRINGER HDB SPEECH Bunton K, 2000, CLIN LINGUIST PHONET, V14, P13, DOI 10.1080/026992000298922 COLCORD RD, 1979, J SPEECH HEAR RES, V22, P468 Constantinescu G, 2010, INT J LANG COMM DIS, V45, P630, DOI 10.3109/13682820903470569 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344 De Bodt MS, 2002, J COMMUN DISORD, V35, P283, DOI 10.1016/S0021-9924(02)00065-5 Doyle PC, 1997, J REHABIL RES DEV, V34, P309 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 Duffy J., 2005, DIFFERENTIAL DIAGNOS Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247 Falk TH, 2010, IEEE T INSTRUM MEAS, V59, P978, DOI 10.1109/TIM.2009.2024697 Falk TH, 2006, IEEE T AUDIO SPEECH, V14, P1935, DOI 10.1109/TASL.2006.883253 Fant G., 1960, NASAL SOUNDS NASALIZ Ferrier L., 1995, AUGMENTATIVE ALTERNA, V11, P165, DOI 10.1080/07434619512331277289 Gillespie B., 2001, P IEEE INT C AC SPEE, V6 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Green P., 2003, P 8 EUR C SPEECH COM, P4 Gu LY, 2005, EURASIP J APPL SIG P, V2005, P1400, DOI 10.1155/ASP.2005.1400 HASEGAWAJOHNSON M, 2006, INT CONF ACOUST SPEE, P1060 Hill AJ, 2006, AM J SPEECH-LANG PAT, V15, P45, DOI 10.1044/1058-0360(2006/006) HOUSE AS, 1956, J SPEECH HEAR DISORD, V21, P218 Huang X., 2001, ALGORITHM SYSTEM DEV KENT RD, 1989, CLIN LINGUIST PHONET, V3, P347, DOI 10.3109/02699208908985295 Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466 KIM H, 2008, P INT C SPOK LANG PR, P1741 Klopfenstein M., 2009, INT J SPEECH LANGUAG, V11, P326 LeGendre S. J., 2009, J ACOUST SOC AM, V125, P2530, DOI [10.1121/1.4783544, DOI 10.1121/1.4783544] Maier A, 2009, SPEECH COMMUN, V51, P425, DOI 10.1016/j.specom.2009.01.004 Middag C, 2009, EURASIP J ADV SIG PR, DOI 10.1155/2009/629030 O'Shaughnessy D., 2008, HDB SPEECH PROCESSIN, P213 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 Raghavendra P., 2001, AUGMENTATIVE ALTERNA, V17, P265, DOI 10.1080/714043390 Rudzicz F, 2007, P INT ACM SIGACCESS, P256 Saz O., 2008, P 1 WORKSH CHILD COM, P6 SCHLENCK KJ, 1993, CLIN LINGUIST PHONET, V7, P119, DOI 10.3109/02699209308985549 Sharma H., 2009, P 10 ANN C INT SPEEC, P4 Sjolander K., 2000, P INT C SPOK LANG PR, P4 Slaney M, 1993, EFFICIENT IMPLEMENTA Talkin D., 1995, ROBUST ALGORITHM PIT, P495 Talkin D., 1987, J ACOUST SOC AM S, V82, pS55 Van Nuffelen G, 2009, INT J LANG COMM DIS, V44, P716, DOI 10.1080/13682820802342062 Zecevic A., 2002, THESIS U MANNHEIM NR 48 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 622 EP 631 DI 10.1016/j.specom.2011.03.007 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600004 ER PT J AU de Bruijn, MJ ten Bosch, L Kuik, DJ Witte, BI Langendijk, JA Leemans, CR Verdonck-de Leeuw, IM AF de Bruijn, Marieke J. ten Bosch, Louis Kuik, Dirk J. Witte, Birgit I. Langendijk, Johannes A. Leemans, C. Rene Verdonck-de Leeuw, Irma M. TI Acoustic-phonetic and artificial neural network feature analysis to assess speech quality of stop consonants produced by patients treated for oral or oropharyngeal cancer SO SPEECH COMMUNICATION LA English DT Article DE Head and neck cancer; Oral cancer; Reconstructive surgery; Speech quality; Artificial neural network; Voice-onset-time (VOT) ID VOICE-ONSET-TIME; OVERLAPPING ARTICULATORY FEATURES; FOREARM FREE-FLAP; OF-LIFE; LONGITUDINAL ASSESSMENT; EUROPEAN-ORGANIZATION; MAXILLECTOMY PATIENTS; PARTIAL GLOSSECTOMY; MICROVASCULAR FLAP; SURGICAL-TREATMENT AB Speech impairment often occurs in patients after treatment for head and neck cancer. A specific speech characteristic that influences intelligibility and speech quality is voice-onset-time (VOT) in stop consonants. VOT is one of the functionally most relevant parameters that distinguishes voiced and voiceless stops. The goal of the present study is to investigate the role and validity of acoustic-phonetic and artificial neural network analysis (ANN) of stop consonants in a multidimensional speech assessment protocol. Speech recordings of 51 patients 6 months after treatment for oral or oropharyngeal cancer and of 18 control speakers were evaluated by trained speech pathologists regarding intelligibility and articulation. Acoustic-phonetic analyses and artificial neural network analysis of the phonological feature voicing were performed in voiced /b/, /d/ and voiceless /p/ and /t/. Results revealed that objective acoustic-phonetic analysis and feature analysis for /b, d, p/ distinguish between patients and controls. Within patients, /t, d/ distinguish for tumour location and tumour stage. Measurements of the phonological feature voicing in almost all consonants were significantly correlated with articulation and intelligibility, but not with self-evaluations. Overall, objective acoustic-phonetic and feature analyses of stop consonants are feasible and contribute to further development of a multidimensional speech quality assessment protocol. (C) 2011 Elsevier B.V. All rights reserved. C1 [de Bruijn, Marieke J.; Leemans, C. Rene; Verdonck-de Leeuw, Irma M.] Vrije Univ Amsterdam, Dept Otolaryngol Head & Neck Surg, Med Ctr, NL-1007 MB Amsterdam, Netherlands. [ten Bosch, Louis] Univ Nijmegen, Dept Language & Speech, Nijmegen, Netherlands. [Kuik, Dirk J.; Witte, Birgit I.] Vrije Univ Amsterdam, Dept Epidemiol & Biostat, Med Ctr, NL-1007 MB Amsterdam, Netherlands. [Langendijk, Johannes A.] Univ Groningen, Dept Radiat Oncol, Univ Med Ctr Groningen, Groningen, Netherlands. RP Verdonck-de Leeuw, IM (reprint author), Vrije Univ Amsterdam, Dept Otolaryngol Head & Neck Surg, Med Ctr, POB 7057, NL-1007 MB Amsterdam, Netherlands. EM im.verdonck@vumc.nl CR AARONSON NK, 1993, J NATL CANCER I, V85, P365, DOI 10.1093/jnci/85.5.365 Allen JS, 2003, J ACOUST SOC AM, V113, P544, DOI 10.1121/1.1528172 [Anonymous], 2007, PRAAT DOING PHON COM Bjordal K, 1999, J CLIN ONCOL, V17, P1008 Borggreven PA, 2005, HEAD NECK-J SCI SPEC, V27, P785, DOI 10.1002/hed.20236 Borggreven PA, 2007, ORAL ONCOL, V43, P1034, DOI 10.1016/j.oraloncology.2006.11.017 Bressmann T, 2004, J ORAL MAXIL SURG, V62, P298, DOI 10.1016/j.joms.2003.04.017 Bridle J., 1998, CLSP JHU SUMM WORKSH CHRISTENSEN JM, 1978, J SPEECH HEAR RES, V21, P56 de Bruijn MJ, 2009, FOLIA PHONIATR LOGO, V61, P180, DOI 10.1159/000219953 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 Erler K, 1996, J ACOUST SOC AM, V100, P2500, DOI 10.1121/1.417358 Furia CLB, 2001, ARCH OTOLARYNGOL, V127, P877 Graupe D., 2007, PRINCIPLES ARTIFICIA Haderlein T, 2009, FOLIA PHONIATR LOGO, V61, P12, DOI 10.1159/000187620 Hara I, 2003, BRIT J ORAL MAX SURG, V41, P161, DOI 10.1016/S0266-4356(03)00068-8 Houde J., 1998, SCIENCE, P1213 Karnell LH, 2000, HEAD NECK-J SCI SPEC, V22, P6, DOI 10.1002/(SICI)1097-0347(200001)22:1<6::AID-HED2>3.0.CO;2-P Kazi R, 2007, INT J LANG COMM DIS, V42, P521, DOI 10.1080/13682820601056566 Kent RA., 1992, INTELLIGIBILITY SPEE King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 KLATT DH, 1975, J SPEECH HEAR RES, V18, P686 Ladefoged P., 1996, SOUNDS WORLDS LANGUA Markkanen-Leppanen M, 2005, J CRANIOFAC SURG, V16, P990, DOI 10.1097/01.scs.0000179753.14037.7a McConnel FMS, 1998, ARCH OTOLARYNGOL, V124, P625 MICHI K, 1989, J CRANIO MAXILL SURG, V17, P162, DOI 10.1016/S1010-5182(89)80015-0 Nabil N., 1995, SIGNAL REPRESENTATIO, P310 Nabil N., 1996, KNOWLEDGE BASED SIGN, P29 Ng ML, 2009, J SPEECH LANG HEAR R, V52, P780, DOI 10.1044/1092-4388(2008/07-0182) Pauloski BR, 1998, OTOLARYNG HEAD NECK, V118, P616, DOI 10.1177/019459989811800509 Rinkel RN, 2008, HEAD NECK-J SCI SPEC, V30, P868, DOI 10.1002/hed.20795 ROBBINS J, 1986, J SPEECH HEAR RES, V29, P499 Robinson T., 1996, AUTOMATIC SPEECH SPE, P233 Savariaux C., 2001, SPEECH PRODUCTION GL Schuster M, 2006, EUR ARCH OTO-RHINO-L, V263, P188, DOI 10.1007/s00405-005-0974-6 Seikaly H, 2003, LARYNGOSCOPE, V113, P897, DOI 10.1097/00005537-200305000-00023 Su W.F., 2003, ARCH OTOLARYNGOL, P412 Sumita YI, 2002, J ORAL REHABIL, V29, P649, DOI 10.1046/j.1365-2842.2002.00911.x Terai H, 2004, BRIT J ORAL MAX SURG, V42, P190, DOI 10.1016/j.bjoms.2004.02.007 van der Molen L, 2009, EUR ARCH OTO-RHINO-L, V266, P901 Whitehill TL, 2006, CLIN LINGUIST PHONET, V20, P135, DOI 10.1080/02699200400026694 Windrich M, 2008, FOLIA PHONIATR LOGO, V60, P151, DOI 10.1159/000121004 Yoshida H, 2000, J ORAL REHABIL, V27, P723, DOI 10.1046/j.1365-2842.2000.00537.x NR 43 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 632 EP 640 DI 10.1016/j.specom.2011.06.005 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600005 ER PT J AU Karakozoglou, SZ Henrich, N d'Alessandro, C Stylianou, Y AF Karakozoglou, Sevasti-Zoi Henrich, Nathalie d'Alessandro, Christophe Stylianou, Yannis TI Automatic glottal segmentation using local-based active contours and application to glottovibrography SO SPEECH COMMUNICATION LA English DT Article DE High-speed videoendoscopy; Vocal-fold vibration; Active contours; Representation; Glottovibrogram; Electroglottography ID HIGH-SPEED VIDEOENDOSCOPY; VOCAL FOLD VIBRATIONS; ELECTROGLOTTOGRAPHY; MECHANISMS; ALGORITHM AB The use of high-speed videoendoscopy (HSV) for the assessment of vocal-fold vibrations dictates the development of efficient techniques for glottal image segmentation. We present a new glottal segmentation method using a local-based active contour framework. The use of local-based features and the exploitation of the vibratory pattern allows for dealing effectively with image noise and cases where the glottal area consists of multiple regions. A scheme for precise glottis localization is introduced, which facilitates the segmentation procedure. The method has been tested on a database of 60 HSV recordings. Comparisons with manual verification resulted in less than 1% difference on the average glottal area. These errors mainly come from detection failure in the posterior or anterior parts of the glottal area. Comparisons with automatic threshold-based glottal detection point out the necessity of complete frameworks for automatic detection. The glottovibrogram (GVG), a representation of glottal vibration is also presented. This easily readable representation depicts the time-varying distance of the vocal-fold edges. (C) 2011 Elsevier B.V. All rights reserved. C1 [Karakozoglou, Sevasti-Zoi; d'Alessandro, Christophe] LIMSI CNRS, Orsay, France. [Henrich, Nathalie] Univ Grenoble 3, Dept Speech & Cognit, GIPSA Lab, UMR 5216,CNRS,INPG, Grenoble, France. [Karakozoglou, Sevasti-Zoi; Stylianou, Yannis] Univ Crete, Dept Comp Sci, Iraklion, Greece. [Karakozoglou, Sevasti-Zoi] Univ Paris 11, Dept Comp Sci, Orsay, France. RP Karakozoglou, SZ (reprint author), LIMSI CNRS, Orsay, France. EM skarako@csd.uoc.gr; Nathalie.Henrich@gipsa-lab.grenoble-inp.fr; Christophe.D'Alessandro@limsi.fr; yannis@csd.uoc.gr RI d'Alessandro, Christophe/I-6991-2013 CR ADAMS R, 1994, IEEE T PATTERN ANAL, V16, P641, DOI 10.1109/34.295913 Allin S., 2004, P IEEE INT S BIOM IM, P812 Bailly L., 2009, THESIS U MAINE BEZIER P., 1972, NUMERICAL CONTROL MA Chan TF, 2001, IEEE T IMAGE PROCESS, V10, P266, DOI 10.1109/83.902291 CHILDERS DG, 1985, CRIT REV BIOMED ENG, V12, P131 CHILDERS DG, 1995, SPEECH COMMUN, V16, P127, DOI 10.1016/0167-6393(94)00050-K Deliyski D, 2003, P 6 INT C ADV QUANT, P1 Deliyski DD, 2008, FOLIA PHONIATR LOGO, V60, P33, DOI 10.1159/000111802 Demeyer J., 2009, 3 ADV VOIC FUNCT ASS Dollinger Michael, 2011, Advances in Vibration Analysis Research Einig D., 2010, THESIS TRIER U APPL GILBERT HR, 1984, J SPEECH HEAR RES, V27, P178 Glasbey C. A., 1993, GRAPH MODEL IM PROC, V55, P532, DOI 10.1006/gmip.1993.1040 HARALICK RM, 1985, COMPUT VISION GRAPH, V29, P100, DOI 10.1016/S0734-189X(85)90153-7 Henrich N., 2001, THESIS U P M CURIE P Henrich N, 2004, J ACOUST SOC AM, V115, P1321, DOI 10.1121/1.1646401 CHILDERS DG, 1990, J SPEECH HEAR RES, V33, P245 KARAKOZOGLOU SZ, 2010, THESIS U PARIS SUD 1 Kass M., 1988, INT J COMPUT VISION, V1, P321, DOI DOI 10.1007/BF00133570 KOHLER R, 1981, COMPUT VISION GRAPH, V15, P319, DOI 10.1016/S0146-664X(81)80015-9 Lankton S, 2008, IEEE T IMAGE PROCESS, V17, P2029, DOI 10.1109/TIP.2008.2004611 Lohscheller J, 2004, IEEE T BIO-MED ENG, V51, P1394, DOI [10.1109/TBME.2004.827938, 10.1109/TMBE.2004.827938] Lohscheller J, 2008, IEEE T MED IMAGING, V27, P300, DOI 10.1109/TMI.2007.903690 Lohscheller J, 2007, MED IMAGE ANAL, V11, P400, DOI 10.1016/j.media.2007.04.005 Marendic B, 2001, IEEE IMAGE PROC, P397 Mehnert A, 1997, PATTERN RECOGN LETT, V18, P1065, DOI 10.1016/S0167-8655(97)00131-1 Mehta DD, 2010, ANN OTO RHINOL LARYN, V119, P1 Mehta DD, 2011, J SPEECH LANG HEAR R, V54, P47, DOI 10.1044/1092-4388(2010/10-0026) Moukalled H., 2009, MAVEBA, V1, P137 Neubauer J, 2001, J ACOUST SOC AM, V110, P3179, DOI 10.1121/1.1406498 ROTHENBERG M, 1992, J VOICE, V6, P36, DOI 10.1016/S0892-1997(05)80007-4 Roubeau B, 2009, J VOICE, V23, P425, DOI 10.1016/j.jvoice.2007.10.014 Samet H., 1988, IEEE T PATTERN ANAL, V10, P586 Scherer R. C., 1988, VOCAL PHYSL VOICE PR, P279 Sethian J. A., 1999, LEVEL SET METHODS FA WESTPHAL LC, 1983, IEEE T ACOUST SPEECH, V31, P766, DOI 10.1109/TASSP.1983.1164104 Yan Y., 2006, IEEE T BIOMED ENG, V53 Zuiderveld K., 1994, GRAPHICS GEMS, VIV, P474 NR 39 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 641 EP 654 DI 10.1016/j.specom.2011.07.010 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600006 ER PT J AU Alpan, A Schoentgen, J Maryn, Y Grenez, F Murphy, P AF Alpan, A. Schoentgen, J. Maryn, Y. Grenez, F. Murphy, P. TI Assessment of disordered voice via the first rahmonic SO SPEECH COMMUNICATION LA English DT Article DE Disordered voice analysis; Cepstrum; First rahmonic; Correlation analysis; Sustained vowel; Connected speech ID CEPSTRAL PEAK PROMINENCE; BREATHY VOCAL QUALITY; DYSPHONIA SEVERITY; SIGNALS; SPEECH; PREDICTION; PARAMETERS; INDEX; NOISE AB A number of studies have shown that the amplitude of the first rahmonic peak (R1) in the cepstrum can be usefully employed to indicate hoarse voice quality. The cepstrum is obtained by taking the inverse Fourier transform of the log-magnitude spectrum. In the present study, a number of spectral pre-processing steps are investigated prior to computing the cepstrum; the pre-processing steps include period-synchronous, period-asynchronous, harmonic-synchronous and harmonic-asynchronous spectral band-limitation analysis. The analysis is applied on both sustained vowels [a] and connected speech signals. The correlation between R1 (the amplitude of the first rahmonic) and perceptual ratings is examined for a corpus comprising 251 speakers. It is observed that the correlation between R1 and perceptual ratings increases when the spectrum is band-limited prior to computing the cepstrum. In addition, comparisons are made with a previously reported cepstral cue, cepstral peak prominence (CPP). (C) 2011 Elsevier B.V. All rights reserved. C1 [Alpan, A.; Schoentgen, J.; Grenez, F.] Univ Libre Bruxelles, Lab Images Signals & Telecommun Devices, Brussels, Belgium. [Maryn, Y.] Sint Jan Gen Hosp, Dept Otorhinolaryngol & Head & Neck Surg, Dept Speech Language Pathol & Audiol, Brugge, Belgium. [Murphy, P.] Univ Limerick, Dept Phys, Limerick, Ireland. RP Alpan, A (reprint author), ULB LIST CP 165-51Av,F Roosevelt 50, B-1050 Brussels, Belgium. EM aalpan@ulb.ac.be; jschoent@ulb.ac.be; Youri.Maryn@azbrugge.be; fgrenez@ulb.ac.be; peter.murphy@ul.ie FU COST ACTION at the University of Limerick [2103]; "Region Wallonne", Belgium FX This research has been supported by COST ACTION 2103 "Advanced Voice Function Assessment" in the framework of a short-term scientific mission at the University of Limerick, and by the "Region Wallonne", Belgium, in the framework of the "WALEO II" programme. CR Alpan A, 2011, SPEECH COMMUN, V53, P131, DOI 10.1016/j.specom.2010.06.010 Awan SN, 2009, J SPEECH LANG HEAR R, V52, P482, DOI 10.1044/1092-4388(2009/08-0034) Awan SN, 2006, CLIN LINGUIST PHONET, V20, P35, DOI 10.1080/02699200400008353 Awan SN, 2005, J VOICE, V19, P268, DOI 10.1016/j.jvoice.2004.03.005 Balasubramanium R.K., 2010, J VOICE, V24, P651 Balasubramanium R.K., J VOICE IN PRESS Boersma P., 1993, IFA P, V17, P97 Boersma P., 2007, PRAAT DOING PHONETIC DEJONCKERE PH, 1994, CLIN LINGUIST PHONET, V8, P161, DOI 10.3109/02699209408985304 DEKROM G, 1993, J SPEECH HEAR RES, V36, P254 DUNN OJ, 1969, J AM STAT ASSOC, V64, P366, DOI 10.2307/2283746 Eadie TL, 2006, J VOICE, V20, P527, DOI 10.1016/j.jvoice.2005.08.007 Heman-Ackah YD, 2003, ANN OTO RHINOL LARYN, V112, P324 Heman-Ackah YD, 2002, J VOICE, V16, P20, DOI 10.1016/S0892-1997(02)00067-X HILLENBRAND J, 1994, J SPEECH HEAR RES, V37, P769 Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311 Hotelling H, 1940, ANN MATH STAT, V11, P271, DOI 10.1214/aoms/1177731867 Maryn Y, 2010, J VOICE, V24, P540, DOI 10.1016/j.jvoice.2008.12.014 Murphy PJ, 2006, J ACOUST SOC AM, V120, P2896, DOI 10.1121/1.2355483 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Rabiner L.R., 1978, DIGITAL PROCESSING S Schoentgen J, 2003, J ACOUST SOC AM, V113, P553, DOI 10.1121/1.1523384 Wolfe V, 1997, J COMMUN DISORD, V30, P403, DOI 10.1016/S0021-9924(96)00112-8 NR 23 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 655 EP 663 DI 10.1016/j.specom.2011.04.001 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600007 ER PT J AU Ghio, A Pouchoulin, G Teston, B Pinto, S Fredouille, C De Looze, C Robert, D Viallet, F Giovanni, A AF Ghio, A. Pouchoulin, G. Teston, B. Pinto, S. Fredouille, C. De Looze, C. Robert, D. Viallet, F. Giovanni, A. TI How to manage sound, physiological and clinical data of 2500 dysphonic and dysarthric speakers? SO SPEECH COMMUNICATION LA English DT Article DE Voice/speech disorders; Dysphonia; Dysarthria; Database; Clinical phonetics ID OBJECTIVE VOICE ANALYSIS; AERODYNAMIC MEASUREMENTS; SPEECH; DISEASE AB The aim of this contribution is to propose a database model designed for the storage and accessibility of various speech disorder data including signals, clinical evaluations and patients' information. This model is the result of 15 years of experience in the management and the analysis of this type of data. We present two important French corpora of voice and speech disorders that we have been recording in hospitals in Marseilles (MTO corpus) and Aix-en-Provence (AHN corpus). The population consists of 2500 dysphonic, dysarthric and control subjects, a number of speakers which, as far as we know, constitutes currently one of the largest corpora of "pathological" speech. The originality of this data lies in the presence of physiological data (such as oral airflow or estimated sub-glottal pressure) associated with acoustic recordings. This activity led us to raise the question of how we can manage the sound, physiological and clinical data of such a large quantity of data. Consequently, we developed a database model that we present here. Recommendations and technical solutions based on MySQL, a relational database management system, are discussed. (C) 2011 Elsevier B.V. All rights reserved. C1 [Ghio, A.; Pouchoulin, G.; Teston, B.; Pinto, S.; De Looze, C.; Robert, D.; Viallet, F.; Giovanni, A.] Aix Marseille Univ, LPL, CNRS, UMR 6057, Marseille, France. [Pouchoulin, G.; Fredouille, C.] Avignon Univ, LIA, Avignon, France. [Viallet, F.] Ctr Hosp Pays Aix, Serv Neurol, Aix En Provence, France. [Robert, D.; Giovanni, A.] Ctr Hosp Univ Timone, Serv ORL, Marseille, France. RP Ghio, A (reprint author), Univ Aix Marseille 1, CNRS, Lab Parole & Langage, 5 Ave Pasteur,BP 80975, F-13604 Aix En Provence 1, France. EM alain.ghio@lpl-aix.fr FU PHRC (projet hospitalier de recherche clinique); French National Research Agency [ANR BLAN08-0125]; COST Action [2103]; France Parkinson Association FX The authors would like to thank the financial supports: PHRC (projet hospitalier de recherche clinique), ANR BLAN08-0125 of the French National Research Agency, COST Action 2103 "Advanced Voice Function Assessment" and "France Parkinson Association". CR Auzou P, 2006, BATTERIE EVALUATION Baken RJ, 2000, CLIN MEASUREMENT SPE Bombien L., 2006, P 11 AUSTR INT C SPE, P313 Bonastre J., 2007, P INT ICSLP ANTW BEL, P1194 CARDEBAT D, 1990, ACTA NEUROL BELG, V90, P207 Carre R., 1984, P INT C AC SPEECH SI, P324 CNIL, 2006, METH REF TRAIT DONN Darley F.L, 1975, MOTOR SPEECH DISORDE DELLER JR, 1993, J ACOUST SOC AM, V93, P3516, DOI 10.1121/1.405684 DENT H, 1995, EUR J DISORDER COMM, V30, P264 Descout R., 1986, P 12 INT C AC TOR CA, VA, P4 Duez D., 2006, J MULTILINGUAL COMMU, V4, P45, DOI 10.1080/14769670500485513 Duez D, 2009, CLIN LINGUIST PHONET, V23, P781, DOI 10.3109/02699200903144788 Durand J., 2002, B PFC, V1, P1 Enderby P. M., 1983, FRENCHAY DYSARTHRIA FABRE P, 1957, Bull Acad Natl Med, V141, P66 Fahn S, 1987, RECENT DEV PARKINSON, P153 FOLSTEIN MF, 1975, J PSYCHIAT RES, V12, P189, DOI 10.1016/0022-3956(75)90026-6 Fougeron C., 2010, P LANG RES EV LRC VA, P2831 Fourcin A., 1989, SPEECH INPUT OUTPUT Fredouille C., 2005, P 9 EUR C SPEECH COM, P149 Fredouille C., 2009, EUR J ADV SIGNAL PRO, P1 Ghio A, 2004, P INT C VOIC PHYSL B, P55 Gibbon D, 1997, HDB STANDARDS RESOUR Gibbon F, 1998, Int J Lang Commun Disord, V33 Suppl, P44 Giovanni A, 1999, LARYNGOSCOPE, V109, P656, DOI 10.1097/00005537-199904000-00026 Giovanni A, 2002, FOLIA PHONIATR LOGO, V54, P304, DOI 10.1159/000066152 Giovanni A, 1999, J VOICE, V13, P341, DOI 10.1016/S0892-1997(99)80040-X HAMMARBERG B, 1980, ACTA OTO-LARYNGOL, V90, P441, DOI 10.3109/00016488009131746 Heaton RK, 1993, WISCONSIN CARD SORTI Helsinki, 2004, DECLARATION HELSINKI Hirano M, 1981, CLIN EXAMINATION VOI Ketelslagers K., 2006, EUR ARCH OTO-RHINO-L, V264, P519 KIM H, 2008, P INT C SPOK LANG PR, P1741 Kim Y, 2011, J SPEECH LANG HEAR R, V54, P417, DOI 10.1044/1092-4388(2010/10-0020) Klatt D., 1980, TRENDS SPEECH RECOGN, P49 Laver J, 1980, PHONETIC DESCRIPTION MARCHAL A, 1993, LANG SPEECH, V36, P137 MARCHAL A, 1993, J ACOUST SOC AM, V93, P2990, DOI 10.1121/1.405820 Mattis S., 1988, DEMENTIA RATING SCAL McVeigh A., 1992, Proceedings of the Fourth Australian International Conference on Speech Science and Technology Menendez-Pidal X., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608020 Nixon T., 2008, PRECLINICAL SPEECH S Parsa V, 2001, J SPEECH LANG HEAR R, V44, P327, DOI 10.1044/1092-4388(2001/027) Pinto S, 2010, REV NEUROL-FRANCE, V166, P800, DOI 10.1016/j.neurol.2010.07.005 Pouchoulin G., 2007, P 8 INTERSPEECH C IN, P1198 Revis J, 2002, FOLIA PHONIATR LOGO, V54, P19, DOI 10.1159/000048593 Robert D, 1999, ACTA OTO-LARYNGOL, V119, P724 Saenz-Lechon N, 2006, BIOMED SIGNAL PROCES, V1, P120, DOI 10.1016/j.bspc.2006.06.003 Sarr MM, 2009, REV NEUROL-FRANCE, V165, P1055, DOI 10.1016/j.neurol.2009.03.012 SMITHERAN JR, 1981, J SPEECH HEAR DISORD, V46, P138 Teston B, 1995, P EUR C SPEECH COMM, P1883 Viallet F., 2002, P SPEECH PROS, P679 VIALLET F., 2003, P DYSPH DYS DYSPH, P53 Viallet F, 2004, MOVEMENT DISORD, V19, pS237 Wester M., 1998, P S DAT VOIC QUAL RE, P92 Wilson D. K., 1987, VOICE PROBLEMS CHILD Yu P, 2007, FOLIA PHONIATR LOGO, V59, P20, DOI 10.1159/000096547 Yu P, 2001, J VOICE, V15, P529, DOI 10.1016/S0892-1997(01)00053-4 YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808 Zeiliger J., 1992, P JOURN ET PAR JEP B, P213 Zeiliger J., 1994, P 10 JOURN ET PAR, P287 NR 62 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2012 VL 54 IS 5 SI SI BP 664 EP 679 DI 10.1016/j.specom.2011.04.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 925JD UT WOS:000302756600008 ER PT J AU Ben Aicha, A Ben Jebara, S AF Ben Aicha, Anis Ben Jebara, Sofia TI Perceptual speech quality measures separating speech distortion and additive noise degradations SO SPEECH COMMUNICATION LA English DT Article DE Upper bound of perceptual equivalence; Lower bound of perceptual equivalence; Class of perceptual equivalence; Objective criteria ID ENHANCEMENT; AUDIO AB In this paper, novel perceptual criteria measuring speech distortion, additive noise and the overall quality are presented. Based on the masking concept, they are built to measure only the audible degradations perceived by the human ear. The class of perceptual equivalence (CPE) is introduced which leads to specify the nature of degradations affecting denoised speech. The CPE is defined in the frequency domain using perceptual tools and limited by two curves : upper bound of perceptual equivalence (UBPE) and lower bound of perceptual equivalence (LBPE). Denoised speech components belonging to this class are perceptually equivalent to the clean speech components, otherwise audible degradations are noticed. Based on this concept, new perceptual criteria are developed to assess denoised speech signals. After criteria introduction and explanation, they are validated by comparing their relationship, in terms of scatter plots and Pearson correlation with ITU-T recommendation P.835 which specifies three subjective tests to evaluate independently the speech distortion (SIG), the residual background noise (BAK) and the overall quality (MOS). Moreover, proposed criteria are compared conventional criteria, indicating an improved ability for predicting subjective tests. (C) 2011 Elsevier B.V. All rights reserved. C1 [Ben Aicha, Anis; Ben Jebara, Sofia] Univ Carthage, Ecole Super Commun Tunis, Res Unit TECHTRA, Ariana 2083, Tunisia. RP Ben Aicha, A (reprint author), Univ Carthage, Ecole Super Commun Tunis, Res Unit TECHTRA, Route Raoued 3-5 Km, Ariana 2083, Tunisia. EM anis_ben_aicha@yahoo.fr; sofia.benjebara@supcom.rnu.tn CR [Anonymous], 2000, PERC EV SPEECH QUAL [Anonymous], 1996, METH SUBJ DET TRANSM, P800 [Anonymous], 2003, SUBJ TEST METH EV SP, P835 Benesty J., 2005, SPEECH ENHANCEMENT Benesty J., 2008, HDB SPEECH PROCESSIN, P843 Beruoti M., 1979, P IEEE INT C AC SPEE, P208 Chetouani M., 2007, ADV NONLINEAR SPEECH, P230 Dimolitsas S, 1984, P IEEE, V136 Dreiseitel P, 2001, P INT WORKSH AC ECH Garofolo J., 1988, GETTING STARTED DARP Gustafsson S, 2002, IEEE T SPEECH AUDI P, V10, P245, DOI 10.1109/TSA.2002.800553 Hansen J.H.L., 1998, P INT C SPOK LANG PR Hensen J.H.L., 1998, P INT C SPOK LANG PR, V7, P2819 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2006, P IEEE INT C AC SPEE, V1, P153 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 HU Y, 2006, P INT, P1447 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Klatt D., 1982, P IEEE INT C AC SPEE, V7, P1278 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Painter T, 2000, P IEEE, V88, P451, DOI 10.1109/5.842996 Quanckenbush S., 1988, OBJECTIVE MEASURES S Rix A., 2001, P IEEE INT C AC SPEE, P749 Rix AW, 2006, IEEE T AUDIO SPEECH, V14, P1890, DOI 10.1109/TASL.2006.883260 Scalart P., 1996, P IEEE INT C AC SPEE Tribolet J. M., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Yang W., 1998, P IEEE INT C AC SPEE, V1, P541 Zwicker E., 1990, PSYCHOACOUSTICS FACT NR 30 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2012 VL 54 IS 4 BP 517 EP 528 DI 10.1016/j.specom.2011.11.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 902DA UT WOS:000301017600001 ER PT J AU Wu, MH Li, HH Hong, ZL Xian, XC Li, JY Wu, XH Li, L AF Wu, Meihong Li, Huahui Hong, Zhiling Xian, Xinchi Li, Jingyu Wu, Xihong Li, Liang TI Effects of aging on the ability to benefit from prior knowledge of message content in masked speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Auditory aging; "Cocktail-party" problem; Content priming; Energetic masking; Informational masking; Speech recognition; Working memory ID PERCEIVED SPATIAL SEPARATION; INFORMATIONAL MASKING; OLDER-ADULTS; COMPETING SPEECH; HEARING-LOSS; AUDITORY ATTENTION; ENERGETIC MASKING; CHINESE SPEECH; NOISE; RELEASE AB Under conditions in the presence of competing talkers, presenting the early part of a target sentence in quiet improves recognition of the last keyword of the sentence. This content-priming effect depends on a working-memory resource holding the information of the early presented part of the target speech (the content prime). Older adults usually exhibit declined working memory and experience more difficulties in speech recognition under "cocktail-party" conditions. This study investigated whether speech masking also affects recall of the content prime and whether the content-priming effect declines in older adults. The results show that in both younger adults and older adults, although the content prime was heard in quiet, recall of keywords in the prime was significantly affected by the signal-to-masker ratio of the target/masker presentation. The vulnerability of prime recall to speech masking was larger in older adults than that in younger adults. Also, the content-priming effect disappeared in older adults, even though older adults are able to use the content prime to determine the target speech in the presence of competing talkers. Thus, a speech masker affects not only recognition but also recall of speech, and there is an age-related decline in both content-priming-based unmasking of the target speech and recall of the prime. (C) 2011 Elsevier B.V. All rights reserved. C1 [Li, Liang] Peking Univ, Dept Psychol, Speech & Hearing Res Ctr, Key Lab Machine Percept,Minist Educ, Beijing 100871, Peoples R China. Peking Univ, Dept Machine Intelligence, Speech & Hearing Res Ctr, Key Lab Machine Percept,Minist Educ, Beijing 100871, Peoples R China. RP Li, L (reprint author), Peking Univ, Dept Psychol, Speech & Hearing Res Ctr, Key Lab Machine Percept,Minist Educ, Beijing 100871, Peoples R China. EM liangli@pku.edu.cn FU "973" National Basic Research Program of China [2009CB320901, 2010DFA31520, 2011CB707805]; National Natural Science Foundation of China [31170985, 30711120563, 90920302, 60811140086]; Chinese Ministry of Education [20090001110050]; Peking University FX This work was supported by the "973" National Basic Research Program of China (2009CB320901; 2010DFA31520; 2011CB707805), the National Natural Science Foundation of China (31170985; 30711120563, 90920302, 60811140086), the Chinese Ministry of Education (20090001110050), and "985" grants from Peking University. CR Agus TR, 2009, J ACOUST SOC AM, V126, P1926, DOI 10.1121/1.3205403 Arbogast TL, 2002, J ACOUST SOC AM, V112, P2086, DOI 10.1121/1.1510141 Baddeley A. D., 1986, WORKING MEMORY Bell R, 2008, PSYCHOL AGING, V23, P377, DOI 10.1037/0882-7974.23.2.377 Best V, 2008, P NATL ACAD SCI USA, V105, P13174, DOI 10.1073/pnas.0803718105 Best V, 2007, JARO-J ASSOC RES OTO, V8, P294, DOI 10.1007/s10162-007-0073-z Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Cao SY, 2011, J ACOUST SOC AM, V129, P2227, DOI 10.1121/1.3559707 Cheesman MF, 1995, AUDIOLOGY, V34, P321 Cherry CE, 1953, J ACOUST SOC AM, V25, P975, DOI DOI 10.1121/1.1907229 Ezzatian P, 2011, EAR HEARING, V32, P84, DOI 10.1097/AUD.0b013e3181ee6b8a DUQUESNOY AJ, 1983, J ACOUST SOC AM, V74, P739, DOI 10.1121/1.389859 Freyman RL, 1999, J ACOUST SOC AM, V106, P3578, DOI 10.1121/1.428211 Freyman RL, 2001, J ACOUST SOC AM, V109, P2112, DOI 10.1121/1.1354984 Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343 Frisina DR, 1997, HEARING RES, V106, P95, DOI 10.1016/S0378-5955(97)00006-3 Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953 GELFAND SA, 1988, J ACOUST SOC AM, V83, P248, DOI 10.1121/1.396426 Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668 Hasher L., 1988, PSYCHOL LEARN MOTIV, V22, P193, DOI DOI 10.1016/S0079-7421(08)60041-9 HELFER KS, 1990, J SPEECH HEAR RES, V33, P149 Helfer KS, 2009, J ACOUST SOC AM, V125, P447, DOI 10.1121/1.3035837 Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432 Helfer KS, 2005, J ACOUST SOC AM, V117, P842, DOI [10.1121/1.1836832, 10.1121/1.183682] Helfer KS, 2008, EAR HEARING, V29, P87 Helfer KS, 2010, J ACOUST SOC AM, V128, P3625, DOI 10.1121/1.3502462 Huang Y, 2010, EAR HEARING, V31, P579, DOI 10.1097/AUD.0b013e3181db6dc2 Huang Y, 2009, J EXP PSYCHOL HUMAN, V35, P1618, DOI 10.1037/a0015791 Huang Y, 2008, HEARING RES, V244, P51, DOI 10.1016/j.heares.2008.07.006 HUMES LE, 1990, J SPEECH HEAR RES, V33, P726 Humes LE, 2007, J AM ACAD AUDIOL, V18, P590, DOI 10.3766/jaaa.18.7.6 JERGER J, 1991, EAR HEARING, V12, P103 KIDD G, 1994, J ACOUST SOC AM, V95, P3475, DOI 10.1121/1.410023 Kidd G, 1998, J ACOUST SOC AM, V104, P422, DOI 10.1121/1.423246 Kidd G, 2005, J ACOUST SOC AM, V118, P3804, DOI 10.1121/1.2109187 King S., 2009, P BLIZZ CHALL WORKSH LEEK MR, 1991, PERCEPT PSYCHOPHYS, V50, P205, DOI 10.3758/BF03206743 Li L, 2004, J EXP PSYCHOL HUMAN, V30, P1077, DOI 10.1037/0096-1523.30.6.1077 Newman RS, 2007, J PHONETICS, V35, P85, DOI 10.1016/j.wocn.2005.10.004 Rakerd B, 2006, J ACOUST SOC AM, V119, P1597, DOI 10.1121/1.2161438 Rosenblum LD, 1996, J SPEECH HEAR RES, V39, P1159 Rossi-Katz J, 2009, J SPEECH LANG HEAR R, V52, P435, DOI 10.1044/1092-4388(2008/07-0243) Rudmann DS, 2003, HUM FACTORS, V45, P329, DOI 10.1518/hfes.45.2.329.27237 SALTHOUSE TA, 1991, PSYCHOL SCI, V2, P179, DOI 10.1111/j.1467-9280.1991.tb00127.x Schneider B. A., 1997, J SPEECH LANGUAGE PA, V21, P111 Schneider B. A., 2007, J AM ACAD AUDIOL, V18, P578 Schneider BA, 2000, PSYCHOL AGING, V15, P110, DOI 10.1037//0882-7974.15.1.110 Shinoda K., 1997, P EUR SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Summerfleld A. Q., 1979, PHONETICA, V36, P314 Tun PA, 2002, PSYCHOL AGING, V17, P453, DOI 10.1037//0882-7974.17.3.453 Verhaeghen P., 1993, J GERONTOL, V48, P157 Wolfram S., 1991, MATH SYSTEM DOING MA Wu X.-H., 2007, EFFECT NUMBER MASKIN, P390 Wu XH, 2005, HEARING RES, V199, P1, DOI 10.1016/j.heares.2004.03.010 Yang ZG, 2007, SPEECH COMMUN, V49, P892, DOI 10.1016/j.specom.2007.05.005 Yoshimura T, 1999, P EUR, P2347 Zen H., 2007, P 6 ISCA WORKSH SPEE Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 59 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2012 VL 54 IS 4 BP 529 EP 542 DI 10.1016/j.specom.2011.11.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 902DA UT WOS:000301017600002 ER PT J AU Sahidullah, M Saha, G AF Sahidullah, Md. Saha, Goutam TI Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition SO SPEECH COMMUNICATION LA English DT Article DE Speaker recognition; MFCC; DCT; Correlation matrix; Decorrelation technique; Linear transformation; Block transform; Narrow-band noise; Missing feature theory ID SUBBAND DCT; NOISE; IDENTIFICATION; VERIFICATION; ALGORITHM AB Standard Mel frequency cepstrum coefficient (MFCC) computation technique utilizes discrete cosine transform (DCT) for decorrelating log energies of filter bank output. The use of DCT is reasonable here as the covariance matrix of Mel filter bank log energy (MFLE) can be compared with that of highly correlated Markov-I process. This full-band based MFCC computation technique where each of the filter bank output has contribution to all coefficients, has two main disadvantages. First, the covariance matrix of the log energies does not exactly follow Markov-I property. Second, full-band based MFCC feature gets severely degraded when speech signal is corrupted with narrow-band channel noise, though few filter bank outputs may remain unaffected. In this work, we have studied a class of linear transformation techniques based on block wise transformation of MFLE which effectively decorrelate the filter bank log energies and also capture speech information in an efficient manner. A thorough study has been carried out on the block based transformation approach by investigating a new partitioning technique that highlights associated advantages. This article also reports a novel feature extraction scheme which captures complementary information to wide band information; that otherwise remains undetected by standard MFCC and proposed block transform (BT) techniques. The proposed features are evaluated on NIST SRE databases using Gaussian mixture model-universal background model (GMM-UBM) based speaker recognition system. We have obtained significant performance improvement over baseline features for both matched and mismatched condition, also for standard and narrow-band noises. The proposed method achieves significant performance improvement in presence of narrow-band noise when clubbed with missing feature theory based score computation scheme. Crown Copyright (C) 2011 Published by Elsevier B.V. All rights reserved. C1 [Sahidullah, Md.; Saha, Goutam] Indian Inst Technol, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India. RP Sahidullah, M (reprint author), Indian Inst Technol, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India. EM sahidullahmd@gmail.com; gsaha@ece.iitkgp.ernet.in RI Sahidullah, Md/E-2953-2013 CR AHMED N, 1974, IEEE T COMPUT, VC 23, P90, DOI 10.1109/T-C.1974.223784 Akansu A. N., 1992, MULTIRESOLUTION SIGN Benesty J., 2007, SPRINGER HDB SPEECH Besacier L., 1997, LECT NOTES COMPUT SC, V1206, P193 Besacier L, 2000, SIGNAL PROCESS, V80, P1245, DOI 10.1016/S0165-1684(00)00033-5 Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 Campbell JP, 2009, IEEE SIGNAL PROC MAG, V26, P95, DOI 10.1109/MSP.2008.931100 Chakroborty S., 2008, THESIS INDIAN I TECH Chakroborty S, 2010, SPEECH COMMUN, V52, P693, DOI 10.1016/j.specom.2010.04.002 Chetouani M, 2009, PATTERN RECOGN, V42, P487, DOI 10.1016/j.patcog.2008.08.008 Damper RI, 2003, PATTERN RECOGN LETT, V24, P2167, DOI 10.1016/S0167-8655(03)00082-5 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Douglas O, 2009, SPEECH COMMUNICATION Finan R. A., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1009652732313 Garreton C, 2010, IEEE T AUDIO SPEECH, V18, P1082, DOI 10.1109/TASL.2010.2049671 Hung WW, 2001, IEEE SIGNAL PROC LET, V8, P70 Jain A.K., 2010, FUNDAMENTALS DIGITAL Jingdong C., 2000, P INT C SPOK LANG PR, VIV, P117 Jingdong Chen, 2004, IEEE Signal Processing Letters, V11, DOI 10.1109/LSP.2003.821689 Jung SH, 1996, IEEE T CIRC SYST VID, V6, P273 Kajarekar S., 2001, P ICASSP, V1, P137 Kim S, 2008, ETRI J, V30, P89, DOI 10.4218/etrij.08.0107.0108 Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 Kinnunen T., 2004, THESIS U JOENSUU Kwon OW, 2004, SIGNAL PROCESS, V84, P1005, DOI 10.1016/j.sigpro.2004.03.004 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Lippmann R., 1997, EUROSPEECH, pKN37 Mak B, 2002, IEEE SIGNAL PROC LET, V9, P241, DOI 10.1109/LSP.2002.803007 Martin A., 2006, 2004 NIST SPEAKER RE Ming J, 2007, IEEE T AUDIO SPEECH, V15, P1711, DOI 10.1109/TASL.2007.899278 Mukherjee J, 2002, IEEE T CIRC SYST VID, V12, P620, DOI 10.1109/TCSVT.2002.800509 Nasersharif B, 2007, PATTERN RECOGN LETT, V28, P1320, DOI 10.1016/j.patrec.2006.11.019 Nitta T., 2000, ICSLP, V1, P385 Oppenheim A. V., 1979, DIGITAL SIGNAL PROCE Przybocki M., 2002, 2001 NIST SPEAKER RE Quatieri T, 2006, DISCRETE TIME SPEECH Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Sahidullah Md, 2010, International Journal of Biometrics, V2, DOI 10.1504/IJBM.2010.035450 Sahidullah M., 2009, IEEE VTC OCT, P1 Sivakumaran P, 2003, SPEECH COMMUN, V41, P485, DOI 10.1016/S0167-6393(03)00017-7 MALVAR HS, 1989, IEEE T ACOUST SPEECH, V37, P553, DOI 10.1109/29.17536 Takiguchi Tetsuya, 2007, Journal of Multimedia, V2, DOI 10.4304/jmm.2.5.13-18 Vale EE, 2008, ELECTRON LETT, V44, P1280, DOI 10.1049/el:20082455 NR 43 TC 13 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2012 VL 54 IS 4 BP 543 EP 565 DI 10.1016/j.specom.2011.11.004 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 902DA UT WOS:000301017600003 ER PT J AU Escudero, D Aguilar, L Vanrell, MD Prieto, P AF Escudero, David Aguilar, Lourdes del Mar Vanrell, Maria Prieto, Pilar TI Analysis of inter-transcriber consistency in the Cat_ToBI prosodic labeling system SO SPEECH COMMUNICATION LA English DT Article DE Prosody; Prosodic labeling; Inter-transcriber consistency; ToBI ID RELIABILITY; AGREEMENT; SPEECH; CORPUS AB A set of tools to analyze inconsistencies observed in a Cat_ToBI labeling experiment are presented. We formalize and use the metrics that are commonly used in inconsistency tests. The metrics are systematically applied to analyze the robustness of every symbol and every pair of transcribers. The results reveal agreement rates for this study that are comparable to previous ToBI inter-reliability tests. The inter-transcriber confusion rates are transformed into distance matrices to use multidimensional scaling for visualizing the confusion between the different ToBI symbols and the disagreement between the raters. Potential different labeling criteria are identified and subsets of symbols that are candidates to be fused are proposed. (C) 2011 Elsevier B.V. All rights reserved. C1 [Escudero, David] Univ Valladolid, Dpt Comp Sci, E-47002 Valladolid, Spain. [Aguilar, Lourdes] Univ Autonoma Barcelona, Dpt Spanish Philol, Barcelona, Spain. [del Mar Vanrell, Maria] Univ Autonoma Barcelona, Dpt Catalan Philol, Barcelona, Spain. [Prieto, Pilar] Univ Pompeu Fabra, Dpt Translat & Language Sci, ICREA, Barcelona, Spain. RP Escudero, D (reprint author), Univ Valladolid, Dpt Comp Sci, E-47002 Valladolid, Spain. EM descuder@infor.uva.es RI Consolider Ingenio 2010, BRAINGLOT/D-1235-2009; Prieto, Pilar/E-7390-2013; Escudero, David/K-7905-2014 OI Prieto, Pilar/0000-0001-8175-1081; Escudero, David/0000-0003-0849-8803 FU Spanish Ministerio de Ciencia e Innovacion [FFI2008-04982-C003-02, FFI2008-04982-C003-03, FFI2011-29559-C02-01, FFI2011-29559-C02-02, FFI2009-07648/FILO, CSD2007-00012]; Generalitat de Catalunya [2009SGR-701] FX This research has been funded by six research grants awarded by the Spanish Ministerio de Ciencia e Innovacion, namely the Glissando project FFI2008-04982-C003-02, FFI2008-04982-C003-03, FFI2011-29559-C02-01, FFI2011-29559-C02-02, FFI2009-07648/FILO and CONSOLIDER-INGENIO 2010 Programme CSD2007-00012, and by a grant awarded by the Generalitat de Catalunya to the Grup d'Estudis de Prosodia (2009SGR-701) CR Aguilar L., 2009, CAT TOBI TRAINING MA Ananthakrishnan S, 2008, IEEE T AUDIO SPEECH, V16, P216, DOI 10.1109/TASL.2007.907570 Arvaniti A., 2005, PROSODIC TYPOLOGY PH, P84 Beckman M., 2000, KOREAN J SPEECH SCI, V7, P143 Beckman M., 2005, PROSODIC TYPOLOGY PH, P9 Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008 Beckman M.E., 2000, INTONATION SPANISH T Boersma P., 2011, PRAAT DOING PHONETIC Bonafonte A., 2008, P LREC MARR, P3325 Borg I., 2005, MODERN MULTIMENSIONA BRUGOS ALEJNA, 2008, P SPEECH PROS 2008, P273 Buhmann J., 2002, P 3 INT C LANG RES E, P779 Cabre T., 2007, INTERACTIVE ATLAS CA COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104 Estebas E., 2009, ESTUDIOS FONETICA EX, VXVIII, P263 Estebas Vilaplana E., 2010, TRANSCRIPTION INTONA, P17 FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619 Godfrey J. J., 1992, P ICASSP, V1, P517 Gonzalez C., 2010, P INT 2010, P142 Grice M., 1995, PHONUS, V1, P33 Grice M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607958 Gwet K. L., 2010, HDB INTERRATER RELIA Gwet K.L, 2008, BRIT J MATH STAT PSY, V61, P26 Hasegawa-Johnson M, 2005, SPEECH COMMUN, V46, P418, DOI 10.1016/j.specom.2005.01.009 Hasegawa-Johnson M, 2004, P ICSA INT C SPOK LA, P2729 Herman R, 2002, LANG SPEECH, V45, P1 Hirst DJ, 2005, SPEECH COMMUN, V46, P334, DOI 10.1016/j.specom.2005.02.020 Ihaka R., 1996, J COMPUTATIONAL GRAP, V5, P299, DOI DOI 10.2307/1390807 Jun S., 2000, P ICSLP, V3, P211 Kruskal J. B., 1978, SAGE U PAPER SERIES LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310 Mayo C., 1997, P ESCA WORKSH INT TH, P231 Ohio-State-University, 2006, WHAT IS TOBI Pena D, 1999, ESTADISTICA MODELOS Pierrehumbert J, 1980, THESIS MIT Pitrelli J., 1994, P 3 INT C SPOK LANG, P123 Pitt MA, 2005, SPEECH COMMUN, V45, P89, DOI 10.1016/j.specom.2004.09.001 Prieto P., 2009, ESTUDIOS FONETICA EX, VXVIII, P287 Prieto P, 2012, PROSODIC TYPOLOGY Prom-on S, 2009, J ACOUST SOC AM, V125, P405, DOI 10.1121/1.3037222 Rosenberg A., 2010, HLT NAACL, P721 Rosenberg A, 2009, THESIS COLUMBIA U US Scott William, 1955, PUBLIC OPIN QUART, P321, DOI DOI 10.1086/266577 Silverman K., 1992, P INT C SPOK LANG PR, P867 Sim J, 2005, PHYS THER, V85, P257 Sridhar VKR, 2008, IEEE T AUDIO SPEECH, V16, P797, DOI 10.1109/TASL.2008.917071 Syrdal A. K., 2000, P INT C SPOK LANG PR, V3, P235 Syrdal AK, 2001, SPEECH COMMUN, V33, P135, DOI 10.1016/S0167-6393(00)00073-X UEBERSAX JS, 1987, PSYCHOL BULL, V101, P140, DOI 10.1037/0033-2909.101.1.140 Venditti Jennifer J., 2005, PROSODIC TYPOLOGY PH, P172 Wightman Colin W, 2002, P SPEECH PROS NR 51 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2012 VL 54 IS 4 BP 566 EP 582 DI 10.1016/j.specom.2011.12.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 902DA UT WOS:000301017600004 ER PT J AU Reveil, B Martens, JP van den Heuvel, H AF Reveil, Bert Martens, Jean-Pierre van den Heuvel, Henk TI Improving proper name recognition by means of automatically learned pronunciation variants SO SPEECH COMMUNICATION LA English DT Article DE Proper name recognition; Pronunciation variation modeling; Cross-linguality ID SPEECH RECOGNITION; CORPORA; MODELS; UNITS AB This paper introduces a novel lexical modeling approach that aims to improve large vocabulary proper name recognition for native and non-native speakers. The method uses one or more so-called phoneme-to-phoneme (P2P) converters to add useful pronunciation variants to a baseline lexicon. Each P2P converter is a stochastic automaton that applies context-dependent transformation rules to a baseline transcription that is generated by a standard grapheme-to-phoneme (G2P) converter. The paper focuses on the inclusion of different types of features to describe the rule context ranging from the identities of neighboring phonemes to morphological and even semantic features such as the language of origin of the name and on the development and assessment of methods that can cope with cross-lingual issues. Another aim is to ensure that the proposed solutions are applicable to new names (not seen during system development) and useful in the hands of product developers with good knowledge of their application domain but little expertise in automatic speech recognition (ASR) and speech corpus acquisition. The proposed method was evaluated on person name and geographical name recognition, two economically interesting domains in which non-native speakers as well as non-native names occur very frequently. For the recognition experiments a state-of-the-art commercial ASR engine was employed. The experimental results demonstrate that significant improvements of the recognition accuracy can be achieved: large gains (up to 40% relative) in case prior knowledge of the speaker tongue and the name origin is available, and still significant gains in case no such prior information is available. (C) 2011 Elsevier B.V. All rights reserved. C1 [Reveil, Bert; Martens, Jean-Pierre] UGent, ELIS, DSSP Grp, B-9000 Ghent, Belgium. [van den Heuvel, Henk] Radboud Univ Nijmegen, Fac Arts, CLST, Nijmegen, Netherlands. RP Reveil, B (reprint author), UGent, ELIS, DSSP Grp, Sint Pietersnieuwstr 41, B-9000 Ghent, Belgium. EM breveil@elis.ugent.be; martens@elis.ugent.be; h.vandenheuvel@let.ru.nl FU Flanders FWO FX The presented work was carried out in the context of two research projects: the Autonomata Too project, granted under the Dutch-Flemish STEVIN program, and the TELEX project, granted by Flanders FWO. CR Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1 Amdal I., 2000, P ISCA ITRW ASR2000, P85 Amdal I., 2000, P ICSLP, P622 Cremelie N, 1999, SPEECH COMMUN, V29, P115, DOI 10.1016/S0167-6393(99)00034-5 Bartkova K, 2006, INT CONF ACOUST SPEE, P1037 Bartkova K, 2007, SPEECH COMMUN, V49, P836, DOI 10.1016/j.specom.2006.12.009 Bisani M, 2008, SPEECH COMMUN, V50, P434, DOI 10.1016/j.specom.2008.01.002 Bonaventura P., 1998, P ESCA WORKSH MOD PR, P17 Bouselmi G, 2006, INT CONF ACOUST SPEE, P345 CMU, 2010, CARN MELL U PRON DIC Conover W., 1999, PRACTICAL NONPARAMET, V3 Cremelie N., 2001, P ISCA ITRW AD METH, P151 Daelemans W., 2005, MEMORY BASED LANGUAG, VI Fosler-Lussier E, 2005, SPEECH COMMUN, V46, P153, DOI 10.1016/j.specom.2005.03.003 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Goronzy S., 2004, SPEECH COMM, P42 Humphries J., 1997, P EUR, P317 Jurafsky D, 2001, INT CONF ACOUST SPEE, P577, DOI 10.1109/ICASSP.2001.940897 Lawson A., 2003, P EUR GEN SWITZ, P1505 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Li X, 2007, P ASRU, P130 Loots L, 2011, SPEECH COMMUN, V53, P75, DOI 10.1016/j.specom.2010.07.006 Maison B., 2003, P ASRU VIRG ISL US, P429 Mayfield-Tomokiyo L., 2001, P WORKSH MULT SPOK L PMLA, 2002, P ITRW PRON MOD LEX Raux A., 2004, P ICSLP 04 INT C SPO, P613 Reveil B., 2010, P LREC, P2149 Revell B., 2009, P INT, P2995 Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 Schaden S., 2003, P 15 INT C PHON SCI, P2545 Schaden S., 2003, P 10 EACL C, P159 Schraagen M., 2010, P LREC, P612 Stemmer G., 2001, P EUR AALB DENM, P2745 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 Van den Heuvel H., 2009, P INT BRIGHT UK, P2991 van den Heuvel H., 2008, P LREC, P140 Van Bael C, 2007, COMPUT SPEECH LANG, V21, P652, DOI 10.1016/j.csl.2007.03.003 Van Compernolle D, 2001, SPEECH COMMUN, V35, P71, DOI 10.1016/S0167-6393(00)00096-0 Wester M., 2000, P INT C SPOK LANG PR, P488 Yang Q., 2002, P PMLA, P123 You H., 2005, P EUR LISB PORT, P749 NR 41 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 321 EP 340 DI 10.1016/j.specom.2011.10.007 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100001 ER PT J AU Kulkarni, PN Pandey, PC Jangamashetti, DS AF Kulkarni, Pandurangarao N. Pandey, Prem C. Jangamashetti, Dakshayani S. TI Multi-band frequency compression for improving speech perception by listeners with moderate sensorineural hearing loss SO SPEECH COMMUNICATION LA English DT Article DE Frequency compression; Sensorineural hearing loss; Spectral masking ID SPECTRAL CONTRAST ENHANCEMENT; IMPAIRED LISTENERS; LOUDNESS RECRUITMENT; THRESHOLD ELEVATION; RESPONSE-TIMES; INTELLIGIBILITY; TRANSPOSITION; SIMULATION; NOISE; DISCRIMINATION AB In multi-band frequency compression, the speech spectrum is divided into a number of analysis bands, and the spectral samples in each band are compressed towards the band center by a constant compression factor, resulting in presentation of the speech energy in relatively narrow bands, for reducing the effect of increased intraspeech spectral masking associated with sensorineural hearing loss. Earlier investigation assessing the quality of the processed speech showed best results for auditory critical bandwidth based compression using spectral segment mapping and pitch-synchronous analysis-synthesis. The objective of the present investigation is to evaluate the effectiveness of the technique in improving speech perception by listeners with moderate to severe sensorineural loss and to optimize the technique with respect to the compression factor. The listening tests showed maximum improvement in speech perception for a compression factor of 0.6, with an improvement of 9%-21% in the recognition scores for consonants and a significant reduction in response times. (V) 2011 Elsevier B.V. All rights reserved. C1 [Kulkarni, Pandurangarao N.; Pandey, Prem C.] Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. [Jangamashetti, Dakshayani S.] Basaveshwar Engn Coll, Dept Elect & Elect Engn, Bagalkot 587102, Karnataka, India. RP Pandey, PC (reprint author), Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. EM pnkulkarni@ee.iitb.ac.in; pcpandey@ee.iitb.ac.in; dsj1869@rediffmail.com FU Department of Information Technology, MCIT, Government of India FX The authors are grateful to Dr. Kiran Kalburgi and Dr. S. S. Doddamani for providing support in conducting listening tests on hearing-impaired listeners. The research is partly supported by a project grant to IIT Bombay under the National Programme on Perception Engineering, sponsored by the Department of Information Technology, MCIT, Government of India. CR ANSI, 1989, S321989 ANSI AM STAN Apoux F, 2001, HEARING RES, V153, P123, DOI 10.1016/S0378-5955(00)00265-3 Arai T., 2004, P 18 INT C AC ICA, P1389 BAER T, 1993, J REHABIL RES DEV, V30, P49 Baskent D, 2006, J ACOUST SOC AM, V119, P1156, DOI 10.1121/1.2151825 BUNNELL HT, 1990, J ACOUST SOC AM, V88, P2546, DOI 10.1121/1.399976 CARNEY AE, 1983, J ACOUST SOC AM, V73, P268, DOI 10.1121/1.388860 Chaudhari D.S., 1998, ACOUST SPEECH SIG PR, P3601 Cheeran A. N., 2004, P ICASSP MONTR QUEB, VIV, P17 CHILDERS DG, 1994, J ACOUST SOC AM, V96, P2026, DOI 10.1121/1.411319 Cohen I, 2006, SIGNAL PROCESS, V86, P698, DOI 10.1016/j.sigpro.2005.06.005 Delogu C., 1991, P EUROSPEECH 91 GENO, P353 DUBNO JR, 1989, J ACOUST SOC AM, V85, P1666, DOI 10.1121/1.397955 Fraga F. J., 2008, P INT BRISB AUSTR, P2238 GATEHOUSE S J G, 1990, British Journal of Audiology, V24, P63, DOI 10.3109/03005369009077843 GLASBERG BR, 1986, J ACOUST SOC AM, V79, P1020, DOI 10.1121/1.393374 HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295 Jangamashetti D. S., 2010, P 20 INT C AC ICA 20 Kreul K.J., 1968, J SPEECH HEAR RES, V11, P536 Kulkarni P. N., 2009, INT J SPEECH TECH, V10, P219 Kulkarni P.N., 2009, P 16 INT C DIG SIGN Kulkarni P.N, 2010, THESIS INDIAN I TECH Kulkarni P.N., 2006, J ACOUST SOC AM, V120, P3253 Lunner T, 1993, Scand Audiol Suppl, V38, P75 Lyregaard P E, 1982, Scand Audiol Suppl, V15, P113 Mackersie C, 1999, EAR HEARING, V20, P140, DOI 10.1097/00003446-199904000-00005 McDermott HJ, 2000, BRIT J AUDIOL, V34, P353 Meftah M., 1996, P ICSLP, V1, P74, DOI 10.1109/ICSLP.1996.607033 Miller RL, 1999, J ACOUST SOC AM, V106, P2693, DOI 10.1121/1.428135 Mitra S.K., 1998, COMPUTER BASED APPRO MOORE BCJ, 1993, J ACOUST SOC AM, V94, P2050, DOI 10.1121/1.407478 Moore BCJ, 1997, INTRO PSYCHOL HEARIN Murase A., 2004, P 18 INT C AC KYOT J, VII, P1519 Nejime Y, 1997, J ACOUST SOC AM, V102, P603, DOI 10.1121/1.419733 Proakis J. G., 1992, DIGITAL SIGNAL PROCE Rabiner L.R., 1978, DIGITAL PROCESSING S REED CM, 1983, J ACOUST SOC AM, V74, P409, DOI 10.1121/1.389834 Robinson JD, 2007, INT J AUDIOL, V46, P293, DOI 10.1080/14992020601188591 Sakamoto S, 2000, Auris Nasus Larynx, V27, P327, DOI 10.1016/S0385-8146(00)00066-3 Simpson A, 2006, INT J AUDIOL, V45, P619, DOI 10.1080/14992020600825508 Simpson A, 2005, INT J AUDIOL, V44, P281, DOI 10.1080/14992020500060636 STONE MA, 1992, J REHABIL RES DEV, V29, P39, DOI 10.1682/JRRD.1992.04.0039 TERKEURS M, 1992, J ACOUST SOC AM, V91, P2872, DOI 10.1121/1.402950 Turner Christopher W., 1999, Journal of the Acoustical Society of America, V106, P877, DOI 10.1121/1.427103 VILLCHUR E, 1974, J ACOUST SOC AM, V56, P1601, DOI 10.1121/1.1903484 VILLCHUR E, 1977, J ACOUST SOC AM, V62, P665, DOI 10.1121/1.381579 Yang J, 2003, SPEECH COMMUN, V39, P33, DOI 10.1016/S0167-6393(02)00057-2 Yang WY, 2006, J ACOUST SOC AM, V120, P801, DOI 10.1121/1.2216768 Yasu K., 2002, P CHIN JAP JOINT C A, P159 Yoo SD, 2007, J ACOUST SOC AM, V122, P1138, DOI 10.1121/1.2751257 ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630 NR 51 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 341 EP 350 DI 10.1016/j.specom.2011.09.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100002 ER PT J AU Moreno-Daniel, A Wilpon, J Juang, BH AF Moreno-Daniel, Antonio Wilpon, Jay Juang, B. H. TI Index-based incremental language model for scalable directory assistance SO SPEECH COMMUNICATION LA English DT Article DE Voice search; Directory assistance; Listings search; Spoken query processing; Incremental language model ID CONTINUOUS SPEECH RECOGNITION; FINITE-STATE TRANSDUCERS AB As the ubiquitous access to vast and remote information sources from portable devices becomes commonplace, the need from users to perform searches in keyboard-unfriendly situations grows substantially, thus triggering the increased demand of voice search sessions. This paper proposes a methodology that addresses different dimensions of scalability of mixed-initiative voice search in automatic spoken dialog systems. The strategy is based on splitting the complexity of the fully-constrained grammar (one that tightly covers the entire hypothesis space) into a fixed/low complexity phonotactic grammar followed by an index mechanism that dynamically assembles a second-pass grammar that consists of only a handful of hypotheses. The experimental analysis demonstrates different dimensions of scalability achieved by the proposed method using actual WHITEPAGEs-residential data. (C) 2011 Elsevier B.V. All rights reserved. C1 [Moreno-Daniel, Antonio; Juang, B. H.] Georgia Inst Technol, Atlanta, GA 30332 USA. [Wilpon, Jay] AT&T Labs Res, Florham Pk, NJ USA. RP Moreno-Daniel, A (reprint author), Georgia Inst Technol, Atlanta, GA 30332 USA. EM amoreno@gmail.com CR Allauzen C., 2005, LECT NOTES COMPUT SC, P23 Allauzen C., 2004, P ICASSP2003, V1, P352 Allen J. E., 1999, IEEE Intelligent Systems, V14, DOI 10.1109/5254.796083 Aubert XL, 2002, COMPUT SPEECH LANG, V16, P89, DOI 10.1006/csla.2001.0185 Bangalore S., 2006, P C EUR CHAPT ASS CO, P361 Bangalore S., 2003, P IEEE WORKSH AUT SP, P221 Bayer R., 1972, Acta Informatica, V1, DOI 10.1007/BF00289509 Brin S, 1998, COMPUT NETWORKS ISDN, V30, P107, DOI 10.1016/S0169-7552(98)00110-X Chang W., 2002, IEEE T SPEECH AUDIO, V10, P531 Crestani F, 2002, DATA KNOWL ENG, V41, P105, DOI 10.1016/S0169-023X(02)00024-1 David R., 1995, STL TUTORIAL REFEREN Dolfing H.J.G.A., 2001, P IEEE AUT SPEECH RE, P194 FREDKIN E, 1960, COMMUN ACM, V3, P490, DOI 10.1145/367390.367400 Goffin V., 2005, P IEEE INT C AC SPEE, VI, P1033, DOI 10.1109/ICASSP.2005.1415293 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X HARRIS R. A., 2004, VOICE INTERACTION DE Hori T., 2004, P ICSLP, V1, P289 Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790 Johnston M., 2006, P IEEE INT C AC SPEE, VI, P617 Kanthak S., 2002, P INT C SPOK LANG PR, P1309 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 Mohri M, 1997, COMPUT LINGUIST, V23, P269 Mohri M., 2008, SPEECH RECOGNITION W, P559, DOI 10.1007/978-3-540-49127-9_28 Mohri M, 1998, LECT NOTES COMPUT SC, V1436, P144 Moreno-Daniel A., 2009, P IEEE INT C AC SPEE, VI, P3945 Moreno-Daniel A., 2007, P IEEE INT C AC SPEE, VIV, P121 Natarajan P., 2002, P ICASSP 2002 MAY, VI, P21 Pack T., 2008, P ISCA INT SEPT, P53 Pack T., 2008, P 21 ANN S US INT SO, P141, DOI 10.1145/1449715.1449738 Parthasarathy S., 2005, P INTERSPEECH, P2493 Parthasarathy S., 2007, P IEEE INT C AC SPEE, VIV, P161 Pereira F.C., 1996, SPEECH RECOGNITION C, P431 Rabiner L. R., 1986, IEEE ASSP Magazine, V3, DOI 10.1109/MASSP.1986.1165342 Rose R.C., 2001, P ICASSP, VI, P17 Salomaa A., 1978, AUTOMATA THEORETIC A Sproat R., 1999, TEXT TO SPEECH SYNTH Wang Y.-Y., 2008, SIGNAL PROCESSING MA, V25, P28 Willet D., 2002, P ICASSP, VI, P713 Wilpon J.G., 1994, P ISCA INT C SPOK LA, P667 Yu D., 2007, P INT, P2709 NR 41 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 351 EP 367 DI 10.1016/j.specom.2011.09.006 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100003 ER PT J AU Recasens, D AF Recasens, Daniel TI A cross-language acoustic study of initial and final allophones of /1/ SO SPEECH COMMUNICATION LA English DT Article DE Clear and dark /1/; Intrinsic and extrinsic allophones; Darkness degree; Vowel coarticulation; Spectral analysis AB Formant frequency data for /1/ in 23 languages/dialects where the consonant may be typically clear or dark show that the two varieties of /1/ are set in contrast mostly in the context of /i/ but also next to /a/, and that a few languages/dialects may exhibit intermediate degrees of darkness in the consonant. F2 for /1/ is higher utterance initially than utterance finally, more so if the lateral is clear than if it is dark; moreover, the initial and final allophones may be characterized as intrinsic (in most languages/dialects) or extrinsic (in several English dialects, Czech and Dutch) depending on whether the position-dependent frequency difference in question is below or above 200/300 Hz. The paper also reports a larger degree of vowel coarticulation for clear /1/ than for dark /1/ and in initial than in final position. These results are interpreted in terms of the production mechanisms involved in the realization of the two /1/ varieties in the different positional and vowel context conditions subjected to investigation. (C) 2011 Elsevier B.V. All rights reserved. C1 [Recasens, Daniel] Univ Autonoma Barcelona, Dept Filologia Catalana, E-08193 Barcelona, Spain. [Recasens, Daniel] Inst Estudis Catalans, Lab Fonet, Barcelona 08001, Spain. RP Recasens, D (reprint author), Univ Autonoma Barcelona, Dept Filologia Catalana, E-08193 Barcelona, Spain. EM daniel.recasens@uab.es FU Ministry of Innovation and Science of Spain [FFI2009-09339]; Catalan Government [2009SGR003] FX This research was funded by the Project FFI2009-09339 of the Ministry of Innovation and Science of Spain and by the research group 2009SGR003 of the Catalan Government. We would like to acknowledge three reviewers for comments on a previous version of the manuscript, and several scholars for providing acoustic recordings or formant frequency data: (Alguercse) Francesco Ballone; (Czech) Jan Volin; (Danish) John Tondering; (Dutch) Louis Pols; (Finnish) olli Aaltonen; (German) Marzena Zygis and Micaela Mertins; (Hungarian) Maria Gosy; (Italian) Silvia Calamai; (Norwegian) Hanne Gram Simonsen and Inger Moen; (Occitan) Daniela Muller; (Portuguese) Antonio Texeira; (Romanian) Ioana Chitoran; (Russian) Alexei Kochetov; (Swedish) Francisco Lacerda. Most of these scholars read a preliminary paper version and made remarks for improvement. We are also grateful to the PETRA lab (Plateau d'Etudes Techniques et de Recherche en Audition; http://petra.univ-tlse2fr/) where the Occitan recordings were carried out. CR Bladon R. A. W., 1979, CURRENT TRENDS PHONE, P501 Bladon R. A. W., 1976, J PHONETICS, V3, P137 Bladon R.A.W., 1978, J ITALIAN LINGUISTIC, V3, P43 Browman C.P., 1995, FESTSCHRIFT KS HARRI, P19 Charcouloff M., 1985, TRAVAUX I PHONETIQUE, V10, P63 Cruz-Ferreira M., 1995, J INT PHON ASSOC, V25, P90 Dankovicova J., 1999, HDB INT PHONETIC ASS, P70 Delattre P, 1965, COMP PHONETIC FEATUR Espinosa A., 2005, J INT PHON ASSOC, V35, P1, DOI DOI 10.1017/S0025100305001878 Fant G., 1960, ACOUSTIC THEORY SPEE Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 Gosy M., 2004, FONETIKA BESZED TUDO Gronnum Nina, 2005, FONETIK FONOLOGI ALM Iivonen A., 2000, SUOMEN FONETIIKKAA P Jones Daniel, 1969, PHONETICS RUSSIAN Kohler KJ, 1977, EINFUHRUNG PHONETIK Kristoffersen G., 2007, PHONOLOGY NORWEGIAN Lacerda F., 2006, ACOUSTIC ANAL CENTRA Ladefoged P., 1968, NATURE GEN PHONETIC, P283 LADEFOGED P, 1965, LANGUAGE, V41, P332, DOI 10.2307/411884 Lavoie L, 2001, CONSONANT STRENGTH P Lehiste I., 1964, ACOUSTICAL CHARACTER Lindblad P., 2003, P 15 ICPHS, P1899 Lindblom B., 2004, SOUND SENSE, P86 Local J. K., 2002, STRUCTURAL VARIATION Marques I., 2010, THESIS U AVEIRO PORT Martinez Celdran E., 2007, MANUAL FONETICA ESPA Narayanan SS, 1997, J ACOUST SOC AM, V101, P1064, DOI 10.1121/1.418030 Newton D.E., 1996, YORK PAPERS LINGUIST, V17, P167 Quilis A., 1979, LINGUISTICA ESPANOLA, V1, P233 Recasens D, 2004, CLIN LINGUIST PHONET, V18, P593, DOI 10.1080/02699200410001703556 Recasens D., 1996, FONETICA DESCRIPTIVA Recasens D., 1986, ESTUDIS FONETICA EXP Recasens D., 1994, PHONOLOGICA 1992, P195 Recasens D, 2004, J PHONETICS, V32, P435, DOI 10.1016/j.wocn.2004.02.001 SPROAT R, 1993, J PHONETICS, V21, P291 Stevens K.N., 1998, ACOUSTIC PHONETICS Tomas Tomas Navarro, 1972, MANUAL PRONUNCIACION Tranel B., 1987, SOUND FRENCH INTRO Warner Natasha, 2001, PHONOLOGY, V18, P387 Wells John, 1982, ACCENTS ENGLISH Wheeler M., 1988, ROMANCE LANGUAGES Wiik K., 1966, PUBLICATIONS PHONETI Zhou X., 2009, THESIS U MARYLAND NR 44 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 368 EP 383 DI 10.1016/j.specom.2011.10.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100004 ER PT J AU Nose, T Kobayashi, T AF Nose, Takashi Kobayashi, Takao TI Very low bit-rate F0 coding for phonetic vocoders using MSD-HMM with quantized F0 symbols SO SPEECH COMMUNICATION LA English DT Article DE Phonetic vocoder; HMM-based speech synthesis; Very low bit-rate speech coding; MSD-HMM; MSD-VQ ID SPEECH PARAMETER GENERATION AB This paper presents a technique of very low bit-rate F0 coding for phonetic vocoders based on a hidden Markov model (HMM) using phone-level quantized F0 symbols. In the proposed technique, an input F0 sequence is converted into an F0 symbol sequence at the phone level using scalar quantization. The quantized F0 symbols represent the rough shape of the original F0 contour and are used as the prosodic context for the HMM in the decoding process. To model the F0 that has voiced and unvoiced regions, we use multi-space probability distribution HMM (MSD-HMM). Synthetic speech is generated from the context-dependent labels and pre-trained MSD-HMMs by using the HMM-based parameter generation algorithm. By taking into account the preceding and succeeding contexts as well as the current one in the modeling and synthesis, we can generate a smooth F0 trajectory similar to that of the original with only a small number of quantization bits. The experimental results reveal that the proposed F0 coding outperforms the conventional segment-based F0 coding technique using MSD-VQ. We also demonstrate that the decoded speech of the proposed vocoder has acceptable quality even when the F0 bit-rate is less than 50 bps. (C) 2011 Elsevier B.V. All rights reserved. C1 [Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. EM takashi.nose@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp FU JSPS [21300063, 21800020] FX Part of this work was supported by JSPS Grant-in-Aid for Scientific Research 21300063 and 21800020. CR DUDLEY H, 1958, J ACOUST SOC AM, V30, P733, DOI 10.1121/1.1909744 Hoshiya T., 2003, P ICASSP 2003, P800 Katsaggelos A., 2002, P IEEE 2, V86, P1126 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Lee KS, 2001, IEEE T SPEECH AUDI P, V9, P482 Nose T., 2010, IEICE T INF SYSTEMS, P2483 Nose T, 2010, INT CONF ACOUST SPEE, P4622, DOI 10.1109/ICASSP.2010.5495548 Picone J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Picone J., 1989, P ICASSP 89, P580 Scheffers M., 1988, P 7 FASE S, P981 Schwartz R., 1980, P ICASSP 80 IEEE, P32 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Soong F., 1989, P ICASSP, P584 Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tokuda K., 1995, P EUROSPEECH, P757 TOKUDA K, 1995, INT CONF ACOUST SPEE, P660, DOI 10.1109/ICASSP.1995.479684 TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229 TOKUDA K, 1998, ACOUST SPEECH SIG PR, P609 NR 18 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 384 EP 392 DI 10.1016/j.specom.2011.10.002 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100005 ER PT J AU de Lima, AA Prego, TD Netto, SL Lee, B Said, A Schafer, RW Kalker, T Fozunbal, M AF de Lima, Amaro A. Prego, Thiago de M. Netto, Sergio L. Lee, Bowon Said, Amir Schafer, Ronald W. Kalker, Ton Fozunbal, Majid TI On the quality-assessment of reverberated speech SO SPEECH COMMUNICATION LA English DT Article DE Speech quality evaluation; Reverberation assessment; Intrusive approach ID ENERGY RATIO; ACOUSTICS AB This paper addresses the problem of quantifying the reverberation effect in speech signals. The perception of reverberation is assessed based on a new measure combining the characteristics of reverberation time, room spectral variance, and direct-to-reverberant energy ratio, which are estimated from the associated room impulse response (RIR). The practical aspects behind a robust RIR estimation are underlined, allowing an effective feature extraction for reverberation evaluation. The resulting objective metric achieves a correlation factor of about 90% with the subjective scores of two distinct speech databases, illustrating the system's ability to assess the reverberation effect in a reliable manner. (C) 2011 Elsevier B.V. All rights reserved. C1 [de Lima, Amaro A.] Fed Ctr Technol Educ Celso Suckow da Fonseca CEFE, Nova Iguacu, RJ, Brazil. [de Lima, Amaro A.; Prego, Thiago de M.; Netto, Sergio L.] Univ Fed Rio de Janeiro, COPPE, Program Elect Engn, BR-21945 Rio De Janeiro, RJ, Brazil. [Lee, Bowon; Said, Amir; Schafer, Ronald W.; Kalker, Ton; Fozunbal, Majid] Hewlett Packard Labs, Palo Alto, CA 94304 USA. RP de Lima, AA (reprint author), Fed Ctr Technol Educ Celso Suckow da Fonseca CEFE, Nova Iguacu, RJ, Brazil. EM amaro@lps.ufrj.br; thprego@lps.ufrj.br; sergioln@lps.ufrj.br; bowon.lee@hp.com; amir_said@hp.com; ron.schafer@hp.com CR Allen J. B., 1982, J ACOUSTIC SOC AM, V71 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 [Anonymous], 2004, P563 ITUT, P563 [Anonymous], 2001, P862 ITUT Berkley D. A., 1993, ACOUSTICAL FACTORS A Cole D., 1994, P INT S SPEECH IM PR, P241, DOI 10.1109/SIPNN.1994.344922 de Lima A. A., 2009, P INT WORKSH MULT SI de Lima A. A., 2008, P INT C SIGN PROC MU, P257 Falk T., 2010, IEEE T AUDIO SPEECH, V18 Figueiredo F.L., 2005, P INT C EXP NOIS CON Gardner WG, 1998, SPRING INT SER ENG C, P85 Goetze S., 2010, P IEEE INT C AC SPEE Griesinger D., 2009, 157 M AC SOC AM PORT ITU- T Rec, 1996, P800 ITUT ITU-T, 2005, P8622 ITUT JETZT JJ, 1979, J ACOUST SOC AM, V65, P1204, DOI 10.1121/1.382786 Jeub M., 2009, P 16 INT C DIG SIGN Jot J.-M., 1991, P 90 CONV AM ENG SOC Karjalainen M., 2001, P CONV AUD ENG SOC A, P867 Kay S. M., 1993, FUNDAMENTALS STAT SI Kuster M, 2008, J ACOUST SOC AM, V124, P982, DOI 10.1121/1.2940585 Kuttruff H., 2000, ROOM ACOUSTICS Kuttruff H., 2007, ACOUSTICS INTRO Larsen E, 2008, J ACOUST SOC AM, V124, P450, DOI 10.1121/1.2936368 SCHROEDE.MR, 1965, J ACOUST SOC AM, V37, P409, DOI 10.1121/1.1909343 Wen J. Y. C., 2006, P IEEE INT WORKSH AC Wen J.Y.C., 2006, P EUR SIGN PROC C FL Zahorik P, 2002, J ACOUST SOC AM, V112, P2110, DOI 10.1121/1.1506692 Zahorik P, 2002, J ACOUST SOC AM, V111, P1832, DOI 10.1121/1.1458027 Zielinski S, 2008, J AUDIO ENG SOC, V56, P427 NR 30 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 393 EP 401 DI 10.1016/j.specom.2011.10.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100006 ER PT J AU Dai, P Soon, IY AF Dai, Peng Soon, Ing Yann TI A temporal frequency warped (TFW) 2D psychoacoustic filter for robust speech recognition system SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; 2D mask; Simultaneous masking; Temporal masking; Temporal frequency warping; Temporal integration ID LATERAL INHIBITION; MASKING AB In this paper, a novel hybrid feature extraction algorithm is proposed, which implements forward masking, lateral inhibition, and temporal integration with a simple 2D psychoacoustic filter. The proposed algorithm consists of two key parts, the 2D psychoacoustic filter and cepstral mean variance normalization (CMVN). Mathematical derivation is provided to show the correctness of the 2D psychoacoustic filter based on the characteristic functions of masking effects. The effectiveness of the proposed algorithm is tested on the AURORA2 database. Extensive comparison is made against lateral inhibition (LI), forward masking (FM), CMVN, RASTA filter, the ETSI standard advanced front-end feature extraction algorithm (AFE), and the temporal warped 2D psychoacoustic filter. Experimental results show significant improvements from the proposed algorithm, a relative improvement of nearly 46.78% over the baseline mel-frequency cepstral coefficients (MFCC) system in noisy conditions. (C) 2011 Elsevier B.V. All rights reserved. C1 [Dai, Peng; Soon, Ing Yann] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. RP Dai, P (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. EM daip0001@e.ntu.edu.sg; eiysoon@ntu.edu.sg CR Brookes M., 1997, VOICEBOX CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 Dai P., 2010, SPEECH COMMUN, V53, P229 Dai P., 2009, P ICICS DEC DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 ETSI, 2007, 202050 ETSI ES Gold B., 2000, SPEECH AUDIO SIGNAL Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 Hirsch H., 2000, P ISCA ITRW ASR, V181, P188 HOUTGAST T, 1972, J ACOUST SOC AM, V51, P1885, DOI 10.1121/1.1913048 Ishizuka K, 2010, SPEECH COMMUN, V52, P41, DOI 10.1016/j.specom.2009.08.003 JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576 Luo X., 2008, P ICALIP, V1105, P1109 Oxenham AJ, 2001, J ACOUST SOC AM, V109, P732, DOI 10.1121/1.1336501 Oxenham AJ, 2000, HEARING RES, V150, P258, DOI 10.1016/S0378-5955(00)00206-9 Park KY, 2003, NEUROCOMPUTING, V52-4, P615, DOI 10.1016/S0925-2312(02)00791-9 SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1622, DOI 10.1121/1.392800 NR 18 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 402 EP 413 DI 10.1016/j.specom.2011.10.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100007 ER PT J AU Grichkovtsova, I Morel, M Lacheret, A AF Grichkovtsova, Ioulia Morel, Michel Lacheret, Anne TI The role of voice quality and prosodic contour in affective speech perception SO SPEECH COMMUNICATION LA English DT Article DE Speech perception; Affective prosody; Voice quality; Prosodic contour; Speech synthesis; Prosody transplantation paradigm; Attitudes; Emotions; French ID RECOGNIZING EMOTIONS; FOREIGN-LANGUAGE; VOCAL EXPRESSION; SYNTHETIC SPEECH; SIMULATION AB We explore the usage of voice quality and prosodic contour in the identification of emotions and attitudes in French. For this purpose, we develop a corpus of affective speech based on one lexically neutral utterance and apply prosody transplantation method in our perception experiment. We apply logistic regression to analyze our categorical data and we observe differences in the identification of these two affective categories. Listeners primarily use prosodic contour in the identification of studied attitudes. Emotions are identified on the basis of voice quality and prosodic contour. However, their usage is not homogeneous within individual emotions. Depending on the stimuli, listeners may use both voice quality and prosodic contour, or privilege just one of them for the successful identification of emotions. The results of our study are discussed in view of their importance for speech synthesis. (C) 2011 Elsevier B.V. All rights reserved. C1 [Grichkovtsova, Ioulia; Morel, Michel] Univ Caen Basse Normandie, Lab CRISCO, EA 4255, F-14032 Caen, France. [Lacheret, Anne] Univ Paris Ouest Nanterre Def, CNRS, Lab MODYCO, UFR LLPHI,Dept Sci Langage,UMR 7114, F-92001 Nanterre, France. [Lacheret, Anne] Inst Univ France, F-75005 Paris, France. RP Grichkovtsova, I (reprint author), Univ Caen Basse Normandie, Lab CRISCO, EA 4255, F-14032 Caen, France. EM grichkovtsova@hotmail.com FU Conseil Regional de Basse-Normandie FX We acknowledge the financial support from the Conseil Regional de Basse-Normandie, under the research funding program "Transfer de Technologic". I.G. is pleased to thank Dr. Henni Ouerdane for careful reading of the manuscript, useful discussions and assistance with statistics. CR Alessandro C., 2006, METHOD EMPIRICAL PRO, P63 Auberge V, 2003, SPEECH COMMUN, V40, P87, DOI 10.1016/S0167-6393(02)00077-8 Baayen R. H., 2004, MENTAL LEXICON WORKI, V1, P1 Bachorowski JA, 1999, CURR DIR PSYCHOL SCI, V8, P53, DOI 10.1111/1467-8721.00013 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 Barkhuysen P, 2010, LANG SPEECH, V53, P3, DOI 10.1177/0023830909348993 Beck S, 2009, J SEMANT, V26, P159, DOI 10.1093/jos/ffp001 Campbell N., 2006, P 15 INT C PHON SCI, P2417 Chen A.J., 2005, THESIS Dromey C, 2005, SPEECH COMMUN, V47, P351, DOI 10.1016/j.specom.2004.09.010 Dutoit T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1393 Ekman P., 1999, HDB COGNITION EMOTIO Elfenbein HA, 2003, CURR DIR PSYCHOL SCI, V12, P159, DOI 10.1111/1467-8721.01252 Erickson D., 2010, SPEECH PROSODY 2010 Garcia M.N., 2006, P INT C LANG RES EV, P307 Ghio A., 2007, TIPA, V22, P115 Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 Grichkovtsova I., 2007, P INT Grichkovtsova I., 2009, ROLE PROSODY AFFECTI, P371 Hancil Sylvie, 2009, ROLE PROSODY AFFECTI Hox J., 2002, MULTILEVEL ANAL TECH Izard C.E., 1971, FACE EMOTION Jaeger TF, 2008, J MEM LANG, V59, P434, DOI 10.1016/j.jml.2007.11.007 Johnson-Laird P. N., 1987, COGNITION EMOTION, V1, P29, DOI [10.1080/02699938708408362, DOI 10.1080/02699938708408362] Johnstone T., 2000, HDB EMOTIONS, V2nd, P220 Hammerschmidt Kurt, 2007, J Voice, V21, P531, DOI 10.1016/j.jvoice.2006.03.002 Lakshminarayanan K, 2003, BRAIN LANG, V84, P250, DOI 10.1016/S0093-934X(02)00516-3 Laver J., 1994, P INT C SPEECH PROS Laver J, 1980, PHONETIC DESCRIPTION Laver John, 1994, PRINCIPLES PHONETICS Moineddin R., 2007, CAHIERS I LINGUISTIQ, V7, P1, DOI 10.2143/CILL.30.1.519219 Moineddin R., 2007, BMC MED RES METHODOL, V7, P1, DOI DOI 10.1186/1471-2288-7-34 Morel M., 2004, TRAITEMENT AUTOMATIQ, V30, P207 Murray IR, 2008, COMPUT SPEECH LANG, V22, P107, DOI 10.1016/j.csl.2007.06.001 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 ORTONY A, 1990, PSYCHOL REV, V97, P315, DOI 10.1037//0033-295X.97.3.315 Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 Pell MD, 2009, J NONVERBAL BEHAV, V33, P107, DOI 10.1007/s10919-008-0065-7 PLUTCHIK R, 1993, HDB EMOTIONS, DOI DOI 10.1016/J.WOCN.2009.07.005 Power M., 2008, COGNITION EMOTION OR Prudon R., 2004, TEXT TO SPEECH SYNTH, P203 R Development Core Team, 2008, R LANG ENV STAT COMP Rilliard A, 2009, LANG SPEECH, V52, P223, DOI 10.1177/0023830909103171 RODERO E, 2011, J VOICE, V25, P25, DOI DOI 10.1177/0023830909103171 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHRODER M, 2008, EMOTIONS HUMAN VOICE, P307 Schroder M., 2008, P INT C SPEECH PROS, P307 Shochi T., 2009, ROLE PROSODY AFFECTI, P31 Thompson WF, 2004, EMOTION, V4, P46, DOI 10.1037/1528-3542.4.1.46 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 YANUSHEVSKAYA I, 2006, P 3 INT C SPEECH PRO NR 53 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 414 EP 429 DI 10.1016/j.specom.2011.10.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100008 ER PT J AU Rudzicz, F AF Rudzicz, Frank TI Using articulatory likelihoods in the recognition of dysarthric speech SO SPEECH COMMUNICATION LA English DT Article DE Dysarthria; Speech recognition; Acoustic-articulatory inversion; Task-dynamics ID GAUSSIAN MIXTURE MODEL; RECOVERING ARTICULATION; ACOUSTICS AB Millions of individuals have congenital or acquired neuro-motor conditions that limit control of their muscles, including those that manipulate the vocal tract. These conditions, collectively called dysarthria, result in speech that is very difficult to understand both by human listeners and by traditional automatic speech recognition (ASR), which in some cases can be rendered completely unusable. In this work we first introduce a new method for acoustic-to-articulatory inversion which estimates positions of the vocal tract given acoustics using a nonlinear Hammerstein system. This is accomplished based on the theory of task-dynamics using the TORGO database of dysarthric articulation. Our approach uses adaptive kernel canonical correlation analysis and is found to be significantly more accurate than mixture density networks, at or above the 95% level of confidence for most vocal tract variables. Next, we introduce a new method for ASR in which acoustic-based hypotheses are re-evaluated according to the likelihoods of their articulatory realizations in task-dynamics. This approach incorporates high-level, long-term aspects of speech production and is found to be significantly more accurate than hidden Markov models, dynamic Bayesian networks, and switching Kalman filters. (C) 2011 Elsevier B.V. All rights reserved. C1 Univ Toronto, Dept Comp Sci, Toronto, ON, Canada. RP Rudzicz, F (reprint author), Univ Toronto, Dept Comp Sci, Toronto, ON, Canada. EM frank@cs.toronto.edu CR Ananthakrishnan G., 2009, P INT 2009 BRIGHT UK Aschbacher E., 2005, P 13 STAT SIGN PROC Bahr RH, 2005, TOP LANG DISORD, V25, P254 Browman Catherine, 1986, PHONOLOGY YB, V3, P219 Deng J., 2005, AC SPEECH SIGN PROC, V1, P1121 Dogil G., 1998, PHONOLOGY, V15 Enderby P. M., 1983, FRENCHAY DYSARTHRIA Friedland B., 2005, CONTROL SYSTEM DESIG Fukuda T., 2003, P ICASSP 03, V2, P25 Goldstein L, 2006, ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM, P215, DOI 10.1017/CBO9780511541599.008 Goldstein L.M., 2003, ARTICULATORY PHONOLO Hasegawa-Johnson M., 2007, INT SPEECH LEXICON P Hawley MS, 2007, MED ENG PHYS, V29, P586, DOI 10.1016/j.medengphy.2006.06.009 Hogden J, 2007, SPEECH COMMUN, V49, P361, DOI 10.1016/j.specom.2007.02.008 Havstam Christina, 2003, Logoped Phoniatr Vocol, V28, P81, DOI 10.1080/14015430310015372 King S, 2007, J ACOUST SOC AM, V121, P723, DOI 10.1121/1.2404622 Kirchhoff K., 1999, THESIS U BIELEFELD G Lai P L, 2000, Int J Neural Syst, V10, P365, DOI 10.1016/S0129-0657(00)00034-X LEE LJ, 2001, ACOUST SPEECH SIG PR, P797 Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1 Livescu Karen, 2007, P INT C AC SPEECH SI Matsumasa Hironori, 2009, Journal of Multimedia, V4, DOI 10.4304/jmm.4.4.254-261 Menendez-Pidal X., 1996, P 4 INT C SPOK LANG Metze F, 2007, SPEECH COMMUN, V49, P348, DOI 10.1016/j.specom.2007.02.009 Morales S. O. C., 2009, EURASIP J ADV SIGNAL Murphy Kevin P., 1998, SWITCHING KALMAN FIL Murphy K.P., 2002, THESIS U CALIFORNIA Nam H., 2006, TADA TASK DYNAMICS A Nam H, 2003, P 15 INT C PHON SCI, P2253 Ozbek IY, 2011, IEEE T AUDIO SPEECH, V19, P1180, DOI 10.1109/TASL.2010.2087751 Polur PD, 2006, MED ENG PHYS, V28, P741, DOI 10.1016/j.medengphy.2005.11.002 Raghavendra P., 2001, AUGMENTATIVE ALTERNA, V17, P265, DOI 10.1080/714043390 Ramsay J. O., 2005, FITTING DIFFERENTIAL, P327 Reimer M., 2010, P INT 2010 MAK JAP, P1608 Richardson M., 2000, P ICSLP BEIJ CHIN, P131 Richmond K, 2003, COMPUT SPEECH LANG, V17, P153, DOI 10.1016/S0885-2308(03)00005-6 Rosen K., 2000, AUGMENTATIVE ALTERNA, V16, P48, DOI DOI 10.1080/07434610012331278904 Rudzicz F., 2010, P 48 ANN M ASS COMP Rudzicz F., 2008, P 8 INT SEM SPEECH P RUDZICZ F, 2007, P 9 INT ACM SIGACCES, P255, DOI DOI 10.1145/1296843.1296899 Rudzicz F., 2011, THESIS U TORONTO TOR Rudzicz Frank, 2010, P 2010 IEEE INT C AC SAKOE H, 1978, IEEE T ACOUST SPEECH, V26, P43, DOI 10.1109/TASSP.1978.1163055 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 Saltzman E.M., 1986, TASK DYNAMIC COORDIN, P129 Smith A., 2004, SPEECH MOTOR CONTROL, P227 Stevens KN, 2010, J PHONETICS, V38, P10, DOI 10.1016/j.wocn.2008.10.004 Sun JP, 2002, J ACOUST SOC AM, V111, P1086, DOI 10.1121/1.1420380 Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001 Vaerenbergh S.V., 2008, EURASIP J ADV SIG PR, V8, P1 Vaerenbergh S.V., 2006, P 2006 IEEE INT C AC Vaerenbergh S.V., 2006, P 2006 INT JOINT C N, P1198 Van Lieshout P., 2008, ASIA PACIFIC J SPEEC, V11, P283 Wrench A., 1999, MOCHA TIMIT ARTICULA Wrench Alan, 2000, P INT C SPOK LANG PR Yorkston K. M., 1981, ASSESSMENT INTELLIGI Yunusova Y, 2009, J SPEECH LANG HEAR R, V52, P547, DOI 10.1044/1092-4388(2008/07-0218) Zheng WM, 2006, IEEE T NEURAL NETWOR, V17, P233, DOI 10.1109/TNN.2005.860849 NR 58 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 430 EP 444 DI 10.1016/j.specom.2011.10.006 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100009 ER PT J AU Jeon, JH Liu, Y AF Jeon, Je Hun Liu, Yang TI Automatic prosodic event detection using a novel labeling and selection method in co-training SO SPEECH COMMUNICATION LA English DT Article DE Prosodic event; ToBI; Pitch accent; Break index; Intonational phrase boundary; Co-training ID FEATURES AB Most previous approaches to automatic prosodic event detection are based on supervised learning, relying on the availability of a corpus that is annotated with the prosodic labels of interest in order to train the classification models. However, creating such resources is an expensive and time-consuming task. In this paper, we exploit semi-supervised learning with the co-training algorithm for automatic detection of coarse-level representation of prosodic events such as pitch accent, intonational phrase boundaries, and break indices. Since co-training works on the condition that the views are compatible and uncorrelated, and real data often do not satisfy these conditions, we propose a method to label and select examples in co-training. In our experiments on the Boston University radio news corpus, when using only a small amount of the labeled data as the initial training set, our proposed labeling method can effectively use unlabeled data to improve performance and finally reach performance close to the results of the supervised method using more labeled data. We perform a thorough analysis of various factors impacting the learning curves, including labeling error rate and informativeness of added examples, performance of the individual classifiers and their difference, and the initial and added data size. (C) 2011 Elsevier B.V. All rights reserved. C1 [Jeon, Je Hun; Liu, Yang] Univ Texas Dallas, Dept Comp Sci, Richardson, TX 75083 USA. RP Jeon, JH (reprint author), Univ Texas Dallas, Dept Comp Sci, Richardson, TX 75083 USA. EM jhjeon@hlt.utdallas.edu; yangl@hlt.utdallas.edu FU US Air Force Office of Scientific Research [FA9550-10-1-0388] FX This work is partly supported by an award from the US Air Force Office of Scientific Research, FA9550-10-1-0388. CR Ananthakrishnan S., 2007, P ICASSP, P65 Ananthakrishnan S, 2008, IEEE T AUDIO SPEECH, V16, P216, DOI 10.1109/TASL.2007.907570 Ananthakrishnan S., 2006, P INT C SPOK LANG PR, P297 Balcan MF, 2005, ADV NEURAL INFORM PR, V17, P89 Bartlett S., 2009, P NAACL HLT, P308, DOI 10.3115/1620754.1620799 Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, DOI 10.1145/279943.279962 Boersma P., 2001, GLOT INT, V5, P341 Brants T., 2000, P 6 C APPL NAT LANG, P224, DOI 10.3115/974147.974178 Chen K., 2004, P ICASSP, P1509 Chen K, 2006, IEEE T AUDIO SPEECH, V14, P232, DOI 10.1109/TSA.2005.853208 Clark S., 2003, P CONLL EDM CAN, P49 Dasgupta S, 2002, ADV NEUR IN, V14, P375 Dehak N, 2007, IEEE T AUDIO SPEECH, V15, P2095, DOI 10.1109/TASL.2007.902758 Goldman S.A., 2000, P 17 INT C MACH LEAR, P327 Grabe E., 2003, P SASRTLM, P45 Gregory Michelle L., 2004, P 42 M ASS COMP LING, P677, DOI 10.3115/1218955.1219041 Gus U., 2007, P INT, P2597 HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C Jeon J. H., 2009, P ACL IJCNLP, P540, DOI DOI 10.3115/1690219.1690222] Jeon J. H., 2010, P INT, P1772 Jeon J.H., 2011, ACL HLT, P732 Jeon J.-H., 2009, P ICASSP, P4565 Kudoh T., 2000, P CONLL 2000 LLL 200, P142 Levow G. A., 2008, P IJCNLP, P217 Levow G.-A., 2006, P HLT NAACL, P224, DOI 10.3115/1220835.1220864 Lin HT, 2007, MACH LEARN, V68, P267, DOI 10.1007/s10994-007-5018-6 Muslea I, 2002, P 19 INT C MACH LEAR, P435 Muslea I., 2000, Proceedings Seventeenth National Conference on Artificial Intelligence (AAAI-2000). Twelfth Innovative Applications of Artificial Intelligence Conference (IAAI-2000) Nakatani C.H., 1995, WORK NOT AAAI 95 SPR, P106 Nenkova A, 2007, P HLT ACL ROCH NY AP, P9 Nigam K., 2000, Proceedings of the Ninth International Conference on Information and Knowledge Management. CIKM 2000, DOI 10.1145/354756.354805 Price P., 1991, P HLT, P372, DOI 10.3115/112405.112738 Rosenberg A, 2007, P INT, P2777 Shriberg E., 1998, LANG SPEECH, V41, P439 Silverman K., 1992, P INT C SPOK LANG PR, P867 Sridhar VKR, 2008, IEEE T AUDIO SPEECH, V16, P797, DOI 10.1109/TASL.2008.917071 Steedman M., 2003, CLSP WS 02 FINAL REP Tur G., 2003, P EUR, P2793 Wang W., 2007, P ICASSP, V4, pIV137 Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 Xie S., 2011, P INT, P2522 NR 41 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 445 EP 458 DI 10.1016/j.specom.2011.10.008 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100010 ER PT J AU Adell, J Escudero, D Bonafonte, A AF Adell, Jordi Escudero, David Bonafonte, Antonio TI Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; Conversational speech; Talking speech synthesiser; Filled pause; Disfluency; Underlying fluent sentence; Prosody; Ogmios; Perceptual evaluation ID UM; UH; DISFLUENCIES; COMPUTERS; SPEAKING; REPAIR; CORPUS AB Until now, speech synthesis has mainly involved reading-style speech. Today, however, text-to-speech systems must provide a variety of styles because users expect these interfaces to do more than just read information. If synthetic voices must be integrated into future technology, they must simulate the way people talk instead of the way people read. Existing knowledge about how disfluencies occur has made it possible to propose a general framework for synthesising disfluencies. We propose a model based on the definition of disfluency and the concept of underlying fluent sentences. The model incorporates the parameters of standard prosodic models for fluent speech with local modifications of prosodic parameters near the interruption point. The constituents of the local models for filled pauses are derived from the analysis corpus, and constituent's prosodic parameters are predicted via linear regression analysis. We also discuss the implementation details of the model when used in a real speech synthesis system. Objective and perceptual evaluations showed that the proposed models outperformed the baseline model. Perceptual evaluations of the system showed that it is possible to synthesise filled pauses without decreasing the overall naturalness of the system, and users stated that the speech produced is even more natural than the one produced without filled pauses. (C) 2011 Elsevier B.V. All rights reserved. C1 [Adell, Jordi; Bonafonte, Antonio] Univ Politecn Cataluna, Barcelona, Spain. [Escudero, David] Univ Valladolid, Valladolid, Spain. RP Adell, J (reprint author), Univ Politecn Cataluna, Barcelona, Spain. EM jordi.adell@upc.edu RI Escudero, David/K-7905-2014 OI Escudero, David/0000-0003-0849-8803 CR Aaron A, 2005, SCI AM, V292, P64 Adell J., 2006, P 3 INT C SPEECH PRO Adell J., 2010, P ISCA SPEECH PROS C Adell J., 2007, LECT NOTES ARTIF INT, V1, P358 Adell J., 2010, P ICASSP DALL US Adell J., 2008, P INT BRISB AUSTR, P2278 Adell J., 2005, P INT C AC SPEECH SI Agiiero P.D., 2004, P 5 ISCA SPEECH SYNT Andersson S., 2010, P INT C SPEECH PROS Campbell N, 2007, TEXT SPEECH LANG TEC, V37, P29 Arslan L., 1998, P ICASSP SEATL US Batliner A., 1995, P 13 INT C PHON SCI, V3, P472 Bennett C. L., 2005, P INT EUR, P105 Bernstein J., 1991, HLT 91, P423 Boersma P., 2005, PRAAT DOING PHONETIC Bonafonte A., 2006, P TC STAR WORKSH SPE Bonafonte A., 2005, TTS PROGR REPORT DEL Bonafonte A., 2008, P 6 LANG RES EV C LR Bonafonte A., 1998, P 8 JORN TEL I D TEL Campbell N., 1998, P 3 ESCA COCOSDA WOR, P17 Carlson R., 2006, P INT, P1300 Chung H., 2002, P INT C SPEECH PROS Clark H. H., 1996, USING LANGUAGE Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3 Clark HH, 2002, SPEECH COMMUN, V36, P5, DOI 10.1016/S0167-6393(01)00022-X Clark R.A.J., 2007, P 3 BLIZZ CHALL BONN Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014 Clinton Hillary Rodham, 2003, LIVING HIST Conejero D., 2003, P EUR GEN SWITZ Corder G. W., 2009, NONPARAMETRIC STAT N Dahl D.A., 1994, P ARPA WORKSH HUM LA, P43, DOI 10.3115/1075812.1075823 Dusterhoff K.E., 1999, P EUR Eide E, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P127, DOI 10.1109/WSS.2002.1224388 Erro D, 2010, IEEE T AUDIO SPEECH, V18, P974, DOI 10.1109/TASL.2009.2038658 Escudero Mancebo D., 2007, SPEECH COMMUN, V49, P213 Tree JEF, 1995, J MEM LANG, V34, P709, DOI 10.1006/jmla.1995.1032 Fraser M., 2007, P 3 BLIZZ CHALL BONN Gabrea M., 2000, P ICSLP BEIJ, V3, P678 GARNHAM A, 1981, LINGUISTICS, V19, P805, DOI 10.1515/ling.1981.19.7-8.805 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 Goedertier W., 2000, P INT C LANG RES EV, P909 Hamza W., 2004, P INT C SPOK LANG PR Hirschberg J, 2002, SPEECH COMMUN, V36, P31, DOI 10.1016/S0167-6393(01)00024-3 Iriondo I., 2007, INT C AC SPEECH SIGN, V4, P821 ITU-T.P85, 1994, ITUTP85 Kowtko J.C., 1989, P DARPA SPEECH NAT L Krishna N. S., 2004, P ISCA WORKSH SPEECH Lee EJ, 2010, COMPUT HUM BEHAV, V26, P665, DOI 10.1016/j.chb.2010.01.003 Levelt W. J., 1993, SPEAKING INTENTION A Levelt W.J., 1983, J SEMANT, P205 LEVELT WJM, 1983, COGNITION, V14, P41, DOI 10.1016/0010-0277(83)90026-4 Likert R., 1932, ARCH PSYCHOL, V140, P1, DOI DOI 10.1111/J.1540-5834.2010.00585 MARCUSROBERTS HM, 1987, J EDUC STAT, V12, P383, DOI 10.3102/10769986012004383 NAKATANI CH, 1994, J ACOUST SOC AM, V95, P1603, DOI 10.1121/1.408547 O'Shaughnessy D., 1992, P ICASSP, P521, DOI 10.1109/ICASSP.1992.225857 O'Connell DC, 2004, J PSYCHOLINGUIST RES, V33, P459, DOI 10.1007/s10936-004-2666-6 O'Connell DC, 2005, J PSYCHOLINGUIST RES, V34, P555, DOI 10.1007/s10936-005-9164-3 Pakhomov S.V., 1999, P 37 ANN M ASS COMP Perez J., 2006, P INT C LANG RES EV Quimbo F.C.M., 1998, P INT C SPEECH LANG ROCHESTE.SR, 1973, J PSYCHOLINGUIST RES, V2, P51, DOI 10.1007/BF01067111 Rose R., 1998, THESIS U BIRMINGHAM Savino M., 2000, LECT NOTES ARTIF INT, P421 Shriberg E., 1997, P EUR RHOD GREEC Shriberg E. E., 1994, THESIS U CALIFORNIA SHRIBERG EE, 1993, PHONETICA, V50, P172 Shriberg Elizabeth E., 1999, P INT C PHON SCI ICP, V1, P619 Stolcke A, 1996, INT CONF ACOUST SPEE, P405, DOI 10.1109/ICASSP.1996.541118 Stouten F, 2006, SPEECH COMMUN, V48, P1590, DOI 10.1016/j.specom.2006.04.004 Sundaram S, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P203, DOI 10.1109/WSS.2002.1224409 Sundaram S., 2003, P EUROSPEECH GEN SWI, P1221 SVARTVIK J, 1980, LUND STUDIES ENGLISH, V56 Svartvik Jan, 1990, LONDON LUND CORPUS S Taylor P, 2009, TEXT TO SPEECH SYNTH Taylor P., 1998, P INT C SPEECH LANG Tree JEF, 2001, MEM COGNITION, V29, P320 Tseng S.-C., 1999, THESIS U BIELEFELD Umbert M., 2006, P INT C LANG RES EV van Santen J., 1997, PROGR SPEECH SYNTHES Vazquez Y., 2002, P INT C SPEECH LANG, P329 Watanabe M., 2005, P EUR SEPT LISB PORT, P37 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zhao Y., 2005, P 13 IEEE INT C NETW, P179 NR 83 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 459 EP 476 DI 10.1016/j.specom.2011.10.010 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100011 ER PT J AU Choi, JH Chang, JH AF Choi, Jae-Hun Chang, Joon-Hyuk TI On using acoustic environment classification for statistical model-based speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Noise classification; Gaussian mixture model; DFT ID SPECTRAL AMPLITUDE ESTIMATOR; SOFT-DECISION; NOISE; SUPPRESSION; SUBTRACTION AB In this paper, we present a statistical model-based speech enhancement technique using acoustic environment classification supported by a Gaussian mixture model (GMM). In the data training stage, the principal parameters of the statistical model-based speech enhancement algorithm such as the weighting parameter in the decision-directed (DD) method, the long-term smoothing parameter of the noise estimation, and the control parameter of the minimum gain value are uniquely set as optimal operating points according to the given noise information to ensure the best performance for each noise. These optimal operating points, which are specific to the different background noises, are estimated based on the composite measures, which are the objective quality measures representing the highest correlation with the actual speech quality processed by noise suppression algorithms. In the on-line environment-aware speech enhancement step, the noise classification is performed on a frame-by-frame basis using the maximum likelihood (ML)-based Gaussian mixture model (GMM). The speech absence probability (SAP) is used to detect the speech absence periods and to update the likelihood of the GMM. According to the classified noise information for each frame, we assign the optimal values to the aforementioned three parameters for speech enhancement. We evaluated the performances of the proposed methods using objective speech quality measures and subjective listening tests under various noise environments. Our experimental results showed that the proposed method yields better performances than does a conventional algorithm with fixed parameters. (C) 2011 Elsevier B.V. All rights reserved. C1 [Choi, Jae-Hun; Chang, Joon-Hyuk] Hanyang Univ, Sch Elect Engn, Seoul 133791, South Korea. RP Chang, JH (reprint author), Hanyang Univ, Sch Elect Engn, Seoul 133791, South Korea. EM jchang@hanyang.ac.kr FU MKE/KEIT [2009-S-036-01]; National Research Foundation of Korea (NRF); Korean Government (MEST) [NRF-2011-0009182]; Hanyang University [HY-2011-201100000000] FX This work was supported by the IT R&D program of MKE/KEIT [2009-S-036-01, Development of New Virtual Machine Specification and Technology]. And, this work was supported by National Research Foundation of Korea (NRF) grant funded by the Korean Government (MEST) (NRF-2011-0009182). This work was supported by the research fund of Hanyang University (HY-2011-201100000000) CR Akbacak M, 2007, IEEE T AUDIO SPEECH, V15, P465, DOI 10.1109/TASL.2006.881694 [Anonymous], 2002, TMS320C55X DSP LIB P [Anonymous], 1996, TIAEIAIS127 [Anonymous], 2005, 3GPP2CR00300 [Anonymous], 2000, P862 ITUT BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Chang JH, 2005, IEEE T CIRCUITS-II, V52, P535, DOI 10.1109/TCSII.2005.850448 Chang JH, 2001, IEICE T INF SYST, VE84D, P1231 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Hu Y., 2008, IEEE T AUDIO SPEECH HU Y, 2006, P INT, P1447 Kim NS, 2000, IEEE SIGNAL PROC LET, V7, P108 Kraft F., 2005, P IEEE INT 2005 LISB, P2689 Krishnamurthy N., 2006, P INT, P1431 Ma L., 2006, ACM T SPEECH LANGUAG, V3, P1, DOI 10.1145/1149290.1149292 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Park YS, 2007, IEICE T COMMUN, VE90B, P2182, DOI 10.1093/ietcom/e90-b.8.2182 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Sangwan A., 2007, P INT 2007 ANTW BELG, P2929 Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328 Sohn J., 1998, P INT C AC SPEECH SI, V1, P365 Song JH, 2008, IEEE SIGNAL PROC LET, V15, P103, DOI 10.1109/LSP.2007.911184 NR 26 TC 4 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 477 EP 490 DI 10.1016/j.specom.2011.10.009 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100012 ER PT J AU Kurata, G Itoh, N Nishimura, M Sethy, A Ramabhadran, B AF Kurata, Gakuto Itoh, Nobuyasu Nishimura, Masafumi Sethy, Abhinav Ramabhadran, Bhuvana TI Leveraging word confusion networks for named entity modeling and detection from conversational telephone speech SO SPEECH COMMUNICATION LA English DT Article DE Named entity detection; Conversational telephone speech; Word confusion networks; Maximum entropy model AB Named Entity (NE) detection from Conversational Telephone Speech (CTS) is important from business aspects. However, results of Automatic Speech Recognition (ASR) inevitably contain errors and this makes NE detection from CTS more difficult than from written text. One of the options to detect NEs is to use a statistical NE model. In order to capture the nature of ASR errors, the NE model is usually trained with the ASR one-best results instead of manually transcribed text and then is applied to the ASR one-best results of speech that contain NEs. To make NE detection more robust to ASR errors, we propose using Word Confusion Networks (WCNs), sequences of bundled words, for both NE modeling and detection by regarding the word bundles as units instead of the independent words. We realize this by clustering similar word bundles that may originate from the same word. We trained the NE models that predict the NE tag sequences from the sequence of the word bundles with the maximum entropy principle. Note that clustering of word bundles is conducted in advance of NE modeling and thus our proposed method can combine with any NE modeling method. We conducted experiments using real-life call-center data. The experimental results showed that by using the WCNs, the accuracy of NE detection improved regardless of the NE modeling method. (C) 2011 Elsevier B.V. All rights reserved. C1 [Kurata, Gakuto; Itoh, Nobuyasu; Nishimura, Masafumi] IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan. [Sethy, Abhinav; Ramabhadran, Bhuvana] IBM Corp, IBM Res TJ Watson Res Ctr, Yorktown Hts, NY USA. RP Kurata, G (reprint author), IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan. EM gakuto@jp.ibm.com FU IBM T.J. Watson Research Center FX We thank Stanley Chen of IBM T.J. Watson Research Center for his support. CR Bechet F., 2002, P INT C SPOK LANG PR, P597 Bender O., 2003, P CONLL 2003 EDM CAN, P148 Bunescu R., 2006, P EACL, V6, P9 Chen S. F., 2009, P HLT NAACL, P450, DOI 10.3115/1620754.1620820 Chen S. F., 2010, P INT, P1037 Chen SF, 1999, COMPUT SPEECH LANG, V13, P359, DOI 10.1006/csla.1999.0128 Chen SF, 2006, IEEE T AUDIO SPEECH, V14, P1596, DOI 10.1109/TASL.2006.879814 Chen S.F, 2009, P HLT NAACL, P468, DOI 10.3115/1620754.1620822 CHIEU H., 2002, P 19 INT C COMP LING, P190 Chieu H.L., 2003, P CONLL 2003, P160 Chiticariu L., 2010, P 2010 C EMP METH NA, P1002 Doddington George, 2004, P 4 INT C LANG RES E, P837 Favre B., 2005, P HLT EMNLP, P491, DOI 10.3115/1220575.1220637 Feng J., 2009, P EACL, P238, DOI 10.3115/1609067.1609093 Finkel J.R., 2005, P 43 ANN M ASS COMP, P363, DOI 10.3115/1219840.1219885 Grishman Ralph, 1996, P 16 INT C COMP LING, P466 Hakkani-Tfir D., 2003, P ICASSP, V1, P596 Hakkani-Tur D, 2006, COMPUT SPEECH LANG, V20, P495, DOI 10.1016/j.csl.2005.07.005 Hori T., 2007, P ICASSP, V4, P73 Huang J., 2001, P ACL, P298, DOI 10.3115/1073012.1073051 Huang Liang, 2009, P ACL, P522 KAZAWA H., 2002, P 19 INT C COMP LING, P1 Kubala F., 1998, P DARPA BROADC NEWS, P287 Kudo T., 2001, P 2 M N AM CHAPT ASS, P1 Kudo T., 2004, P 2004 C EMP METH NA, P230 Kurata G., 2011, P ICASSP, P5576 Mamou J., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/1148170.1148183 Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152 Minkov E., 2005, P C HUM LANG TECHN E, P443, DOI 10.3115/1220575.1220631 Nagata M., 1994, P 15 INT C COMP LING, P201 NIST, 2008, ACE08 EV PLAN Palmer D.D., 2001, P HUM LANG TECHN WOR, P1, DOI 10.3115/1072133.1072186 Papageorgiou C.P., 1994, P HLT WORKSH, P283, DOI 10.3115/1075812.1075875 Povey D, 2008, INT CONF ACOUST SPEE, P4057, DOI 10.1109/ICASSP.2008.4518545 Ramshaw L. A., 1995, P 3 WORKSH VER LARG, P88 Sang E.F.T.K., 1999, P 9 C EUR CHAPT ASS, P173, DOI 10.3115/977035.977059 Saraclar M., 2004, P HLT NAACL, P129 Sarikaya R., 2010, P INT, P1804 Shao J., 2007, P INT, P2405 Shen D, 2004, P 42 ANN M ASS COMP, P589, DOI 10.3115/1218955.1219030 SUDOH K, 2006, P 21 INT C COMP LING, P617, DOI 10.3115/1220175.1220253 Tjong Kim Sang E.F, 2003, P 7 C NAT LANG LEARN, V4, P142 Tjong Kim Sang E.F., 2002, P CONLL, V20, P1 Tsujii J., 2000, P 18 INT C COMP LING, P201 Xue N, 2005, NAT LANG ENG, V11, P207, DOI 10.1017/S135132490400364X Yu S., 2001, PROCESSING NORMS MOD Zhai L., 2004, P HLT NAACL, P37, DOI 10.3115/1613984.1613994 Zhao Y, 2005, DATA MIN KNOWL DISC, V10, P141, DOI 10.1007/s10618-005-0361-3 Zhao Y., 2002, 02014 U MINN Zhou G., 2002, P 40 ANN M ASS COMP, P473 NR 50 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 491 EP 502 DI 10.1016/j.specom.2011.11.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100013 ER PT J AU Gomez, AM Schwerin, B Paliwal, K AF Gomez, Angel M. Schwerin, Belinda Paliwal, Kuldip TI Improving objective intelligibility prediction by combining correlation and coherence based methods with a measure based on the negative distortion ratio SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Objective measures; Speech enhancement; Distance-based methods; Correlation-based methods; Coherence-based methods; Negative spectral distortion ID SPEECH-INTELLIGIBILITY; ENHANCEMENT; ALGORITHMS; QUALITY; INDEX AB In this paper we propose a novel objective method for intelligibility prediction of enhanced speech which is based on the negative distortion ratio (NDR) - that is, the amount of power spectra that has been removed in comparison to the original clean speech signal, likely due to a bad noise estimate during the speech enhancement procedure. While negative spectral distortions can have a significant importance in subjective intelligibility assessment of processed speech, most of the objective measures in the literature do not well account for this type of distortion. The proposed method focuses on a very specific type of noise, so it is not intended to be used alone but in combination with other techniques, to jointly achieve a better intelligibility prediction. In order to find an appropriate technique to be combined with, in this paper we also review a number of recently proposed methods based on correlation and coherence measures. These methods have already shown a high correlation with human recognition scores, as they effectively detect the presence of nonlinearities, frequently found in noise-suppressed speech. However, when these techniques are jointly applied with the proposed method, significantly higher correlations (above r = 0.9) are shown to be achieved. (C) 2011 Elsevier B.V. All rights reserved. C1 [Gomez, Angel M.] Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, E-18071 Granada, Spain. [Gomez, Angel M.; Schwerin, Belinda; Paliwal, Kuldip] Griffith Univ, Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia. RP Gomez, AM (reprint author), Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, Campus Fuentenueva S-N, E-18071 Granada, Spain. EM amgg@ugr.es; b.schwerin@griffith.edu.au; k.paliwal@griffith.edu.au RI Gomez Garcia, Angel Manuel/C-6856-2012 OI Gomez Garcia, Angel Manuel/0000-0002-9995-3068 FU Spanish Government [JC2010-0194, CEB09-0010] FX This work has been supported by the Spanish Government Grant JC2010-0194 and project CEI BioTIC GENIL (CEB09-0010). CR [Anonymous], 2001, P862 ITUT ANSI, 1997, S351997 ANSI Balakrishnan N., 1992, HDB LOGISTIC DISTRIB Boldt J., 2009, P EUSIPCO, P1849 CARTER GC, 1983, IEEE T AUDIO ELECTRO, V21, P337, DOI DOI 10.1109/TAU.1973.1162496 Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 Falk T., 2008, P INT WORKSH AC ECH FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 Gibbons J. D., 1985, NONPARAMETRIC STAT I Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 Gomez A., 2011, P ISCA EUR C SPEECH Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657 Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575 Kim G., 2010, P IEEE INT C AC SPEE, V1, P4738 Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Ludvigsen C, 1993, Scand Audiol Suppl, V38, P50 Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493 Ma JF, 2011, SPEECH COMMUN, V53, P340, DOI 10.1016/j.specom.2010.10.005 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Pearce D., 2000, P ICSLP, V4, P29 Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058 Scalart P., 1996, P ICASSP, V2, P629 SILVERMAN HF, 1976, IEEE T ACOUST SPEECH, V24, P289, DOI 10.1109/TASSP.1976.1162814 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881 NR 29 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2012 VL 54 IS 3 BP 503 EP 515 DI 10.1016/j.specom.2011.11.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 899IH UT WOS:000300809100014 ER PT J AU Ward, NG Vega, A Baumann, T AF Ward, Nigel G. Vega, Alejandro Baumann, Timo TI Prosodic and temporal features for language modeling for dialog SO SPEECH COMMUNICATION LA English DT Article DE Dialog dynamics; Dialog state; Prosody; Interlocutor behavior; Word probabilities; Prediction; Perplexity; Speech recognition; Switchboard corpus; Verbmobil corpus ID SPEECH RECOGNITION; COMMUNICATION; CONVERSATION; FRAMEWORK; SPEAKING; ENGLISH AB If we can model the cognitive and communicative processes underlying speech, we should be able to better predict what a speaker will do. With this idea as inspiration, we examine a number of prosodic and timing features as potential sources of information on what words the speaker is likely to say next. In spontaneous dialog we find that word probabilities do vary with such features. Using perplexity as the metric, the most informative of these included recent speaking rate, volume, and pitch, and time until end of utterance. Using simple combinations of such features to augment trigram language models gave up to a 8.4% perplexity benefit on the Switchboard corpus, and up to a 1.0% relative reduction in word error rate (0.3% absolute) on the Verbmobil II corpus. (C) 2011 Elsevier B.V. All rights reserved. C1 [Ward, Nigel G.; Vega, Alejandro] Univ Texas El Paso, El Paso, TX 79968 USA. [Baumann, Timo] Univ Potsdam, Dept Linguist, D-14476 Potsdam, Germany. RP Ward, NG (reprint author), Univ Texas El Paso, 500 W Univ Ave, El Paso, TX 79968 USA. EM nigelward@acm.org; avega5@miners.utep.edu; mail@timobaumann.de FU NSF [IIS-0415150, IIS-0914868]; US Army Research, Development and Engineering Command; USC Institute for Creative Technologies; DFG FX We thank Shreyas Karkhedkar, Nisha Kiran, Shubhra Datta, Gary Beverungen, and Justin McManus for contributing to the code and analysis; David Novick, Olac Fuentes and many thoughtful reviewers for Speech Communication, Interspeech, ASRU, and the National Science Foundation for comments; and Joe Picone and Andreas Stolcke for making available the Switchboard labeling and the SRILM toolkit, respectively. This work was supported in part by NSF Grants IIS-0415150 and IIS-0914868 and REU supplements thereto, by the US Army Research, Development and Engineering Command via a subcontract to the USC Institute for Creative Technologies, and by a DFG grant in the Emmy Noether program. CR Ananthakrishnan S, 2007, INT CONF ACOUST SPEE, P873 Bard E. G., 2002, EDILOG 2002, P29 Barsalou Lawrence W, 2007, Cogn Process, V8, P79, DOI 10.1007/s10339-007-0163-1 Batliner A., 1995, 88 VERBM PROJ Batliner A., 2001, ISCA WORKSH PROS SPE Beebe B, 2008, J PSYCHOLINGUIST RES, V37, P293, DOI 10.1007/s10936-008-9078-y Bell A, 2009, J MEM LANG, V60, P92, DOI 10.1016/j.jml.2008.06.003 Bellegarda JR, 2004, SPEECH COMMUN, V42, P93, DOI 10.1016/j.specom.2003.08.002 Bengio Y, 2003, J MACH LEARN RES, V3, P1137, DOI 10.1162/153244303322533223 Bradlow A. R., 2009, LANG SPEECH, V52, P391 Braun B, 2011, LANG COGNITIVE PROC, V26, P350, DOI 10.1080/01690965.2010.492641 BRENNAN SE, 1995, KNOWL-BASED SYST, V8, P143, DOI 10.1016/0950-7051(95)98376-H Campbell N, 2007, LECT NOTES COMPUT SC, V4775, P117 Chen K., 2007, ROBUST SPEECH RECOGN, P319 Chen S., 1998, DARPA BROADC NEWS TR Clark H. H., 1996, USING LANGUAGE Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3 Clark HH, 2002, SPEECH COMMUN, V36, P5, DOI 10.1016/S0167-6393(01)00022-X Ferrer L., 2003, ICASSP Fujisaki H, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1 Garrod S, 2004, TRENDS COGN SCI, V8, P8, DOI 10.1016/j.tics.2003.10.016 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 GOFFMAN E, 1978, LANGUAGE, V54, P787, DOI 10.2307/413235 Gratch J., 2006, 6 INT C INT VIRT AG, P14 Hamaker J., 1998, RULES GUIDELINES TRA Huang S., 2007, LNCS, V4892, P191 ISIP, 2003, MAN CORR SWITCHB WOR Jaffe J., 1978, NONVERBAL BEHAV COMM, P55 Jahr E., 2007, J SPEECH LANG PATHOL, V2, P190 Jekat S., 1997, 62 VM LMU MUNCH U HA Jelinek F., 1997, STAT METHODS SPEECH Ji G., 2010, ICASSP 2010 Ji G., 2004, C HUM LANG TECHN Levinson Stephen C, 2006, ROOTS HUMAN SOCIALIT, P39 Ma KW, 2000, SPEECH COMMUN, V31, P51, DOI 10.1016/S0167-6393(99)00060-6 Macrae CN, 2008, COGNITION, V109, P152, DOI 10.1016/j.cognition.2008.07.007 Morgan N., 1998, ICASSP, P721 O'Connell DC, 2008, COGN LANG A SER PSYC, P3, DOI 10.1007/978-0-387-77632-3_1 Petukhova V., 2009, NAACL HLT Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169 Raux A., 2009, HUMAN LANGUAGE TECHN SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 Sebanz N, 2006, TRENDS COGN SCI, V10, P70, DOI 10.1016/j.tics.2005.12.009 Shriberg E, 2004, IMA V MATH, V138, P105 Shriberg E., 2004, P INT C SPEECH PROS, P575 Shriberg E., 2001, J INT PHON ASSOC, V31, P153 Stolcke A., 1999, P 6 EUR C SPEECH COM Stolcke A., 2002, P INT C SPOK LANG PR Streeck J, 2009, DISCOURSE PROCESS, V46, P93, DOI 10.1080/01638530902728777 Truong KP, 2007, SPEECH COMMUN, V49, P144, DOI 10.1016/j.specom.2007.01.001 Vicsi K, 2010, SPEECH COMMUN, V52, P413, DOI 10.1016/j.specom.2010.01.003 Ward NG, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P160 Walker W., 2004, TR20040811 SMLI SUN Ward N., 2006, PRAGMAT COGN, V14, P113 Ward N, 2000, J PRAGMATICS, V32, P1177, DOI 10.1016/S0378-2166(99)00109-5 Ward N. G., 2010, INTERSPEECH Ward N. G., 1999, ESCA WORKSH DIAL PRO, P83 Ward NG, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1606 Ward N. G., 2010, SPEECH PROS Ward N. G., 2010, UTEPCS22 DEP COMP SC Ward N. G., 2009, 11 IEEE WORKSH AUT S, P323 Xu P, 2007, COMPUT SPEECH LANG, V21, P105, DOI 10.1016/j.csl.2006.01.003 Yngve Victor, 1970, 6 REG M CHIC LING SO, P567 NR 63 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 161 EP 174 DI 10.1016/j.specom.2011.07.009 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600001 ER PT J AU Andersson, S Yamagishi, J Clark, RAJ AF Andersson, Sebastian Yamagishi, Junichi Clark, Robert A. J. TI Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; HMM; Conversation; Spontaneous speech; Filled pauses; Discourse marker ID SPEAKING; READ AB Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis. (C) 2011 Elsevier B.V. All rights reserved. C1 [Andersson, Sebastian; Yamagishi, Junichi; Clark, Robert A. J.] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland. RP Andersson, S (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland. EM J.S.Andersson@sms.ed.ac.uk; jyamagis@inf.ed.ac.uk; robert@cstr.ed.ac.uk FU Marie Curie Early Stage Training Site EdSST [MEST-CT-2005-020568] FX The authors are grateful to David Traum and Kallirroi Georgila at the USC Institute for Creative Technologies (http://www.ict.usc.edu) for making the speech data available to us. The first author is supported by Marie Curie Early Stage Training Site EdSST (MEST-CT-2005-020568). CR Adell J., 2008, P INT BRISB AUSTR, P2278 Adell J., 2010, P SPEECH PROS CHIC U, V100624, P1 Anderson S. H., 2010, Proceedings of the 19th World Congress of Soil Science: Soil solutions for a changing world, Brisbane, Australia, 1-6 August 2010. Symposium 4.4.2 Attracting (young) people to a soils career, P1, DOI 10.1109/NEBC.2010.5458172 Andersson S., 2010, P SSW7 KYOT JAP, P173 Aylett M, 2006, J ACOUST SOC AM, V119, P3048, DOI 10.1121/1.2188331 Aylett M. P., 2007, P AISB 2007 NEWC UK, P174 Badino L., 2009, P INT BRIGHT UK, P520 Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836 BLAAUW E, 1994, SPEECH COMMUN, V14, P359, DOI 10.1016/0167-6393(94)90028-0 CADIC D, 2008, P INT BRISB AUSTR, P1861 Campbell N, 2007, P SSW6 BONN GERM, P22 Campbell N, 2006, P SPEECH PROS DRESD Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3 Gravano A., 2007, P 45 ANN M ASS COMP, P800 Gustafsson K., 2004, P SSW5 PITTSB US, P145 Jurafsky D., 1998, COLING ACL 98 WORKSH Kawahara H., 2001, P 2 MAVEBA FIR IT Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 King S., 2009, P BLIZZ CHALL WORKSH Laan GPM, 1997, SPEECH COMMUN, V22, P43, DOI 10.1016/S0167-6393(97)00012-5 Lasarcyk E., 2010, P SSW7 KYOT JAP, P230 Lee CH, 2010, INT CONF ACOUST SPEE, P4826, DOI 10.1109/ICASSP.2010.5495140 Nakamura M, 2008, COMPUT SPEECH LANG, V22, P171, DOI 10.1016/j.csl.2007.07.003 O'Shaughnessy D., 1992, P ICASSP, P521, DOI 10.1109/ICASSP.1992.225857 Odell JJ, 1995, THESIS CAMBRIDGE U Romportl J., 2010, P SSW7 KYOT JAP, P120 Schiffrin Deborah, 1987, DISCOURSE MARKERS Shriberg E., 1999, P INT C PHON SCI SAN, P619 Shriberg E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607996 Sundaram S, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P203, DOI 10.1109/WSS.2002.1224409 Tokuda K, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P227 Traum D. R., 2008, RECENT TRENDS DISCOU, P45, DOI [10.1007/978-1-4020-6821-8_3, DOI 10.1007/978-1-4020-6821-8_3] Wahlster W., 2000, VERBMOBIL FDN SPEECH Yamagishi J, 2005, IEICE T INF SYST, VE88D, P502, DOI 10.1093/ietisy/e88-d.3.502 Young S., 2006, HTK BOOK Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 37 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 175 EP 188 DI 10.1016/j.specom.2011.08.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600002 ER PT J AU Bouton, S Cole, P Serniclaes, W AF Bouton, Sophie Cole, Pascale Serniclaes, Willy TI The influence of lexical knowledge on phoneme discrimination in deaf children with cochlear implants SO SPEECH COMMUNICATION LA English DT Article DE Speech perception; Phonological features; Cochlear implant; Speech development; Lexicality effect ID AUDIOVISUAL SPEECH-PERCEPTION; WORD RECOGNITION; FINE-STRUCTURE; TEMPORAL CUES; USERS; FRENCH; CATEGORIZATION; INFORMATION; ACCESS; SKILLS AB This paper addresses the questions of whether lexical information influences phoneme discrimination in children with cochlear implants (CI) and whether this influence is similar to what occurs in normal-hearing (NH) children. Previous research with CI children evidenced poor accuracy in phonemic perception, which might have an incidence on the use of lexical information in phoneme discrimination. A discrimination task with French vowels and consonants in minimal pairs of words (e.g., mouche/bouche) or pseudowords (e.g., moute/boute) was used to search for possible differences in the use of lexical knowledge between CI children and NH children matched for listening age. Minimal pairs differed in a single consonant or vowel feature (e.g., nasality, vocalic aperture, voicing) to unveil possible interactions between phonological/acoustic and lexical processing. The results showed that both the word and pseudoword discrimination of CI children are inferior to those of NH children, with the magnitude of the deficit depending on the feature. However, word discrimination was better than pseudoword discrimination, and this lexicality effect was equivalent for both CI and NH children. Further, this lexicality effect did not depend on the feature in either group. Our results support the idea that hearing deprivation period may not have consequence on lexical processes implied on speech perception. (C) 2011 Elsevier B.V. All rights reserved. C1 [Bouton, Sophie; Cole, Pascale] Aix Marseille Univ, Lab Cognit Psychol, CNRS, UMR 6146, Aix En Provence, France. [Bouton, Sophie; Serniclaes, Willy] Paris Descartes Univ, Lab Psychol Percept, CNRS, UMR 8158, Paris, France. RP Bouton, S (reprint author), Univ Aix Marseille 1, Lab Psychol Cognit, Batiment 9,Case D,3 Pl Victor Hugo, F-13331 Marseille 3, France. EM Sophie.Bouton@univ-provence.fr RI Imhof, Margarete/F-8471-2011 CR Abramson A. S., 1985, PHONETIC LINGUISTICS, P25 ANDRUSKI JE, 1994, COGNITION, V52, P163, DOI 10.1016/0010-0277(94)90042-6 Bergeson TR, 2005, EAR HEARING, V26, P149, DOI 10.1097/00003446-200504000-00004 Bergeson TR, 2003, VOLTA REV, V103, P347 Bertoncini J, 2009, J SPEECH LANG HEAR R, V52, P682, DOI 10.1044/1092-4388(2008/07-0273) Bouton S., J SPEECH LA IN PRESS Brancazio L, 2004, J EXP PSYCHOL HUMAN, V30, P445, DOI 10.1037/0096-1523.30.3.445 Burns TC, 2007, APPL PSYCHOLINGUIST, V28, P455, DOI 10.1017/S0142716407070257 Chiappe P, 2001, J EXP CHILD PSYCHOL, V80, P58, DOI 10.1006/jecp.2000.2624 CONNINE CM, 1987, J EXP PSYCHOL HUMAN, V13, P291, DOI 10.1037/0096-1523.13.2.291 CONTENT A, 1988, CAH PSYCHOL COGN, V8, P399 Dorman Michael F, 2002, Am J Audiol, V11, P119, DOI 10.1044/1059-0889(2002/014) EILERS RE, 1979, CHILD DEV, V50, P14 Fort M, 2010, SPEECH COMMUN, V52, P525, DOI 10.1016/j.specom.2010.02.005 FOX RA, 1984, J EXP PSYCHOL HUMAN, V10, P526, DOI 10.1037//0096-1523.10.4.526 GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110 Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613 Geers A., 2003, EAR HEARING, V24, P24 Harnsberger JD, 2001, J ACOUST SOC AM, V109, P2135, DOI 10.1121/1.1350403 Hoonhorst I, 2009, J EXP CHILD PSYCHOL, V104, P353, DOI 10.1016/j.jecp.2009.07.005 Hoonhorst I, 2009, CLIN NEUROPHYSIOL, V120, P897, DOI 10.1016/j.clinph.2009.02.174 Kaiser AR, 2003, J SPEECH LANG HEAR R, V46, P390, DOI 10.1044/1092-4388(2003/032) KIRK KI, 1995, EAR HEARING, V16, P470, DOI 10.1097/00003446-199510000-00004 Knudsen EI, 2004, J COGNITIVE NEUROSCI, V16, P1412, DOI 10.1162/0898929042304796 Lachs L, 2001, EAR HEARING, V22, P236, DOI 10.1097/00003446-200106000-00007 Lete B, 2004, BEHAV RES METH INS C, V36, P156, DOI 10.3758/BF03195560 Leybaert J., 2007, ENFANCE, V59, P245, DOI [10.3917/enf.593.0245, DOI 10.3917/ENF.593.0245] MacMillan N. A., 2005, DETECTION THEORY USE MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 McMurray B, 2003, J PSYCHOLINGUIST RES, V32, P77, DOI 10.1023/A:1021937116271 MCQUEEN JM, 1991, J EXP PSYCHOL HUMAN, V17, P433, DOI 10.1037/0096-1523.17.2.433 Medina V., 2009, REV LOGOPEDIA FONIAT, V29, P186, DOI DOI 10.1016/S0214-4603(09)70027-0 Misiurski C, 2005, BRAIN LANG, V93, P64, DOI 10.1016/j.bandl.2004.08.001 Norris D, 2003, COGNITIVE PSYCHOL, V47, P204, DOI 10.1016/S0010-0285(03)00006-9 O'Donoghue GM, 2000, LANCET, V356, P466, DOI 10.1016/S0140-6736(00)02555-1 Raven JC, 1947, COLOURED PROGR MATRI REPP BH, 1982, PSYCHOL BULL, V92, P81, DOI 10.1037//0033-2909.92.1.81 Repp B.H, 1984, SPEECH LANGUAGE ADV, V10, P234 Rivera-Gaxiola M, 2005, DEVELOPMENTAL SCI, V8, P162, DOI 10.1111/j.1467-7687.2005.00403.x ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070 Rouger J, 2008, BRAIN RES, V1188, P87, DOI 10.1016/j.brainres.2007.10.049 SAERENS M, 1989, LANG SPEECH, V32, P291 Schatzer R, 2010, ACTA OTO-LARYNGOL, V130, P1031, DOI 10.3109/00016481003591731 Serniclaes W., 2010, PORTUGUESE J LINGUIS, V9 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a Spencer LJ, 2009, J DEAF STUD DEAF EDU, V14, P1, DOI 10.1093/deafed/enn013 Stevens KN, 2000, PHONETICA, V57, P139, DOI 10.1159/000028468 TYEMURRAY N, 1995, J ACOUST SOC AM, V98, P2454, DOI 10.1121/1.413278 Tyler RS, 1997, OTOLARYNG HEAD NECK, V117, P180, DOI 10.1016/S0194-5998(97)70172-4 van Linden S, 2007, NEUROSCI LETT, V420, P49, DOI 10.1016/j.neulet.2007.04.006 Verschuur C, 2001, BRIT J AUDIOL, V35, P209 WHALEN DH, 1991, PERCEPT PSYCHOPHYS, V50, P351, DOI 10.3758/BF03212227 Won JH, 2010, EAR HEARING, V31, P796, DOI 10.1097/AUD.0b013e3181e8b7bd Xu L, 2007, J ACOUST SOC AM, V122, P1758, DOI 10.1121/1.2767000 NR 56 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 189 EP 198 DI 10.1016/j.specom.2011.08.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600003 ER PT J AU Gudnason, J Thomas, MRP Ellis, DPW Naylor, PA AF Gudnason, Jon Thomas, Mark R. P. Ellis, Daniel P. W. Naylor, Patrick A. TI Data-driven voice source waveform analysis and synthesis SO SPEECH COMMUNICATION LA English DT Article DE Voice source signal; Inverse filtering; Vocal tract modeling; Principal component analysis; Gaussian mixture model; Segmental signal to reconstruction ratio ID VOCAL-FOLD VIBRATION; LINEAR PREDICTION; GLOTTAL CLOSURE; SPEAKER IDENTIFICATION; SPEECH ANALYSIS; TUTORIAL; MODEL; ALGORITHM; QUALITY; PHASE AB A data-driven approach is introduced for studying, analyzing and processing the voice source signal. Existing approaches parameterize the voice source signal by using models that are motivated, for example, by a physical model or function-fitting. Such parameterization is often difficult to achieve and it produces a poor approximation to a large variety of real voice source waveforms of the human voice. This paper presents a novel data-driven approach to analyze different types of voice source waveforms using principal component analysis and Gaussian mixture modeling. This approach models certain voice source features that many other approaches fail to model. Prototype voice source waveforms are obtained from each mixture component and analyzed with respect to speaker, phone and pitch. An analysis/synthesis scheme was set up to demonstrate the effectiveness of the method. Compression of the proposed voice source by discarding 75% of the features yields a segmental signal-to-reconstruction error ratio of 13 dB and a Bark spectral distortion of 0.14. (C) 2011 Elsevier B.V. All rights reserved. C1 [Gudnason, Jon] Reykjavik Univ, Sch Sci & Engn, Reykjavik, Iceland. [Thomas, Mark R. P.; Naylor, Patrick A.] Univ London Imperial Coll Sci Technol & Med, Dept Elect & Elect Engn, London SW7 2AZ, England. [Naylor, Patrick A.] Columbia Univ, LabROSA, New York, NY 10027 USA. RP Gudnason, J (reprint author), Reykjavik Univ, Sch Sci & Engn, Reykjavik, Iceland. EM jg@ru.is; mark.r.thomas02@imperial.ac.uk; dpwe@ee.columbia.edu; p.naylor@imperial.ac.uk CR ABBERTON ERM, 1989, CLIN LINGUIST PHONET, V3, P281, DOI 10.3109/02699208908985291 Alipour F, 2000, J ACOUST SOC AM, V108, P3003, DOI 10.1121/1.1324678 ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801 Ananthapadmanabha T., 1984, 2 ROYAL I TECHN SPEE, P1 ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 Backstrom T, 2002, IEEE T SPEECH AUDI P, V10, P186, DOI 10.1109/TSA.2002.1001983 BLACK AW, 2007, STAT PARAMETRIC SPEE, V4, P1229 Brookes D.M., 1994, P I ACOUSTICS, V15, P501 Carlson R., 1989, P INT C AC SPEECH SI, P223 Cataldo E., 2006, J BRAZILIAN SOC MECH, P28 CHAN DSF, 1989, P EUR C SPEECH COMM, V33 CHILDERS DG, 1995, IEEE T SPEECH AUDI P, V3, P209, DOI 10.1109/89.388148 CUMMINGS KE, 1995, DIGIT SIGNAL PROCESS, V5, P21, DOI 10.1006/dspr.1995.1003 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Ding C, 2004, P INT C MACH LEARN I, P225, DOI 10.1145/1015330.1015408 Drugman T., 2009, P IEEE INT C AC SPEE Duda R. O., 2001, PATTERN CLASSIFICATI Eysholdt U, 1996, FOLIA PHONIATR LOGO, V48, P163 Fant G., 1960, ACOUSTIC THEORY SPEE Fant G., 1985, STL QPSR, V26, P1 Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JL, 1968, IEEE T ACOUST SPEECH, VAU16, P57, DOI 10.1109/TAU.1968.1161949 FUJISAKI H, 1987, P IEEE INT C AC SPEE, V12, P637 Gudnason J., 2009, P INT C BRIGHT UK Gudnason J, 2008, INT CONF ACOUST SPEE, P4821, DOI 10.1109/ICASSP.2008.4518736 Hartigan J. A., 1979, Applied Statistics, V28, DOI 10.2307/2346830 Hirano M, 1981, CLIN EXAMINATION VOI IEC, 2003, 616722003 IEC ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 Jurafsky Daniel, 2000, SPEECH LANGUAGE PROC KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 Kumar A., 1997, P IEEE WORKSH SPEECH, P3 LECLUSE F, 1975, FOLIA PHONIATR, V17, P215 Lindsey G., 1987, SPARS ARCH ACTUAL WO Ma C, 1994, IEEE T SPEECH AUDI P, V2, P258 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Markel JD, 1976, LINEAR PREDICTION SP MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389 SPANIAS AS, 1994, P IEEE, V82, P1541, DOI 10.1109/5.326413 STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234 STRUBE HW, 1974, J ACOUST SOC AM, V56, P1625, DOI 10.1121/1.1903487 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 Svec JG, 1996, J VOICE, V10, P201, DOI 10.1016/S0892-1997(96)80047-6 Thomas M. R. P., 2008, P EUR SIGN PROC C EU Thomas M.R.P., 2010, P IEEE INT C AC SPEE Thomas M.R.P., 2010, DETECTION GLOT UNPUB Titze IR, 2000, J ACOUST SOC AM, V107, P581, DOI 10.1121/1.428324 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 Zhu W., 1996, P INT C SPOK LANG PR, P1413 NR 53 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 199 EP 211 DI 10.1016/j.specom.2011.08.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600004 ER PT J AU Saon, G Soltau, H AF Saon, George Soltau, Hagen TI Boosting systems for large vocabulary continuous speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Boosting; Acoustic modeling AB We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Additionally, we study the impact of boosting on maximum likelihood (ML) and discriminatively trained acoustic models. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training. (C) 2011 Elsevier B.V. All rights reserved. C1 [Saon, George; Soltau, Hagen] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA. RP Saon, G (reprint author), IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA. EM gsaon@us.ibm.com FU DARPA [HR0011-06-2-0001] FX The authors acknowledge the support of DARPA under Grant HR0011-06-2-0001 for funding part of this work. The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense. CR Breslin C., 2007, INT 07, P1441 Dimitrakakis C., 2004, ICASSP 04, P621 Du J., 2010, INT 10, P2942 Eibl G., 2002, ECML 02, P72 Freund Y, 1997, J COMPUT SYST SCI, V55, P119, DOI 10.1006/jcss.1997.1504 Povey D, 2008, INT CONF ACOUST SPEE, P4057, DOI 10.1109/ICASSP.2008.4518545 Saon G, 2010, INT CONF ACOUST SPEE, P4378, DOI 10.1109/ICASSP.2010.5495640 Saon G, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P920 Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901 Siohan O, 2005, INT CONF ACOUST SPEE, P197 Soltau H., 2010, WORKSH SPEECH LANG T, P97 Tang H., 2010, ICASSP 10, P2274 Zhang R., 2004, ICSLP 04 Zhu J, 2009, STAT INTERFACE, V2, P349 Zweig G., 2000, ACOUST SPEECH SIG PR, P1527 NR 15 TC 9 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 212 EP 218 DI 10.1016/j.specom.2011.07.011 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600005 ER PT J AU Kurata, G Sethy, A Ramabhadran, B Rastrow, A Itoh, N Nishimura, M AF Kurata, Gakuto Sethy, Abhinav Ramabhadran, Bhuvana Rastrow, Ariya Itoh, Nobuyasu Nishimura, Masafumi TI Acoustically discriminative language model training with pseudo-hypothesis SO SPEECH COMMUNICATION LA English DT Article DE Discriminative training; Language model; Phonetic confusability; Finite state transducer ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; DIVERGENCE AB Recently proposed methods for discriminative language modeling require alternate hypotheses in the form of lattices or N-best lists. These are usually generated by an Automatic Speech Recognition (ASR) system on the same speech data used to train the system. This requirement restricts the scope of these methods to corpora where both the acoustic material and the corresponding true transcripts are available. Typically, the text data available for language model (LM) training is an order of magnitude larger than manually transcribed speech. This paper provides a general framework to take advantage of this volume of textual data in the discriminative training of language models. We propose to generate probable N-best lists directly from the text material, which resemble the N-best lists produced by an ASR system by incorporating phonetic confusability estimated from the acoustic model of the ASR system. We present experiments with Japanese spontaneous lecture speech data, which demonstrate that discriminative LM training with the proposed framework is effective and provides modest gains in ASR accuracy. (C) 2011 Elsevier B.V. All rights reserved. C1 [Kurata, Gakuto; Itoh, Nobuyasu; Nishimura, Masafumi] IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan. [Sethy, Abhinav; Ramabhadran, Bhuvana] IBM Corp, IBM Res TJ Watson Res Ctr, Yorktown Hts, NY USA. [Rastrow, Ariya] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA. RP Kurata, G (reprint author), IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan. EM gakuto@jp.ibm.com CR Bahl L.R., 1977, J ACOUST SOC AM 1, V62, pS63 Berger A., 1998, P ICASSP, VII, P705, DOI 10.1109/ICASSP.1998.675362 Bhattacharyya A., 1943, Bulletin of the Calcutta Mathematical Society, V35 Chen J.-Y., 2007, P INTERSPEECH 2007 A, P2089 Chen S. F., 2009, P HLT NAACL, P450, DOI 10.3115/1620754.1620820 Chen SF, 1999, COMPUT SPEECH LANG, V13, P359, DOI 10.1006/csla.1999.0128 Chen Z., 2000, P ICSLP, V1, P493 Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1 Gauvain J., 2003, P RICH TRANSCR WORKS Hershey J. R., 2007, P ICASSP, V4, P317 Hershey John R., 2007, P ASRU, P323 Hershey JR, 2008, INT CONF ACOUST SPEE, P4557, DOI 10.1109/ICASSP.2008.4518670 Kuo H., 2007, P ICASSP, V4, P45 Kuo H. K. J., 2002, P ICASSP, V1, P325 Kuo J.-W., 2005, P INTERSPEECH, P1277 Kurata G, 2009, INT CONF ACOUST SPEE, P4717, DOI 10.1109/ICASSP.2009.4960684 Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186 Lin S., 2005, P INTERSPEECH, P733 Minematsu N., 2002, P ICSLP, P529 Mohanty B, 2008, INT CONF ACOUST SPEE, P4953, DOI 10.1109/ICASSP.2008.4518769 Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 Nguyen L., 2003, P RICH TRANSCR WORKS Nishimura R., 2001, P EUROSPEECH, P2127 Oba Takanobu, 2007, P INT, P1753 Okanohara D., 2007, P ACL, P73 Pallet David S, 1990, P ICASSP, P97 Povey D., 2007, P ICASSP, V4, P321 Printz H, 2002, COMPUT SPEECH LANG, V16, P131, DOI 10.1006/csla.2001.0188 Rastrow Ariya, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5373338 Roark B, 2007, COMPUT SPEECH LANG, V21, P373, DOI 10.1016/j.csl.2006.06.006 Roark B., 2004, P ACL, P47, DOI 10.3115/1218955.1218962 Sandbank B, 2008, P EMNLP, P51, DOI 10.3115/1613715.1613723 Schwenk H., 2005, P C HUM LANG TECHN E, P201, DOI 10.3115/1220575.1220601 Silva J, 2006, IEEE T AUDIO SPEECH, V14, P890, DOI 10.1109/TSA.2005.858059 Silva J., 2006, P IEEE INT S INF THE, P2299 Smith N A, 2005, P 43 ANN M ASS COMP, P354, DOI 10.3115/1219840.1219884 Woodland PC, 2002, COMPUT SPEECH LANG, V16, P25, DOI 10.1006/csla.2001.0182 Xu P., 2009, P ASRU, P317 NR 38 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 219 EP 228 DI 10.1016/j.specom.2011.08.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600006 ER PT J AU Fujimoto, M Watanabe, S Nakatani, T AF Fujimoto, Masakiyo Watanabe, Shinji Nakatani, Tomohiro TI Frame-wise model re-estimation method based on Gaussian pruning with weight normalization for noise robust voice activity detection SO SPEECH COMMUNICATION LA English DT Article DE Voice activity detection; Switching Kalman filter; Gaussian pruning; Posterior probability; Gaussian weight normalization ID HIGHER-ORDER STATISTICS; SPEECH RECOGNITION AB This paper proposes a robust voice activity detection (VAD) method that operates in the presence of noise. For noise robust VAD, we have already proposed statistical models and a switching Kalman filter (SKF)-based technique. In this paper, we focus on a model re-estimation method using Gaussian pruning with weight normalization. The statistical model for SKF-based VAD is constructed using Gaussian mixture models (GMMs), and consists of pre-trained silence and clean speech GMMs and a sequentially estimated noise GMM. However, the composed model is not optimal in that it does not fully reflect the characteristics of the observed signal. Thus, to ensure the optimality of the composed model, we investigate a method for its re-estimation that reflects the characteristics of the observed signal sequence. Since our VAD method works through the use of frame-wise sequential processing, processing with the smallest latency is very important. In this case, there are insufficient re-training data for a re-estimation of all the Gaussian parameters. To solve this problem, we propose a model re-estimation method that involves the extraction of reliable characteristics using Gaussian pruning with weight normalization. Namely, the proposed method re-estimates the model by pruning non-dominant Gaussian distributions that express the local characteristics of each frame and by normalizing the Gaussian weights of the remaining distributions. In an experiment using a speech corpus for VAD evaluation, CENSREC-1-C, the proposed method significantly improved the VAD performance with compared that of the original SKF-based VAD. This result confirmed that the proposed Gaussian pruning contributes to an improvement in VAD accuracy. (C) 2011 Elsevier B.V. All rights reserved. C1 [Fujimoto, Masakiyo; Watanabe, Shinji; Nakatani, Tomohiro] NTT Corp 2 4, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan. RP Fujimoto, M (reprint author), NTT Corp 2 4, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan. EM fujimoto.masakiyo@lab.ntt.co.jp CR [Anonymous], 1999, 301708 ETSI EN [Anonymous], 2006, 202050 ETSI ES Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403 Cournapeau D., 2007, P INT, P2945 Fischer V., 1999, P EUROSPEECH99 SEPT, V3, P1099 Fujimoto M., 2009, P INT 09 SEPT, P1235 Fujimoto M., 2007, P INT 07 AUG, P2933 Fujimoto M, 2008, INT CONF ACOUST SPEE, P4441 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P18 Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790 Ishizuka K, 2010, SPEECH COMMUN, V52, P41, DOI 10.1016/j.specom.2009.08.003 ITU-T, 1996, G729 ITUT Kato H., 2008, SPEECH COMMUN, V50, P476 Kitaoka N., 2007, P IEEE WORKSH AUT SP, P607 KRISTIANSSON T., 2005, P INTERSPEECH, P369 Li K, 2005, IEEE T SPEECH AUDI P, V13, P965, DOI 10.1109/TSA.2005.851955 Mohri M., 2000, P ASR2000, P97 Nakamura S, 2005, IEICE T INF SYST, VE88D, P535, DOI 10.1093/ietisy/e88-d.3.535 Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996 Ogawa A, 2008, INT CONF ACOUST SPEE, P4173 RABINER LR, 1975, AT&T TECH J, V54, P297 Ramirez J, 2007, INT CONF ACOUST SPEE, P801 Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002 Ramirez J, 2005, IEEE SIGNAL PROC LET, V12, P689, DOI 10.1109/LSP.2005.855551 Shinoda K., 2002, P ICASSP2002, VI, P869 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521 Weiss R.J., 2008, P INT 08, P127 NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 229 EP 244 DI 10.1016/j.specom.2011.08.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600007 ER PT J AU Chunwijitra, V Nose, T Kobayashi, T AF Chunwijitra, Vataya Nose, Takashi Kobayashi, Takao TI A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE HMM-based speech synthesis; Average voice model; F0 modeling; F0 quantization ID SYNTHESIS SYSTEM; PITCH; ADAPTATION AB This paper proposes a technique of improving tone correctness in speech synthesis of a tonal language based on an average-voice model trained with a corpus from nonprofessional speakers' speech. We focused on reducing tone disagreements in speech data acquired from nonprofessional speakers without manually modifying the labels. To reduce the distortion in tone caused by inconsistent tonal labeling, quantized F0 symbols were utilized as the context for F0 to obtain an appropriate F0 model. With this technique, the tonal context could be directly extracted from the original speech and this prevented inconsistency between speech data and F0 labels generated from transcriptions, which affect naturalness and the tone correctness in synthetic speech. We examined two types of labeling for the tonal context using phone-based and sub-phone-based quantized F0 symbols. Subjective and objective evaluations of the synthetic voice were carried out in terms of the intelligibility of tone and its naturalness. The experimental results from both the objective and subjective tests revealed that the proposed technique could improve not only naturalness but also the tone correctness of synthetic speech under conditions where a small amount of speech data from nonprofessional target speakers was used. (C) 2011 Elsevier B.V. All rights reserved. C1 [Chunwijitra, Vataya; Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. RP Chunwijitra, V (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. EM chunwijitra.v.aa@m.titech.ac.jp; takashi.nose@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp FU JSPS [21300063]; Thai government FX This work was supported in part by the JSPS Grant-in-Aid for Scientific Research 21300063. The first author was supported by a Science and Technology Scholarship from the Thai government. We would like to thank NECTEC, Thailand, for providing us with the LOTUS and the TSynC-1 speech corpora. CR Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X Chomphan S, 2009, SPEECH COMMUN, V51, P330, DOI 10.1016/j.specom.2008.10.003 Chomphan S, 2008, SPEECH COMMUN, V50, P392, DOI 10.1016/j.specom.2007.12.002 CHOU PA, 1991, IEEE T PATTERN ANAL, V13, P340, DOI 10.1109/34.88569 Fujisaki H., 1984, J ACOUST SOC JPN ASJ, V5, P133 Hansakunbuntheung C., 2005, P SNLP 2005, P127 Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110 Kasuriya S., 2003, Proceedings of the Oriental COCOSDA 2003. International Coordinating Committee on Speech Databases and Speech I/O System Assessment Kawahara H., 1997, P IEEE INT C AC SPEE, V1, P1303 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Li Y., 2004, P SPEECH PROS MARCH, P169 Mittrapiyanuruk P., 2000, NECTEC ANN C BANGK, P483 Nose T, 2010, INT CONF ACOUST SPEE, P4622, DOI 10.1109/ICASSP.2010.5495548 Ogata K., 2006, P INTERSPEECH 2006 I, P1328 Raux A., 2003, P WORKSH AUT SPEECH, P700 Sornlertlamvanich V., 1998, P OR COCOSDA WORKSH TAMURA M, 2001, ACOUST SPEECH SIG PR, P805 Tamura M., 2001, P EUROSPEECH 2001 SE, P345 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 Thangthai A., 2008, P INT C SPEECH SCI T, P2270 TOKUDA K, 1995, P ICASSP, P660 TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229 Yamagishi J, 2003, IEICE T FUND ELECTR, VE86A, P1956 Yamagishi J, 2003, IEICE T INF SYST, VE86D, P534 Yamagishi J, 2010, IEEE T AUDIO SPEECH, V18, P984, DOI 10.1109/TASL.2010.2045237 Yamagishi J, 2007, IEICE T INF SYST, VE90D, P533, DOI 10.1093/ietisy/e90-d.2.533 Yoshimura T, 1999, P EUR, P2347 Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 NR 28 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 245 EP 255 DI 10.1016/j.specom.2011.08.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600008 ER PT J AU Tohidypour, HR Seyyedsalehi, SA Behbood, H Roshandel, H AF Tohidypour, Hamid Reza Seyyedsalehi, Seyyed Ali Behbood, Hossein Roshandel, Hossein TI A new representation for speech frame recognition based on redundant wavelet filter banks SO SPEECH COMMUNICATION LA English DT Article DE Redundant wavelet filter-bank (RWFB); Wavelet transform (WT); Speech frame recognition; Representation; Frame wavelet; Zero moments; Four-channel higher density discrete wavelet; Time delay neural network (TDNN) ID TRANSFORM; MODEL AB Although the conventional wavelet transform possesses multi-resolution properties, it is not optimized for speech recognition systems. It suffers from lower performance compared with Mel Frequency Cepstral Coefficients (MFCCs) in which Mel scale is based on human auditory perception. In this paper, some new speech representations based on redundant wavelet filter-banks (RWFB) are proposed. RWFB parameters are much less shift-sensitive than those of critically sampled discrete wavelet transform (DWT), so they seem to feature better performance in speech recognition tasks because of having better time-frequency localization ability. However, the improvement is at the expense of higher redundancy. In this paper, some types of wavelet representations are introduced, including a combination of critically sampled DWT and some different multi-channel redundant filter-banks down-sampled by 2. In order to find appropriate filter values for multi-channel filter-banks, effects of changing the zero moments of proposed wavelet are discussed. The corresponding method performances are compared in a phoneme recognition task using time delay neural networks. It is revealed that redundant multi-channel wavelet filter-banks work better than conventional DWT in speech recognition systems. The proposed four-channel higher density discrete wavelet filter-bank results in up to approximately 8.95% recognition rate increase, compared with critically sampled two-channel wavelet filter-bank. (C) 2011 Elsevier B.V. All rights reserved. C1 [Tohidypour, Hamid Reza; Seyyedsalehi, Seyyed Ali; Behbood, Hossein] Amirkabir Univ Technol, Dept Biomed Engn, Tehran 158754413, Iran. [Roshandel, Hossein] Amirkabir Univ Technol, Dept Elect Engn, Tehran Polytech, Tehran 158754413, Iran. RP Tohidypour, HR (reprint author), Amirkabir Univ Technol, Dept Biomed Engn, 424 Hafez Ave, Tehran 158754413, Iran. EM hamidto86@aut.ac.ir CR Abdelnour A.F, 2005, P SOC PHOTO-OPT INS, V5914, P133 Abdelnour AF, 2005, IEEE T SIGNAL PROCES, V53, P231, DOI 10.1109/TSP.2004.838959 Bijankhan M., 1994, P SPEECH SCI TECHN C, P826 Bresolin AD, 2008, INT CONF ACOUST SPEE, P1545 Farooq O, 2001, IEEE SIGNAL PROC LET, V8, P196, DOI 10.1109/97.928676 Favero R. F., 1994, Proceedings of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis (Cat. No.94TH8007), DOI 10.1109/TFSA.1994.467280 Gillick L., 1989, P ICASSP, V1, P532 GOWDY JN, 2000, P IEEE INT C AC SPEE, V3, P1351 Karam J.R., 2000, P CAN C EL COMP ENG, V1, P331 Karami Sh., 2001, THESIS AMIRKABIR U T Lebrun J, 2004, J SYMB COMPUT, V37, P227, DOI 10.1016/j.jsc.2002.06.002 Nejadgholi I, 2009, NEURAL COMPUT APPL, V18, P45, DOI 10.1007/s00521-007-0151-5 Rahiminejad M., 2002, THESIS AMIRKABIR U T Selesnick I. W., 2001, WAVELETS SIGNAL IMAG Selesnick IW, 2004, APPL COMPUT HARMON A, V17, P211, DOI 10.1016/j.acha.2004.05.003 Selesnick IW, 2005, IEEE SIGNAL PROC MAG, V22, P123, DOI 10.1109/MSP.2005.1550194 Selesnick IW, 2006, IEEE T SIGNAL PROCES, V54, P3039, DOI 10.1109/TSP.2006.875388 Shao Y., 2010, IEEE T SYST MAN CY A, V41, P284 Tufekei Z, 2006, SPEECH COMMUN, V48, P1294, DOI 10.1016/j.specom.2006.06.006 Wu JD, 2009, EXPERT SYST APPL, V36, P3136, DOI 10.1016/j.eswa.2008.01.038 Xueying Z., 2004, P 7 INT C SIGN PROC, VI, P695, DOI 10.1109/ICOSP.2004.1452758 Yao J, 2001, IEEE T BIO-MED ENG, V48, P856 NR 22 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 256 EP 271 DI 10.1016/j.specom.2011.09.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600009 ER PT J AU Chen, F Loizou, PC AF Chen, Fei Loizou, Philipos C. TI Impact of SNR and gain-function over- and under-estimation on speech intelligibility SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Speech intelligibility; SNR estimation ID SPECTRAL AMPLITUDE ESTIMATOR; ENHANCEMENT ALGORITHMS; MINIMUM STATISTICS; DENSITY-ESTIMATION; NOISE-ESTIMATION AB Most noise reduction algorithms rely on obtaining reliable estimates of the SNR of each frequency bin. For that reason, much work has been done in analyzing the behavior and performance of SNR estimation algorithms in the context of improving speech quality and reducing speech distortions (e.g., musical noise). Comparatively little work has been reported, however, regarding the analysis and investigation of the effect of errors in SNR estimation on speech intelligibility. It is not known, for instance, whether it is the errors in SNR overestimation, errors in SNR underestimation, or both that are harmful to speech intelligibility. Errors in SNR estimation produce concomitant errors in the computation of the gain (suppression) function, and the impact of gain estimation errors on speech intelligibility is unclear. The present study assesses the effect of SNR estimation errors on gain function estimation via sensitivity analysis. Intelligibility listening studies were conducted to validate the sensitivity analysis. Results indicated that speech intelligibility is severely compromised when SNR and gain over-estimation errors are introduced in spectral components with negative SNR. A theoretical upper bound on the gain function is derived that can be used to constrain the values of the gain function so as to ensure that SNR overestimation errors are minimized. Speech enhancement algorithms that can limit the values of the gain function to fall within this upper bound can improve speech intelligibility. (C) 2011 Elsevier B.V. All rights reserved. C1 [Chen, Fei; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. RP Loizou, PC (reprint author), Univ Texas Dallas, Dept Elect Engn, 800 W Campbell Rd,EC33, Richardson, TX 75080 USA. EM loizou@utdallas.edu FU National Institute of Deafness and other Communication Disorders, NIH [R01 DC010494] FX This research was supported by Grant No. R01 DC010494 from the National Institute of Deafness and other Communication Disorders, NIH. CR Berouti M., 1979, P IEEE INT C AC SPEE, P208 Breithaupt C, 2011, IEEE T AUDIO SPEECH, V19, P277, DOI 10.1109/TASL.2010.2047681 Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929 Cappe O., 1994, IEEE T SPEECH AUDIO, V2, P346 Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erkelens J, 2007, SPEECH COMMUN, V49, P530, DOI 10.1016/j.specom.2006.06.012 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058 Kim G, 2009, J ACOUST SOC AM, V126, P1486, DOI 10.1121/1.3184603 Kim G, 2011, J ACOUST SOC AM, V130, P1581, DOI 10.1121/1.3619790 Kim G., 2010, P INT, P1632 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1698, DOI 10.1121/1.1909096 Li N., 2009, J ACOUST SOC AM, V123, P1673 Li YP, 2009, SPEECH COMMUN, V51, P230, DOI 10.1016/j.specom.2008.09.001 Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lu Y, 2010, INT CONF ACOUST SPEE, P4754, DOI 10.1109/ICASSP.2010.5495156 Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R, 2006, SIGNAL PROCESS, V86, P1215, DOI 10.1016/j.sigpro.2005.07.037 Martin R, 2005, SIG COM TEC, P43, DOI 10.1007/3-540-27489-8_3 Papoulis A., 2002, PROBABILITY RANDOM V Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005 Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199 Wang D., 2006, COMPUTATIONAL AUDITO Whitehead PS, 2011, INT CONF ACOUST SPEE, P5080 NR 31 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 272 EP 281 DI 10.1016/j.specom.2011.09.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600010 ER PT J AU Paliwal, K Schwerin, B Wojcicki, K AF Paliwal, Kuldip Schwerin, Belinda Wojcicki, Kamil TI Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator SO SPEECH COMMUNICATION LA English DT Article DE Modulation domain; Analysis-modification-synthesis (AMS); Speech enhancement; MMSE short-time spectral magnitude estimator (AME); Modulation spectrum; Modulation magnitude spectrum; MMSE short-time modulation magnitude estimator (MME) ID AMPLITUDE ESTIMATOR; QUALITY ESTIMATION; STATISTICAL-MODEL; NOISE; INTELLIGIBILITY; RECOGNITION; SUPPRESSION; SUBTRACTION; SNR AB In this paper we investigate the enhancement of speech by applying MMSE short-time spectral magnitude estimation in the modulation domain. For this purpose, the traditional analysis-modification-synthesis framework is extended to include modulation domain processing. We compensate the noisy modulation spectrum for additive noise distortion by applying the MMSE short-time spectral magnitude estimation algorithm in the modulation domain. A number of subjective experiments were conducted. Initially, we determine the parameter values that maximise the subjective quality of stimuli enhanced using the MMSE modulation magnitude estimator. Next, we compare the quality of stimuli processed by the MMSE modulation magnitude estimator to those processed using the MMSE acoustic magnitude estimator and the modulation spectral subtraction method, and show that good improvement in speech quality is achieved through use of the proposed approach. Then we evaluate the effect of including speech presence uncertainty and log-domain processing on the quality of enhanced speech, and find that this method works better with speech uncertainty. Finally we compare the quality of speech enhanced using the MMSE modulation magnitude estimator (when used with speech presence uncertainty) with that enhanced using different acoustic domain MMSE magnitude estimator formulations, and those enhanced using different modulation domain based enhancement algorithms. Results of these tests show that the MMSE modulation magnitude estimator improves the quality of processed stimuli, without introducing musical noise or spectral smearing distortion. The proposed method is shown to have better noise suppression than MMSE acoustic magnitude estimation, and improved speech quality compared to other modulation domain based enhancement methods considered. (C) 2011 Elsevier B.V. All rights reserved. C1 [Paliwal, Kuldip; Schwerin, Belinda; Wojcicki, Kamil] Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Nathan, Qld 4111, Australia. RP Schwerin, B (reprint author), Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Nathan, Qld 4111, Australia. EM belinda.schwerin@griffithuni.edu.au CR Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 Atlas L., 2004, P IEEE INT C AC SPEE, V2, P761 Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Breithaupt C, 2011, IEEE T AUDIO SPEECH, V19, P277, DOI 10.1109/TASL.2010.2047681 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Falk T., 2008, P INT WORKSH AC ECH Falk T. H., 2007, P ISCA C INT SPEECH, P970 Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679 Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247 GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421 GREENBERG S, 1997, P ICASSP, V3, P1647 Hermansky H., 1995, P IEEE INT C AC SPEE, V1, P405 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Huang X., 2001, SPOKEN LANGUAGE PROC ITU-T P. 835, 2007, P835 ITUT Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466 Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lyons J., 2008, P ISCA C INT SPEECH, P387 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004 Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755 Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Quatieri T. F., 2002, DISCRETE TIME SPEECH Rabiner L. R., 2010, THEORY APPL DIGITAL Rix A., 2001, P862 ITUT Scalart P., 1996, P ICASSP, V2, P629 Shannon B., 2006, P INT C SPOK LANG PR, P1423 Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328 So S, 2011, SPEECH COMMUN, V53, P818, DOI 10.1016/j.specom.2011.02.001 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397 Tyagi V., 2003, P ISCA EUR C SPEECH, P981 Vary P, 2006, DIGITAL SPEECH TRANS Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 Wiener N., 1949, EXTRAPOLATION INTERP Wu S., 2009, INT C DIG SIGN PROC ZADEH LA, 1950, P IRE, V38, P291, DOI 10.1109/JRPROC.1950.231083 NR 52 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 282 EP 305 DI 10.1016/j.specom.2011.09.003 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600011 ER PT J AU Hines, A Harte, N AF Hines, Andrew Harte, Naomi TI Speech intelligibility prediction using a Neurogram Similarity Index Measure SO SPEECH COMMUNICATION LA English DT Article DE Auditory periphery model; Simulated performance intensity function; NSIM; SSIM; Speech Intelligibility ID AUDITORY-NERVE RESPONSES; QUALITY ASSESSMENT; PHENOMENOLOGICAL MODEL; STRUCTURAL SIMILARITY; TEMPORAL INFORMATION; NORMAL-HEARING; RECOGNITION; PERCEPTION; PERIPHERY; LOUDNESS AB Discharge patterns produced by fibres from normal and impaired auditory nerves in response to speech and other complex sounds can be discriminated subjectively through visual inspection. Similarly, responses from auditory nerves where speech is presented at diminishing sound levels progressively deteriorate from those at normal listening levels. This paper presents a Neurogram Similarity Index Measure (NSIM) that automates this inspection process, and translates the response pattern differences into a bounded discrimination metric. Performance intensity functions can be used to provide additional information over measurement of speech reception threshold and maximum phoneme recognition by plotting a test subject's recognition probability over a range of sound intensities. A computational model of the auditory periphery was used to replace the human subject and develop a methodology that simulates a real listener test. The newly developed NSIM is used to evaluate the model outputs in response to Consonant-Vowel-Consonant (CVC) word lists and produce phoneme discrimination scores. The simulated results are rigorously compared to those from normal hearing subjects in both quiet and noise conditions. The accuracy of the tests and the minimum number of word lists necessary for repeatable results is established and the results are compared to predictions using the speech intelligibility index (SII). The experiments demonstrate that the proposed simulated performance intensity function (SPIF) produces results with confidence intervals within the human error bounds expected with real listener tests. This work represents an important step in validating the use of auditory nerve models to predict speech intelligibility. (C) 2011 Elsevier B.V. All rights reserved. C1 [Hines, Andrew; Harte, Naomi] Trinity Coll Dublin, Sigmedia Grp, Dept Elect & Elect Engn, Dublin, Ireland. RP Hines, A (reprint author), Trinity Coll Dublin, Sigmedia Grp, Dept Elect & Elect Engn, Dublin, Ireland. EM hinesa@tcd.ie CR American National Standards Institute (ANSI), 1997, S351997R2007 ANSI Bondy J, 2004, ADV NEUR IN, V16, P1409 Boothroyd A, 1968, SOUND, V2, P3 Boothroyd A., 2006, COMPUTER AIDED SPEEC Boothroyd A, 2008, EAR HEARING, V29, P479, DOI 10.1097/AUD.0b013e318174f067 BOOTHROY.A, 1968, J ACOUST SOC AM, V43, P362, DOI 10.1121/1.1910787 Bruce IC, 2003, J ACOUST SOC AM, V113, P369, DOI 10.1121/1.1519544 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dillon H., 2001, HEARING AIDS Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6 Gallun F, 2008, EAR HEARING, V29, P800, DOI 10.1097/AUD.0b013e31817e73ef Gelfand SA, 1998, J SPEECH LANG HEAR R, V41, P1088 Hines A, 2010, SPEECH COMMUN, V52, P736, DOI 10.1016/j.specom.2010.04.006 HOCHBERG I, 1975, AUDIOLOGY, V14, P27 Huber R, 2006, IEEE T AUDIO SPEECH, V14, P1902, DOI 10.1109/TASL.2006.883259 Ibrahim RA, 2010, NEUROPHYSIOLOGICAL BASES OF AUDITORY PERCEPTION, P429, DOI 10.1007/978-1-4419-5686-6_40 JERGER J, 1971, ARCHIV OTOLARYNGOL, V93, P573 Jurgens T, 2009, J ACOUST SOC AM, V126, P2635, DOI 10.1121/1.3224721 Jurgens T., 2010, INTERSPEECH 2010 MAK, P2478 Kandadai S, 2008, INT CONF ACOUST SPEE, P221, DOI 10.1109/ICASSP.2008.4517586 Mackersie C L, 2001, J Am Acad Audiol, V12, P390 Markides A, 1978, Br J Audiol, V12, P40, DOI 10.3109/03005367809078852 McCreery R, 2010, EAR HEARING, V31, P95, DOI 10.1097/AUD.0b013e3181bc7702 Moore B.C.J, 2007, PSYCHOL TECHNICAL IS Olsen WO, 1997, EAR HEARING, V18, P175, DOI 10.1097/00003446-199706000-00001 PREMINGER JE, 1995, J SPEECH HEAR RES, V38, P714 ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070 Sachs MB, 2002, ANN BIOMED ENG, V30, P157, DOI 10.1114/1.1458592 SAMMETH CA, 1989, EAR HEARING, V10, P94 Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a STUDEBAKER GA, 1993, J SPEECH HEAR RES, V36, P799 Wang Z, 2004, IEEE T IMAGE PROCESS, V13, P600, DOI 10.1109/TIP.2003.819861 Young ED, 2008, PHILOS T R SOC B, V363, P923, DOI 10.1098/rstb.2007.2151 Zhang XD, 2001, J ACOUST SOC AM, V109, P648, DOI 10.1121/1.1336503 Zilany MSA, 2009, J ACOUST SOC AM, V126, P2390, DOI 10.1121/1.3238250 Zilany MSA, 2007, 3 INT IEEE EMBS C NE, P481 Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512 NR 37 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2012 VL 54 IS 2 BP 306 EP 320 DI 10.1016/j.specom.2011.09.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 853RY UT WOS:000297444600012 ER PT J AU Jaywant, A Pell, MD AF Jaywant, Abhishek Pell, Marc D. TI Categorical processing of negative emotions from speech prosody SO SPEECH COMMUNICATION LA English DT Article DE Prosody; Facial expressions; Emotion; Nonverbal cues; Priming; Category-specific processing ID AFFECT DECISION TASK; FACIAL EXPRESSION; VOCAL EXPRESSION; AUTOMATIC ACTIVATION; NONVERBAL EMOTION; CIRCUMPLEX MODEL; PERCEPTION; FACE; VOICE; RECOGNITION AB Everyday communication involves processing nonverbal emotional cues from auditory and visual stimuli. To characterize whether emotional meanings are processed with category-specificity from speech prosody and facial expressions, we employed a cross-modal priming task (the Facial Affect Decision Task; Pell, 2005a) using emotional stimuli with the same valence but that differed by emotion category. After listening to angry, sad, disgusted, or neutral vocal primes, subjects rendered a facial affect decision about an emotionally congruent or incongruent face target. Our results revealed that participants made fewer errors when judging face targets that conveyed the same emotion as the vocal prime, and responded significantly faster for most emotions (anger and sadness). Surprisingly, participants responded slower when the prime and target both conveyed disgust, perhaps due to attention biases for disgust-related stimuli. Our findings suggest that vocal emotional expressions with similar valence are processed with category specificity, and that discrete emotion knowledge implicitly affects the processing of emotional faces between sensory modalities. (C) 2011 Elsevier B.V. All rights reserved. C1 [Jaywant, Abhishek; Pell, Marc D.] McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada. RP Pell, MD (reprint author), McGill Univ, Sch Commun Sci & Disorders, 1266 Ave Pins Ouest, Montreal, PQ H3G 1A8, Canada. EM marc.pell@mcgill.ca FU Natural Sciences and Engineering Research Council of Canada; McGill University FX This research was financially supported by the Natural Sciences and Engineering Research Council of Canada (Discovery grants competition) and by McGill University (William Dawson Scholar Award to MDP). We thank Catherine Knowles, Shoshana Gal, Laura Monetta, Pan Liu, and Hope Valeriote for their input and help with data collection and manuscript preparation. CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Batty M, 2003, COGNITIVE BRAIN RES, V17, P613, DOI 10.1016/S0926-6410(03)00174-5 Bimler D, 2001, COGNITION EMOTION, V15, P633, DOI 10.1080/02699930143000077 Borod JC, 2000, COGNITION EMOTION, V14, P193 Bowers D., 1993, NEUROPSYCHOLOGY, V7, P433, DOI 10.1037//0894-4105.7.4.433 Carroll NC, 2005, Q J EXP PSYCHOL-A, V58, P1173, DOI 10.1080/02724980443000539 Charash M, 2002, J ANXIETY DISORD, V16, P529, DOI 10.1016/S0887-6185(02)00171-8 Cisler JM, 2009, COGNITION EMOTION, V23, P675, DOI 10.1080/02699930802051599 de Gelder B, 2000, COGNITION EMOTION, V14, P289 de Gelder B, 1999, NEUROSCI LETT, V260, P133, DOI 10.1016/S0304-3940(98)00963-X EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068 ETCOFF NL, 1992, COGNITION, V44, P227, DOI 10.1016/0010-0277(92)90002-Y Fazio RH, 2001, COGNITION EMOTION, V15, P115, DOI 10.1080/0269993004200024 Gerber AJ, 2008, NEUROPSYCHOLOGIA, V46, P2129, DOI 10.1016/j.neuropsychologia.2008.02.032 Goldstone RL, 2010, WIRES COGN SCI, V1, P69, DOI 10.1002/wcs.26 HERMANS D, 1994, COGNITION EMOTION, V8, P515, DOI 10.1080/02699939408408957 Hietanen J, 2004, EUR J COGN PSYCHOL, V16, P769, DOI 10.1080/09541440340000330 Hinojosa JA, 2009, EMOTION, V9, P164, DOI 10.1037/a0014680 Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770 Juslin PN, 2001, EMOTION, V1, P381, DOI 10.1037//1528-3542.1.4.381 Kreifelts B, 2007, NEUROIMAGE, V37, P1445, DOI 10.1016/j.neuroimage.2007.06.020 Kreifelts B, 2009, NEUROPSYCHOLOGIA, V47, P3059, DOI 10.1016/j.neuropsychologia.2009.07.001 Krolak-Salmon P, 2001, EUR J NEUROSCI, V13, P987, DOI 10.1046/j.0953-816x.2001.01454.x Laukka P, 2005, COGNITION EMOTION, V19, P633, DOI 10.1080/02699930441000445 Laukka P, 2005, EMOTION, V5, P277, DOI 10.1037/1528-3542.5.3.277 LEVENSON RW, 1990, PSYCHOPHYSIOLOGY, V27, P363, DOI 10.1111/j.1469-8986.1990.tb02330.x Massaro DW, 1996, PSYCHON B REV, V3, P215, DOI 10.3758/BF03212421 NIEDENTHAL PM, 1994, PERS SOC PSYCHOL B, V20, P401, DOI 10.1177/0146167294204007 Palermo R, 2004, BEHAV RES METH INS C, V36, P634, DOI 10.3758/BF03206544 Paulmann S, 2009, J PSYCHOPHYSIOL, V23, P63, DOI 10.1027/0269-8803.23.2.63 PAULMANN S, SPEECH COMM IN PRESS Paulmann S, 2010, COGN AFFECT BEHAV NE, V10, P230, DOI 10.3758/CABN.10.2.230 Paulmann S, 2009, NEUROREPORT, V20, P1603, DOI 10.1097/WNR.0b013e3283320e3f Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 PELL MD, TIMECOURSE IN PRESS Pell MD, 2009, J NONVERBAL BEHAV, V33, P107, DOI 10.1007/s10919-008-0065-7 Pell MD, 2005, J NONVERBAL BEHAV, V29, P193, DOI 10.1007/s10919-005-7720-z Pell MD, 2008, SPEECH COMMUN, V50, P519, DOI 10.1016/j.specom.2008.03.006 Pell MD, 2011, COGNITION EMOTION, V25, P834, DOI 10.1080/02699931.2010.516915 Pell MD, 2002, BRAIN COGNITION, V48, P499, DOI 10.1006/brxg.2001.1406 Pell MD, 2005, J NONVERBAL BEHAV, V29, P45, DOI 10.1007/s10919-004-0889-8 Posner J, 2005, DEV PSYCHOPATHOL, V17, P715, DOI 10.1017/S0954579405050340 Pourtois G, 2000, NEUROREPORT, V11, P1329, DOI 10.1097/00001756-200004270-00036 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Schirmer A, 2002, COGNITIVE BRAIN RES, V14, P228, DOI 10.1016/S0926-6410(02)00108-8 Schirmer A, 2005, COGNITIVE BRAIN RES, V24, P442, DOI 10.1016/j.cogbrainres.2005.02.022 Schrober M, 2003, SPEECH COMMUN, V40, P99, DOI 10.1016/S0167-6393(02)00078-X Scott S. K., 2009, HDB MAMMALIAN VOCALI, P187 Simon-Thomas ER, 2009, EMOTION, V9, P838, DOI 10.1037/a0017810 Spruyt A, 2007, EXP PSYCHOL, V54, P44, DOI 10.1027/1618-3169.54.1.44 Tracy JL, 2008, EMOTION, V8, P81, DOI 10.1037/1528-3542.8.1.81 Vroomen J, 2001, COGN AFFECT BEHAV NE, V1, P382, DOI 10.3758/CABN.1.4.382 Williams LM, 2009, J CLIN EXP NEUROPSYC, V31, P257, DOI 10.1080/13803390802255635 Young AW, 1997, COGNITION, V63, P271, DOI 10.1016/S0010-0277(97)00003-6 Zhang Q, 2006, BRAIN RES BULL, V71, P316, DOI 10.1016/j.brainresbull.2006.09.023 NR 55 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 1 EP 10 DI 10.1016/j.specom.2011.05.011 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800001 ER PT J AU Fersini, E Messina, E Archetti, F AF Fersini, E. Messina, E. Archetti, F. TI Emotional states in judicial courtrooms: An experimental investigation SO SPEECH COMMUNICATION LA English DT Article DE Emotion recognition; Pattern recognition; Application ID SPEECH; CLASSIFICATION; RECOGNITION AB Thanks to the recent progress in the judicial proceedings management, especially related to the introduction of audio/video recording facilities, the challenge of identification of emotional states can be tackled. Discovering affective states embedded into speech signals could help in semantic retrieval of multimedia clips, and therefore in a deep understanding of mechanisms behind courtroom debates and judges/jurors decision making processes. In this paper two main contributions are given: (1) the collection of real-world human emotions coming from courtroom audio recordings; (2) the investigation of a hierarchical classification system, based on a risk minimization method, able to recognize emotional states from speech signatures. The accuracy of the proposed classification approach - named Multilayer Support Vector Machines - has been evaluated by comparing its performance with traditional machine learning approaches, by using both benchmark datasets and real courtroom recordings. Results in recognition obtained by the proposed technique outperform the prediction power achieved by traditional approaches like SVM, k-Nearest Neighbors, Naive Bayes, Decision Trees and Bayesian Networks. (C) 2011 Elsevier B.V. All rights reserved. C1 [Fersini, E.; Messina, E.; Archetti, F.] Univ Milano Bicocca, I-20126 Milan, Italy. RP Fersini, E (reprint author), Univ Milano Bicocca, Viale Sarca 336, I-20126 Milan, Italy. EM fersini@disco.unimib.it FU European Community [214306] FX This work has been supported by the European Community FP-7 under the JUMAS Project (ref.: 214306). The authors would like to thank Gaia Arosio for the development of the Multi-layer SVM software. CR AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759 Albornoz EM, 2010, LECT NOTES COMPUT SC, V5967, P242 ARCHETTI F, 2008, 1 INT C ICT SOL JUST BARRACHICOTE R, 2009, 10 ANN C INT SPEECH, P336 BATLINER A, 2004, 4 INT C LANG RES EV, P171 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Bouckaert RR, 2010, J MACH LEARN RES, V11, P2533 Burkhardt F., 2005, INTERSPEECH, P1517 CAMPBELL N, 2002, 3 INT C LANG RES EV Chavhan Y., 2010, INT J COMPUTER APPL, V1, P6 Cichosz J., 2004, Proceedings of International Conference on Signals and Electronic Systems ICSES'04 COOPER GF, 1992, MACH LEARN, V9, P309, DOI 10.1007/BF00994110 Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022 DEVILLERS L, 2007, REAL LIFE EMOTION RE, P34 Engbert I. S., 2007, DOCUMENTATION DANISH Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8 Fersini E, 2009, LECT NOTES ARTIF INT, V5632, P594, DOI 10.1007/978-3-642-03070-3_45 France DJ, 2000, IEEE T BIO-MED ENG, V47, P829, DOI 10.1109/10.846676 GRIMM M, 2008, IEEE INT C MULT EXP, P865 Guyon I., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753616 Hansen J.H.L., 1997, P EUR C SPEECH COMM, P1743 Hastie T, 1998, ADV NEUR IN, V10, P507 JOVICIC S, 2004, P 9 C SPEECH COMP Karstedt S, 2002, THEOR CRIMINOL, V6, P299 KIM D, 2009, P 16 INT C NEUR INF, P649 Kohavi R., 1996, THESIS STANFORD LAZARUS R, 2001, RELATIONAL MEANING D, P37 Lee C., 2009, P INTERSPEECH MAO X, 2009, CSIE 09 P 2009 WRI W, P225 Martin O., 2006, P 22 INT C DAT ENG W, P8, DOI 10.1109/ICDEW.2006.145 Mozziconacci S., 1999, P 14 INT C PHON SCI, P2001 Olshen R., 1984, CLASSIFICATION REGRE, V1st Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 PAO TL, 2007, ICIC 07 P INT COMP 3, P997 Pereira C., 2000, ITRW SPEECH EMOTION, P25 Petrushin V.A., 2000, P 6 INT C SPOK LANG, P222 REDDY P, 2010, ABS10064548 CORR Roach P., 1998, J INT PHON ASSOC, V28, P83 Schuller Bjorn, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5372886 Schuller B, 2006, SPEECH PROSODY Schuller B, 2006, P INT C MULT EXP ICM, P5 Schuller B, 2009, IMAGE VISION COMPUT, V27, P1760, DOI 10.1016/j.imavis.2009.02.013 SCHULLER B, 2007, ICASSP, V2, P733 SEDAAGHI MH, 2007, P 15 EUR SIGN PROC C, P2209 Sethu V, 2008, INT CONF ACOUST SPEE, P5017, DOI 10.1109/ICASSP.2008.4518785 Steininger S., 2002, P WORKSH MULT RES MU, P33 Tato R.S., 2002, P INT C SPOK LANG PR, P2029 VERVERIDIS D, 2004, SIGNAL PROCESS, V1, P593 Vogt Thurid, 2006, P LANG RES EV C LREC WALKER MA, 2001, EUR C SPEECH LANG PR, P1371 Wollmer M., 2008, P INT BRISB AUSTR, P597 Xiao ZZ, 2010, MULTIMED TOOLS APPL, V46, P119, DOI 10.1007/s11042-009-0319-3 Yacoub S, 2003, P EUR GEN, p[1, 729] Yang B, 2010, SIGNAL PROCESS, V90, P1415, DOI 10.1016/j.sigpro.2009.09.009 ZHOU Y, 2009, INT C RES CHALL COMP, P73 NR 55 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 11 EP 22 DI 10.1016/j.specom.2011.06.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800002 ER PT J AU Djamah, M O'Shaughnessy, D AF Djamah, Mouloud O'Shaughnessy, Douglas TI Fine granularity scalable speech coding using embedded tree-structured vector quantization SO SPEECH COMMUNICATION LA English DT Article DE Embedded quantization; Fast search; Scalable speech coding; Tree-structured vector quantization ID LPC PARAMETERS; ALGORITHM; SEARCH; DESIGN AB This paper proposes an efficient codebook design for tree-structured vector quantization (TSVQ) that is embedded in nature. We modify two speech coding standards by replacing their original quantizers for line spectral frequencies (LSF's) and/or Fourier magnitudes quantization with TSVQ-based quantizers. The modified coders are fine-granular bit-rate scalable with gradual change in quality for the synthetic speech. A fast search encoding algorithm using multistage tree-structured vector quantization (MTVQ) is proposed for quantization of LSF's. The proposed method is compared to the multipath sequential tree-assisted search (MSTS) and to the well known multipath sequential search (MSS) or M-L search algorithms. (C) 2011 Elsevier B.V. All rights reserved. C1 [Djamah, Mouloud] Univ Quebec, INRS EMT, Bur 6900, Montreal, PQ H5A 1K6, Canada. RP Djamah, M (reprint author), Univ Quebec, INRS EMT, Bur 6900, 800 Gauchetiere Ouest, Montreal, PQ H5A 1K6, Canada. EM djamah@emt.inrs.ca; dougo@emt.inrs.ca CR [Anonymous], 2006, G7291 ITUT [Anonymous], 1993, P56 ITUT [Anonymous], 2001, P862 ITUT Bao K, 2000, IEEE T CIRC SYST VID, V10, P833 BHATTACHARYA B, 1992, IEEE INT C AC SPEECH CHAN W, 1991, IEEE T COMMUN JAN, P11 CHAN W, 1994, IEEE ICASSP, P521 Chan W.-Y., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.226194 CHAN WY, 1990, INT CONF ACOUST SPEE, P1109, DOI 10.1109/ICASSP.1990.116130 CHAN WY, 1991, INT CONF ACOUST SPEE, P3597, DOI 10.1109/ICASSP.1991.151052 CHANG RF, 1991, INT CONF ACOUST SPEE, P2281 BEI CD, 1985, IEEE T COMMUN, V33, P1132 CHEMLA D, 1993, P IEEE SPEECH COD WO, P71, DOI 10.1109/SCFT.1993.762344 Chen FC, 2003, INT CONF ACOUST SPEE, P145 Chu WC, 2003, SPEECH CODING ALGORI CHU WC, 2004, SIGNALS SYSTEMS COMP, V1, P425 Chu WC, 2006, IEEE T AUDIO SPEECH, V14, P1205, DOI 10.1109/TSA.2005.860831 DARPA TIMIT, 1993, AC PHON CONT SPEECH Djamah M, 2010, INT CONF ACOUST SPEE, P4686, DOI 10.1109/ICASSP.2010.5495190 DJAMAH M, 2009, INT C SIGN IM PROC H, P42 DJAMAH M, 2009, 10 ANN C INT SPEECH, P2603 DONG H, 2002, P IEEE ISCAS, P859 Gersho A., 1992, VECTOR QUANTIZATION HIWASAKI Y, 2004, NTT TECH REV, V2 *ITU, 1990, G727 ITU *ITU T, 2007, G729 ITUT CSACELP *ITU T, 2005, G191 ITUT SOFTW TOOL *ITU T STUD GROUP, 1995, SQ4695R3 ITUT STUD G JAFARKHANI H, 1995, P IEEE INT C IM PROC, V2, P81 JUNG S, 2004, P IEEE ICASSP, P285 KABAL P, 1986, IEEE T ACOUST SPEECH, V34, P1419, DOI 10.1109/TASSP.1986.1164983 Katsavounidis I, 1996, IEEE T IMAGE PROCESS, V5, P398, DOI 10.1109/83.480778 LeBlanc WP, 1993, IEEE T SPEECH AUDI P, V1, P373, DOI 10.1109/89.242483 Li WP, 2001, IEEE T CIRC SYST VID, V11, P301 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 LYONS DF, 1993, IEEE ICASSP, V5, P602, DOI 10.1109/ICASSP.1993.319883 McCree AV, 1997, IEEE ICASSP, P1591 MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089 NISHIGUCHI M, 1999, IEEE SPEECH COD WORK, P84 Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 RISKIN EA, 1994, IEEE T IMAGE PROCESS, V3, P307, DOI 10.1109/83.287025 SUGAMURA N, 1986, ELSEVIER SPEECH COMM, P199 TAI HM, 1999, IEEE IND ELECT SOC, P762 Tsou SL, 2003, ICICS-PCM 2003, VOLS 1-3, PROCEEDINGS, P1389 NR 44 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 23 EP 39 DI 10.1016/j.specom.2011.06.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800003 ER PT J AU Sangwan, A Hansen, JHL AF Sangwan, Abhijeet Hansen, John H. L. TI Automatic analysis of Mandarin accented English using phonological features SO SPEECH COMMUNICATION LA English DT Article DE Phonological features; Accent analysis; Non-native speaker traits ID ARTICULATORY FEATURES; SPEECH RECOGNITION; AMERICAN ENGLISH; FOREIGN ACCENT; CLASSIFICATION; PRONUNCIATION; PERCEPTION; NETWORKS; SPEAKERS; VOWELS AB The problem of accent analysis and modeling has been considered from a variety of domains, including linguistic structure, statistical analysis of speech production features, and HMM/GMM (hidden Markov model/Gaussian mixture model) model classification. These studies however fail to connect speech production from a temporal perspective through a final classification strategy. Here, a novel accent analysis system and methodology which exploits the power of phonological features (PFs) is presented. The proposed system exploits the knowledge of articulation embedded in phonology by building Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents, a new statistical measure of "accentedness" is developed which rates the articulation of a word by a speaker on a scale of native-like (+1) to non-native like (-1). The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The experimental results demonstrate the capability of the proposed system to perform quantitative as well as qualitative analysis of foreign accents. The work developed in this study can be easily expanded into language learning systems, and has potential impact in the areas of speaker recognition and ASR (automatic speech recognition). (C) 2011 Elsevier B.V. All rights reserved. C1 [Sangwan, Abhijeet; Hansen, John H. L.] Univ Texas Dallas, CRSS, Richardson, TX 75083 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Richardson, TX 75083 USA. EM john.hansen@utdallas.edu FU USAF [FA8750-09-C-0067] FX This work was supported by the USAF under a subcontract to RADC, Inc., Contract FA8750-09-C-0067. (Approved for public release. Distribution unlimited.) CR Angkititrakul P, 2006, IEEE T AUDIO SPEECH, V14, P634, DOI 10.1109/TSA.2005.851980 Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6 Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608 Choueiter G, 2008, INT CONF ACOUST SPEE, P4265, DOI 10.1109/ICASSP.2008.4518597 Chreist F., 1964, FOREIGN ACCENT DAS S, 2004, IEEE NORSIG, P344 FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876 Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052 Frankel J, 2007, COMPUT SPEECH LANG, V21, P620, DOI 10.1016/j.csl.2007.03.002 FRANKEL J, 2007, INTERSPEECH Garofolo JS, 1993, TIMIT ACOUSTIC PHONE Hansen JHL, 2010, SPEECH COMMUN, V52, P777, DOI 10.1016/j.specom.2010.05.004 Jia G, 2006, J ACOUST SOC AM, V119, P1118, DOI 10.1121/1.2151806 Jou SC, 2005, INT CONF ACOUST SPEE, P1009 King S, 2007, J ACOUST SOC AM, V121, P723, DOI 10.1121/1.2404622 King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Leung KY, 2006, SPEECH COMMUN, V48, P71, DOI 10.1016/j.specom.2005.05.013 MAK B, 2003, HUMAN LANGUAGE TECHN, V2, P217 MANGAYYAGARI S, 2008, INT C PATT REC Markov K, 2006, SPEECH COMMUN, V48, P161, DOI 10.1016/j.specom.2005.07.003 METZE F, 2002, ICSLP Metze F, 2007, SPEECH COMMUN, V49, P348, DOI 10.1016/j.specom.2007.02.009 Morris J, 2008, IEEE T AUDIO SPEECH, V16, P617, DOI 10.1109/TASL.2008.916057 NERI A, 2006, INTERSPEECH PEDERSEN C, 2007, 6 INT C COMP INF SCI SALVI G, 2003, EUROSPEECH, P2677 SANGWAN A, 2007, IEEE AUT SPEECH REC, P582 Sangwan A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1525 Scharenborg O, 2007, SPEECH COMMUN, V49, P811, DOI 10.1016/j.specom.2007.01.005 Tepperman J, 2008, IEEE T AUDIO SPEECH, V16, P8, DOI 10.1109/TASL.2007.909330 WEI S, 2006, INTERSPEECH 06 Zheng Y., 2005, INTERSPEECH, P217 NR 32 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 40 EP 54 DI 10.1016/j.specom.2011.06.003 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800004 ER PT J AU Vijayasenan, D Valente, F Bourlard, H AF Vijayasenan, Deepu Valente, Fabio Bourlard, Herve TI Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features SO SPEECH COMMUNICATION LA English DT Article DE Speaker diarization; Meeting recordings; Multi-stream modeling; NIST rich transcription; Information bottleneck diarization ID INFORMATION; SYSTEM AB Many state-of-the-art diarization systems for meeting recordings are based on the HMM/GMM framework and the combination of spectral (MFCC) and time delay of arrivals (TDOA) features. This paper presents an extensive study on how multistream diarization can be improved beyond these two sets of features. While several other features have been proven effective for speaker diarization, little efforts have been devoted to integrate them into the MFCC + TDOA state-of-the-art baseline and to the authors' best knowledge, no positive results have been reported so far. The first contribution of this paper consists in analyzing the reasons of this, investigating through a set of oracle experiments the robustness of the HMM/GMM diarization when also other features (the modulation spectrum features and the frequency domain linear prediction features) are integrated. The second contribution of the paper consists in introducing a non-parametric multistream diarization method based on the information bottleneck (IB) approach. In contrary to the HMM/GMM which makes use of log-likelihood combination, it combines the feature streams in a normalized space of relevance variables. The previous analysis is repeated revealing that the proposed approach is more robust and can actually benefit from other sources of information beyond the conventional MFCC and TDOA features. Experiments based on the rich transcription data (heterogeneous meetings data recorded in several different rooms) show that it achieves a very competitive error of only 6.3% when four feature streams are used, compared to the 14.9% of the HMM/GMM system. Those results are analyzed in terms of error sensitivity to the stream weightings. To the authors' best knowledge this is the first successful attempt to reduce the speaker error combining other features with the MFCC and the TDOA and the first study to show the shortcomings of the HMM/GMM in going beyond this baseline. As last contribution, the paper also addresses issues related to the computational complexity of multistream approaches. (C) 2011 Elsevier B.V. All rights reserved. C1 [Vijayasenan, Deepu; Valente, Fabio; Bourlard, Herve] Idiap Res Inst, CH-1920 Martigny, Switzerland. RP Valente, F (reprint author), Idiap Res Inst, CH-1920 Martigny, Switzerland. EM deepu.vijayasenan@idiap.ch; fabio.valente@idiap.ch; herve.bourlard@idiap.ch FU Swiss Science Foundation; EU; Hasler Foundation FX The authors would like to thank colleagues involved in the AMI and IM2 projects, Dr. John Dines (IDIAP) and Samuel Thomas (Johns Hopkins University) for their help with this work as well as the anonymous reviewers for their comments. This work was funded by the Swiss Science Foundation through IM2 grant, by the EU through SSPnet grant and by the Hasler Foundation through the SESAME grant. CR Ajmera J, 2004, IEEE SIGNAL PROC LET, V11, P649, DOI 10.1109/LSP.2004.831666 Ajmera J., 2004, THESIS ECOLE POLYTEC Anguera X., 2006, THESIS U POLITECNICA Anguera X., 2006, BEAMFORMIT FAST ROBU Anguera X, 2005, LECT NOTES COMPUT SC, V3869, P402 ANGUERA X, 2006, P AUT SPEECH REC UND, P426 Athineos M., 2003, P IEEE WORKSH AUT SP Chen S. S., 1998, P DARPA BROADC NEWS, P127 Friedland G, 2009, IEEE T AUDIO SPEECH, V17, P985, DOI 10.1109/TASL.2009.2015089 GANAPATHY S, 2008, P INTERSPEECH BRISB GUILLERMO A, 2008, THESIS ECOLE POLYTEC HARREMOES P, 2007, IEEE INT S INF THEOR, P566 KINGSBURY B, 1998, SPEECH COMMUN, V25, P17132 Kinnunen T., 2008, P OD SPEAK LANG REC NOULAS A, 2007, P INT C MULT INT ICM, P350, DOI 10.1145/1322192.1322254 Pardo JM, 2007, IEEE T COMPUT, V56, P1212, DOI 10.1109/TC.2007.1077 PARDO JM, 2006, INT C SPEECH LANG PR Slonim N, 2002, THESIS HEBREW U JERU Slonim N., 1999, P ADV NEUR INF PROC, P617 Thomas S, 2008, IEEE SIGNAL PROC LET, V15, P681, DOI 10.1109/LSP.2008.2002708 TISHBY N, 1998, NEC RES I TR Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 van Leeuwen DA, 2008, LECT NOTES COMPUT SC, V4625, P475 Vijayasenan D., 2009, 10 ANN C INT SPEECH VIJAYASENAN D, 2008, INTERSPEECH 2008 Vijayasenan D, 2009, IEEE T AUDIO SPEECH, V17, P1382, DOI 10.1109/TASL.2009.2015698 VINYALS O, 2008, P INT 2008 Wooters C, 2008, LECT NOTES COMPUT SC, V4625, P509 NR 28 TC 0 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 55 EP 67 DI 10.1016/j.specom.2011.07.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800005 ER PT J AU Chiosain, MN Welby, P Espesser, R AF Chiosain, Maire Ni Welby, Pauline Espesser, Robert TI Is the syllabification of Irish a typological exception? An experimental study SO SPEECH COMMUNICATION LA English DT Article DE Syllabification; Ambisyllabicity; Irish; Speech perception; Pattern frequency ID INTERVOCALIC CONSONANTS; ENGLISH SYLLABICATION; SYLLABLE STRUCTURE; SEGMENTATION; LANGUAGE; FRENCH; DURATION; VOWEL; SYLLABIFICATION; PERCEPTION AB We examined whether Irish speakers syllabify intervocalic consonants as codas (e.g., poca 'pocket' 'po:k.(sic)/ CVC.V), as claimed by many authors, but contrary to claims in phonological theory of a universal preference for syllables with onsets. We conducted a perception experiment using a part-repetition task and presented auditory stimuli consisting of VCV items with a single medial consonant (Cm), varying in the length of VI and the manner of articulation of Cm (e.g., poca /po:k(sic)/ 'pocket', lofa /lof(sic)/ 'rotten'), as well as VCCV items varying in the length of VI and consonant sequence type (e.g., masla /masl(sic) 'insult', canta /ka:nt(sic)/ 'hunk'). Response patterns were in line with many, though not all, of the findings in the literature for other languages: Listeners preferred syllables with onsets, often treated Cm as ambisyllabic, syllabified Cm as a coda more often when VI was short, and dispreferred stops as codas. The results, however, did not completely support the Syllable Onset Segmentation Hypothesis (SOSH), which proposes differing roles for syllable onsets and offsets in word segmentation. For VCCV items, listeners show a great deal of variability in decisions not only about where the first syllable ends, but also about where the second syllable begins, a variability that could not be explained by the number of legal onsets possible for a given consonant sequence. We examined the hypotheses that variability in perception can be accounted for by (I) variability in production (in the signal) and (2) phoneme pattern frequency. We searched pattern frequencies in an electronic dictionary of Irish, but found no support for an account in which language-specific syllabification patterns reflect patterns of word-initial phoneme sequences. Our investigation of potential acoustic cues to syllable boundaries showed a gradient effect of vowel length on syllabification judgments: the longer the VI duration, the less likely a closed syllable, in line with results for other languages. For Irish, though, this pattern interestingly holds for all consonant manners except stops. The phonetic analyses point to other language-specific differences in phonetic patterns that cue syllable boundaries. For Irish, unlike English, consonant duration was not a more important cue to syllable boundaries than vowel duration, and there was no evidence that relative duration between the two consonants of a medial sequence signals syllable boundaries. The findings have implications not only for the syllable structure of Irish and theories of syllabification more generally. They are relevant to all theoretical and applied work on Irish that makes reference to the syllable. (C) 2011 Elsevier B.V. All rights reserved. C1 [Chiosain, Maire Ni; Welby, Pauline] Univ Coll Dublin, Sch Irish Celt Studies Irish Folklore & Linguist, Dublin 4, Ireland. [Welby, Pauline; Espesser, Robert] Univ Aix Marseille 1, CNRS, Lab Parole & Langage, F-13100 Aix En Provence, France. [Welby, Pauline; Espesser, Robert] Univ Aix Marseille 2, F-13100 Aix En Provence, France. RP Chiosain, MN (reprint author), Univ Coll Dublin, Sch Irish Celt Studies Irish Folklore & Linguist, John Henry Newman Bldg, Dublin 4, Ireland. EM maire.nichiosain@ucd.ie; pauline.welby@lpl-aix.fr; robert.espesser@lpl-aix.fr FU Foras na Gaeilge FX We thank Cliona Ni Chiosain, Niall O Ciosain, Anna Ni Ghallachair, Kayla Reed, and John Walsh for their help in recruiting participants and providing space to run the experiments, Brian O Raghallaigh and Michelle Tooher for technical assistance, Michal Boleslav Mechura and Kevin Scannell for their help with the frequency analyses, Christine "Ni Mhuilleoir" Meunier for her help with the acoustic analyses, the audiences at our presentations at the Formal Approaches to Celtic Linguistics Conference and the Laboratoire Parole et Langage (LPL) for their helpful feedback, and Foras na Gaeilge for financial support. We thank two anonymous reviewers for their valuable comments on an earlier version of the manuscript. We also thank our participants. CR Arnason Kristjan, 1980, QUANTITY HIST PHONOL Baayen R. H., 1995, CELEX LEXICAL DATABA Baayen R. Harald, 2008, ANAL LINGUISTIC DATA Barry William, 1999, PHONUS I PHONETICS U, V4, P87 Bates D., 2005, R NEWS, V5, P27, DOI DOI 10.1111/J.1523-1739.2005.00280.X Bell Alan, 1978, SYLLABLES SEGMENTS Berg T, 2000, J PHONETICS, V28, P187, DOI 10.1006/jpho.2000.0112 Berg Thomas, 2001, NORD J LINGUIST, V24, P71, DOI 10.1080/033258601750266196 Bertinetto Pier Marco, 2004, ITALIAN J LINGUISTIC, V16, P349 BERTINETTO PM, 1994, QUADERNI LAB LINGUIS, V8, P1 Blevins J., 1995, HDB PHONOLOGICAL THE, P206 Boersma P., 2008, PRAAT DOING PHONETIC BORGSTROM C, 1937, NORSK TIDSSKRIFT SPO, V7, P71 BOSCH A, 1998, TEXAS LINGUISTIC FOR, V41, P1 Bosch Anna, 1998, SCOTTISH GAELIC STUD, V18, P1 BOUCHER VJ, 1988, J PHONETICS, V16, P299 Breatnach Risteard B., 1947, IRISH RING CO WATERF Breen G, 1999, LINGUIST INQ, V30, P1, DOI 10.1162/002438999553940 BREEN G, 1990, SYLLABLE ARRER UNPUB CAIRNS CE, 2011, BRILLS HDB LINGUISTI CHRISTIE WM, 1974, J ACOUST SOC AM, V55, P819, DOI 10.1121/1.1914606 Clements G. N., 1990, PAPERS LAB PHONOLOGY, P283 Clements George N., 1983, LINGUISTIC INQUIRY M, V9 Clements GN, 2009, CURR STUD LINGUIST, P165 Content A, 2001, J MEM LANG, V45, P177, DOI 10.1006/jmla.2000.2775 Content A, 2001, LANG COGNITIVE PROC, V16, P609 COTE MH, OXFORD HDB IN PRESS Cote MH, 2011, BRILL HANDB LINGUIST, V1, P273 CRYSTAL TH, 1988, J PHONETICS, V16, P285 Cutler A, 2001, LANG SPEECH, V44, P171 Dalton M., 2008, THESIS TRINITY COLL Dalton M, 2005, LANG SPEECH, V48, P441 DAVIDSEN-NIELSEN N., 1974, J PHONETICS, V2, P15 De Bhaldraithe Tomas, 1945, IRISH COIS FHAIRRGE de Burca Sean, 1958, IRISH TOURMAKEADY CO DERWING BL, 1992, LANG SPEECH, V35, P219 DILWORTH A, 1972, MAINLAND DIALECTS SC Dixon R., 2002, AUSTR LANGUAGES THEI Dubach Green Antony, 1997, THESIS CORNELL U Dumay N, 2002, BRAIN LANG, V81, P144, DOI 10.1006/brln.2001.2513 ELLISON TM, 1998, INSTRUMENTAL S UNPUB Evans N, 2009, BEHAV BRAIN SCI, V32, P429, DOI 10.1017/S0140525X0999094X FALLOWS D, 1981, J LINGUIST, V17, P309, DOI 10.1017/S0022226700007027 Fery Caroline, 2003, SYLLABLE OPTIMALITY FHAILLIGH EMA, 1968, IRISH ERRIS Forster K. I., 2008, J MEMORY LANGUAGE, V59 Gillies William, 1993, CELTIC LANGUAGES, P145 Gillis S, 1996, J CHILD LANG, V23, P487 Giollagain C, 2007, STAIDEAR CUIMSITHEAC GORDEEVA OB, 2007, WP12 QUEEN MARGARET Goslin J, 2008, LANG SPEECH, V51, P199, DOI 10.1177/0023830908098540 Goslin J, 2001, LANG SPEECH, V44, P409 GUOMUNDSSON V, 1922, ISLANDSK GRAMMATIK Harrington J, 2010, PHONETIC ANAL SPEECH Hay J., 2004, PAPERS LAB PHONOLOGY, VVI, P58 Holmer Nils, 1962, GAELIC KINTYRE HOOPER JB, 1972, LANGUAGE, V48, P525, DOI 10.2307/412031 Hothorn T, 2008, BIOMETRICAL J, V50, P346, DOI 10.1002/bimj.200810425 Ishikawa K, 2002, LANG SPEECH, V45, P355 ITO J, 1988, THESIS U MASSACHUSET Jaeger TF, 2008, J MEM LANG, V59, P434, DOI 10.1016/j.jml.2007.11.007 Jakobson Roman, 1956, FUNDAMENTALS LANGUAG JOANISSE M, 1999, INT C PHON SCI SAN F, P731 Jones D, 1936, OUTLINE ENGLISH PHON KAHN D, 1980, THESIS MASSACHUSETTS KEATING PA, 1984, LANGUAGE, V60, P286, DOI 10.2307/413642 KHARLAMOV V, 2009, ANN C CAN LING ASS, P12 KNOTT E, 1974, INTRO IRISH SYLLABIC Kochetav A, 2004, LANG SPEECH, V47, P351 Ladefoged P., 1996, SOUNDS WORLDS LANGUA Levin J., 1985, THESIS MIT LUCE PA, 1985, J ACOUST SOC AM, V78, P1949, DOI 10.1121/1.392651 MACKEN MA, 1990, PAR SYLL PHON PHON, P273 MADDIESON IAN, 1985, PHONETIC LINGUISTICS, P203 McQueen JM, 1998, J MEM LANG, V39, P21, DOI 10.1006/jmla.1998.2568 MEYNADIER Y, 2001, TRAVAUX INTERDISCIPL, V20, P91 NEW B, 2006, TRAITEMENT AUTOMATIQ Ni Chasaide A., 1999, HDB INT PHONETIC ASS, P111 Ni Chiosain Maire, 1991, THESIS U MASSACHUSET NICHASAIDE A, 1987, INT C PHON SCI TALL, P28 NICHIOSAIN M, 2007, LANG VAR CHANGE, V19, P51 NICHIOSAIN M, 1994, PHONOLOGICA 1992, P157 O Baoill Donall, 1986, LARCHANUINT DON GHAE O Cuiv Brian, 1944, IRISH W MUSKERRY CO O Murchu Mairtin, 1989, E PERTHSHIRE GAELIC O Searcaigh Seamus, 1925, FOGHRAIDHEACHT GHAED O Siadhail Micheal, 1975, CORAS FUAIMEANNA GAE OBAOILL D, 1986, FOCLOIR POCA O'Connor JD, 1953, WORD, V9, P103 OFTEDAL M, 1956, NORSK TIDSSKRIFT S S, V4 Ohala M, 1999, STUD GENERA GRAMMAR, V45, P93 ORAGHALLAIGH B, 2010, THESIS TRINITY COLL PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 PRINCE A, 2004, 2 RUTG U CTR COGN SC PULGRAM E, 1970, JANUA LINGUARUM SERI, V81 Quene H, 2004, SPEECH COMMUN, V43, P103, DOI 10.1016/j.specom 2004.02.004 Quiggin Edmund Crosby, 1906, DIALECT DONEGAL BEIN Redford MA, 2005, J PHONETICS, V33, P27, DOI 10.1016/j.wocn.2004.05.003 Redford MA, 1999, J ACOUST SOC AM, V106, P1555, DOI 10.1121/1.427152 Richtsmeier PT, 2011, LAB PHONOLOGY, V2, P157 RUBACH J, 1999, RIV LINGUISTICA, V11, P273 Schiller NO, 1997, LANG SPEECH, V40, P103 SELKIRK L, 1982, STRUCTURE PHONOLOGIC, V2, P337 Shatzman KB, 2006, PERCEPT PSYCHOPHYS, V68, P1, DOI 10.3758/BF03193651 Siadhail Micheal, 1989, MODERN IRISH GRAMMAT Sjoestedt Marie-Louise, 1931, PHONETIQUE PARLER IR SOMMER BA, 1970, INT J AM LINGUIST, V36, P57, DOI 10.1086/465090 SOMMER B, 1981, PHONOLOGY 1980S, P231 Spinelli E, 2003, J MEM LANG, V48, P233, DOI 10.1016/S0749-596X(02)00513-2 Spinelli E, 2010, ATTEN PERCEPT PSYCHO, V72, P775, DOI 10.3758/APP.72.3.775 Spinelli E, 2007, LANG COGNITIVE PROC, V22, P828, DOI 10.1080/01690960601076472 Steriade Donca, 1982, THESIS MIT Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 Suomi K, 1997, J MEM LANG, V36, P422, DOI 10.1006/jmla.1996.2495 Tabain M., 2004, J INT PHON ASSOC, V34, P175, DOI 10.1017/S0025100304001719 Ternes Elmar, 1973, PHONEMIC ANAL SCOTTI TREIMAN R, 1990, J MEM LANG, V29, P66, DOI 10.1016/0749-596X(90)90010-W TREIMAN R, 1992, J PHONETICS, V20, P383 TREIMAN R, 1988, J MEM LANG, V27, P87, DOI 10.1016/0749-596X(88)90050-2 Tuller B., 1990, ATTENTION PERFORM, P429 TULLER B, 1991, J SPEECH HEAR RES, V34, P501 van der Hulst Harry, 1999, SYLLABLE VIEWS FACTS VANDERLUGT A, 1999, THESIS KATHOLIEKE U Vennemann Th, 1988, PREFERENCE LAWS SYLL Zec D, 2007, CAMBRIDGE HANDBOOK OF PHONOLOGY, P161 ZIOLKOWSKI MS, 1990, PAR SYLL PHON PHON ZUE VW, 1976, THESIS MIT NR 127 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 68 EP 91 DI 10.1016/j.specom.2011.07.002 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800006 ER PT J AU Paulmann, S Titone, D Pell, MD AF Paulmann, Silke Titone, Debra Pell, Marc D. TI How emotional prosody guides your way: Evidence from eye movements SO SPEECH COMMUNICATION LA English DT Article DE Eye-tracking; Gaze; Speech processing; Affective prosody; Semantics ID EVENT-RELATED POTENTIALS; SPOKEN-WORD RECOGNITION; AFFECT DECISION TASK; VISUAL-SEARCH; FACIAL EXPRESSION; SEX-DIFFERENCES; ERP EVIDENCE; TIME-COURSE; BASIC EMOTIONS; FACE AB This study investigated cross-modal effects of emotional voice tone (prosody) on face processing during instructed visual search. Specifically, we evaluated whether emotional prosodic cues in speech have a rapid, mandatory influence on eye movements to an emotionally-related face, and whether these effects persist as semantic information unfolds. Participants viewed an array of six emotional faces while listening to instructions spoken in an emotionally congruent or incongruent prosody (e.g., "Click on the happy face" spoken in a happy or angry voice). The duration and frequency of eye fixations were analyzed when only prosodic cues were emotionally meaningful (pre-emotional label window: "Click on the/ ... "), and after emotional semantic information was available (post-emotional label window: " ... /happy face"). In the pre-emotional label window, results showed that participants made immediate use of emotional prosody, as reflected in significantly longer frequent fixations to emotionally congruent versus incongruent faces. However, when explicit semantic information in the instructions became available (post-emotional label window), the influence of prosody on measures of eye gaze was relatively minimal. Our data show that emotional prosody has a rapid impact on gaze behavior during social information processing, but that prosodic meanings can be overridden by semantic cues when linguistic information is task relevant. (C) 2011 Elsevier B.V. All rights reserved. C1 [Paulmann, Silke] Univ Essex, Dept Psychol, Colchester C04 3SQ, Essex, England. [Titone, Debra] McGill Univ, Dept Psychol, Montreal, PQ, Canada. [Pell, Marc D.] McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ, Canada. [Titone, Debra; Pell, Marc D.] McGill Ctr Res Language Mind & Brain, Montreal, PQ, Canada. RP Paulmann, S (reprint author), Univ Essex, Dept Psychol, Wivenhoe Pk, Colchester C04 3SQ, Essex, England. EM paulmann@essex.ac.uk FU Center for Research on Language, Mind and Brain (CRLMB); German Academic Exchange Service (DAAD); McGill University FX The authors wish to thank Matthieu Couturier for help with programming the experiment, Abhishek Jaywant for help with data acquisition, Stephen Hopkins, Moritz Dannhauer, and Cord Plasse for help with data analysis, and Catherine Knowles for help with tables and figures. This work was supported by a new initiative fund awarded to the authors by the Center for Research on Language, Mind and Brain (CRLMB). Support received from the German Academic Exchange Service (DAAD) to the first author and McGill University (William Dawson Scholar award) to the third author is gratefully acknowledged. CR Allopenna PD, 1998, J MEM LANG, V38, P419, DOI 10.1006/jmla.1997.2558 Ashley V, 2004, NEUROREPORT, V15, P211, DOI 10.1097/01.wnr.0000091411.19795.f5 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Batty M, 2003, COGNITIVE BRAIN RES, V17, P613, DOI 10.1016/S0926-6410(03)00174-5 Berman JMJ, 2010, J EXP CHILD PSYCHOL, V107, P87, DOI 10.1016/j.jecp.2010.04.012 Besson M, 2002, TRENDS COGN SCI, V6, P405, DOI 10.1016/S1364-6613(02)01975-7 Boersma P., 2009, PRAAT DOING PHONETIC Borod JC, 2000, COGNITION EMOTION, V14, P193 Bostanov V, 2004, PSYCHOPHYSIOLOGY, V41, P259, DOI 10.1111/j.1469-8986.2003.00142.x BOWER GH, 1981, AM PSYCHOL, V36, P129, DOI 10.1037//0003-066X.36.2.129 Bowers D., 1993, NEUROPSYCHOLOGY, V7, P433, DOI 10.1037//0894-4105.7.4.433 BOWERS D, 1987, NEUROPSYCHOLOGIA, V25, P317, DOI 10.1016/0028-3932(87)90021-2 Brosch T., 2008, J COGNITIVE NEUROSCI, V21, P1670 Calvo MG, 2008, EXP PSYCHOL, V55, P359, DOI 10.1027/1618-3169.55.6.359 Calvo MG, 2008, J EXP PSYCHOL GEN, V137, P471, DOI 10.1037/a0012771 Carroll JM, 1996, J PERS SOC PSYCHOL, V70, P205, DOI 10.1037/0022-3514.70.2.205 Carroll NC, 2005, Q J EXP PSYCHOL-A, V58, P1173, DOI 10.1080/02724980443000539 Dahan D, 2005, PSYCHON B REV, V12, P453, DOI 10.3758/BF03193787 de Gelder B, 2000, COGNITION EMOTION, V14, P289 Eastwood JD, 2001, PERCEPT PSYCHOPHYS, V63, P1004, DOI 10.3758/BF03194519 Eimer M, 2002, NEUROREPORT, V13, P427, DOI 10.1097/00001756-200203250-00013 Eimer M, 2003, COGN AFFECT BEHAV NE, V3, P97, DOI 10.3758/CABN.3.2.97 EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068 EKMAN P, 1969, SCIENCE, V164, P86, DOI 10.1126/science.164.3875.86 Fazio RH, 2001, COGNITION EMOTION, V15, P115, DOI 10.1080/0269993004200024 Frischen A, 2008, PSYCHOL BULL, V134, P662, DOI 10.1037/0033-2909.134.5.662 Grimshaw GM, 1998, BRAIN COGNITION, V36, P108, DOI 10.1006/brcg.1997.0949 HANSEN CH, 1988, J PERS SOC PSYCHOL, V54, P917, DOI 10.1037/0022-3514.54.6.917 HANSEN CH, 1995, PERS SOC PSYCHOL B, V21, P548, DOI 10.1177/0146167295216001 Henderson J., 2004, INTERFACE LANGUAGE V HESS U, 1988, MULTICHANNEL COMMUNI Hietanen J, 2004, EUR J COGN PSYCHOL, V16, P769, DOI 10.1080/09541440340000330 Horstmann G, 2007, VIS COGN, V15, P799, DOI 10.1080/13506280600892798 Innes-Ker A, 2002, J PERS SOC PSYCHOL, V83, P804, DOI 10.1037//0022-3514.83.4.804 Isaacowitz D. M., 2008, PSYCHOL SCI, V19, P843 ITO K, 2008, J MEM LANG, P541 Johnstone T., 2000, HDB EMOTIONS, V2nd, P220 Kissler J, 2008, EXP BRAIN RES, V188, P215, DOI 10.1007/s00221-008-1358-0 Kitayama S, 2002, COGNITION EMOTION, V16, P29, DOI 10.1080/0269993943000121 Koelsch S, 2005, TRENDS COGN SCI, V9, P578, DOI 10.1016/j.tics.2005.10.001 Kotz SA, 2007, BRAIN RES, V1151, P107, DOI 10.1016/j.brainres.2007.03.015 Massaro DW, 1996, PSYCHON B REV, V3, P215, DOI 10.3758/BF03212421 MATIN E, 1993, PERCEPT PSYCHOPHYS, V53, P372, DOI 10.3758/BF03206780 MUMANUS MS, 2009, GAZE FIXATION PERCEP NIEDENTHAL PM, 1995, PSYCHOL LEARN MOTIV, V33, P23, DOI 10.1016/S0079-7421(08)60371-0 Nummenmaa L, 2009, J EXP PSYCHOL HUMAN, V35, P305, DOI 10.1037/a0013626 Nygaard LC, 2002, MEM COGNITION, V30, P583, DOI 10.3758/BF03194959 Ohman A, 2001, J EXP PSYCHOL GEN, V130, P466, DOI 10.1037/0096-3445.130.3.466 Paulmann S, 2011, MOTIV EMOTION, V35, P192, DOI 10.1007/s11031-011-9206-0 PAULMANN S, 2006, P ARCH MECH LANG PRO, P37 Paulmann S, 2008, BRAIN LANG, V105, P59, DOI 10.1016/j.bandl.2007.11.005 Paulmann S, 2009, J PSYCHOPHYSIOL, V23, P63, DOI 10.1027/0269-8803.23.2.63 Paulmann S, 2008, NEUROREPORT, V19, P209, DOI 10.1097/WNR.0b013e3282f454db Paulmann S, 2010, COGN AFFECT BEHAV NE, V10, P230, DOI 10.3758/CABN.10.2.230 Paulmann S, 2008, BRAIN LANG, V104, P262, DOI 10.1016/j.bandl.2007.03.002 Paulmann S, 2009, NEUROREPORT, V20, P1603, DOI 10.1097/WNR.0b013e3283320e3f Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005 Pell MD, 2006, BRAIN LANG, V96, P221, DOI 10.1016/j.bandl.2005.04.007 Pell MD, 2009, J NONVERBAL BEHAV, V33, P107, DOI 10.1007/s10919-008-0065-7 Pell MD, 2005, J NONVERBAL BEHAV, V29, P193, DOI 10.1007/s10919-005-7720-z Pell MD, 2008, SPEECH COMMUN, V50, P519, DOI 10.1016/j.specom.2008.03.006 PELL MD, COGNITION E IN PRESS Pell MD, 2005, J NONVERBAL BEHAV, V29, P45, DOI 10.1007/s10919-004-0889-8 Pittman J., 1993, HDB EMOTIONS, P185 Russell JA, 2000, HDB EMOTION Sauter DA, 2010, J COGNITIVE NEUROSCI, V22, P474, DOI 10.1162/jocn.2009.21215 Sauter DA, 2010, Q J EXP PSYCHOL, V63, P2251, DOI 10.1080/17470211003721642 Scherer K. R., 1989, VOCAL MEASUREMENT EM Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 Schirmer A, 2005, NEUROREPORT, V16, P635, DOI 10.1097/00001756-200504250-00024 Schirmer A, 2006, TRENDS COGN SCI, V10, P24, DOI 10.1016/j.tics.2005.11.009 Schirmer A, 2002, COGNITIVE BRAIN RES, V14, P228, DOI 10.1016/S0926-6410(02)00108-8 Schirmer A, 2005, COGNITIVE BRAIN RES, V24, P442, DOI 10.1016/j.cogbrainres.2005.02.022 Schirmer A, 2003, J COGNITIVE NEUROSCI, V15, P1135, DOI 10.1162/089892903322598102 Spivey MJ, 2001, PSYCHOL SCI, V12, P282, DOI 10.1111/1467-9280.00352 TANENHAUS MK, 1995, SCIENCE, V268, P1632, DOI 10.1126/science.7777863 Thompson WF, 2008, COGNITION EMOTION, V22, P1457, DOI 10.1080/02699930701813974 Vroomen J, 2001, COGN AFFECT BEHAV NE, V1, P382, DOI 10.3758/CABN.1.4.382 Vroomen J, 2000, J EXP PSYCHOL HUMAN, V26, P1583, DOI 10.1037/0096-1523.26.5.1583 Wambacq IJA, 2004, NEUROREPORT, V15, P555, DOI 10.1097/01.wnr.0000109989.85243.8f Weber A, 2006, COGNITION, V99, pB63, DOI 10.1016/j.cognition.2005.07.001 Wilson D, 2006, J PRAGMATICS, V38, P1559, DOI 10.1016/j.pragma.2005.04.012 Young AW, 1997, COGNITION, V63, P271, DOI 10.1016/S0010-0277(97)00003-6 Zeelenberg R, 2010, COGNITION, V115, P202, DOI 10.1016/j.cognition.2009.12.004 NR 84 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 92 EP 107 DI 10.1016/j.specom.2011.07.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800007 ER PT J AU Jancovic, P Zou, X Kokuer, M AF Jancovic, Peter Zou, Xin Koekueer, Muenevver TI Speech enhancement based on Sparse Code Shrinkage employing multiple speech models SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Sparse Code Shrinkage; Independent Component Analysis; Multiple models; Clustering; Super-Gaussian distribution; Gaussian mixture model (GMM) ID NONSTATIONARY NOISE; SIGNAL ENHANCEMENT AB This paper presents a single-channel speech enhancement system based on the Sparse Code Shrinkage (SCS) algorithm and employment of multiple speech models. The enhancement system consists of two stages: training and enhancement. In the training stage, the Gaussian mixture modelling (GMM) is employed to cluster speech signals in ICA-based transform domain into several categories, and for each category a super-Gaussian model is estimated that is used during the enhancement stage. In the enhancement stage, the estimate of each signal frame is obtained as a weighted average of estimates obtained by using each speech category model. The weights are calculated according to the probability of each category, given the signal enhanced using the conventional SCS algorithm. During the enhancement, the individual speech category models are further adapted at each signal frame. Experimental evaluations are performed on speech signals from the TIMIT database, corrupted by Gaussian noise and three real-world noises, Subway, Street, and Railway noise, from the NOISEX-92 database. Evaluations are performed in terms of segmental SNR, spectral distortion and PESQ measure. Experimental results show that the proposed multi-model SCS enhancement algorithm significantly outperforms the conventional WF, SCS and multi-model WF algorithms. (C) 2011 Elsevier B.V. All rights reserved. C1 [Jancovic, Peter; Zou, Xin; Koekueer, Muenevver] Univ Birmingham, Sch Elect Elect & Comp Engn, Birmingham B15 2TT, W Midlands, England. RP Jancovic, P (reprint author), Univ Birmingham, Sch Elect Elect & Comp Engn, Pritchatts Rd, Birmingham B15 2TT, W Midlands, England. EM p.jancovic@bham.ac.uk; x.zou@ul-ster.ac.uk; m.kokuer@bham.ac.uk FU UK EPSRC [EP/F036132/1] FX This work was supported by UK EPSRC Grant EP/F036132/1. CR BOLL S, 1979, SIGNAL PROCESS, V27, P120 CHOI C, 2001, P INT C IND COMP AN EPHRAIM Y, 1992, P IEEE, V80, P1524 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 Ephraim Y., 1985, IEEE T ACOUST SPEECH, V33, P251 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gazor S, 2005, IEEE T SPEECH AUDI P, V13, P896, DOI 10.1109/TSA.2005.851943 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P457, DOI 10.1109/TSA.2003.815936 Hyvarinen A, 1999, NEURAL COMPUT, V11, P1739, DOI 10.1162/089976699300016214 Hyvarinen A, 2001, INDEPENDENT COMPONEN Jancovic P, 2007, IEEE SIGNAL PROC LET, V14, P66, DOI 10.1109/LSP.2006.881517 JANCOVIC P, 2011, RECENT ADV ROBUST SP, P103 Kundu A, 2008, INT CONF ACOUST SPEE, P4893, DOI 10.1109/ICASSP.2008.4518754 Lee JH, 2000, ELECTRON LETT, V36, P1506, DOI 10.1049/el:20001028 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Lotter T, 2005, EURASIP J APPL SIG P, V2005, P1110, DOI 10.1155/ASP.2005.1110 Martin R, 2002, INT CONF ACOUST SPEE, P253 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Ming J, 2011, IEEE T AUDIO SPEECH, V19, P822, DOI 10.1109/TASL.2010.2064312 Potamitis I, 2001, INT CONF ACOUST SPEE, P621, DOI 10.1109/ICASSP.2001.940908 Sameti H, 1998, IEEE T SPEECH AUDI P, V6, P445, DOI 10.1109/89.709670 Srinivasan S, 2006, IEEE T AUDIO SPEECH, V14, P163, DOI 10.1109/TSA.2005.854113 Vaseghi Saeed V., 2005, ADV DIGITAL SIGNAL P Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Wolfe PJ, 2003, EURASIP J APPL SIG P, V2003, P1043, DOI 10.1155/S1110865703304111 YOU C, 2006, SPEECH COMMUN, V48, P50 YOUNG S, 2000, HTK BOOK V3 1 Zhao DY, 2007, IEEE T AUDIO SPEECH, V15, P882, DOI 10.1109/TASL.2006.885256 Zou X, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P415 Zou X, 2008, IEEE T SIGNAL PROCES, V56, P1812, DOI 10.1109/TSP.2007.910555 NR 30 TC 3 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 108 EP 118 DI 10.1016/j.specom.2011.07.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800008 ER PT J AU Do, CT Pastor, D Goalic, A AF Cong-Thanh Do Pastor, Dominique Goalic, Andre TI A novel framework for noise robust ASR using cochlear implant-like spectrally reduced speech SO SPEECH COMMUNICATION LA English DT Article DE Aurora 2; Cochlear implant; Kullback-Leibler divergence; HMM-based ASR; Noise robust ASR; Spectrally reduced speech ID HIDDEN MARKOV-MODELS; MAXIMUM-LIKELIHOOD; AMPLITUDE ESTIMATOR; SUBSPACE APPROACH; RECOGNITION; ENHANCEMENT; ENVIRONMENTS; SUBTRACTION; DIVERGENCE; ALGORITHM AB We propose a novel framework for noise robust automatic speech recognition (ASR) based on cochlear implant-like spectrally reduced speech (SRS). Two experimental protocols (EPs) are proposed in order to clarify the advantage of using SRS for noise robust ASR. These two EPs assess the SRS in both the training and testing environments. Speech enhancement was used in one of two EPs to improve the quality of testing speech. In training, SRS is synthesized from original clean speech whereas in testing, SRS is synthesized directly from noisy speech or from enhanced speech signals. The synthesized SRS is recognized with the ASR systems trained on SRS signals, with the same synthesis parameters. Experiments show that the ASR results, in terms of word accuracy, calculated with ASR systems using SRS, are significantly improved compared to the baseline non-SRS ASR systems. We propose also a measure of the training and testing mismatch based on the Kullback-Leibler divergence. The numerical results show that using the SRS in ASR systems helps in reducing significantly the training and testing mismatch due to environmental noise. The training of the HMM-based ASR systems and the recognition tests were performed by using the HTK toolkit and the Aurora 2 speech database. (C) 2011 Elsevier B.V. All rights reserved. C1 [Cong-Thanh Do] Idiap Res Inst, Ctr Parc, CH-1920 Martigny, Switzerland. [Pastor, Dominique; Goalic, Andre] Telecom Bretagne, UMR CNRS Lab STICC 3192, F-29238 Brest 3, France. RP Do, CT (reprint author), Idiap Res Inst, Ctr Parc, Rue Marconi 19,POB 592, CH-1920 Martigny, Switzerland. EM cong-thanh.do@idiap.ch FU Bretagne Regional Council, Bretagne, France FX This work was initially performed when Cong-Thanh Do was with Telecom Bretagne, UMR CNRS 3192 Lab-STICC. It was supported by the Bretagne Regional Council, Bretagne, France. CR Bhattacharyya A., 1943, Bulletin of the Calcutta Mathematical Society, V35 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chen J., 2008, SPRINGER HDB SPEECH, P843, DOI 10.1007/978-3-540-49127-9_43 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Do CT, 2010, IEEE T AUDIO SPEECH, V18, P1065, DOI 10.1109/TASL.2009.2032945 DO CT, 2010, P JEP 2010 JOURN ET, P49 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FURUI S, 1980, IEEE T ACOUST SPEECH, V32, P357 Gales M., 1996, THESIS CAMBRIDGE U Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6 Gauvain JL, 2000, P IEEE, V88, P1181, DOI 10.1109/5.880079 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Gunawan T. S., 2004, P 10 INT C SPEECH SC, P420 Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083 Hansen JHL, 2006, IEEE T AUDIO SPEECH, V14, P2049, DOI 10.1109/TASL.2006.876883 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hershey J. R., 2007, P ICASSP, V4, P317 HIRSCH HG, 2000, P ISCA ASR2000 AUT S Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031 KUBIN G, 1999, P IEEE INT C AC SPEE, V1, P205 KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leonard R., 1984, P ICASSP, V9, P328 Loizou PC, 1999, IEEE ENG MED BIOL, V18, P32, DOI 10.1109/51.740962 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 NADAS A, 1983, IEEE T ACOUST SPEECH, V31, P814, DOI 10.1109/TASSP.1983.1164173 Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005 Shannon BJ, 2006, SPEECH COMMUN, V48, P1458, DOI 10.1016/j.specom.2006.08.003 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Silva J, 2008, IEEE T SIGNAL PROCES, V56, P4176, DOI 10.1109/TSP.2008.924137 Silva J, 2006, IEEE T AUDIO SPEECH, V14, P890, DOI 10.1109/TSA.2005.858059 Young S., 2008, SPRINGER HDB SPEECH, P539, DOI 10.1007/978-3-540-49127-9_27 Young S., 2006, HTK BOOK HTK VERSION NR 38 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 119 EP 133 DI 10.1016/j.specom.2011.07.006 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800009 ER PT J AU Nakamura, K Toda, T Saruwatari, H Shikano, K AF Nakamura, Keigo Toda, Tomoki Saruwatari, Hiroshi Shikano, Kiyohiro TI Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech SO SPEECH COMMUNICATION LA English DT Article DE Electrolaryngeal speech; Voice conversion; Speaking-aid system; Speech enhancement; Airpressure sensor; Silence excitation; Non-audible murmur; Laryngectomee ID MAXIMUM-LIKELIHOOD; LARYNGECTOMY; CANCER AB An electrolarynx (EL) is a medical device that generates sound source signals to provide laryngectomees with a voice. In this article we focus on two problems of speech produced with an EL (EL speech). One problem is that EL speech is extremely unnatural and the other is that sound source signals with high energy are generated by an EL, and therefore, the signals often annoy surrounding people. To address these two problems, in this article we propose three speaking-aid systems that enhance three different types of EL speech signals: EL speech, EL speech using an air-pressure sensor (EL-air speech), and silent EL speech. The air-pressure sensor enables a laryngectomee to manipulate the F-0 contours of EL speech using exhaled air that flows from the tracheostoma. Silent EL speech is produced with a new sound source unit that generates signals with extremely low energy. Our speaking-aid systems address the poor quality of EL speech using voice conversion (VC), which transforms acoustic features so that it appears as if the speech is uttered by another person. Our systems estimate spectral parameters, F-0 and aperiodic components independently. The result of experimental evaluations demonstrates that the use of an air-pressure sensor dramatically improves F-0 estimation accuracy. Moreover, it is revealed that the converted speech signals are preferred to source EL speech. (C) 2011 Elsevier B.V. All rights reserved. C1 [Nakamura, Keigo; Toda, Tomoki; Saruwatari, Hiroshi; Shikano, Kiyohiro] Nara Inst Sci & Technol, Grad Sch Informat Sci, Ikoma, Nara 6300192, Japan. RP Nakamura, K (reprint author), Nara Inst Sci & Technol, Grad Sch Informat Sci, 8916-5 Takayama Cho, Ikoma, Nara 6300192, Japan. EM kei_go@nifty.com FU MIC SCOPE; JSPS FX The authors are grateful to Professor Hideki Kawahara of Wakayama University, Japan, for permission to use the STRAIGHT analysis-synthesis method. This research was also supported in part by MIC SCOPE and Grant-in-Aid for JSPS Fellows. CR Carr MM, 2000, OTOLARYNG HEAD NECK, V122, P39, DOI 10.1016/S0194-5998(00)70141-0 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Forastiere AA, 2003, NEW ENGL J MED, V349, P2091, DOI 10.1056/NEJMoa031317 Fukada T., 1992, P ICASSP 92, V1, P137 Goldstein E.A, 2004, IEEE T BIOMED ENG, V51 HASHIBA M, 2001, IEICE T D 2, V94, P1240 Hatamura Y, 2001, J ARTIF ORGANS, V4, P288, DOI 10.1007/BF02480019 HAYES M, 1951, CANC J CLIN, V1, P147 Hocevar-Boltezar I, 2001, RADIOL ONCOL, V35, P249 HOSOI Y, 2003, SP2003105 IEICE, P13 IFUKUBE T, 2003, SOUND BASED ASSISTIV Imai S., 1983, ELECTR COMMUN JPN, V66, P10, DOI 10.1002/ecja.4400660203 Jemal A, 2008, CA-CANCER J CLIN, V58, P71, DOI 10.3322/CA.2007.0010 KAIN A, 1998, ACOUST SPEECH SIG PR, P285 KAWAHARA H, 2001, 2 MODELS ANAL VOCAL Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Laccourreye O, 1996, LARYNGOSCOPE, V106, P495, DOI 10.1097/00005537-199604000-00019 Liu HJ, 2006, IEEE T BIO-MED ENG, V53, P865, DOI 10.1109/TBME.2006.872821 Miyamoto D, 2009, INT CONF ACOUST SPEE, P3901, DOI 10.1109/ICASSP.2009.4960480 Murakami K., 2004, IEICE T D 1, VJ87-D-I, P1030 NAKAGIRI M, 2006, P INTERSPEECH PITTSB, P2270 Nakajima Y., 2005, P INT 2005, P293 Nakajima Y, 2006, IEICE T INF SYST, VE89D, P1, DOI 10.1093/ietisy/e89-d.1.1 Nakamura K., 2007, IEICE T INF SYST, VJ90-D, P780 OHTANI Y, 2006, P ICSLP SEPT, P2266 Saikachi Y, 2009, J SPEECH LANG HEAR R, V52, P1360, DOI 10.1044/1092-4388(2009/08-0167) SINGER MI, 1980, ANN OTO RHINOL LARYN, V89, P529 Talkin D., 1995, SPEECH CODING SYNTHE, P495 Toda T., 2009, P INTERSPEECH, P632 Toda T., 2005, P INTERSPEECH LISB P, P1957 Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344 Uemi N., 1994, Proceedings. 3rd IEEE International Workshop on Robot and Human Communication. RO-MAN '94 Nagoya (Cat. No.94TH0679-1), DOI 10.1109/ROMAN.1994.365931 WILLIAMS SE, 1985, ARCH OTOLARYNGOL, V111, P216 NR 33 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 134 EP 146 DI 10.1016/j.specom.2011.07.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800010 ER PT J AU Kong, YY Mullangi, A AF Kong, Ying-Yee Mullangi, Ala TI On the development of a frequency-lowering system that enhances place-of-articulation perception SO SPEECH COMMUNICATION LA English DT Article DE Frequency lowering; Speech perception; Place of articulation; Fricatives ID HEARING-LOSS; ENGLISH FRICATIVES; ACOUSTIC CHARACTERISTICS; CONSONANT IDENTIFICATION; SPEECH-PERCEPTION; TRANSPOSITION; COMPRESSION; LISTENERS; CHILDREN; DISCRIMINATION AB Frequency lowering is a form of signal processing designed to deliver high-frequency speech cues to the residual hearing region of a listener with a high-frequency hearing loss. While this processing technique has been shown to improve the intelligibility of fricative and affricate consonants, perception of place of articulation has remained a challenge for hearing-impaired listeners, especially when the bandwidth of the speech signal is reduced during the frequency-lowering processing. This paper describes a modified vocoder-based frequency-lowering system similar to one reported by Posen et al. (1993), with the goal of improving place-of-articulation perception by enhancing the spectral differences of fricative consonants. In this system, frequency lowering is conditional; it suppresses the processing whenever the high-frequency portion (>400 Hz) of the speech signal is a periodic signal. In addition, the system separates non-sonorant consonants into three classes based on the spectral information (slope and peak location) of fricative consonants. Results from a group of normal-hearing listeners with our modified system show improved perception of frication and affrication features, as well as place-of-articulation distinction, without degrading the perception of nasals and semivowels compared to low-pass filtering and Posen et al's system. (C) 2011 Elsevier B.V. All rights reserved. C1 [Kong, Ying-Yee] Northeastern Univ, Dept Speech Language Pathol & Audiol, Boston, MA 02115 USA. [Kong, Ying-Yee; Mullangi, Ala] Northeastern Univ, Bioengn Program, Boston, MA 02115 USA. RP Kong, YY (reprint author), Northeastern Univ, Dept Speech Language Pathol & Audiol, 106A Forsyth Bldg, Boston, MA 02115 USA. EM yykong@neu.edu FU NIH/NIDCD [R03 DC009684-03] FX We would like to thank Professor Louis Braida for helpful suggestions and Ikaro Silva for technical support. We also thank Dr. Qian-Jie Fu for allowing us to use his Matlab programs for performing information transmission analysis. This work was supported by NIH/NIDCD (R03 DC009684-03 and ARRA supplement, PI: YYK). CR Ali AMA, 2001, J ACOUST SOC AM, V109, P2217, DOI 10.1121/1.1357814 Baer T, 2002, J ACOUST SOC AM, V112, P1133, DOI 10.1121/1.1498853 BEASLEY DS, 1976, AUDIOLOGY, V15, P395 BEHRENS S, 1988, J ACOUST SOC AM, V84, P861, DOI 10.1121/1.396655 BEHRENS SJ, 1988, J PHONETICS, V16, P295 Boersma P., 2009, PRAAT DOING PHONETIC BRAIDA LD, 1979, ASHA MONOGRAPH, V19 Dudley H., 1939, J ACOUST SOC AM, V11, P165, DOI 10.1121/1.1902137 Fox RA, 2005, J SPEECH LANG HEAR R, V48, P753, DOI 10.1044/1092-4388(2005/052) Fullgrabe C, 2010, INT J AUDIOL, V49, P420, DOI 10.3109/14992020903505521 Glista D, 2009, INT J AUDIOL, V48, P632, DOI 10.1080/14992020902971349 HARRIS KS, 1958, LANG SPEECH, V1, P1 Henry Belinda A., 1998, Australian Journal of Audiology, V20, P79 Hogan CA, 1998, J ACOUST SOC AM, V104, P432, DOI 10.1121/1.423247 HUGHES GW, 1956, J ACOUST SOC AM, V28, P303, DOI 10.1121/1.1908271 Jongman A, 2000, J ACOUST SOC AM, V108, P1252, DOI 10.1121/1.1288413 Kong YY, 2011, J SPEECH LANG HEAR R, V54, P959, DOI 10.1044/1092-4388(2010/10-0197) Korhonen P, 2008, J AM ACAD AUDIOL, V19, P639, DOI 10.3766/jaaa.19.8.7 Kuk F, 2009, J AM ACAD AUDIOL, V20, P465, DOI 10.3766/jaaa.20.8.2 Lippmann R. P., 1980, J ACOUST SOC AM, V67, pS78, DOI 10.1121/1.2018401 Maniwa K, 2009, J ACOUST SOC AM, V125, P3962, DOI 10.1121/1.2990715 McDermotet H, 2010, J AM ACAD AUDIOL, V21, P380, DOI 10.3766/jaaa.21.6.3 McDermott HJ, 2000, BRIT J AUDIOL, V34, P353 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Moore F. R., 1990, ELEMENTS COMPUTER MU Nissen SL, 2005, J ACOUST SOC AM, V118, P2570, DOI 10.1121/1.2010407 NITTROUER S, 1989, J SPEECH HEAR RES, V32, P120 Onaka A., 2000, 8 AUST INT C SPEECH, P134 POSEN MP, 1993, J REHABIL RES DEV, V30, P26 REED CM, 1991, J REHABIL RES DEV, V28, P6782 REED CM, 1983, J ACOUST SOC AM, V74, P409, DOI 10.1121/1.389834 Robinson JD, 2007, INT J AUDIOL, V46, P293, DOI 10.1080/14992020601188591 Shannon RV, 1999, J ACOUST SOC AM, V106, pL71, DOI 10.1121/1.428150 Simpson A, 2006, INT J AUDIOL, V45, P619, DOI 10.1080/14992020600825508 Simpson A, 2005, INT J AUDIOL, V44, P281, DOI 10.1080/14992020500060636 Simpson Andrea, 2009, Trends Amplif, V13, P87, DOI 10.1177/1084713809336421 Stelmachowicz PG, 2004, ARCH OTOLARYNGOL, V130, P556, DOI 10.1001/archotol.130.5.556 Turner Christopher W., 1999, Journal of the Acoustical Society of America, V106, P877, DOI 10.1121/1.427103 VELMANS M, 1973, LANG SPEECH, V16, P224 Velmans M., 1974, BRIT J AUDIOL, V8, P1, DOI 10.3109/03005367409086943 Wolfe J, 2010, J AM ACAD AUDIOL, V21, P618, DOI 10.3766/jaaa.21.10.2 Wolfe J, 2011, INT J AUDIOL, V50, P396, DOI 10.3109/14992027.2010.551788 NR 42 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2012 VL 54 IS 1 BP 147 EP 160 DI 10.1016/j.specom.2011.07.008 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 831QI UT WOS:000295745800011 ER PT J AU Schuller, B Batliner, A Steidl, S AF Schuller, Bjorn Batliner, Anton Steidl, Stefan TI Introduction to the special issue on sensing emotion and affect - Facing realism in speech processing SO SPEECH COMMUNICATION LA English DT Editorial Material C1 [Schuller, Bjorn] Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany. [Batliner, Anton; Steidl, Stefan] Univ Erlangen Nurnberg, Pattern Recognit Lab, Nurnberg, Germany. RP Schuller, B (reprint author), Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany. EM schuller@tum.de NR 0 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1059 EP 1061 DI 10.1016/j.specom.2011.07.003 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000001 ER PT J AU Schuller, B Batliner, A Steidl, S Seppi, D AF Schuller, Bjorn Batliner, Anton Steidl, Stefan Seppi, Dino TI Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge SO SPEECH COMMUNICATION LA English DT Article DE Emotion; Affect; Automatic classification; Feature types; Feature selection; Noise robustness; Adaptation; Standardisation; Usability; Evaluation ID HIDDEN MARKOV-MODELS; AFFECT RECOGNITION; LINEAR PREDICTION; ALGORITHM; MEMORY; PERFORMANCE; CONTROVERSY; PERCEPTION; FRAMEWORK; SELECTION AB More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its 'big brothers' speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech-the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge's database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts. (C) 2011 Elsevier B.V. All rights reserved. C1 [Schuller, Bjorn] Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany. [Batliner, Anton; Steidl, Stefan] Univ Erlangen Nurnberg, Pattern Recognit Lab, Nurnberg, Germany. [Seppi, Dino] Katholieke Univ Leuven, ESAT, Louvain, Belgium. RP Schuller, B (reprint author), Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany. EM schuller@tum.de FU European Union [211486, IST-2001-37599, IST-2002-50742, RTN-CT-2006-035561]; HUMAINE Association; Deutsche Telekom Laboratories FX This work was partly funded by the European Union under Grant Agreement No. 211486 (FP7/2007-2013, SEMAINE), IST-2001-37599 (PF-STAR), IST-2002-50742 (HUMAINE), and RTN-CT-2006-035561 (S2S). The authors would further like to thank the sponsors of the challenge, the HUMAINE Association and Deutsche Telekom Laboratories. The responsibility lies with the authors. CR AI H, 2006, P INT PITTSB PA, P797 ALHAMES M, 2006, P ICASSP TOUL FRANC, P757 Altun H, 2009, EXPERT SYST APPL, V36, P8197, DOI 10.1016/j.eswa.2008.10.005 Ang J, 2002, P INT C SPOK LANG PR, P2037 Armstrong JS, 2007, INT J FORECASTING, V23, P321, DOI 10.1016/j.ijforecast.2007.03.004 Arunachalam S., 2001, P EUR AALB DENM, P2675 ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 Athanaselis T, 2005, NEURAL NETWORKS, V18, P437, DOI 10.1016/j.neunet.2005.03.008 Ayadi M. M. H. E., 2007, P ICASSP HON HY, P957 Baggia P., 2007, EMMA EXTENSIBLE MULT BARRACHICOTE R, 2009, ACOUSTIC EMOTION REC, P336 Batliner A, 2001, P EUR 2001 AALB DENM, P2781 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Batliner A., 2010, ADV HUMAN COMPUTER I, P15, DOI 10.1155/2010/782802 BATLINER A, 2004, P TUT RES WORKSH AFF, P1 Batliner A., 2005, P INT LISB, P489 Batliner A, 2008, USER MODEL USER-ADAP, V18, P175, DOI 10.1007/s11257-007-9039-4 Batliner A., 2003, P EUROSPEECH, P733 Batliner A, 2006, P IS LTC 2006 LJUBL, P240 Batliner A., 2008, P ICASSP 2008 LAS VE, P4497 Batliner A, 2011, COMPUT SPEECH LANG, V25, P4, DOI 10.1016/j.csl.2009.12.003 Batliner A, 2000, P ISCA WORKSH SPEECH, P195 Batliner A., 2007, P INT WORKSH PAR SPE, P17 Batliner A., 2007, P 16 INT C PHONETICS, P2201 Batliner Anton, 2006, P IS LTC 2006 LJ SLO, P246 Bellman RE, 1961, ADAPTIVE CONTROL PRO Bengio S., 2003, ADV NIPS Bengio Y, 1995, ADV NEURAL INFORMATI, V7, P427 BODA PP, 2004, P COLING 2004 SAT WO, P22 Boersma P., 2005, PRAAT DOING PHONETIC Boersma P., 1993, P I PHONETIC SCI, V17, P97 Bogert B, 1963, S TIM SER AN, P209 Bozkurt E., 2009, P INT BRIGHT, P324 BREESE J, 1998, MSTR9841 Brendel M., 2010, P 3 INT WORKSH EMOTI, P58 Burkhardt F., 2005, P INT, P1517 Burkhardt F, 2009, P ACII AMST NETH, P1 Busso C., 2004, P 6 INT C MULT INT, P205, DOI 10.1145/1027933.1027968 Campbell N., 2005, P INT LISB PORT, P465 Chen L. S., 1998, P IEEE WORKSH MULT S, P83 CHENG YM, 2006, P INT WORKSH MULT SI, P238 Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917 Chuang ZJ, 2004, P IEEE INT C MULT EX, V1, P53 Cohen J., 1988, STAT POWER ANAL BEHA, V2nd COWIE R, 2005, NEURAL NETWORKS, V18, P3388 Cowie R., 2000, P ISCA WORKSH SPEECH, P19 Cowie R., 1999, COMPUT INTELL, P109 COWIE R, 2010, HUMAINE HDB DAUBECHIES I, 1990, IEEE T INFORM THEORY, V36, P961, DOI 10.1109/18.57199 DAVIS S, 1980, IEEE T ACOUST SPEECH, V29, P917 de Gelder B, 2000, COGNITION EMOTION, V14, P289 de Gelder B, 1999, NEUROSCI LETT, V260, P133, DOI 10.1016/S0304-3940(98)00963-X Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022 Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 DEVILLERS L, 2005, P 1 INT C AFF COMP I, P519 Devillers L., 2003, P ICME 2003, P549 Devillers L., 2007, LECT NOTES COMPUTER, V4441, P34 Ding Hui, 2006, P 2006 INT C INT INF, P537 Dumouche P, 2009, P INT C INT 2009 BRI, P344 Elliott C., 1992, THESIS NW U Engberg I., 1997, P EUR RHOD GREEC, P1695 ERICKSON D, 2004, PHONETICA, V63, P1 Eyben F, 2010, P 3 INT WORKSH EM SA, P77 Eyben F, 2010, J MULTIMODAL USER IN, V3, P7, DOI 10.1007/s12193-009-0032-6 Eyben F., 2010, P ACM MULT MM FLOR I, P1459, DOI 10.1145/1873951.1874246 Eyben F., 2009, P ACII AMST NETH, P576 EYSENCK HJ, 1960, PSYCHOL REV, V67, P269, DOI 10.1037/h0048412 FATTAH SA, 2008, INT C NEUR NETW SIGN, P114 FEHR B, 1984, J EXP PSYCHOL GEN, V113, P464, DOI 10.1037/0096-3445.113.3.464 Fei ZC, 2006, PACLIC 20: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, P257 Ferguson CJ, 2009, PROF PSYCHOL-RES PR, V40, P532, DOI 10.1037/a0015808 Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8 FILLENBA.S, 1966, LANG SPEECH, V9, P217 Nasoz F., 2004, Cognition, Technology & Work, V6, DOI 10.1007/s10111-003-0143-x Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 FLEISS JL, 1969, PSYCHOL BULL, V72, P323, DOI 10.1037/h0028106 Forbes-Riley K., 2004, P HUM LANG TECHN C N FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd Gaussier E., 2005, P 28 ANN INT ACM SIG, P601, DOI DOI 10.1145/1076034.1076148 Gigerenzer G., 2004, J SOCIO-ECON, V33, P587, DOI [DOI 10.1016/J.SOCEC.2004.09.033, DOI 10.1016/J.S0CEC.2004.09.033] Godbole N., 2007, P INT C WEBL SOC MED Goertzel B., 2000, P ANN C SOC STUD ART Grimm M, 2008, P IEEE INT C MULT EX, P865 Grimm M, 2007, LECT NOTES COMPUT SC, V4738, P126 GUNES H, 2005, IEEE INT C SYST MAN, V4, P3437 Hall M., 1998, THESIS WAIKATO U HAM Hansen J., 1997, P EUROSPEECH 97 RHOD, V4, P1743 HARNAD S, 1987, GROUNDWORK COGNITION HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HESS W, 1996, COMPUTING PROSODY, P363 HIRSCHBERG J, 2003, P ISCA IEEE WORKSH S, P1 HUI L, 2006, P ICASSP TOUL FRANC, P1 Hyvarinen A, 2001, INDEPENDENT COMPONEN Inanoglu Z., 2005, P 10 INT C INT US IN, P251, DOI 10.1145/1040830.1040885 Joachims Thorsten, 1998, P 10 EUR C MACH LEAR, P137 Johnstone T., 2000, HDB EMOTIONS, V2nd, P220 Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd KHARAT GU, 2008, WSEAS T COMPUT, V7 KIESSLING A, 1997, BERICHTE INFORM Kim E., 2007, P INT C ADV INT MECH, P1 Kim J., 2005, P 9 EUR C SPEECH COM, P809 Kim KH, 2004, MED BIOL ENG COMPUT, V42, P419, DOI 10.1007/BF02344719 Kim S.-M., 2005, P INT JOINT C NAT LA, P61 Kockmann M, 2009, P INT BRIGHT, P348 KWON O.W., 2003, P 8 EUR C SPEECH COM, P125 LASKOWSKI K, 2009, P ICASSP IEEE TAIP T, P4765 Lee C, 2009, P INT, P320 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Lee CM, 2002, P INT C SPOK LANG PR, P873 Lee DD, 1999, NATURE, V401, P788 LEFTER I, 2010, P 11 INT C COMP SYST, P287, DOI 10.1145/1839379.1839430 Liscombe J., 2003, P EUR C SPEECH COMM, P725 Liscombe J, 2005, P INT, P1837 Liscombe J., 2005, P INT LISB PORT, P1845 Litman D., 2003, P ASRU VIRG ISL, P25 Liu H., 2003, P 8 INT C INT US INT, P125 Lizhong Wu, 1999, IEEE Transactions on Multimedia, V1, DOI 10.1109/6046.807953 LOVINS JB, 1968, MECH TRANSL, V11, P22 LUENGO I, 2009, P INT BRIGHT UK SEP, P332 Lugger M., 2008, SPEECH RECOGNITION I, P1 LUGGER M, 2006, P ICASSP TOUL FRANC, P1097 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MARTIN JC, 2006, INT J HUM ROBOT, V3, P1 Martinez C.A., 2005, IEEE INT WORKSH ROB, P19 Matos S, 2006, IEEE T BIO-MED ENG, V53, P1078, DOI 10.1109/TBME.2006.873548 McGilloway S., 2000, P ISCA WORKSH SPEECH, P207 MEYER D, 2002, REPORT SERIES ADAPTI, V78 MISSEN M, 2009, SCIENCE, V2009, P729 Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004 Morrison D, 2007, J NETW COMPUT APPL, V30, P1356, DOI 10.1016/j.jnca.2006.09.005 Morrison D., 2007, International Journal of Intelligent Systems Technologies and Applications, V2, DOI 10.1504/IJISTA.2007.012486 Mower E., 2009, P ACII AMST NETH, P662 Nefian A. V., 2002, P IEEE ICASSP 2002, P2013 Neiberg D, 2006, P INT C SPOK LANG PR, P809 Nickerson RS, 2000, PSYCHOL METHODS, V5, P241, DOI 10.1037//1082-989X.5.2.241 NOGUEIRAS A, 2001, P EUROSPEECH 2001, P2267 NOLL AM, 1967, J ACOUST SOC AM, V41, P293, DOI 10.1121/1.1910339 Nose T., 2007, P INTERSPEECH 2007 A, P2285 Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 PACHET F, 2009, EURASIP J AUDIO SPEE Pal P., 2006, P ICASSP TOUL FRANC, P809 Pang B, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P79 Pantic M, 2003, P IEEE, V91, P1370, DOI 10.1109/JPROC.2003.817122 Perneger TV, 1998, BRIT MED J, V316, P1236 Petrushin V., 1999, P ART NEUR NETW ENG, P7 Picard RW, 2001, IEEE T PATTERN ANAL, V23, P1175, DOI 10.1109/34.954607 Planet S, 2009, P 10 ANN C INT SPEEC, P316 Polzehl T., 2009, P INT BRIGHT, P340 Polzin T., 2000, P ISCA WORKSH SPEECH, P201 Popescu A.-M., 2005, P C HUM LANG TECHN E, P339, DOI 10.3115/1220575.1220618 PORTER MF, 1980, PROGRAM-AUTOM LIBR, V14, P130, DOI 10.1108/eb046814 Potthast M., 2012, HLT NAACL 2004 ASS C, V3, P68 PUDIL P, 1994, PATTERN RECOGN LETT, V15, P1119, DOI 10.1016/0167-8655(94)90127-9 RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P24, DOI 10.1109/TASSP.1977.1162905 RAHURKAR MA, 2003, P 4 INT S IND COMP A, P1017 Rong J, 2007, P 6 INT C COMP INF S, P419 ROSCH E, 1975, J EXP PSYCHOL GEN, V104, P192, DOI 10.1037//0096-3445.104.3.192 ROZEBOOM WW, 1960, PSYCHOL BULL, V57, P416, DOI 10.1037/h0042040 Russell James A., 2003, VVolume 54, P329 SACHS JS, 1967, PERCEPT PSYCHOPHYS, V2, P437, DOI 10.3758/BF03208784 Said Christopher P, 2010, Front Syst Neurosci, V4, P6, DOI 10.3389/fnsys.2010.00006 Salzberg SL, 1997, DATA MIN KNOWL DISC, V1, P317, DOI 10.1023/A:1009752403260 Sato N., 2007, INFORM MEDIA TECHNOL, V2, P835 Scherer K. R., 2003, HDB AFFECTIVE SCI, P433 Schiel F, 1999, P 14 INT C PHON SCI, P607 Schroder M, 2007, LECT NOTES COMPUT SC, V4738, P440 SchrOder M., 2008, P 4 INT WORKSH HUM C Schroder M, 2006, P LREC 06 WORKSH COR, P88 SCHULLER B., 2005, P INT LISB PORT, P805 SCHULLER B, 2006, P INTERSPEECH 2006 I, P1818 Schuller B, 2009, P INT BRIGHT UK, P1999 Schuller B., 2004, P IEEE INT C AC SPEE, V1 Schuller Bjorn, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5372886 SCHULLER B, 2007, P INT C MULT INT ACM, P30, DOI 10.1145/1322192.1322201 Schuller B., 2007, P INT, P2253 SCHULLER B, 2008, P 1 WORKSH CHILD COM Schuller B., 2009, P ICASSP TAIP TAIW, P4585 Schuller B., 2005, P IEEE INT C AC SPEE, VI, P325, DOI 10.1109/ICASSP.2005.1415116 SCHULLER B, 2006, P SPEECH PROS 2006 D SCHULLER B, 2008, P ICASSP LAS VEG NV, P4501 Schuller B, 2009, EURASIP J AUDIO SPEE, DOI 10.1155/2009/942617 Schuller B., 2009, STUDIES LANGUAGE COM, V97, P285 SCHULLER B, 2006, P INTERSPEECH 2006 I, P793 Schuller B., 2007, P ICASSP 2007 HON, P941 Schuller B, 2003, P ICASSP HONG KONG, P1 SCHULLER B, 2008, P 9 INT 2008 INC 12, P265 Schuller Bjorn, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495017 Schuller B, 2006, P INT C MULT EXP ICM, P5 Schuller B, 2009, IMAGE VISION COMPUT, V27, P1760, DOI 10.1016/j.imavis.2009.02.013 Schuller B., 2005, P IEEE INT C MULT EX, P864 SCHULLER B., 2010, P INTERSPEECH 2010 M, P2794 Schuller Bjorn, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495061 Schuller B, 2009, P INT C DOC AN REC B, P858 Schuller B, 2008, LECT NOTES ARTIF INT, V5078, P99, DOI 10.1007/978-3-540-69369-7_12 Schuller Bjorn, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5494986 SCHULLER B, 2008, P ICME HANN GERM, P1333 Schuller Bjorn, 2009, P INTERSPEECH, P312 Schulte B, 2010, Proceedings of the 2010 40th European Microwave Conference (EuMC) Seppi D., 2010, P SPEECH PROS 2010 C SEPPI D, 2008, P 1 WORKSH CHILD COM Seppi D., 2008, P INT BRISB AUSTR, P601 Sethu V, 2007, Proceedings of the 2007 15th International Conference on Digital Signal Processing, P611 Shami M., 2007, LECT NOTES COMPUTER, V4441, P43, DOI DOI 10.1007/978-3-540-74122-05 Shaver P. R., 1992, EMOTION, P175 SOOD S, 2004, P 12 ANN ACM INT C M, P280, DOI 10.1145/1027527.1027591 Steidl S., 2009, P INT C AFF COMP INT, P690 STEIDL S, 2008, 11 INT C TEXT SPEECH, P525 Steidl S, 2010, EURASIP J AUDIO SPEE, DOI 10.1155/2010/783954 Steidl S., 2009, THESIS FAU ERLANGEN STEIDL S, 2004, 7 INT C TEXT SPEECH, P629 Takahashi K., 2004, P 2 INT C AUT ROB AG, P186 ten Bosch L, 2003, SPEECH COMMUN, V40, P213, DOI 10.1016/S0167-6393(02)00083-3 TOMLINSON MJ, 1996, P ICASSP ATL GA US, P812 Truong K.P., 2005, P EUR C SPEECH COMM, P485 Ververidis D., 2003, PCI 2003 9 PANH C IN, P560 Ververidis D., 2006, P EUR SIGN PROC C EU VIDRASCU L, 2007, P INT WORKSH PAR SPE, P11 Vinciarelli A, 2008, P 10 INT C MULT INT, P61, DOI DOI 10.1145/1452392.1452405 Vlasenko B., 2009, P INT 2009, P2039 Vlasenko B, 2007, P INT, P2249 VLASENKO B, 2008, P INT BRISB AUSTR, P805 Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139 Vogt T, 2008, LECT NOTES ARTIF INT, V5078, P188, DOI 10.1007/978-3-540-69369-7_21 VOGT T, 2009, P ACII AMST NETH, P670 Vogt T, 2009, P INT SPEECH COMM AS, P328 Vogt T., 2005, P MULT EXP AMST, P474 Wagner J., 2005, P IEEE INT C MULT EX, P940 Wagner J, 2007, LECT NOTES COMPUT SC, V4738, P114 WANG Y, 2005, P IEEE C AC SPEECH S, V2, P1125, DOI DOI 10.1109/ICASSP20051415607 WILSON T, 2004, P C AM ASS ART INT A WIMMER M, 2008, P 3 INT C COMP VIS T, P145 Witten I.H., 2005, DATA MINING PRACTICA Wollmer M, 2009, NEUROCOMPUTING, V73, P366, DOI 10.1016/j.neucom.2009.08.005 Wollmer M, 2010, IEEE J-STSP, V4, P867, DOI 10.1109/JSTSP.2010.2057200 WOLLMER M, 2009, P ICASSP TAIP TAIW, P3949 Wollmer M., 2008, P INT BRISB AUSTR, P597 WOLPERT DH, 1992, NEURAL NETWORKS, V5, P241, DOI 10.1016/S0893-6080(05)80023-1 WU CH, 2008, AFFECTIVE INFORM PRO, V2, P93 WU S, 2008, P INTERSPEECH BRISB, P638 Wu T., 2005, FDN DATA MINING KNOW, P319 Yi J, 2003, P 3 IEEE INT C DAT M, P427 You M, 2006, PROC IEEE INTL CONF, P1653 Young S., 2006, HTK BOOK HTK VERSION YU C, 2004, P ICSLP, V1, P1329 Zeng ZH, 2007, LECT NOTES COMPUT SC, V4451, P72 Zeng ZH, 2009, IEEE T PATTERN ANAL, V31, P39, DOI 10.1109/TPAMI.2008.52 Zeng ZH, 2007, IEEE T MULTIMEDIA, V9, P424, DOI 10.1109/TMM.2006.886310 Zhe X., 2002, P INT S COMM SYST NE, P164 Zwicker E, 1999, PSYCHOACOUSTICS FACT NR 251 TC 75 Z9 76 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1062 EP 1087 DI 10.1016/j.specom.2011.01.011 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000002 ER PT J AU Fernandez, R Picard, R AF Fernandez, Raul Picard, Rosalind TI Recognizing affect from speech prosody using hierarchical graphical models SO SPEECH COMMUNICATION LA English DT Article DE Affective speech; Prosodic modeling; Graphical models; Paralinguistics ID EMOTION AB In this work we develop and apply a class of hierarchical directed graphical models on the task of recognizing affective categories from prosody in both acted and natural speech. A strength of this new approach is the integration and summarization of information using both local (e.g., syllable level) and global prosodic phenomena (e.g., utterance level). In this framework speech is structurally modeled as a dynamically evolving hierarchical model in which levels of the hierarchy are determined by prosodic constituency and contain parameters that evolve according to dynamical systems. The acoustic parameters have been chosen to reflect four main components of speech thought to reflect paralinguistic and affect-specific information: intonation, loudness, rhythm and voice quality. The work is first evaluated on a database of acted emotions and compared to human perceptual recognition of five affective categories where it achieves rates within nearly 10% of human recognition accuracy despite only focusing on prosody. The model is then evaluated on two different corpora of fully spontaneous, affectively-colored, naturally occurring speech between people: Call Home English and BT Call Center. Here the ground truth labels are obtained from examining the agreement of 29 human coders labeling arousal and valence. The best discrimination performance on the natural spontaneous speech, using only the prosody features, obtains a 70% detection rate with 30% false alarms when detecting high arousal negative valence speech in call centers. (C) 2011 Elsevier B.V. All rights reserved. C1 [Fernandez, Raul] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. [Picard, Rosalind] MIT, Media Lab, Cambridge, MA 02139 USA. RP Fernandez, R (reprint author), IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. EM fernanra@us.ibm.com; picard@media.mit.edu CR Barra-Chicote R., 2009, P INT, P336 BATLINER A, 2010, ADV HUMAN COMPUT INT, V15 BECKMAN ME, 1996, PROSODY PARSING SPEC, P17 BLACK M, 2010, P INT CHIB JAP, P2030 Burkhardt F., 2005, P INT, P1517 Campbell N., 2003, P 15 INT C PHON SCI, P2417 Chen DQ, 2009, PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON IMAGE AND GRAPHICS (ICIG 2009), P912, DOI 10.1109/ICIG.2009.120 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070 DOUGLASCOWIE E, 2005, P INT LISB PORT, P88 DURSTON PJ, 2001, P EUR AALB DENM, P1323 Eyben F, 2010, J MULTIMODAL USER IN, V3, P7, DOI 10.1007/s12193-009-0032-6 Fernandez R., 2004, THESIS MIT Fernandez R., 2005, P INT 2005 LISB PORT, P473 Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 Hayes Bruce, 1989, PHONETICS PHONOLOGY, V1, P201 HIRSCHBERG J, 1998, P AAAI SPRING S APPL, P52 KOLLER D, 2009, PRINCIPLES TECHNIQUE KOMPE R, 1994, P INT C AC SPEECH SI, V2, P173 LADD DR, 1996, CAMBRIDGE STUDIES LI, V79 Laver John, 1994, PRINCIPLES PHONETICS Lehiste I., 1970, SUPRASEGMENTALS Mao X, 2009, P WRI WORLD C COMP S, P225 Murphy K, 2007, BAYES NET TOOLBOX MA Noth E, 2000, IEEE T SPEECH AUDI P, V8, P519, DOI 10.1109/89.861370 Pereira C., 2000, P ISCA WORKSH SPEECH, P25 POLZIN T, 2000, THESIS CARNEGIE MELL SCHERER K, 2005, PROPOSAL EXEMPLARS W Schroder M, 2006, P LREC 06 WORKSH COR, P88 Schuller B., 2003, P IEEE INT C AC SPEE, V2, P1 SCHULLER B., 2010, P INTERSPEECH 2010 M, P2794 Schuller Bjorn, 2009, P INTERSPEECH, P312 SEBE N, 2006, P 18 INT C PATT REC, P1136 Selkirk E. O., 1984, PHONOLOGY SYNTAX REL ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 VLASENKO B, 2008, P INT BRISB AUSTR, P805 Wang M. Q., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90025-Y WIGHTMAN CW, 1992, P ICASSP 92, V1, P221 Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 WIGHTMAN CW, 1991, P INT C ACOUST SPEEC, V1, P321 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 Yanushevskaya I., 2005, P INT LISB PORT, P1849 YANUSHEVSKAYA I, 2008, P SPEECH PROS 2008 I, P709 Zwicker E., 1999, SPRINGER SERIES INFO, V22 NR 44 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1088 EP 1103 DI 10.1016/j.specom.2011.05.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000003 ER PT J AU Worgan, SF Moore, RK AF Worgan, Simon F. Moore, Roger K. TI Towards the detection of social dominance in dialogue SO SPEECH COMMUNICATION LA English DT Article DE Social dominance; Rapport; Affordances; Social kinesthesis ID SPEECH-PERCEPTION; VOCAL COMMUNICATION; ACCOMMODATION; AFFORDANCES; RECOGNITION; PSYCHOLOGY AB When developing human computer conversational systems the complex co-acting processes of human human dialogue present a significant challenge. Para-linguistic features establish rapport between individuals and direct the conversation in directions that cannot be captured by semantic analysis alone. This paper attempts to address part of this challenge by considering the role of para-linguistic features in establishing and manipulating social dominance. We propose that social dominance can be understood as an interaction affordance, revealing action potentials for each signalling participant, and can be detected as a feature of rapport not of the individual. An analysis of F0 and long-term averaged spectra (LTAS) correlation values for conversational pairs reveals a high degree of accommodation. The nature of this accommodation demonstrates that others will adjust their speech to match the current dominant individual. We conclude by exploring the implications of these results on the role of rapport and outline potential advances for the detection of emotion in speech by encompassing the entirety of pleasure-arousal-dominance emotional space. (C) 2011 Elsevier B.V. All rights reserved. C1 [Worgan, Simon F.; Moore, Roger K.] Univ Sheffield, Speech & Hearing Res Grp, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Worgan, SF (reprint author), Univ Sheffield, Speech & Hearing Res Grp, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. EM S.Worgan@dcs.shef.ac.uk CR ANDERSON AH, 1991, LANG SPEECH, V34, P351 [Anonymous], SNACK SOUND TOOLKIT Bock J. K., 1990, COGNITION, V35, P1 Burkhardt F., 2005, P INT, P1517 CATIZONE R, 2010, ARTIFICIAL COMPANION, P157 Crystal D, 1980, 1 DICT LINGUISTICS P DAWKINS MS, 1991, ANIM BEHAV, V41, P865, DOI 10.1016/S0003-3472(05)80353-7 FOWLER CA, 1986, J PHONETICS, V14, P3 Gibson J. J., 1986, ECOLOGICAL APPROACH Giles Howard, 1991, LANGUAGE CONTEXTS CO Good JMM, 2007, THEOR PSYCHOL, V17, P265, DOI 10.1177/0959354307075046 Gratch J., 2006, 6 INT C INT VIRT AG Gregory SW, 1997, J NONVERBAL BEHAV, V21, P23 Gregory SW, 2002, SOC PSYCHOL QUART, V65, P298 Gregory SW, 2001, LANG COMMUN, V21, P37, DOI 10.1016/S0271-5309(00)00011-2 Hockett C. F., 1955, MANUAL PHONOLOGY Hodges BH, 2006, PERS SOC PSYCHOL REV, V10, P2, DOI 10.1207/s15327957pspr1001_1 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 KANG S, 2009, 8 INT C IND COMP AN Lauria S, 2007, CIRC SYST SIGNAL PR, V26, P513, DOI 10.1007/s00034-007-4005-9 Levelt W. J. M., 1983, J SEMANT, V2, P205 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Lu YY, 2009, SPEECH COMMUN, V51, P1253, DOI 10.1016/j.specom.2009.07.002 MEHRABIAN A, 1996, BEHAV SCI, V14, P261 *MIT, 2010, MIT AM ENGL MAP TASK Moore RK, 2007, IEEE T COMPUT, V56, P1176, DOI 10.1109/TC.2007.1080 Nearey TM, 1997, J ACOUST SOC AM, V101, P3241, DOI 10.1121/1.418290 Nittrouer S, 2006, J ACOUST SOC AM, V120, P1799, DOI 10.1121/1.2335273 OHALA JJ, 1986, J PHONETICS, V14, P75 OHALA JJ, 1982, J ACOUST SOC AM, V72, P66 Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 Parrill F, 2006, J NONVERBAL BEHAV, V30, P157, DOI 10.1007/s10919-006-0014-2 Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169 Port RF, 2005, LANGUAGE, V81, P927, DOI 10.1353/lan.2005.0195 PORTER RJ, 1986, J PHONETICS, V14, P83 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 Schafer AJ, 2000, J PSYCHOLINGUIST RES, V29, P169, DOI 10.1023/A:1005192911512 Scherer K. R., 2003, HDB AFFECTIVE SCI, P433 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Thorisson KR, 1999, APPL ARTIF INTELL, V13, P449, DOI 10.1080/088395199117342 Vogt T, 2008, LECT NOTES ARTIF INT, V5078, P188, DOI 10.1007/978-3-540-69369-7_21 Ward N., 2004, International Journal of Speech Technology, V7, DOI 10.1023/B:IJST.0000037070.31146.f9 WARD N, 2009, INT 2009 BRIGHT UK, P2431 Ward N, 2000, J PRAGMATICS, V32, P1177, DOI 10.1016/S0378-2166(99)00109-5 WARD NG, 2005, UTEPCS0523 WORGAN SF, 2010, THESIS U SOUTHAMPTON Worgan SF, 2010, ECOL PSYCHOL, V22, P327, DOI 10.1080/10407413.2010.517125 NR 47 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1104 EP 1114 DI 10.1016/j.specom.2010.12.004 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000004 ER PT J AU Forbes-Riley, K Litman, D AF Forbes-Riley, Kate Litman, Diane TI Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor SO SPEECH COMMUNICATION LA English DT Article DE Fully automated spoken dialogue tutoring system; Automatic affect detection and adaptation; Accompanying manual annotation; Controlled experimental evaluation; Error analysis ID EMOTIONS AB We evaluate the performance of a spoken dialogue system that provides substantive dynamic responses to automatically detected user affective states. We then present a detailed system error analysis that reveals challenges for real-time affect detection and adaptation. This research is situated in the tutoring domain, where the user is a student and the spoken dialogue system is a tutor. Our adaptive system detects uncertainty in each student turn via a model that combines a machine learning approach with hedging phrase heuristics; the learned model uses acoustic-prosodic and lexical features extracted from the speech signal, as well as dialogue features. The adaptive system varies its content based on the automatic uncertainty and correctness labels for each turn. Our controlled experimental evaluation shows that the adaptive system yields higher global performance than two non-adaptive control systems, but the difference is only significant for a subset of students. Our system error analysis indicates that noisy affect labeling is a major performance bottleneck, yielding fewer than expected adaptations thus lower than expected performance. However, the percentage of received adaptation correlates with higher performance over all students. Moreover, when uncertainty is accurately recognized and adapted to, local performance is significantly improved. (C) 2011 Elsevier B.V. All rights reserved. C1 [Forbes-Riley, Kate; Litman, Diane] Univ Pittsburgh, Ctr Learning Res & Dev, Dept Comp Sci, Pittsburgh, PA 15260 USA. RP Forbes-Riley, K (reprint author), Univ Pittsburgh, Ctr Learning Res & Dev, Dept Comp Sci, Pittsburgh, PA 15260 USA. EM forbesk@pitt.edu FU National Science Foundation (NSF) [0914615, 0631930] FX This work is funded by National Science Foundation (NSF) awards #0914615 and #0631930. We thank Pam Jordan and the ITSPOKE Group for their help with this research. CR AI H, 2006, P INT PITTSB PA, P797 AIST G, 2002, P INT TUT SYST WORKS Ang J, 2002, P INT C SPOK LANG PR, P2037 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Batliner A, 2008, USER MODEL USER-ADAP, V18, P175, DOI 10.1007/s11257-007-9039-4 Bhatt K, 2004, PROCEEDINGS OF THE TWENTY-SIXTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P114 BLACK A, 1997, 83 U ED HUM COMM RES Burleson W, 2007, IEEE INTELL SYST, V22, P62, DOI 10.1109/MIS.2007.69 Chawla NV, 2002, J ARTIF INTELL RES, V16, P321 Conati C, 2009, USER MODEL USER-ADAP, V19, P267, DOI 10.1007/s11257-009-9062-8 CONATI C, 2004, P INT TUT SYST C ITS, P55 D'Mello S., 2007, P 29 ANN M COGN SCI, P203 De Vicente A., 2002, P 6 INT C INT TUT SY, P933 Devillers L., 2005, P INTERSPEECH LISB P DEVILLERS L, 2003, P IEEE INT C MULT EX D'Mello SK, 2008, USER MODEL USER-ADAP, V18, P45, DOI 10.1007/s11257-007-9037-6 Ekman P., 1978, FACIAL ACTION CODING FORBESRILEY K, 2006, P FLOR ART INT RES S, P509 Forbes-Riley K., 2009, P 14 INT C ART INT E FORBESRILEY K, 2005, P 9 EUR C SPEECH COM FORBESRILEY K, 2010, P INT INT TUT SYST C Forbes-Riley K., 2004, P HUM LANG TECHN C N, P201 Forbes-Riley K., 2010, COMPUTER SPEECH LANG, V25, P105 FORBESRILEY K, 2007, P AFF COMP INT INT A, P678 Forbes-Riley K., 2008, P 9 INT C INT TUT SY Gholson J., 2004, J ED MEDIA, V29, P241, DOI DOI 10.1080/1358165042000283101 Gratch J., 2003, BROWN J WORLD AFFAIR, VX, P63 Hake R.R., 2002, P UNESCO ASPEN WORKS HALL L, 2004, P INT TUT SYST C ITS, P604 Hall M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI DOI 10.1145/1656274.1656278 HUANG X, 1993, COMPUT SPEECH LA FEB, P137 JORDAN PW, 2007, P AIED 2007, P43 Klein J, 2002, INTERACT COMPUT, V14, P119 Kort B., 2001, Proceedings IEEE International Conference on Advanced Learning Technologies, DOI 10.1109/ICALT.2001.943850 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Lee CM, 2002, P INT C SPOK LANG PR, P873 LITMAN D, 2009, COGN MET ED SYST AAA Litman D., 2005, INT J ARTIFICIAL INT, V16, P145 LITMAN D, 2004, P SIGDIAL WORKSH DIS, P144 LITMAN D, 2009, P INT BRIGHT UK Litman D. J., 2004, P 42 ANN M ASS COMP, P352 Litman DJ, 2006, SPEECH COMMUN, V48, P559, DOI 10.1016/j.specom.2005.09.008 LIU K, 2005, CHI WORKSH HCI CHALL Mairesse F., 2008, P 46 ANN M ASS COMP McQuiggan S. W., 2008, P 9 INT INT TUT SYST McQuiggan SW, 2008, USER MODEL USER-ADAP, V18, P81, DOI 10.1007/s11257-007-9040-y OUDEYER P, 2002, P I INT C PROS AIX E, P551 Paiva A., 2007, AFFECTIVE COMPUTING Pon-Barry Heather, 2006, INT J ARTIFICIAL INT, V16, P171 Porayska-Pomsta K, 2008, USER MODEL USER-ADAP, V18, P125, DOI 10.1007/s11257-007-9041-x Prendinger H, 2005, APPL ARTIF INTELL, V19, P267, DOI 10.1080/08839510590910174 Schuller B., 2010, P 11 ANN C INT SPEEC Schuller B., 2009, P 10 ANN C INT SPEEC Shafran I., 2003, P IEEE AUT SPEECH RE, P31 Talkin D., 1996, GET F0 ONLINE DOCUME Talkin D., 1995, SPEECH CODING SYNTHE TSUKAHARA W, 2001, P SIG CHI HUM FACT C VanLehn K, 2003, COGNITION INSTRUCT, V21, P209, DOI 10.1207/S1532690XCI2103_01 VANLEHN K, 2002, P INT TUT SYST Wang N, 2005, P 2005 INT C INT US, P12, DOI 10.1145/1040830.1040845 Wang N, 2008, INT J HUM-COMPUT ST, V66, P98, DOI 10.1016/j.ijhcs.2007.09.003 NR 61 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1115 EP 1136 DI 10.1016/j.specom.2011.02.006 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000005 ER PT J AU Acosta, JC Ward, NG AF Acosta, Jaime C. Ward, Nigel G. TI Achieving rapport with turn-by-turn, user-responsive emotional coloring SO SPEECH COMMUNICATION LA English DT Article DE Affective computing; Dimensional emotions; Prosody; Persuasion; Prediction; Immediate response patterns; Responsiveness; User modeling; Social interaction in dialog; Interpersonal adaptation ID HUMAN-COMPUTER INTERACTION; AGENTS; ALIGNMENT AB People in dialog use a rich set of nonverbal behaviors, including variations in the prosody of their utterances. Such behaviors, often emotion-related, call for appropriate responses, but today's spoken dialog systems lack the ability to do this. Recent work has shown how to recognize user emotions from prosody and how to express system-side emotions with prosody, but demonstrations of how to combine these functions to improve the user experience have been lacking. Working with a corpus of conversations with students about graduate school, we analyzed the emotional states of the interlocutors, utterance-by-utterance, using three dimensions: activation, evaluation, and power. We found that the emotional coloring of the speaker's utterance could be largely predicted from the emotion shown by her interlocutor in the immediately previous utterance. This finding enabled us to build Gracie, the first spoken dialog system that recognizes a user's emotional state from his or her speech and gives a response with appropriate emotional coloring. Evaluation with 36 subjects showed that they felt significantly more rapport with Gracie than with either of two controls. This shows that dialog systems can tap into this important level of interpersonal interaction using today's technology. (C) 2010 Elsevier B.V. All rights reserved. C1 [Acosta, Jaime C.; Ward, Nigel G.] Univ Texas El Paso, El Paso, TX 79968 USA. RP Ward, NG (reprint author), Univ Texas El Paso, 500 W Univ Ave, El Paso, TX 79968 USA. EM jcacosta@miners.utep.edu; nigelward@acm.org FU ACTEDS Scholarship; NSF [IIS-0415150, IIS-0914868]; US Army FX We thank Anais Rivera and Sue Walker for the persuasive dialog corpus, Rafael Escalante-Ruiz for the persuasive letter generator, Alejandro Vega, Shreyas Karkhedkar, and Tam Acosta for their help during testing, Ben Walker and Josh McCartney for running subjects, and David Novick, Steven Crites and Anton Bat liner for discussion and comments. This work was supported in part by an ACTEDS Scholarship, by NSF Awards IIS-0415150 and IIS-0914868, and by the US Army Research, Development and Engineering Command via a contract to the USC Institute for Creative Technologies. CR Acosta J. C., 2009, THESIS U TEXAS EL PA Batliner A, 2011, COMPUT SPEECH LANG, V25, P4, DOI 10.1016/j.csl.2009.12.003 BATLINER A, 2009, ADV HUMAN COMPUTER I Beale R, 2009, INT J HUM-COMPUT ST, V67, P755, DOI 10.1016/j.ijhcs.2009.05.001 Becker C, 2004, LECT NOTES COMPUT SC, V3068, P154 Berry DC, 2005, INT J HUM-COMPUT ST, V63, P304, DOI 10.1016/j.ijhcs.2005.03.006 BEVACQUA E, 2008, P 8 INT C INT VIRT A Branigan HP, 2010, J PRAGMATICS, V42, P2355, DOI 10.1016/j.pragma.2009.12.012 BRAVE S, 2007, HUMAN COMPUTER INTER, P77 Burgoon J., 1995, INTERPERSONAL ADAPTA CACIOPPO JT, 1982, ATTITUDES LANGUAGE V, P189 Calvo RA, 2010, IEEE T AFFECT COMPUT, V1, P18, DOI 10.1109/T-AFFC.2010.1 Cappella J. N., 1991, COMMUN THEORY, V1, P4, DOI 10.1111/j.1468-2885.1991.tb00002.x Cassell J, 2003, USER MODEL USER-ADAP, V13, P89, DOI 10.1023/A:1024026532471 Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 D'Mello S, 2009, LECT NOTES COMPUT SC, V5612, P595, DOI 10.1007/978-3-642-02580-8_65 Fogg B. J., 2003, PERSUASIVE TECHNOLOG Forbes-Riley K, 2011, COMPUT SPEECH LANG, V25, P105, DOI 10.1016/j.csl.2009.12.002 Frank E, 1998, MACH LEARN, V32, P63, DOI 10.1023/A:1007421302149 Grahe JE, 1999, J NONVERBAL BEHAV, V23, P253, DOI 10.1023/A:1021698725361 Gratch J., 2006, 6 INT C INT VIRT AG, P14 Gratch J, 2007, LECT NOTES ARTIF INT, V4722, P125 HOLTGRAVES TM, 2008, LANGUAGE MEANING SOC Huggins-Daines D., 2006, IEEE INT C AC SPEECH Klein J, 2002, INTERACT COMPUT, V14, P119 Komatani K, 2005, USER MODEL USER-ADAP, V15, P169, DOI 10.1007/s11257-004-5659-0 Kopp S, 2010, SPEECH COMMUN, V52, P587, DOI 10.1016/j.specom.2010.02.007 Lee C.-C., 2009, INTERSPEECH, P320 Mazzotta I, 2009, LECT NOTES ARTIF INT, V5773, P527 Pittermann Johannes, 2010, International Journal of Speech Technology, V13, DOI 10.1007/s10772-010-9068-y PONBARRY H, 2008, INTERSPEECH, V12, P74 Saerbeck M, 2010, CHI2010: PROCEEDINGS OF THE 28TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P1613 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHRODER M, 2004, SPEECH EMOTION RES A Schroder M., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025708916924 SCHROEDER M, 2009, EMOTION MARKUP LANGU Schuller Bjorn, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5372886 Shepard C. A., 2001, NEW HDB LANGUAGE SOC, P33 Suzuki N, 2007, CONNECT SCI, V19, P131, DOI 10.1080/09540090701369125 WARD N, 2009, INT 2009 BRIGHT UK, P2431 Ward N, 2003, INT J HUM-COMPUT ST, V59, P603, DOI 10.1016/S1071-5819(03)00085-5 Ward NG, 2010, J CROSS CULT PSYCHOL, V41, P270, DOI 10.1177/0022022109354644 Winton Ward M., 1990, HDB LANGUAGE SOCIAL, P33 Witten I. H., 2002, ACM SIGMOD RECORD, V31, P76, DOI 10.1145/507338.507355 Witten I.H., 2005, DATA MINING PRACTICA NR 47 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1137 EP 1148 DI 10.1016/j.specom.2010.11.006 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000006 ER PT J AU Mahdhaoui, A Chetouani, M AF Mahdhaoui, Ammar Chetouani, Mohamed TI Supervised and semi-supervised infant-directed speech classification for parent-infant interaction analysis SO SPEECH COMMUNICATION LA English DT Article DE Infant-directed speech; Emotion recognition; Face-to-face interaction; Data fusion; Semi-supervised learning ID MOTHERESE; PREFERENCE; LANGUAGE AB This paper describes the development of an infant-directed speech discrimination system for parent infant interaction analysis. Different feature sets for emotion recognition were investigated using two classification techniques: supervised and semi-supervised. The classification experiments were carried out with short pre-segmented adult-directed speech and infant-directed speech segments extracted from real-life family home movies (with durations typically between 0.5 s and 4 s). The experimental results show that in the case of supervised learning, spectral features play a major role in the infant-directed speech discrimination. However, a major difficulty of using natural corpora is that the annotation process is time-consuming, and the expression of emotion is much more complex than in acted speech. Furthermore, interlabeler agreement and annotation label confidences are important issues to address. To overcome these problems, we propose a new semi-supervised approach based on the standard co-training algorithm exploiting labelled and unlabelled data. It offers a framework to take advantage of supervised classifiers trained by different features. The proposed dynamic weighted co-training approach combines various features and classifiers usually used in emotion recognition in order to learn from different views. Our experiments demonstrate the validity and effectiveness of this method for a real-life corpus such as home movies. (C) 2011 Elsevier B.V. All rights reserved. C1 [Mahdhaoui, Ammar] Univ Paris 06, F-75005 Paris, France. CNRS, ISIR, Inst Syst Intelligents & Robot, UMR 7222, F-75005 Paris, France. RP Mahdhaoui, A (reprint author), Univ Paris 06, F-75005 Paris, France. EM Ammar.Mahdhaoui@isir.upmc.fr; Mohamed.Chetouani@upmc.fr RI CHETOUANI, Mohamed/F-5854-2010 FU La Fondation de France FX The authors would like to thank Filippo Muratori and Fabio Apicella from Scientific Institute Stella Mans of University of Pisa, Italy, who have provided data; family home movies. We would also like to extend our thanks to David Cohen and his staff, Raquel Sofia Cassel and Catherine Saint-Georges, from the Department of Child and Adolescent Psychiatry, AP-HP, Groupe Hospitalier Pitie-Salpetriere, Universite Pierre et Marie Curie, Paris France, for their collaboration and the manual database annotation and data analysis. Finally, this work has been partially funded by La Fondation de France. CR Association A. P., 1994, DIAGNOSTIC STAT MANU, VIV Bishop C. M., 1995, NEURAL NETWORKS PATT Blum A., 1998, C COMP LEARN THEOR Boersma P., 2005, PRAAT DOING PHONETIC BREFELD U, 2006, EFFICIENT COREGULARI BURNHAM C, 2002, WHATS NEW PUSSYCAT T, V296, P1435 CARLSON A, 2009, THESIS CARNEGIE MELL Chang C.-C., 2001, LIBSVM LIB SUPPORT V Chetouani M, 2009, COGN COMPUT, V1, P194, DOI 10.1007/s12559-009-9016-9 Cohen J., 1960, EDUC PSYCHOL MEAS, V20, P3746 COOPER RP, 1990, CHILD DEV, V61, P1584, DOI 10.1111/j.1467-8624.1990.tb02885.x Cooper RP, 1997, INFANT BEHAV DEV, V20, P477, DOI 10.1016/S0163-6383(97)90037-0 Duda R., 2000, PATTERN CLASSIFICATI EIBE F, 1999, MORGAN KAUFMANN SERI Esposito A, 2005, LECT NOTES ARTIF INT, V3445, P1 FERNALD A, 1985, INFANT BEHAV DEV, V8, P181, DOI 10.1016/S0163-6383(85)80005-9 FERNALD A, 1987, INFANT BEHAV DEV, V10, P279, DOI 10.1016/0163-6383(87)90017-8 Goldman S., 2000, INT C MACH LEARN, P327 GRIESER DL, 1988, DEV PSYCHOL, V24, P14, DOI 10.1037/0012-1649.24.1.14 INOUEA T, 2011, NEUROSCI RES, P1 KESSOUS L, 2007, PARALING Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533 LAZNIK M, 2005, COMMENCEMENT TAIT VO, P81 MAESTRO S, 2005, CHILD PSYCHIAT HUMAN, V35, P83 MANDHAOUI A, 2008, INT C PATT REC ICPR, P8 MANDHAOUI A, 2011, INT J METH PSYCH RES, V20, pE6 MANDHAOUI A, 2009, MULTIMODAL SIGNALS C, P248 Muratori F., 2007, INT J DIALOGICAL SCI, V2, P93 Muslea I., 2000, Proceedings Seventeenth National Conference on Artificial Intelligence (AAAI-2000). Twelfth Innovative Applications of Artificial Intelligence Conference (IAAI-2000) NIGAM K, 2000, INT C MACH LEARN Nigam K., 2000, 9 INT C INF KNOWL MA, P86 Platt J. C., 1999, ADV LARGE MARGIN CLA REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Saint-Georges C, 2010, RES AUTISM SPECT DIS, V4, P355, DOI 10.1016/j.rasd.2009.10.017 Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901 Schuller B., 2007, INTERSPEECH, P2253 Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006 SHAMI M, 2005, IEEE MULTIMEDIA EXPO Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3 Truong KP, 2007, SPEECH COMMUN, V49, P144, DOI 10.1016/j.specom.2007.01.001 Vapnik V., 1995, NATURE STAT LEARNING Vapnik V, 1998, STAT LEARNING THEORY Zhang QJ, 2010, PATTERN RECOGN, V43, P3113, DOI 10.1016/j.patcog.2010.04.004 Zhou D., 2005, ADV NEURAL INFORM PR, V17, P1633 Zhu X., 2003, ICML, P912 Zwicker E, 1999, PSYCHOACOUSTICS FACT Zwicker E., 1961, ACOUSTICAL SOC AM, V33, P248 NR 47 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1149 EP 1161 DI 10.1016/j.specom.2011.05.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000007 ER PT J AU Lee, CC Mower, E Busso, C Lee, S Narayanan, S AF Lee, Chi-Chun Mower, Emily Busso, Carlos Lee, Sungbok Narayanan, Shrikanth TI Emotion recognition using a hierarchical binary decision tree approach SO SPEECH COMMUNICATION LA English DT Article DE Emotion recognition; Hierarchical structure; Support Vector Machine; Bayesian Logistic Regression AB Automated emotion state tracking is a crucial element in the computational study of human communication behaviors. It is important to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities to support human decision making and to design human machine interfaces that facilitate efficient communication. We introduce a hierarchical computational structure to recognize emotions. The proposed structure maps an input speech utterance into one of the multiple emotion classes through subsequent layers of binary classifications. The key idea is that the levels in the tree are designed to solve the easiest classification tasks first, allowing us to mitigate error propagation. We evaluated the classification framework on two different emotional databases using acoustic features, the AIBO database and the USC IEMOCAP database. In the case of the AIBO database, we obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall on the evaluation data set improves by 3.37% absolute (8.82% relative) over a Support Vector Machine baseline model. In the USC IEMOCAP database, we obtain an absolute improvement of 7.44% (14.58%) over a baseline Support Vector Machine modeling. The results demonstrate that the presented hierarchical approach is effective for classifying emotional utterances in multiple database contexts. (C) 2011 Elsevier B.V. All rights reserved. C1 [Lee, Chi-Chun; Mower, Emily; Lee, Sungbok; Narayanan, Shrikanth] Univ So Calif, Signal Anal & Interpretat Lab SAIL, Dept Elect Engn, Los Angeles, CA 90089 USA. [Busso, Carlos] Univ Texas Dallas, Dept Elect Engn, Dallas, TX 75080 USA. RP Lee, CC (reprint author), Univ So Calif, Signal Anal & Interpretat Lab SAIL, Dept Elect Engn, Los Angeles, CA 90089 USA. EM chiclee@usc.edu RI Narayanan, Shrikanth/D-5676-2012 CR Agresti A., 1990, WILEY SERIES PROBABI ALBORNOZ EM, 2011, COMPUT SPEE IN PRESS, V25 AMIR N, 2010, SPEECH PROSODY BLACK M, 2010, P INT MAK JAP 2010 Black M., 2008, P WORKSH CHILD COMP Brave S, 2005, INT J HUM-COMPUT ST, V62, P161, DOI 10.1016/j.ijhcs.2004.11.002 BRODY L, 2008, GENDER EMOTION CONTE Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578 Busso C, 2008, LANG RESOUR EVAL, V42, P335, DOI 10.1007/s10579-008-9076-6 Chawla NV, 2002, J ARTIF INTELL RES, V16, P321 Eyben F, 2009, SPEECH MUSIC INTERPR Genkin A, 2007, TECHNOMETRICS, V49, P291, DOI 10.1198/004017007000000245 HASSAN A, 2010, P INT, P2354 HERM O, 2008, P INT Hosmer DW, 2000, WILEY SERIES PROBABI, V2nd Kanda T, 2004, HUM-COMPUT INTERACT, V19, P61, DOI 10.1207/s15327051hci1901&2_4 Kapoor A., 2005, P 13 ANN ACM INT C M, P677, DOI DOI 10.1145/1101149.1101300 LAZARUS R, 2001, EMOTION THEOR METHOD, P37 LEE CC, 2010, P INT MAK JAP Lee C.-C., 2009, P INT Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Mao QR, 2010, COMPUT SCI INF SYST, V7, P211, DOI 10.2298/CSIS1001211Q Metallinou A., 2010, P INT C AC SPEECH SI Mower E, 2011, IEEE T AUDIO SPEECH, V19, P1057, DOI 10.1109/TASL.2010.2076804 Pantic M, 2005, P 13 ANN ACM INT C M, P669, DOI 10.1145/1101149.1101299 Prendinger H, 2005, INT J HUM-COMPUT ST, V62, P231, DOI 10.1016/j.ijhcs.2004.11.009 Schuller B., 2007, P INT SCHULLER B, 2009, P INT BRIGHT UK Steidl S., 2009, AUTOMATIC CLASSIFICA Vapnik V., 1995, NATURE STAT LEARNING WANGER HL, 1993, J NONVERBAL BEHAV, V17, P3 XIAO Z, 2007, P INT S MULT WORKSH, P291 YILDIRIM S, 2005, P EUR LISB PORT Yildirim S, 2011, COMPUT SPEECH LANG, V25, P29, DOI 10.1016/j.csl.2009.12.004 NR 34 TC 26 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1162 EP 1171 DI 10.1016/j.specom.2011.06.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000008 ER PT J AU Kockmann, M Burget, L Cernocky, J AF Kockmann, Marcel Burget, Lukas Cernocky, Jan Honza TI Application of speaker- and language identification state-of-the-art techniques for emotion recognition SO SPEECH COMMUNICATION LA English DT Article DE Emotion recognition; Gaussian mixture models; Maximum-mutual-information; Intersession variability compensation; Score-level fusion ID VERIFICATION AB This paper describes our efforts of transferring feature extraction and statistical modeling techniques from the fields of speaker and language identification to the related field of emotion recognition. We give detailed insight to our acoustic and prosodic feature extraction and show how to apply Gaussian Mixture Modeling techniques on top of it. We focus on different flavors of Gaussian Mixture Models (GMMs), including more sophisticated approaches like discriminative training using Maximum-Mutual-Information (MMI) criterion and InterSession Variability (ISV) compensation. Both techniques show superior performance in language and speaker identification. Furthermore, we combine multiple system outputs by score-level fusion to exploit the complementary information in diverse systems. Our proposal is evaluated with several experiments on the FAU Aibo Emotion Corpus containing non-acted spontaneous emotional speech. Within the Interspeech 2009 Emotion Challenge we could achieve the best results for the 5-class task of the Open Performance Sub-Challenge with an unweighted average recall of 41.7%. Further additional experiments on the acted Berlin Database of Emotional Speech show the capability of intersession variability compensation for emotion recognition. (C) 2011 Elsevier B.V. All rights reserved. C1 [Kockmann, Marcel; Burget, Lukas; Cernocky, Jan Honza] Brno Univ Technol, Speech FIT, Brno, Czech Republic. RP Kockmann, M (reprint author), Brno Univ Technol, Speech FIT, Brno, Czech Republic. EM kockmann@fit.vutbr.cz FU European project MOBIO [FP7-214324]; Grant Agency of Czech Republic [102/08/0707]; Czech Ministry of Education [MSM0021630528]; SVOX Deutschland GmbH, Munich, Germany FX This work was partly supported by European project MOBIO (FP7-214324), by Grant Agency of Czech Republic project No. 102/08/0707, and by the Czech Ministry of Education project No. MSM0021630528. Marcel Kockmann is supported by SVOX Deutschland GmbH, Munich, Germany. CR Batliner A, 2006, P IS LTC 2006 LJUBL, P240 Bishop C. M., 2006, PATTERN RECOGNITION Brummer N, 2007, IEEE T AUDIO SPEECH, V15, P2072, DOI 10.1109/TASL.2007.902870 Brummer N., 2004, P NIST SPEAK REC EV Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001 Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499 Burkhardt F., 2005, 9 EUR C SPEECH COMM Cohen J, 1995, J ACOUST SOC AM, V97, P3246, DOI 10.1121/1.411700 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Dehak N., 2009, IEEE T AUDIO SPEECH, P1 GAURAV M, 2008, SPOK LANG TECHN WORK, P313 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HUBEIKA V, 2008, P INT, P1990 Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147 Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009 KOCKMANN M, 2008, SPOK LANG TECHN WORK, P45 Kockmann M, 2009, P INT BRIGHT, P348 Matejka P., 2008, P INT Matejka P., 2006, P OD *NIST, 2005, 2005 NIST LANG REC E, P1 POVEY D, 2003, THESIS CAMBRIDGE U E, P1 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 SCHULLER B, 2006, SPEECH PROSODY DRESD SCHULLER B, 2007, INTERSPEECH 2007 1 4 Schuller B, 2009, P INT BRIGHT, P1, DOI 10.1007/978-3-0346-0198-6_1 Schwarz P., 2006, P ICASSP, P325 SEPPI D, 2008, P INT SHRIBERG E, 2009, INTERSPEECH BRIGHTON STEIDL S, 2009, STUDIEN MUSTERERKENN, V28, P1 Torres-Carrasquillo P.A., 2002, 7 INT C SPOK LANG PR Ververidis D., 2003, P 1 RICHM C, P109 Vlasenko B, 2007, P INT, P2249 Young S. J., 2006, HTK BOOK VERSION 3 4 NR 33 TC 9 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1172 EP 1185 DI 10.1016/j.specom.2011.01.007 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000009 ER PT J AU Bozkurt, E Erzin, E Erdem, CE Erdem, AT AF Bozkurt, Elif Erzin, Engin Erdem, Cigdem Eroglu Erdem, A. Tanju TI Formant position based weighted spectral features for emotion recognition SO SPEECH COMMUNICATION LA English DT Article DE Emotion recognition; Emotional speech classification; Spectral features; Formant frequency; Line spectral frequency; Decision fusion AB In this paper, we propose novel spectrally weighted mel-frequency cepstral coefficient (WMFCC) features for emotion recognition from speech. The idea is based on the fact that formant locations carry emotion-related information, and therefore critical spectral bands around formant locations can be emphasized during the calculation of MFCC features. The spectral weighting is derived from the normalized inverse harmonic mean function of the line spectral frequency (LSF) features, which are known to be localized around formant frequencies. The above approach can be considered as an early data fusion of spectral content and formant location information. We also investigate methods for late decision fusion of unimodal classifiers. We evaluate the proposed WMFCC features together with the standard spectral and prosody features using HMM based classifiers on the spontaneous FAU Aibo emotional speech corpus. The results show that unimodal classifiers with the WMFCC features perform significantly better than the classifiers with standard spectral features. Late decision fusion of classifiers provide further significant performance improvements. (C) 2011 Elsevier B.V. All rights reserved. C1 [Bozkurt, Elif; Erzin, Engin] Koc Univ, Multimedia Vis & Graph Lab, Coll Engn, TR-34450 Istanbul, Turkey. [Erdem, Cigdem Eroglu] Bahcesehir Univ, Dept Elect & Elect Engn, TR-34353 Istanbul, Turkey. [Erdem, A. Tanju] Ozyegin Univ, Dept Elect & Elect Engn, TR-34662 Istanbul, Turkey. RP Erzin, E (reprint author), Koc Univ, Multimedia Vis & Graph Lab, Coll Engn, TR-34450 Istanbul, Turkey. EM ebozkurt@ku.edu.tr; eerzin@ku.edu.tr; cigdem.eroglu@bahcesehir.edu.tr; tanju.erdem@ozyegin.edu.tr RI Erzin, Engin/H-1716-2011; Eroglu Erdem, Cigdem/J-4216-2012 OI Eroglu Erdem, Cigdem/0000-0002-9264-5652 FU Turkish Scientific and Technical Research Council (TUBITAK) [106E201, COST2102, 110E056] FX This work was supported in part by the Turkish Scientific and Technical Research Council (TUBITAK) under projects 106E201 (COST2102 action) and 110E056. The authors would like to acknowledge and thank the anonymous referees for their valuable comments that have significantly improved the quality of the paper. CR Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Boersma P., 2010, PRAAT DOING PHONETIC Deller J. R., 1993, DISCRETE TIME PROCES Dietterich TG, 1998, NEURAL COMPUT, V10, P1895, DOI 10.1162/089976698300017197 DUMOUCHEL P, 2009, INT 2009 10 ANN C IN, P1 Erzin E, 2005, IEEE T MULTIMEDIA, V7, P840, DOI 10.1109/TMM.2005.854464 GOUDBEEK MB, 2009, P INT BRIGHT UK GRIMM M, 2006, P 14 EUR SIGN PROC C ITAKURA F, 1975, J ACOUST SOC AM, V57, pS35, DOI 10.1121/1.1995189 Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 KOCKMANN M, 2009, INT 2009 10 ANN C IN, V1, P316 Laroia R., 1991, P IEEE INT C AC SPEE, P641, DOI 10.1109/ICASSP.1991.150421 LEE C, 2004, INT C SPOK LANG PROC Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Morris RW, 2002, IEEE SIGNAL PROC LET, V9, P19, DOI 10.1109/97.988719 Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004 Nakatsu R, 2000, KNOWL-BASED SYST, V13, P497, DOI 10.1016/S0950-7051(00)00070-8 Neiberg D, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2755 POLZIN T, 2000, INTERSPEECH 2008 Sargin ME, 2007, IEEE T MULTIMEDIA, V9, P1396, DOI 10.1109/TMM.2007.906583 SCHERER KR, 1995, P 13 INT C PHON SCI, P90 SCHULLER B, 2006, DAGA, P57 SCHULLER B, 2009, INTERSPEECH 2009 Schuller B., 2003, P INT C AC SPEECH SI Steidl S., 2009, AUTOMATIC CLASSIFICA Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003 VLASENKO B, 2007, P 2 INT C AFF COMP I, P139 Zeng ZH, 2009, IEEE T PATTERN ANAL, V31, P39, DOI 10.1109/TPAMI.2008.52 NR 28 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1186 EP 1197 DI 10.1016/j.specom.2011.04.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000010 ER PT J AU Polzehl, T Schmitt, A Metze, F Wagner, M AF Polzehl, Tim Schmitt, Alexander Metze, Florian Wagner, Michael TI Anger recognition in speech using acoustic and linguistic cues SO SPEECH COMMUNICATION LA English DT Article DE Emotion detection; Anger classification; Linguistic and prosodic acoustic modeling; IGR ranking; Decision fusion; IVR speech ID EMOTIONS AB The present study elaborates on the exploitation of both linguistic and acoustic feature modeling for anger classification. In terms of acoustic modeling we generate statistics from acoustic audio descriptors, e.g. pitch, loudness, spectral characteristics. Ranking our features we see that loudness and MFCC seem most promising for all databases. For the English database also pitch features are important. In terms of linguistic modeling we apply probabilistic and entropy-based models of words and phrases, e.g. Bag-of-Words (BOW), Term Frequency (TF), Term Frequency - Inverse Document Frequency (TF.IDF) and the Self-Referential Information (SRI). SRI clearly outperforms vector space models. Modeling phrases slightly improves the scores. After classification of both acoustic and linguistic information on separated levels we fuse information on decision level adding confidences. We compare the obtained scores on three different databases. Two databases are taken from the IVR customer care domain, another database accounts for a WoZ data collection. All corpora are of realistic speech condition. We observe promising results for the IVR databases while the WoZ database shows lower scores overall. In order to provide comparability between the results we evaluate classification success using the fl measurement in addition to overall accuracy figures. As a result, acoustic modeling clearly outperforms linguistic modeling. Fusion slightly improves overall scores. With a baseline of approximately 60% accuracy and.40 fl-measurement by constant majority class voting we obtain an accuracy of 75% with respective.70 fl for the WoZ database. For the IVR databases we obtain approximately 79% accuracy with respective.78 fl over a baseline of 60% accuracy with respective.38 fl. (C) 2011 Elsevier B.V. All rights reserved. C1 [Polzehl, Tim] Tech Univ Berlin, Qual & Usabil Lab, D-10587 Berlin, Germany. [Polzehl, Tim] Deutsch Telekom Labs, Qual & Usabil Lab, D-10587 Berlin, Germany. [Schmitt, Alexander] Univ Ulm, Dialogue Syst Grp, D-89081 Ulm, Germany. [Schmitt, Alexander] Univ Ulm, Inst Informat Technol, D-89081 Ulm, Germany. [Metze, Florian] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. [Wagner, Michael] Univ Canberra, Natl Ctr Biometr Studies, Canberra, ACT 2601, Australia. RP Polzehl, T (reprint author), Tech Univ Berlin, Qual & Usabil Lab, Ernst Reuter Pl 7, D-10587 Berlin, Germany. EM tim.polzehl@gmail.com; alexander.schmitt@uni-ulm.de; fmetze@cs.cmu.edu; michael.wagner@canberra.edu.au RI Metze, Florian/N-4661-2014 OI Metze, Florian/0000-0002-6663-8600 CR Batliner A, 2000, ISCA WORKSH SPEECH E Bitouk D, 2010, SPEECH COMMUN, V52, P613, DOI 10.1016/j.specom.2010.02.010 Boersma P., 2009, PRAAT DOING PHONETIC BURKHARDT F, 2005, P ANN C INT SPEECH C Burkhardt F., 2005, P EL SPEECH SIGN PRO BURKHARDT F, 2009, P IEEE ICASSP, P4761 CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411 DAVIES M, 1982, MEASURING AGREEMENT Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 Duda R., 2000, PATTERN CLASSIFICATI DUMOUCHEL P, 2009, P ANN C INT SPEECH C ENBERG IS, 1996, DOCUMENTATION DANISH Fastl H, 2005, PSYCHOACOUSTICS FACT HOZJAN V, 2003, INT J SPEECH TECHNOL, V6, P11 Huang X., 2001, SPOKEN LANGUAGE PROC Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 LEE FM, 2008, INT C SIGN PROC ROB, P171 METZE F, 2008, GETTING CLOSER TAILO Metze F., 2009, P INT C SEM COMP ICS METZE F, 2009, P ANN C INT SPEECH C, P1 Polzehl T, 2010, SPOKEN DIALOGUE SYST, P81 Polzehl T., 2009, P INT BRIGHT, P340 POLZEHL T, 2009, P INT WORKSH SPOK DI Schmitt A., 2010, 6 INT C INT ENV IE 1 Schuller B, 2009, INT C AC SPEECH SIGN SCHULLER B, 2006, THESIS TU MUNCHEN MU SCHULLER B, 2009, P ANN C INT SPEECH C Schuller B., 2004, IEEE INT C AC SPEECH SHAFRAN I, 2003, IEEE WORKSH AUT SPEE, P31 SHAFRAN I, 2005, IEEE INT C AC SPEECH Steidl S, 2005, INT CONF ACOUST SPEE, P317 Steidl S., 2009, THESIS VIDRASCU L, 2007, PARALING VLASENKO B, 2008, P INT BRISB AUSTR, P805 Vlasenko B., 2007, P INTERSPEECH 2007, P2225 VLASENKO B, 2009, P ANN C INT SPEECH C WANG Y, 2009, EL MEAS INSTR ICEMI Yacoub S., 2003, EUROSPEECH, P1 NR 38 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1198 EP 1209 DI 10.1016/j.specom.2011.05.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000011 ER PT J AU Lopez-Cozar, R Silovsky, J Kroul, M AF Lopez-Cozar, Ramon Silovsky, Jan Kroul, Martin TI Enhancement of emotion detection in spoken dialogue systems by combining several information sources SO SPEECH COMMUNICATION LA English DT Article DE Adaptive spoken dialogue systems; Combination of classifiers; Information fusion; Emotion detection; Human computer interaction ID RECOGNITION; AGREEMENT; COMPUTER; CLASSIFIERS; SPEECH; TUTORS; USER AB This paper proposes a technique to enhance emotion detection in spoken dialogue systems by means of two modules that combine different information sources. The first one, called Fusion-0, combines emotion predictions generated by a set of classifiers that deal with different kinds of information about each sentence uttered by the user. To do this, the module employs several methods for information fusion that produce other predictions about the emotional state of the user. The predictions are the input to the second information fusion module, called Fusion-1, where they are combined to deduce the emotional state of the user. Fusion-0 represents a method employed in previous studies to enhance classification rates, whereas Fusion-1 represents the novelty of the technique, which is the combination of emotion predictions generated by Fusion-0. One advantage of the technique is that it can be applied as a posterior processing stage to any other methods that combine information from different information sources at the decision level. This is so because the technique works on the predictions (outputs) of the methods, without interfering in the procedure used to obtain these predictions. Another advantage is that the technique can be implemented as a modular architecture, which facilitates the setting up within a spoken dialogue system as well as the deduction of the emotional state of the user in real time. Experiments have been carried out considering classifiers to deal with prosodic, acoustic, lexical, and dialogue acts information, and three methods to combine information: multiplication of probabilities, average of probabilities, and unweighted vote. The results show that the technique enhances the classification rates of the standard fusion by 2.27% and 3.38% absolute in experiments carried out considering two and three emotion categories, respectively. (C) 2011 Elsevier B.V. All rights reserved. C1 [Lopez-Cozar, Ramon] Univ Granada, Dept Languages & Comp Syst, Fac Comp Sci, E-18071 Granada, Spain. [Silovsky, Jan; Kroul, Martin] Tech Univ Liberec, Inst Informat Technol & Elect, Fac Mechatron, Liberec, Czech Republic. RP Lopez-Cozar, R (reprint author), Univ Granada, Dept Languages & Comp Syst, Fac Comp Sci, E-18071 Granada, Spain. EM rlopezc@ugr.es; jan.silovsky@tul.cz; martin.kroul@tul.cz RI Prieto, Ignacio/B-5361-2013; Lopez-Cozar, Ramon/A-7686-2012 OI Lopez-Cozar, Ramon/0000-0003-2078-495X FU Spanish project HADA [TIN2007-64718]; Czech Grant Agency [102/08/0707]; Technical University of Liberec FX This research has been funded by the Spanish project HADA TIN2007-64718, the Czech Grant Agency Project No. 102/08/0707 and the Student Grant Scheme (SGS) at the Technical University of Liberec. The authors would like to thank the reviewers and the guest editors for their comments, suggestions and corrections that significantly improved the quality of this paper. CR AI H, 2006, P INT PITTSB PA, P797 Ang J, 2002, P INT C SPOK LANG PR, P2037 Barra-Chicote R., 2009, P INT, P336 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Bozkurt E., 2009, P INT BRIGHT, P324 Carletta J, 1996, COMPUT LINGUIST, V22, P249 COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104 Cover T M, 1991, ELEMENTS INFORM THEO Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022 Devillers L., 2006, P INT C SPOK LANG PR, P801 HASTIE HW, 2002, P M ASS COMP LING, P384 HUBER R, 2000, P INT C SPOK LANG PR, V1, P665 Kanda T, 2004, HUM-COMPUT INTERACT, V19, P61, DOI 10.1207/s15327051hci1901&2_4 Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 Klein J, 2002, INTERACT COMPUT, V14, P119 Kuncheva LI, 2001, PATTERN RECOGN, V34, P299, DOI 10.1016/S0031-3203(99)00223-X LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310 Le CA, 2005, LECT NOTES ARTIF INT, V3518, P262 Lee C, 2003, P EUR, P157 Lee C, 2009, P INT, P320 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Lee C-M, 2001, P IEEE WORKSH AUT SP, P240 Lee CM, 2002, P INT C SPOK LANG PR, P873 Liscombe J., 2005, P INT LISB PORT, P1845 Litman DJ, 2006, SPEECH COMMUN, V48, P559, DOI 10.1016/j.specom.2005.09.008 Lopez-Cozar R., 2005, SPOKEN MULTILINGUAL LOPEZCOZAR R, 2005, COMPUTER SPEECH LANG, V20, P420, DOI DOI 10.1016/J.CSL.2005.05.003 Luengo I, 2005, P INTERSPEECH, P493 LUENGO I, 2009, P INT BRIGHT UK SEP, P332 LUGGER M, 2009, P INT, P1995 MOLLER S, 2004, QUALITY TELEPHONE BA Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004 Nakatsu R., 1999, P INT C MULT COMP SY Neiberg D, 2006, P INT C SPOK LANG PR, P809 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Ortony A., 1990, COGNITIVE STRUCTURE PETRUSHIN V, 2000, P ICSLP BEIJ CHIN Plutchik R, 1994, PSYCHOL BIOL EMOTION, V1st Polzehl T., 2009, P INT BRIGHT, P340 ROLI F, 2004, P 5 INT WORKSH MSC 2, V3077 Scheirer J, 2002, INTERACT COMPUT, V14, P93, DOI 10.1016/S0953-5438(01)00059-5 Schuller Bjorn, 2009, P INTERSPEECH, P312 Steidl S., 2009, AUTOMATIC CLASSIFICA Stolcke A, 2000, COMPUT LINGUIST, V26, P339, DOI 10.1162/089120100561737 TAYLOR P, 1998, LANG SPEECH, V41, P489 XU L, 2009, P INT, P2035 Yacoub S., 2003, P EUROSPEECH, P729 NR 47 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2011 VL 53 IS 9-10 SI SI BP 1210 EP 1228 DI 10.1016/j.specom.2011.01.006 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 810CW UT WOS:000294104000012 ER PT J AU Garner, PN AF Garner, Philip N. TI Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; Cepstral normalisation; Noise robustness; Aurora ID ENHANCEMENT; SUPPRESSION AB Cepstral normalisation in automatic speech recognition is investigated in the context of robustness to additive noise. In this paper, it is argued that such normalisation leads naturally to a speech feature based on signal to noise ratio rather than absolute energy (or power). Explicit calculation of this SNR-cepstrum by means of a noise estimate is shown to have theoretical and practical advantages over the usual (energy based) cepstrum. The relationship between the SNR-cepstrum and the articulation index, known in psycho-acoustics, is discussed. Experiments are presented suggesting that the combination of the SNR-cepstrum with the well known perceptual linear prediction method can be beneficial in noisy environments. (C) 2011 Elsevier B.V. All rights reserved. C1 Idiap Res Inst, Ctr Parc, CH-1920 Martigny, Switzerland. RP Garner, PN (reprint author), Idiap Res Inst, Ctr Parc, Rue Marconi 19,POB 592, CH-1920 Martigny, Switzerland. EM pgarner@idiap.ch FU Swiss National Science Foundation under the National Center of Competence in Research (NCCR) on Interactive Multi-modal Information Management (IM2) FX This work was supported by the Swiss National Science Foundation under the National Center of Competence in Research (NCCR) on Interactive Multi-modal Information Management (IM2). This paper only reflects the authors' views and funding agencies are not liable for any use that may be made of the information contained herein. CR ACERO A, 1990, THESIS CARNEGIE MELL, P15213 Acero A., 1990, P IEEE INT C AC SPEE, V2, P849 Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Allen JB, 2005, J ACOUST SOC AM, V117, P2212, DOI 10.1121/1.1856231 Au Yeung S.-K., 2004, P INT C SPOK LANG PR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940 de la Torre A, 2005, IEEE T SPEECH AUDI P, V13, P355, DOI 10.1109/TSA.2005.845805 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 *ETSI, 2002, 202050 ETSI FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 GARNER PN, 2009, P IEEE WORKSH AUT SP Garner P.N., 2010, P INT MAK JAP HAIN T, 2006, P NIST RT06 SPRING W Hain T., 2010, P INT MAK JAP Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H. G., 2000, ISCA ITRW ASR2000 AU LATHOUD G, 2006, 0609 IDIAPRR LATHOUD G, 2005, P IEEE WORKSH AUT SP LI J, 2007, P IEEE WORKSH AUT SP LINDBERG B, 2001, DANISH SPEECHDAT CAR LOBDELL BE, 2008, P INT BRISB AUSTR MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 MORENO PJ, 1996, THESIS CARNEGIE MELL, P15213 Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733 NETSCH L, 2001, AU27300 STQ AUR DSR PARIHAR N, 2004, P 12 EUR SIGN PROC C Plapous C., 2004, P IEEE INT C AC SPEE, V1, P289 Ris C, 2001, SPEECH COMMUN, V34, P141, DOI 10.1016/S0167-6393(00)00051-0 Segura J.C., 2002, P ICSLP 02, P225 STEVENS SS, 1957, PSYCHOL REV, V64, P153, DOI 10.1037/h0046162 Van Compernolle D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90027-2 Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8 VIKKI O, 1997, ROBUST SPEECH RECOGN, P107 NR 37 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2011 VL 53 IS 8 BP 991 EP 1001 DI 10.1016/j.specom.2011.05.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 804RJ UT WOS:000293671400001 ER PT J AU D'Haro, LF de Cordoba, R San-Segundo, R Ferreiros, J Pardo, JM AF Fernando D'Haro, Luis de Cordoba, Ricardo San-Segundo, Ruben Ferreiros, Javier Manuel Pardo, Jose TI Design and evaluation of acceleration strategies for speeding up the development of dialog applications SO SPEECH COMMUNICATION LA English DT Article DE Development tools; Automatic design; VoiceXML; Data mining; Speech-based dialogs ID PLATFORM; SYSTEMS AB In this paper, we describe a complete development platform that features different innovative acceleration strategies, not included in any other current platform, that simplify and speed up the definition of the different elements required to design a spoken dialog service. The proposed accelerations are mainly based on using the information from the backend database schema and contents, as well as cumulative information produced throughout the different steps in the design. Thanks to these accelerations, the interaction between the designer and the platform is improved, and in most cases the design is reduced to simple confirmations of the "proposals" that the platform dynamically provides at each step. In addition, the platform provides several other accelerations such as configurable templates that can be used to define the different tasks in the service or the dialogs to obtain or show information to the user, automatic proposals for the best way to request slot contents from the user (i.e. using mixed-initiative forms or directed forms), an assistant that offers the set of more probable actions required to complete the definition of the different tasks in the application, or another assistant for solving specific modality details such as confirmations of user answers or how to present them the lists of retrieved results after querying the backend database. Additionally, the platform also allows the creation of speech grammars and prompts, database access functions, and the possibility of using mixed initiative and over-answering dialogs. In the paper we also describe in detail each assistant in the platform, emphasizing the different kind of methodologies followed to facilitate the design process at each one. Finally, we describe the results obtained in both a subjective and an objective evaluation with different designers that confirm the viability, usefulness, and functionality of the proposed accelerations. Thanks to the accelerations, the design time is reduced in more than 56% and the number of keystrokes by 84%. (C) 2011 Elsevier B.V. All rights reserved. C1 [Fernando D'Haro, Luis; de Cordoba, Ricardo; San-Segundo, Ruben; Ferreiros, Javier; Manuel Pardo, Jose] Univ Politecn Madrid, Grp Tecnol Habla, Madrid, Spain. RP D'Haro, LF (reprint author), ETSI Telecomunicac, Ciudad Univ S-N, Madrid 28040, Spain. EM lfdharo@die.upm.es; cordoba@die.upm.es; lapiz@die.upm.es; jfl@die.upm.es; pardo@die.upm.es RI Pardo, Jose/H-3745-2013; Cordoba, Ricardo/B-5861-2008 OI Cordoba, Ricardo/0000-0002-7136-9636 FU ROBONAUTA [DPI2007-66846-c02-02]; SD-TEAM [TIN2008-06856-C05-03] FX This work has been supported by ROBONAUTA (DPI2007-66846-c02-02) and SD-TEAM (TIN2008-06856-C05-03). We want to thank the following people for their contribution in the coding of the platform and runtime system: to Rosalia Ramos, Jose Ramon Jimenez, Javier Morante, Ignacio Ibarz, and Ruben Martin from the Universidad Politecnica de Madrid, and to all the members of the GEMINI project for making possible the creation of the platform described in this paper. CR Agah A, 2000, INTERACT COMPUT, V12, P529, DOI 10.1016/S0953-5438(99)00022-3 BALENTINE B, 2001, BUILD SPEECH RECOGNI, P414 Bohus D, 2009, COMPUT SPEECH LANG, V23, P332, DOI 10.1016/j.csl.2008.10.001 CHUNG G, 2004, ACL, P63 CORDOBA R, 2004, INT C SPOK LANG PROC, P257 D'Haro L. F., 2009, THESIS U POLITECNICA DHARO LF, 2004, INT C SPOK LANG PROC, P3057 D'Haro LF, 2006, SPEECH COMMUN, V48, P863, DOI 10.1016/j.specom.2005.11.001 EBERMAN B, 2002, 11 INT C WWW, P713 Feng J., 2003, WORKSH AUT SPEECH RE, P168 GEORGILA K, 2004, 4 INT C LANG RES EV Hamerich S. W., 2008, 9 SIGDIAL WORKSH DIS, P92 HAMERICH SW, 2003, XML BASED DIALOG DES, P404 Jung S., 2008, SPEECH COMMUN, V50, P683 Lopez-Cozar R., 2005, SPOKEN MULTILINGUAL McGlashan S., 2004, VOICE EXTENSIBLE MAR McTear M, 2005, SPEECH COMMUN, V45, P249, DOI 10.1016/j.specom.2004.11.006 McTear M., 1998, INT C SPOK LANG PROC, P1223 Pargellis AN, 2004, SPEECH COMMUN, V42, P329, DOI 10.1016/j.specom.2003.10.003 Polifroni J., 2006, INT C LANG RES EV LR, P143 Schubert V., 2005, EUR C SPEECH COMM TE, P789 Tsai MJ, 2006, EXPERT SYST APPL, V31, P684, DOI 10.1016/j.eswa.2006.01.010 Wang YY, 2006, SPEECH COMMUN, V48, P390, DOI 10.1016/j.specom.2005.07.001 Wolters M, 2009, INTERACT COMPUT, V21, P276, DOI 10.1016/j.intcom.2009.05.009 NR 24 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2011 VL 53 IS 8 BP 1002 EP 1025 DI 10.1016/j.specom.2011.05.008 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 804RJ UT WOS:000293671400002 ER PT J AU Polyakova, T Bonafonte, A AF Polyakova, Tatyana Bonafonte, Antonio TI Introducing nativization to Spanish TTS systems SO SPEECH COMMUNICATION LA English DT Article DE Nativization; Pronunciation by analogy; Multilingual TTS; Grapheme-to-phoneme conversion; Phoneme-to-phoneme conversion ID TO-PHONEME CONVERSION; ANALOGY AB In the modern world, speech technologies must be flexible and adaptable to any framework. Mass media globalization introduces multilingualism as a challenge for the most popular speech applications such as text-to-speech synthesis and automatic speech recognition. Mixed-language texts vary in their nature and when processed, some essential characteristics must be considered. In Spain and other Spanish-speaking countries, the use of Anglicisms and other words of foreign origin is constantly growing. A particularity of peninsular Spanish is that there is a tendency to nativize the pronunciation of non-Spanish words so that they fit properly into Spanish phonetic patterns. In our previous work, we proposed to use hand-crafted nativization tables that were capable of nativizing correctly 24% of words from the test data. In this work, our goal was to approach the nativization challenge by data-driven methods, because they are transferable to other languages and do not drop in performance in comparison with explicit rules manually written by experts. Training and test corpora for nativization consisted of 1000 and 100 words respectively and were crafted manually. Different specifications of nativization by analogy and learning from errors focused on finding the best nativized pronunciation of foreign words. The best obtained objective nativization results showed an improvement from 24% to 64% in word accuracy in comparison to our previous work. Furthermore, a subjective evaluation of the synthesized speech allowed for the conclusion that nativization by analogy is clearly the preferred method among listeners of different backgrounds when comparing to previously proposed methods. These results were quite encouraging and proved that even a small training corpus is sufficient for achieving significant improvements in naturalness for English inclusions of variable length in Spanish utterances. (C) 2011 Elsevier B.V. All rights reserved. C1 [Polyakova, Tatyana; Bonafonte, Antonio] Univ Politecn Cataluna, Signal Theory Dept, ES-08034 Barcelona, Spain. RP Polyakova, T (reprint author), Univ Politecn Cataluna, Signal Theory Dept, Jordi Girona 1-3, ES-08034 Barcelona, Spain. EM tatyana.polyakova@upc.edu; antonio.bonafonte@upc.edu CR [Anonymous], 1999, HDB INT PHONETIC ASS Bear D., 2003, ENGLISH LEARNERS REA, P71 Bellegarda JR, 2005, SPEECH COMMUN, V46, P140, DOI 10.1016/j.specom.2005.03.002 Bisani M, 2008, SPEECH COMMUN, V50, P434, DOI 10.1016/j.specom.2008.01.002 BLACK A, 1998, SSW3, P77 BLACK AW, 2004, P ICASSP MAY, V3, P761 BONAFONTE A, 2008, UPC TTS SYSTEM DESCR Bonafonte Antonio, 2006, P INT C LANG RES EV, P311 Brill E, 1995, COMPUT LINGUIST, V21, P543 Canellada Maria Josefa, 1987, PRONUNCIACION ESPANO CONDE X, 2001, IANUA ROMANCE PHI S4 DAMPER RI, 2004, P 5 INT SPEECH COMM, P209 Dedina M. J., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90017-K *ED 1, 2011, ENGL PROF IND FOX RA, 1995, J ACOUST SOC AM, V97, P2540, DOI 10.1121/1.411974 GLUSHKO R, 1981, INTERACTIVE PROCESSE, P1 Hammond Robert M., 2001, SOUNDS SPANISH ANAL Hartikainen E., 2003, P EUR C SPEECH COMM, P1529 LADEFOGED P, 2003, VOWELS CONSONANTS LINDSTROM A, 2004, THESIS U LINKOPING L Llitjos Ariadna Font, 2001, P EUROSPEECH, P1919 LLORENTE J, 2004, LIBRO ESTILO CANAL T Marchand Y, 2000, COMPUT LINGUIST, V26, P195, DOI 10.1162/089120100561674 Pfister Beat, 2003, P EUR, P2037 POLYAKOVA T, 2006, P INT 2006 PITTSB US, P2442 POLYAKOVA T, 2009, P IEEE INT C AC SPEE, P4261 POLYAKOVA T, 2008, ACT 5 JORN TECN HABL, P207 RAYNOLDS L, 2009, READ WRIT, P1 Real Academia Espanola, 1992, DICC LENG ESP SEJNOWSKI T, 1993, NETTALK CORPUS SOONKLANG T, 2008, NAT LANG ENG, V14, P527 Swan M., 2001, LEARNER ENGLISH TEAC, V2nd Taylor P., 2005, P INT, P1973 TRANCOSO I, 1999, P EUR 1999 5 9 SEPT, P195 TRANCOSO I, 1995, P WORKSH INT LANG SP, P193 Van den Heuvel H., 2009, P INT BRIGHT UK, P2991 VANDENBOSCH A, 1993, P 6 C EUR CHAPT ASS, P45, DOI 10.3115/976744.976751 Wells J. C., 1982, ACCENTS ENGLISH INTR Yavas M., 2006, APPL ENGLISH PHONOLO Zheng M, 2005, LECT NOTES ARTIF INT, V3614, P600 NR 40 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2011 VL 53 IS 8 BP 1026 EP 1041 DI 10.1016/j.specom.2011.05.009 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 804RJ UT WOS:000293671400003 ER PT J AU Davidson, L AF Davidson, Lisa TI Characteristics of stop releases in American English spontaneous speech SO SPEECH COMMUNICATION LA English DT Article DE Stop releases; Stop-consonant sequences; Spontaneous speech; Articulatory coordination ID CONSONANT SEQUENCES; GLOTTALIZATION; DURATION; OVERLAP AB This study examines the factors affecting the production of stop releases in American English spontaneous speech. Previous research has shown that releases are conditioned by phonetic and social factors. However, previous studies either rely exclusively on read speech, or for sociolinguistic studies, focus on phrase-final stops. In this study, spontaneous speech is collected from two sources: interviews from the non-profit StoryCorps project and from sentences spontaneously generated in a picture description task. Stop releases were examined before obstruents and nasals in word-medial position (e.g. rugby), word-final, phrase-medial position (e.g. They crack nuts), and prepausally (e.g. I look up). Phonetic factors taken into account include identity of the stop, directionality of place of articulation in the consonant cluster (front-to-back vs. back-to-front) and manner of C2. For the StoryCorps data, race of the speaker was also found to be an important predictor. Results showed that approximately a quarter of the stops followed by a consonant were released, but release was strongly affected by the place of the stop and the manner of the following consonant. Release of pre-pausal stops differed between black and white speakers; the latter had double the amount of final release. Other realizations of the stops, such as deletion, lenition, and glottalization are also analyzed. (C) 2011 Elsevier B.V. All rights reserved. C1 NYU, Dept Linguist, New York, NY 10003 USA. RP Davidson, L (reprint author), NYU, Dept Linguist, 10 Washington Pl, New York, NY 10003 USA. EM lisa.davidson@nyu.edu FU NSF [BCS-0449560] FX The author would like to thank Marcos Rohena-Madrazo and Vincent Chanethom for their assistance in data collection and analysis, and Jon Brennan for statistical consulting. Thanks also to the members of the NYU Phonetics and Experimental Phonology Lab, Cohn Wilson, and to the audience at the Acoustical Society of America in Cancun, Mexico for valuable feedback. This research was supported by NSF CAREER Grant BCS-0449560. CR Balota DA, 2007, BEHAV RES METHODS, V39, P445, DOI 10.3758/BF03193014 Batliner A., 1993, P ESCA WORKSH PROS L, P176 BENOR S, 2001, SELECTED PAPERS NWAV, V29, P1 BOERSMA P, 2011, PRAAT 5 2 DOING PHON Browman C. P., 1990, PAPERS LABORATORY PH, P341 Bucholtz M., 1998, P 4 BERK WOM LANG C, P119 Byrd D, 1996, J PHONETICS, V24, P209, DOI 10.1006/jpho.1996.0012 Byrd D., 1993, UCLA WORKING PAPERS, V83, P97 Catford John C., 1977, FUNDAMENTAL PROBLEMS Chitoran I., 2002, PAPERS LAB PHONOLOGY, V7 Cho T, 1999, J PHONETICS, V27, P207, DOI 10.1006/jpho.1999.0094 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911 CRYSTAL TH, 1988, J PHONETICS, V16, P285 Davidson L, 2008, J INT PHON ASSOC, V38, P137, DOI 10.1017/S0025100308003447 Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023 Eckert P, 2008, J SOCIOLING, V12, P453, DOI 10.1111/j.1467-9841.2008.00374.x Eddington D, 2010, AM SPEECH, V85, P338, DOI 10.1215/00031283-2010-019 Francis W.N., 1958, STRUCTURE AM ENGLISH Gafos A. I., 2010, LAB PHONOLOGY, V10, P657 Ghosh PK, 2009, J ACOUST SOC AM, V126, pEL1, DOI 10.1121/1.3141876 Guy Gregory, 1997, LANG VAR CHANGE, V9, P149 Guy Gregory R, 1980, LOCATING LANGUAGE TI, P1 Hardcastle W. J., 1979, CURRENT ISSUES PHONE, P531 Harris J., 1994, ENGLISH SOUND STRUCT HENDERSON JB, 1982, PHONETICA, V39, P71 HENTON CG, 1987, LANGUAGE SPEECH MIND, P3 Kahn D., 1980, SYLLABLE BASED GEN E KIPARSKY P, 1979, LINGUIST INQ, V10, P421 Kochetov A., 2002, PRODUCTION PERCEPTIO Labov William, 1972, SOCIOLINGUISTIC PATT Ladefoged Peter, 2005, COURSE PHONETICS Lamothe Peter, 2006, J AM HIST, V93, P171 Nespor M., 1986, PROSODIC PHONOLOGY Neu H., 1980, LOCATING LANGUAGE TI, P37 Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666 Ortega-Llebaria M, 2004, LAB APPROACHES SPANI, P237 R Development Core Team, 2011, R LANG ENV STAT COMP Raymond William D., 2006, LANG VAR CHANGE, V18, P55 Redi L, 2001, J PHONETICS, V29, P407, DOI 10.1006/jpho.2001.0145 Roberts J, 2006, AM SPEECH, V81, P227, DOI 10.1215/00031283-2006-016 Selkirk E., 1982, STRUCTURE PHONOLOG 2, P337 SMITH DB, 2009, REF REV, V23, P57 Sumner M, 2005, J MEM LANG, V52, P322, DOI 10.1016/j.jml.2004.11.004 SURPRENANT A, 1988, J ACOUST SOC AM, V104, P518 Tiede M. K., 2001, J ACOUST SOC AM, V110, P2657 Trager George L., 1951, OUTLINE ENGLISH STRU Tsukada K, 2004, PHONETICA, V61, P67, DOI 10.1159/000082557 Wolfram Walt, 1969, SOCIOLINGUISTIC DESC Podesva RJ, 2002, LANGUAGE AND SEXUALITY: CONTESTING MEANING IN THEORY AND PRACTICE, P175 Yuan J., 2008, P AC 08, P5687 Zsiga E., 2003, STUDIES 2 LANGUAGE A, V25, P399 Zsiga EC, 2000, J PHONETICS, V28, P69, DOI 10.1006/jpho.2000.0109 ZSIGA EC, 1994, J PHONETICS, V22, P121 NR 53 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2011 VL 53 IS 8 BP 1042 EP 1058 DI 10.1016/j.specom.2011.05.010 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 804RJ UT WOS:000293671400004 ER PT J AU Chen, TH Massaro, DW AF Chen, Trevor H. Massaro, Dominic W. TI Evaluation of synthetic and natural Mandarin visual speech: Initial consonants, single vowels, and syllables SO SPEECH COMMUNICATION LA English DT Article DE Mandarin; Visual speech; Visemes; Speechreading; Synthetic; Natural; Talking head ID VISIBLE SPEECH; HEARING-LOSS; PERCEPTION; INDIVIDUALS; VOCABULARY; CHILDREN AB Although the auditory aspects of Mandarin speech are relatively more heavily-researched and well-known in the field, this study addresses its visual aspects by examining the perception of both Mandarin natural and synthetic visual speech. In perceptual experiments, the synthetic visual speech of a computer-animated Mandarin talking head was evaluated and subsequently improved. Also, the basic (or "minimum") units of Mandarin visual speech were determined for initial consonants and final single-vowels. Overall, the current study achieved solid improvements of synthetic visual speech, and this was one step towards a Mandarin synthetic talking head with realistic speech. (C) 2011 Elsevier B.V. All rights reserved. C1 [Chen, Trevor H.; Massaro, Dominic W.] Univ Calif Santa Cruz, Dept Psychol, Santa Cruz, CA 95064 USA. RP Chen, TH (reprint author), Univ Calif Santa Cruz, Dept Psychol, 1156 High St, Santa Cruz, CA 95064 USA. EM river_rover_t@yahoo.com FU Federico and Rena Per lino Scholarship Award; Eugene Cota-Robles Fellowship; Psychology Department at the University of California, Santa Cruz FX The research and writing of this article were supported by the Federico and Rena Per lino Scholarship Award, the Eugene Cota-Robles Fellowship, and the Psychology Department at the University of California, Santa Cruz (the Doctoral Student Sabbatical Fellowship and the Mini-Grant Research Fellowship). The authors thank Michael M. Cohen for offering expert technical assistance. CR Andersson U., 2001, J DEAF STUD DEAF EDU, V6, P103, DOI DOI 10.1093/DEAFED/6.2.103 BAILLY G, 2000, COST254 WORKSH FRIEN Bosseler A, 2003, J AUTISM DEV DISORD, V33, P653, DOI 10.1023/B:JADD.0000006002.82367.4f Campbell CS, 1997, PERCEPTION, V26, P627, DOI 10.1068/p260627 Caplier A, 2007, EURASIP J IMAGE VIDE, DOI 10.1155/2007/45641 CHEN F, 2005, P 2005 INT C CYB CHEN HC, 1991, MANDARIN CONSONANT V, P149 CHEN HC, 1992, MANDARIN VOWEL DIPHT, P179 Chen TH, 2008, J ACOUST SOC AM, V123, P2356, DOI 10.1121/1.2839004 COHEN MM, 1990, BEHAV RES METH INSTR, V22, P260 COLE R, 1998, STILL ESCA WORKSH SP, P163 COSI P, 2002, ICSLP 2002 7 INT C S FISHER CG, 1968, J SPEECH HEAR RES, V11, P796 GAILEY L, 1987, HEARING EYE PSYCHOL, P115 JACKSON PL, 1988, VOLTA REV, V90, P99 LADEFOGED PETER, 2001, COURSE PHONETICS Lee W-S., 2003, J INT PHON ASSOC, V33, P109, DOI 10.1017/S0025100303001208 MacMillan N. A., 2005, DETECTION THEORY USE Massaro D. W., 1998, PERCEIVING TALKING F Massaro D. W., 1989, EXPT PSYCHOL INFORM Massaro D. W., 1987, SPEECH PERCEPTION EA Massaro D. W., 2003, P EUR INT 8 EUR C SP Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025) MASSARO DW, 2005, P 38 ANN HAW INT C S MASSARO DW, 2006, P 9 INT C SPOK LANG, P825 MASSARO DW, 2008, P INT 2008 BRISB QUE, P2623 Massaro DW, 2004, VOLTA REV, V104, P141 MING O, 1999, ICAT 99 Mohammed T, 2006, CLIN LINGUIST PHONET, V20, P621, DOI 10.1080/02699200500266745 OUNI S, 2003, P 15 INT C PHON SCI Ouni S, 2005, SPEECH COMMUN, V45, P115, DOI 10.1016/j.specom.2004.11.008 Pei Y, 2007, IEEE T VIS COMPUT GR, V13, P58, DOI 10.1109/TVCG.2007.22 Pei YR, 2006, LECT NOTES COMPUT SC, V3851, P591 PULLUM Geoffrey, 1996, PHONETIC SYMBOL GUID WALDEN BE, 1977, J SPEECH HEAR RES, V20, P130 WANG AH, 2000, P INT S CHIN SPOK LA, P215 WANG ZM, 2003, J SOFTWARE, V16, P1054 WU Z, 2006, P 9 INT C SPOK LANG ZHOU W, 2007, 6 IEEE ACIS INT C CO, P924 NR 39 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2011 VL 53 IS 7 BP 955 EP 972 DI 10.1016/j.specom.2011.03.009 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 791OZ UT WOS:000292675800001 ER PT J AU Nose, T Kobayashi, T AF Nose, Takashi Kobayashi, Takao TI Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency SO SPEECH COMMUNICATION LA English DT Article DE Voice conversion; Hidden Markov model (HMM); HMM-based speech synthesis; Speaker-independent model; Fundamental frequency quantization; Prosody conversion ID SPEECH SYNTHESIS AB This paper describes a speaker-independent HMM-based voice conversion technique that incorporates context-dependent prosodic symbols obtained using adaptive quantization of the fundamental frequency (F0). In the HMM-based conversion of our previous study, the input utterance of a source speaker is decoded into phonetic and prosodic symbol sequences, and the converted speech is generated using the decoded information from the pre-trained target speaker's phonetically and prosodically context-dependent H M M. In our previous work, we generated the F0 symbol by quantizing the average log F0 value of each phone using the global mean and variance calculated from the training data. In the current study, these statistical parameters are obtained from each utterance itself, and this adaptive method improves the F0 conversion performance of the conventional one. We also introduce a speaker-independent model for decoding the input speech and model adaptation for training the target speaker's model in order to reduce the required amount of training data under a condition where the phonetic transcription is available for the input speech. Objective and subjective experimental results for Japanese speech demonstrate that the adaptive quantization method gives better F0 conversion performance than the conventional one. Moreover, our technique with only ten sentences of the target speaker's adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences. (C) 2011 Elsevier B.V. All rights reserved. C1 [Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan. EM takashi.nose@ip.titech.ac.jp; takao.ko-bayashi@ip.titech.ac.jp CR ABE M, 1991, INT CONF ACOUST SPEE, P765, DOI 10.1109/ICASSP.1991.150451 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Imai S., 1983, IECE T A, P122 Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 MASHIMO M, 2002, P ICSLP, P293 NAKANO Y, 2006, P INTERSPEECH 2006 I, P2286 Nose T., 2010, IEICE T INF SYSTEMS, P2483 NOSE T, 2010, P 7 ISCA SPEECH SYNT, P80 NOSE T, 2010, P INT 2010, P1724 Nose Takashi, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495548 Rabiner L, 1993, FUNDAMENTALS SPEECH Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 STYLIANOU Y, 2009, P ICASSP 2009, P3585 Tamura M., 2001, P EUROSPEECH 2001 SE, P345 TANAKA T, 2002, P ICASSP 2002, P329 Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344 Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tokuda K, 1998, INT CONF ACOUST SPEE, P609, DOI 10.1109/ICASSP.1998.675338 TOKUDA K, 1995, P ICASSP, P660 Tokuda K, 2002, IEICE T INF SYST, VE85D, P455 TURK O, 2009, P ICASSP 2009, P3597 TURK O, 2002, P INT C SPOK LANG PR, P289 VERMA A, 2005, ACM T SPEECH LANGUAG, V2, P1, DOI 10.1145/1075389.1075393 Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839 YOKOMIZO S, 2010, P INT 2010, P430 Yoshimura T, 1999, P EUR, P2347 YUTANI K, 2009, P ICASSP 2009, P3897 Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 NR 30 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2011 VL 53 IS 7 BP 973 EP 985 DI 10.1016/j.specom.2011.05.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 791OZ UT WOS:000292675800002 ER PT J AU Peng, JX Bei, CX Sun, HT AF Peng, Jianxin Bei, Chengxun Sun, Haitao TI Relationship between Chinese speech intelligibility and speech transmission index in rooms based on auralization SO SPEECH COMMUNICATION LA English DT Article DE Chinese speech intelligibility; Speech transmission index; Room impulse response; Phonetically balanced test; Diotic listening; Dichotic listening ID OCTAVE-BAND WEIGHTS; RASTI-METHOD; VALIDATION; ENGLISH AB Based on simulated monaural and binaural room impulse responses, the relationship between Chinese speech intelligibility scores and speech transmission index (STI) including the effect of noise is investigated using a phonetically balanced test in virtual rooms. The results show that Chinese speech intelligibility scores increase monotonically with STI values. The correlation coefficients are 0.95, 0.90 and the standard deviation is 5.6%, 6.7% under diotic and dichotic listening conditions, respectively. Compared with diotic listening based on monaural room impulse responses, dichotic listening based on binaural room impulse responses can improve by 2.7 dB signal-to-noise ratio for Chinese speech intelligibility. The STI method can better predict and evaluate Chinese speech intelligibility in rooms. (C) 2011 Elsevier B.V. All rights reserved. C1 [Peng, Jianxin; Bei, Chengxun] S China Univ Technol, Dept Phys, Sch Sci, Guangzhou 510640, Peoples R China. [Peng, Jianxin; Sun, Haitao] S China Univ Technol, State Key Lab Subtrop Bldg Sci, Guangzhou 510640, Peoples R China. RP Peng, JX (reprint author), S China Univ Technol, Dept Phys, Sch Sci, Guangzhou 510640, Peoples R China. EM phjxpeng@163.com FU National Natural Science Foundation of China [10774048]; Science and Technology Planning Project of Guangdong Province, China [2008B080701020]; State Key Laboratory of Subtropical Building Science, South China University of Technology, China [2008KB32] FX The authors thank the students who participated in subjective evaluation of Chinese speech intelligibility. This work was support by National Natural Science Foundation of China (Grant No. 10774048), Science and Technology Planning Project of Guangdong Province, China (Grant No. 2008B080701020) and Opening Project of the State Key Laboratory of Subtropical Building Science, South China University of Technology, China (Grant No. 2008KB32). CR ANDERSON BW, 1987, J ACOUST SOC AM, V81, P1982, DOI 10.1121/1.394764 [Anonymous], 2003, 6026816 IEC [Anonymous], 1995, 15508 GBT BRADLEY JS, 1986, J ACOUST SOC AM, V80, P837, DOI 10.1121/1.393907 Christensen C.L., 2009, ODEON ROOM ACOUSTICS Diaz C, 1995, APPL ACOUST, V46, P363, DOI 10.1016/0003-682X(95)00016-3 HOUTGAST T, 1984, ACUSTICA, V54, P185 HOUTGAST T, 1973, ACUSTICA, V28, P66 JACOB KD, 1991, J AUDIO ENG SOC, V39, P232 Kang J, 1998, J ACOUST SOC AM, V103, P1213, DOI 10.1121/1.421253 Kruger K., 1991, Canadian Acoustics, V19 MIJIC M, 1991, ACUSTICA, V74, P143 Peng JX, 2007, SPEECH COMMUN, V49, P933, DOI 10.1016/j.specom.2007.06.001 Peng JX, 2005, APPL ACOUST, V66, P591, DOI 10.1016/j.apacoust.2004.08.006 Peng JX, 2006, ACTA ACUST UNITED AC, V92, P79 Peng JX, 2008, CHINESE SCI BULL, V53, P2748, DOI 10.1007/s11434-008-0383-5 SHAO L, 1989, P 5 ARCH PHYS BEIJ SHEN H, 1993, AUDIO ENG, P2 Steeneken HJM, 1999, SPEECH COMMUN, V28, P109, DOI 10.1016/S0167-6393(99)00007-2 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Steeneken HJM, 2002, SPEECH COMMUN, V38, P413, DOI 10.1016/S0167-6393(02)00010-9 Steeneken HJM, 2002, SPEECH COMMUN, V38, P399, DOI 10.1016/S0167-6393(02)00011-0 Wijngaarden S., 2008, J ACOUST SOC AM, V123, P4514 Yang W, 2007, ACTA ACUST UNITED AC, V93, P991 ZHANG JL, 1981, ACTA ACUST, V7, P237 NR 25 TC 5 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2011 VL 53 IS 7 BP 986 EP 990 DI 10.1016/j.specom.2011.05.004 PG 5 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 791OZ UT WOS:000292675800003 ER PT J AU Beukelman, DR Childes, J Carrell, T Funk, T Ball, LJ Pattee, GL AF Beukelman, David R. Childes, Jana Carrell, Tom Funk, Trisha Ball, Laura J. Pattee, Gary L. TI Perceived attention allocation of listeners who transcribe the speech of speakers with amyotrophic lateral sclerosis SO SPEECH COMMUNICATION LA English DT Article DE Amyotrophic lateral sclerosis; Dysarthria; Perception AB The purpose of this study was to investigate the self-perceived attention allocation of listeners as they transcribed the speech samples of speakers with mild to severe dysarthria as a result of amyotrophic lateral sclerosis. Listeners reported that their perceived attention allocation increased consistently as speech intelligibility for sentences decreased from 100% to 75%. In this study, self-perceptions of attention allocation peaked between 75% and 80% intelligibility. These results support the conclusion that listeners experience a considerable perceptual load as they attempt to comprehend the messages of persons whose speech has relatively high intelligibility but distorted due to dysarthria. (C) 2011 Elsevier B.V. All rights reserved. C1 [Beukelman, David R.; Childes, Jana; Carrell, Tom; Funk, Trisha] Univ Nebraska, Barkley Mem Ctr 301, Lincoln, NE 68583 USA. [Beukelman, David R.; Pattee, Gary L.] Univ Nebraska, Med Ctr, Omaha, NE 68198 USA. [Beukelman, David R.] Madonna Rehabil Hosp, Inst Rehabil Sci & Engn, Lincoln, NE 68506 USA. [Ball, Laura J.] E Carolina Univ, Greenville, NC 27858 USA. RP Beukelman, DR (reprint author), Univ Nebraska, Barkley Mem Ctr 202, POB 830732, Lincoln, NE 68583 USA. EM dbeukelman1@unl.edu FU Barkley Trust FX This research was partially funded by the Barkley Trust. The authors wish to thank the staff and patients of the Muscular Dystrophy Clinic and ALS Clinic for their support of the Nebraska ALS Database Project. The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the article. CR Ball L., 2004, AUGMENTATIVE ALTERNA, V20, P113, DOI 10.1080/0743461042000216596 Ball L., 2007, AUGMENTATIVE COMMUNI, P287 Ball LJ, 2002, J MED SPEECH-LANG PA, V10, P231 BECKER CA, 1976, J EXP PSYCHOL HUMAN, V2, P556, DOI 10.1037//0096-1523.2.4.556 BEUKELMAN D, 1979, J COMMUN DISORD, V12, P89 Broadbent D.E., 1958, PERCEPTION COMMUNICA CRANDALL J, 2007, ATTENTION ALLOCATION, P1 Duffy J.R, 2005, MOTOR SPEECH DISORDE Green M., 2000, TRANSPORTATION HUMAN, V2, P195, DOI DOI 10.1207/STHF0203_1 HENDY K, 1993, HUM FACTORS, V23, P579 Kahneman D., 1973, ATTENTION EFFORT Pisoni D. B., 1982, Speech Technology, V1 SITVER M, 1982, ASHA, V24, P783 Yorkston K., 2007, SENTENCE INTELLIGIBI Yorkston K., 2010, CLIN MANAGEMENT SPEA YORKSTON K, 1991, J MED SPEECH-LANG PA, V1, P35 Yorkston K. M., 2004, MANAGEMENT SPEECH SW NR 17 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 801 EP 806 DI 10.1016/j.specom.2010.12.005 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500001 ER PT J AU Irwin, A Pilling, M Thomas, SM AF Irwin, Amy Pilling, Michael Thomas, Sharon M. TI An analysis of British regional accent and contextual cue effects on speechreading performance SO SPEECH COMMUNICATION LA English DT Article DE Speechreading; Accent; Speech perception; Talker speechreadability ID AUDIOVISUAL SPEECH-PERCEPTION; NORMAL-HEARING; SPOKEN WORDS; VARIABILITY; INTELLIGIBILITY; TALKER; RECOGNITION; SENTENCES; ADULTS AB The aim of this paper was to examine the effect of regional accent on speechreading accuracy and the utility of contextual. cues in reducing accent effects. Study 1: Participants were recruited from Nottingham (n = 24) and Glasgow (n = 17). Their task was to speechread 240 visually presented sentences spoken by 12 talkers, half with a Glaswegian accent, half a Nottingham accent. Both participant groups found the Glaswegian talkers less intelligible (p < 0.05). A significant interaction between participant location and accent type (p < 0.05) indicated that both participant groups showed an advantage for speechreading talkers with their own accent over the opposite group. Study 2: Participants were recruited from Nottingham (n = 15). The same visual sentences were used, but each one was presented with a contextual cue. The results showed that speechreading performance was significantly improved when a contextual cue was used (p < 0.05). However the Nottingham observers still found the Glaswegian talkers less intelligible than the Nottingham talkers (p < 0.05). The findings of this paper suggest that accent type may have an influence upon visual speech intelligibility and as such may impact upon the design, and results, of tests of speechreading ability. (C) 2011 Elsevier B.V. All rights reserved. C1 [Irwin, Amy; Pilling, Michael; Thomas, Sharon M.] Nottingham Univ Sect, MRC Inst Hearing Res, Nottingham NG7 2RD, England. RP Irwin, A (reprint author), Univ Aberdeen, Sch Psychol, Aberdeen AB24 2UB, Scotland. EM a.irwin@abdn.ac.uk CR Abercrombie D, 1967, ELEMENTS GEN PHONETI Arnold P, 2001, BRIT J PSYCHOL, V92, P339, DOI 10.1348/000712601162220 Arnold P., 1997, J DEAF STUD DEAF EDU, V2, P199 Auer ET, 1997, J ACOUST SOC AM, V102, P3704, DOI 10.1121/1.420402 Bench J, 1979, Br J Audiol, V13, P108, DOI 10.3109/03005367909078884 Bernstein LE, 2001, J SPEECH LANG HEAR R, V44, P5, DOI 10.1044/1092-4388(2001/001) Beskow J, 2004, LECT NOTES COMPUT SC, V3118, P1178 BOOTHROYD A, 1988, VOLTA REV, V90, P77 Clopper C.G., 2006, J ACOUST SOC AM, V119, P3424 Conrey B, 2006, VISION RES, V46, P3243, DOI 10.1016/j.visres.2006.03.020 COX RM, 1987, J ACOUST SOC AM, V81, P1598, DOI 10.1121/1.394512 DEMOREST ME, 1992, J SPEECH HEAR RES, V35, P876 ELLIS T, 2001, P INT C AUD VIS SPEE, P13 ERBER NP, 1992, J ACAD R, V25, P113 Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078 Floccia C, 2006, J EXP PSYCHOL HUMAN, V32, P1276, DOI 10.1037/0096-1523.32.5.1276 Flynn MC, 1999, J SPEECH LANG HEAR R, V42, P540 Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135 Grant KW, 2000, J ACOUST SOC AM, V107, P1000, DOI 10.1121/1.428280 Jordan TR, 2000, PERCEPT PSYCHOPHYS, V62, P1394, DOI 10.3758/BF03212141 KRICOS PB, 1982, VOLTA REV, V84, P219 KRICOS PB, 1985, VOLTA REV, V87, P5 Labov W., 1997, LANGUAGE VARIETY S R, P508 LABOV W, 1989, CHICAGO LINGUISTIC S, V25, P171 Lander K, 2008, Q J EXP PSYCHOL, V61, P961, DOI 10.1080/17470210801908476 LANSING CR, 1995, J SPEECH HEAR RES, V38, P1377 LESNER SA, 1988, VOLTA REV, V90, P89 Lidestam B, 2001, SCAND AUDIOL, V30, P89, DOI 10.1080/010503901300112194 MACLEOD A, 1987, British Journal of Audiology, V21, P131, DOI 10.3109/03005368709077786 Massaro D. W., 1998, PERCEIVING TALKING F MASSARO DW, 1993, PERCEPT PSYCHOPHYS, V53, P549, DOI 10.3758/BF03205203 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Most T, 2001, EAR HEARING, V22, P252, DOI 10.1097/00003446-200106000-00008 Munro MJ, 1995, LANG SPEECH, V38, P289 Nathan L, 1998, J CHILD LANG, V25, P343, DOI 10.1017/S0305000998003444 Nathan L, 2001, APPL PSYCHOLINGUIST, V22, P343, DOI 10.1017/S0142716401003046 OWENS E, 1985, J SPEECH HEAR RES, V28, P381 Reisberg D., 1987, HEARING EYE PSYCHOL, P97 Ronnberg J, 1999, J SPEECH LANG HEAR R, V42, P5 SHEFFERT SM, 1995, J MEM LANG, V34, P665, DOI 10.1006/jmla.1995.1030 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009 SUMMERFIELD Q, 1987, HEARING EYE PSYCHOL SUMMERFIELD Q, 1984, Q J EXP PSYCHOL-A, V36, P51 Valentine G, 2008, SOC CULT GEOGR, V9, P469, DOI 10.1080/14649360802175691 Wells J. C., 1982, ACCENTS ENGLISH Wells J. C, 1982, ACCENTS ENGLISH, V1 Yakel DA, 2000, PERCEPT PSYCHOPHYS, V62, P1405, DOI 10.3758/BF03212142 NR 48 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 807 EP 817 DI 10.1016/j.specom.2011.01.010 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500002 ER PT J AU So, S Paliwal, KK AF So, Stephen Paliwal, Kuldip K. TI Modulation-domain Kalman filtering for single-channel speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Modulation domain; Kalman filtering; Speech enhancement ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE; RECEPTION AB In this paper, we investigate the modulation-domain Kalman filter (MDKF) and compare its performance with other time-domain and acoustic-domain speech enhancement methods. In contrast to previously reported modulation domain-enhancement methods based on fixed bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as phase information has been shown to play an important role in the modulation domain. We have found that the Kalman filter is better suited for processing in the modulation-domain, rather than in the time-domain, since the low order linear predictor is sufficient at modelling the dynamics of slow changes in the modulation domain, while being insufficient at modelling the long-term correlation speech information in the time domain. As a result, the MDKF method produces enhanced speech that has very minimal distortion and residual noise, in the ideal case. The results from objective experiments and blind subjective listening tests using the NOIZEUS corpus show that the MDKF (with clean speech parameters) outperforms all the acoustic and time-domain enhancement methods that were evaluated, including the time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results. (C) 2011 Elsevier B.V. All rights reserved. C1 [So, Stephen; Paliwal, Kuldip K.] Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Brisbane, Qld 4111, Australia. RP So, S (reprint author), Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Brisbane, Qld 4111, Australia. EM s.so@griffith.edu.au; k.paliwal@griffith.edu.au RI So, Stephen/D-6649-2011 CR Arai T, 1999, J ACOUST SOC AM, V105, P2783, DOI 10.1121/1.426895 Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Falk T. H., 2007, P ISCA C INT SPEECH, P970 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144 Greenberg S., 2001, P 7 EUR C SPEECH COM, P473 Greenberg S., 1998, P INT C SPOK LANG PR, P2803 Hermansky H., 1995, P IEEE INT C AC SPEE, V1, P405 Kalman R.E, 1960, J BASIC ENG, V82, P35, DOI DOI 10.1115/1.3662552 KANEDERA N, 1998, P IEEE INT C AC SPEE, V2, P613, DOI 10.1109/ICASSP.1998.675339 Li C. J., 2006, THESIS AARLBORG U DE Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lyons J., 2008, P ISCA C INT SPEECH, P387 Mesgarani N., 2005, P IEEE INT C AC SPEE, P1105 Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004 Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P177 Quatieri T. F., 2002, DISCRETE TIME SPEECH Rix A., 2001, P862 ITUT SAMBUR MR, 1976, IEEE T ACOUST SPEECH, V24, P488, DOI 10.1109/TASSP.1976.1162870 SO S, 2009, P IEEE INT C AC SPEE, P4405 So S, 2011, SPEECH COMMUN, V53, P355, DOI 10.1016/j.specom.2010.10.006 SORQVIST P, 1997, P IEEE INT C AC SPEE, V2, P1219 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Wiener N., 1949, EXTRAPOLATION INTERP Wu WR, 1998, IEEE T CIRCUITS-II, V45, P1072 NR 32 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 818 EP 829 DI 10.1016/j.specom.2011.02.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500003 ER PT J AU Muller, F Mertins, A AF Mueller, Florian Mertins, Alfred TI Contextual invariant-integration features for improved speaker-independent speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Speaker-independency; Invariant-integration ID VOCAL-TRACT NORMALIZATION; HIDDEN MARKOV-MODELS; PATTERN-RECOGNITION; TRANSFORM AB This work presents a feature-extraction method that is based on the theory of invariant integration. The invariant-integration features are derived from an extended time period, and their computation has a very low complexity. Recognition experiments show a superior performance of the presented feature type compared to cepstral coefficients using a mel filterbank (MFCCs) or a gammatone filterbank (GTCCs) in matching as well as in mismatching training-testing conditions. Even without any speaker adaptation, the presented features yield accuracies that are larger than for MFCCs combined with vocal tract length normalization (VTLN) in matching training-test conditions. Also, it is shown that the invariant-integration features (IIFs) can be successfully combined with additional speaker-adaptation methods to further increase the accuracy. In addition to standard MFCCs also contextual MFCCs are introduced. Their performance lies between the one of MFCCs and IIFs. (C) 2011 Elsevier B.V. All rights reserved. C1 [Mueller, Florian; Mertins, Alfred] Med Univ Lubeck, Inst Signal Proc, D-23538 Lubeck, Germany. RP Muller, F (reprint author), Med Univ Lubeck, Inst Signal Proc, Ratzeburger Allee 160, D-23538 Lubeck, Germany. EM mueller@isip.uni-luebeck.de; mertins@isip.uni-luebeck.de FU German Research Foundation [ME1170/2-1] FX This work has been supported by the German Research Foundation under Grant No. ME1170/2-1. CR Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006 Boe L.-J., 2006, P 7 INT SEM SPEECH P, P75 BURKHARDT H, 1980, IEEE T ACOUST SPEECH, V28, P517, DOI 10.1109/TASSP.1980.1163439 Burkhardt H., 2001, NONLINEAR MODEL BASE, P269 COHEN L, 1993, IEEE T SIGNAL PROCES, V41, P3275, DOI 10.1109/78.258073 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deller J. R., 1993, DISCRETE TIME PROCES Ellis DPW, 2009, GAMMATONE LIKE SPECT FANG M, 1989, APPL OPTICS, V28, P1257, DOI 10.1364/AO.28.001257 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Gramss Tino, 1991, P IEEE WORKSH NEUR N, P289 Haeb-Umbach R., 1992, P IEEE INT C AC SPEE, V1, P13 Halberstadt A., 1998, THESIS MIT Huang X., 2001, SPOKEN LANGUAGE PROC HURWITZ A, 1897, UEBER ERZEUGUNG INVA, P71 Irino T, 2002, SPEECH COMMUN, V36, P181, DOI 10.1016/S0167-6393(00)00085-6 Ishizuka K, 2006, J ACOUST SOC AM, V120, P443, DOI 10.1121/1.2205131 Kleinschmidt M., 2002, P INT C SPOK LANG PR, P25 Kleinschmidt Michael, 2002, THESIS U OLDENBURG LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LEONARD RG, 1993, TIDIGITS LINGUISTIC Lohweg V, 2004, EURASIP J APPL SIG P, V2004, P1912, DOI 10.1155/S1110865704404247 MERTINS A, 2006, P IEEE INT C AC SPEE, V5, P1025 Mertins A., 2005, P 2005 IEEE AUT SPEE, P308 Monaghan J. J., 2008, J ACOUST SOC AM, V123, P3066, DOI 10.1121/1.2932824 MOORE BCJ, 1996, ACTA ACUST UNITED AC, V82, P245 Muller Florian, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5373284 Muller F., 2009, P INT C SPOK LANG PR, P2975 Muller F, 2010, LECT NOTES ARTIF INT, V5933, P111 NOETHER E, 1916, MATH ANN, V77, P89 Patterson R. D., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.183 PATTERSON RD, 1992, ADV BIOSCI, V83, P429 Pitz M, 2005, IEEE T SPEECH AUDI P, V13, P930, DOI 10.1109/TSA.2005.848881 RADEMACHER J, 2006, P INT C SPOK LANG PR, P1499 REITBOEC.H, 1969, INFORM CONTROL, V15, P130, DOI 10.1016/S0019-9958(69)90387-8 Saon G., 2000, P INT C AC SPEECH SI, V2, P1129 SCHLUTER R, 2006, P INT C SPOK LANG PR, P345 SCHULZMIRBACH H, 1995, MUSTERERKENNUNG 1995, V17, P1 SCHULZMIRBACH H, 1992, P 11 INT C PATT REC, V2, P178 SCHULZMIRBACH H, 1995, TR40295018 U HAMB SENA AD, 2005, P SOUND MUS C SAL JU SIGGELKOW S, 2002, THESIS ALBERTLUDWIGS SINHA R, 2002, P IEEE INT C AC SPEE, V1 Umesh S, 1999, IEEE T SPEECH AUDI P, V7, P40, DOI 10.1109/89.736329 UMESH S, 2002, P INT C AC SPEECH SI, V1 Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435 Young S., 2009, HTK BOOK HTK VERSION NR 50 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 830 EP 841 DI 10.1016/j.specom.2011.02.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500004 ER PT J AU Feng, YQ Hao, GJ Xue, SA Max, L AF Feng, Yongqiang Hao, Grace J. Xue, Steve A. Max, Ludo TI Detecting anticipatory effects in speech articulation by means of spectral coefficient analyses SO SPEECH COMMUNICATION LA English DT Article DE Speech production; Articulation; Anticipatory coarticulation; Acoustic analysis; Spectral coefficients ID TO-VOWEL COARTICULATION; FRICATIVE-STOP COARTICULATION; CATALAN VCV SEQUENCES; LOCUS EQUATIONS; STATISTICAL-ANALYSIS; PARKINSONS-DISEASE; MULTIPLE-SCLEROSIS; ACOUSTIC ANALYSIS; CO-ARTICULATION; CONSONANTS AB Few acoustic studies have attempted to examine anticipatory effects in the earliest part of the release of stop consonants. We investigated the ability of spectral coefficients to reveal anticipatory coarticulation in the burst and early aspiration of stops in monosyllables. Twenty American English speakers produced stop (/k,t,p/) - vowel (/ae,i,o/) stop (/k,t,p/) sequences in two phrase positions. The first four spectral coefficients (mean, standard deviation, skewness, kurtosis) were calculated for one window centered on the burst of the onset consonant and two subsequent, non-overlapping windows. All coefficients showed an influence of vowel-to-consonant anticipatory coarticulation. Which onset consonant showed the strongest vowel effect depended on the specific coefficient under consideration. A context-dependent consonant-to-consonant anticipatory effect was observed for onset /p/. Findings demonstrate that spectral coefficients can reveal subtle anticipatory adjustments as early as the burst of stop consonants. Different results for the four coefficients suggest that comprehensive spectral analyses offer advantages over other approaches. Studies using these techniques may expose previously unobserved articulatory adjustments among phonetic contexts or speaker populations. (C) 2011 Elsevier B.V. All rights reserved. C1 [Max, Ludo] Univ Washington, Dept Speech & Hearing Sci, Seattle, WA 98105 USA. [Feng, Yongqiang] Chinese Acad Sci, Inst Acoust, Beijing 100190, Peoples R China. [Hao, Grace J.] N Carolina Cent Univ, Dept Commun Disorders, Durham, NC 27707 USA. [Xue, Steve A.] Univ Hong Kong, Div Speech & Hearing Sci, Hong Kong, Hong Kong, Peoples R China. [Max, Ludo] Haskins Labs Inc, New Haven, CT 06511 USA. RP Max, L (reprint author), Univ Washington, Dept Speech & Hearing Sci, 1417 NE 42nd St, Seattle, WA 98105 USA. EM LudoMax@uw.edu FU National Institute on Deafness and Other Communication Disorders [R01DC007603] FX This work was supported by Grant R01DC007603 from the National Institute on Deafness and Other Communication Disorders (L. Max PI). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute on Deafness and Other Communication Disorders or the National Institutes of Health. CR BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 BROAD DJ, 1970, J ACOUST SOC AM, V47, P1572, DOI 10.1121/1.1912090 Cho T, 1999, J PHONETICS, V27, P207, DOI 10.1006/jpho.1999.0094 DANILOFF R, 1968, J SPEECH HEAR RES, V11, P707 FARNETANI E, 1993, LANG SPEECH, V36, P279 FORREST K, 1988, J ACOUST SOC AM, V84, P115, DOI 10.1121/1.396977 Fowler CA, 2000, LANG SPEECH, V43, P1 GOLDMANEISLER F, 1961, LANG SPEECH, V4, P220 Gordon M., 2002, J INT PHON ASSOC, V32, P141, DOI 10.1017/S0025100302001020 Hawkins S, 2004, J PHONETICS, V32, P199, DOI 10.1016/S0095-4470(03)00031-7 HAWKINS S, 2000, SWAP, P167 Hertrich I, 1999, J SPEECH LANG HEAR R, V42, P367 HOUSE AS, 1953, J ACOUST SOC AM, V25, P105, DOI 10.1121/1.1906982 Joos M., 1948, LANGUAGE SUPPL, V24, P1, DOI DOI 10.2307/522229 Kent R. D., 1977, J PHONETICS, V15, P115 KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P322, DOI 10.1121/1.388813 Kirk RR, 1995, EXPT DESIGN PROCEDUR LISKER L, 1964, WORD, V20, P384 LOFQVIST A, 1994, PHONETICA, V51, P52 MACNEILA.PF, 1969, J ACOUST SOC AM, V45, P1217, DOI 10.1121/1.1911593 Max L, 1999, J SPEECH LANG HEAR R, V42, P261 Max L, 1998, J SPEECH LANG HEAR R, V41, P1265 MENZERATH P, 1933, KOARTIKULATION STEUC MOLL KL, 1971, J ACOUST SOC AM, V50, P678, DOI 10.1121/1.1912683 NEWELL KM, 1984, J MOTOR BEHAV, V16, P320 NITTROUER S, 1995, J ACOUST SOC AM, V97, P520, DOI 10.1121/1.412278 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 Ostry DJ, 1996, J NEUROSCI, V16, P1570 PARUSH A, 1983, J ACOUST SOC AM, V74, P1115, DOI 10.1121/1.390035 PLOMP R, 1967, J ACOUST SOC AM, V41, P707, DOI 10.1121/1.1910398 PURCELL ET, 1979, J ACOUST SOC AM, V66, P1691, DOI 10.1121/1.383641 RABINER LR, 1975, AT&T TECH J, V54, P297 RECASENS D, 1984, J PHONETICS, V12, P61 RECASENS D, 1984, J ACOUST SOC AM, V76, P1624, DOI 10.1121/1.391609 RECASENS D, 1987, J PHONETICS, V15, P299 RECASENS D, 1985, LANG SPEECH, V28, P97 Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727 REPP BH, 1983, J ACOUST SOC AM, V74, P420, DOI 10.1121/1.389835 REPP BH, 1986, J ACOUST SOC AM, V79, P1616, DOI 10.1121/1.393298 REPP BH, 1982, J ACOUST SOC AM, V71, P1562, DOI 10.1121/1.387810 REPP BH, 1981, J ACOUST SOC AM, V69, P1154, DOI 10.1121/1.385695 SERENO JA, 1987, J ACOUST SOC AM, V81, P512, DOI 10.1121/1.394917 SIREN KA, 1995, J SPEECH HEAR RES, V38, P351 SKIPPER JK, 1972, STAT ISSUES READER B, P141 Stevens K.N., 1998, ACOUSTIC PHONETICS STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567 SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923 SUSSMAN HM, 1992, J SPEECH HEAR RES, V35, P769 SUSSMAN HM, 1994, PHONETICA, V51, P119 Tjaden K, 2000, J SPEECH LANG HEAR R, V43, P1466 Tjaden K, 2005, J SPEECH LANG HEAR R, V48, P261, DOI 10.1044/1092-4388(2005/018) Tjaden K, 2003, J SPEECH LANG HEAR R, V46, P990, DOI 10.1044/1092-4388(2003/077) WHALEN DH, 1990, J PHONETICS, V18, P3 WINITZ H, 1972, J ACOUST SOC AM, V51, P1309, DOI 10.1121/1.1912976 NR 57 TC 1 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 842 EP 854 DI 10.1016/j.specom.2011.02.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500005 ER PT J AU Drugman, T Bozkurt, B Dutoit, T AF Drugman, Thomas Bozkurt, Bans Dutoit, Thierry TI Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation SO SPEECH COMMUNICATION LA English DT Article DE Complex cepstrum; Homomorphic analysis; Glottal source estimation; Source-tract separation ID VOICED SPEECH; FLOW; PERCEPTION; SIGNALS AB Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met. It is also shown that this complex cepstral decomposition gives similar glottal estimates as obtained with the ZZT method. However, as complex cepstrum uses FFT operations instead of requiring the factoring of high-degree polynomials, the method benefits from a much higher speed. Finally in our tests on a large corpus of real expressive speech, we show that the proposed method has the potential to be used for voice quality analysis. (C) 2011 Elsevier B.V. All rights reserved. C1 [Drugman, Thomas; Dutoit, Thierry] Univ Mons, TCTS Lab, B-7000 Mons, Belgium. [Bozkurt, Bans] Izmir Inst Technol, Dept Elect & Elect Engn, Izmir, Turkey. RP Drugman, T (reprint author), Univ Mons, TCTS Lab, B-7000 Mons, Belgium. EM thomas.drugman@umons.ac.be FU Fonds National de la Recherche Scientifique (FNRS); Scientific and Technological Research Council of Turkey (TUBITAK) FX Thomas Drugman is supported by the Fonds National de la Recherche Scientifique (FNRS). Bans Bozkurt is supported by the Scientific and Technological Research Council of Turkey (TUBITAK). The authors also would like to thank N. Henrich and B. Doval for providing us the speech recording used to create Fig. 10 and M. Schroeder for the De7 database (Schroeder and Grice, 2003) used in the second experiment on real speech. Authors also would like to thank reviewers for their fruitful feedback. CR ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365 ALKU P, 1994, 3 INT C SPOK LANG PR, P1619 Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801 Ananthapadmanabha T. V., 1982, SPEECH COMMUN, V1, P167, DOI 10.1016/0167-6393(82)90015-2 [Anonymous], SNACK SOUND TOOLKIT Bozkurt B, 2007, SPEECH COMMUN, V49, P159, DOI 10.1016/j.specom.2006.12.004 BOZKURT B, 2005, IEEE SIGNAL PROCESS, V12 Bozkurt B., 2003, VOQUAL 03, P21 BOZKURT B, 2004, P INT Childers D. G., 1999, SPEECH PROCESSING SY CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 DALESSANDRO C, 2008, LNCS, V4885, P1 Deng HQ, 2006, IEEE T AUDIO SPEECH, V14, P445, DOI 10.1109/TSA.2005.857811 Doval B., 2003, P ISCA ITRW VOQUAL G, P15 Doval B, 2006, ACTA ACUST UNITED AC, V92, P1026 Drugman T., 2009, P INT Drugman T., 2009, ISCA WORKSH NONL SPE FANT G, 1997, 4 PARAMETER MODEL GL, P1 Fant G., 1995, STL QPSR, V36, P119 Fant Gunnar, 1985, STL QPSR, V4, P1 HANSON HM, 1995, P IEEE ICASSP, P772 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878 NORDEN F, 2001, ICASSP, V2, P717, DOI DOI 10.1109/TASL.2006.876878 Oppenheim A., 1989, DISCRETE TIME SIGNAL Oppenheim A.V., 1983, SIGNALS SYSTEMS Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 PEDERSEN C, 2009, P INT Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 QUATIERI TF, 1979, IEEE T ACOUST SPEECH, V27, P328, DOI 10.1109/TASSP.1979.1163252 QUATIERI T, 2002, DISCRETE TIME SIGNAL, pCH6 Schroder M., 2003, P 15 INT C PHON SCI, P2589 SITTON G, 2003, IEEE SIGNAL PROCESS, P27 STEIGLITZ K, 1982, IEEE T ACOUST SPEECH, V30, P984, DOI 10.1109/TASSP.1982.1163975 STEIGLITZ K, 1977, P ICASSP77, V2, P723 STURMEL N, 2007, P INT TITZE IR, 1992, J ACOUST SOC AM, V91, P2936, DOI 10.1121/1.402929 TRIBOLET J, 1977, P ICASSP77, V2, P716 VEENEMAN DE, 1985, IEEE T ACOUST SPEECH, V33, P369, DOI 10.1109/TASSP.1985.1164544 VERHELST W, 1986, IEEE T ACOUST SPEECH, V34, P43, DOI 10.1109/TASSP.1986.1164787 WALKER J, 2007, IEEE T ACOUST SPEECH, P1, DOI 10.1049/ic:20070799 WANGRAE J, 2005, SCIENCE WANGRAE J, 2005, LECT NOTES COMPUTER NR 44 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 855 EP 866 DI 10.1016/j.specom.2011.02.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500006 ER PT J AU Goberman, AM Hughes, S Haydock, T AF Goberman, Alexander M. Hughes, Stephanie Haydock, Todd TI Acoustic characteristics of public speaking: Anxiety and practice effects SO SPEECH COMMUNICATION LA English DT Article DE Acoustic analysis; Public speaking; Anxiety; Practice; Illusion of transparency ID COMMUNICATION APPREHENSION; SPEECH ANXIETY; TRANSPARENCY; PERFORMANCE; ILLUSION; OTHERS; FEAR; DISTURBANCES; ASSESSMENTS; SELF AB This study describes the relationship between acoustic characteristics, self-ratings, and listener-ratings of public speaking. The specific purpose of this study was to examine the effects of anxiety and practice on speech and voice during public speaking. Further examination of the data was completed to examine the illusion of transparency, which hypothesizes that public speakers think their anxiety is more noticeable to listeners than it really is. Self-rating and acoustic speech data were reported on two separate speeches produced by 16 college-aged individuals completing coursework in interpersonal communication. Results indicated that there were significant relationships between acoustic characteristics of speech and both self- and listener-ratings of anxiety in public speaking. However, self-ratings of anxiety were higher than listener ratings, indicating possible confirmation of the illusion of transparency. Finally, data indicate that practice patterns have a significant effect on the fluency characteristics of public speaking performance, as speakers who started practicing earlier were less disfluent than those who started later. Data are also discussed relative to rehabilitation for individuals with communication disorders that can be associated with public speaking anxiety. (C) 2011 Elsevier B.V. All rights reserved. C1 [Goberman, Alexander M.] Bowling Green State Univ, Dept Commun Sci & Disorders, Bowling Green, OH 43403 USA. [Hughes, Stephanie] Governors State Univ, Conununicat Disorders Dept, University Pk, IL 60484 USA. [Haydock, Todd] Encore Rehabil Serv, Whitehouse, OH 43571 USA. RP Goberman, AM (reprint author), Bowling Green State Univ, Dept Commun Sci & Disorders, 200 Hlth Ctr Bldg, Bowling Green, OH 43403 USA. EM goberma@bgsu.edu; s-hughes@-govst.edu; THaydock@encorerehabilitation.com CR BALDWIN HJ, 1983, PATIENT COUNS COMMUN, V2, P8 BLOTE AW, 2009, J ANXIETY DISORD, V23, P5 Bodie G. D., 2010, COMMUN EDUC, V59, P70, DOI DOI 10.1080/03634520903443849 Boersma P., 2007, PRAAT DOING PHONETIC Bortfeld H, 2001, LANG SPEECH, V44, P123 BRANIGAN HP, 1999, P 14 C PHON SCI SAN Breakey Lisa K., 2005, Seminars in Speech and Language, V26, P107, DOI 10.1055/s-2005-871206 Cho YR, 2004, BEHAV RES THER, V42, P13, DOI 10.1016/S0005-7967(03)00067-6 Christenfeld N, 1996, J PERS SOC PSYCHOL, V70, P451 DALY JA, 1975, J COUNS PSYCHOL, V22, P309, DOI 10.1037/h0076748 DRISKELL JE, 1994, J APPL PSYCHOL, V79, P481, DOI 10.1037/0021-9010.79.4.481 Ezrati-Vinacour R, 2004, J FLUENCY DISORD, V29, P135, DOI 10.1016/j.jfludis.2004.02.003 Gilovich T, 1999, CURR DIR PSYCHOL SCI, V8, P165, DOI 10.1111/1467-8721.00039 Gilovich T, 1998, J PERS SOC PSYCHOL, V75, P332, DOI 10.1037/0022-3514.75.2.332 Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 GOLDMANEISLER F, 1968, PSYCHOLINGUSITCS EXP Guitar B, 2006, STUTTERING INTEGRATE GUNEY F, 2010, J CLIN NEUROSCI, V16, P1311 Hagenaars MA, 2005, J ANXIETY DISORD, V19, P521, DOI 10.1016/j.janxdix.2004.04.008 Hancock AB, 2010, J VOICE, V24, P302, DOI 10.1016/j.jvoice.2008.09.007 Harris SR, 2002, CYBERPSYCHOL BEHAV, V5, P543, DOI 10.1089/109493102321018187 HEISHMAN SJ, 2010, PSYCHOPHARMACOLOGY, V2, P453 Hofmann SG, 1997, J ANXIETY DISORD, V11, P573, DOI 10.1016/S0887-6185(97)00040-6 Lee JM, 2002, CYBERPSYCHOL BEHAV, V5, P191, DOI 10.1089/109493102760147169 MAHL GF, 1956, J ABNORM SOC PSYCH, V53, P1, DOI 10.1037/h0047552 Mansell W, 1999, BEHAV RES THER, V37, P419, DOI 10.1016/S0005-7967(98)00148-X McCroskey J. C., 1976, W SPEECH COMMUNICATI, V40, P14, DOI 10.1080/10570317609373881 McCroskey J. C., 1989, COMMUNICATION Q, V37, P100 McCroskey J. C., 1977, HUMAN COMMUNICATION, V4, P78, DOI DOI 10.1111/J.1468-2958.1977.TB00599.X MCCROSKEY JC, 1970, SPEECH MONOGR, V37, P269 McCroskey J.C., 1975, HUMAN COMMUNICATION, V2, P51, DOI 10.1111/j.1468-2958.1975.tb00468.x McCroskey J.C., 1976, HUMAN COMMUNICATION, V2, P376, DOI 10.1111/j.1468-2958.1976.tb00498.x MCCROSKEY JC, 1976, FLA SPEECH COMMUN J, V4, P1 McCroskey JC, 1976, HUMAN COMMUNICATION, V3, P73, DOI 10.1111/j.1468-2958.1976.tb00506.x Merritt L, 2001, J VOICE, V15, P257, DOI 10.1016/S0892-1997(01)00026-1 Protopapas A, 1997, J ACOUST SOC AM, V101, P2267, DOI 10.1121/1.418247 RAPEE RM, 1992, J ABNORM PSYCHOL, V101, P728, DOI 10.1037/0021-843X.101.4.728 ROCHESTE.SR, 1973, J PSYCHOLINGUIST RES, V2, P51, DOI 10.1007/BF01067111 Ruiz R, 1996, SPEECH COMMUN, V20, P111, DOI 10.1016/S0167-6393(96)00048-9 Savitsky K, 2003, J EXP SOC PSYCHOL, V39, P618, DOI 10.1016/S0022-1031(03)00056-8 SCOTT MD, 1978, J COMMUN, V28, P104, DOI 10.1111/j.1460-2466.1978.tb01571.x Slater M, 2006, CYBERPSYCHOL BEHAV, V9, P627, DOI 10.1089/cpb.2006.9.627 NR 42 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 867 EP 876 DI 10.1016/j.specom.2011.02.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500007 ER PT J AU Stephens, JDW Holt, LL AF Stephens, Joseph D. W. Holt, Lori L. TI A standard set of American-English voiced stop-consonant stimuli from morphed natural speech SO SPEECH COMMUNICATION LA English DT Article DE Speech stimuli; Consonants; Linear predictive coding ID PHONETIC CATEGORIZATION; PRECEDING LIQUID; PERCEPTION; COARTICULATION; INVARIANCE; PLACE; IDENTIFICATION; COMPENSATION; ARTICULATION; LANGUAGE AB Linear predictive coding (LPC) analysis was used to create morphed natural tokens of English voiced stop consonants ranging from /b/ to /d/ and /d/ to /g/ in four vowel contexts (/i/, /ae/, /a/, /u/). Both vowel consonant vowel (VCV) and consonant vowel (CV) stimuli were created. A total of 320 natural-sounding acoustic speech stimuli were created, comprising 16 stimulus series. A behavioral experiment demonstrated that the stimuli varied perceptually from /b/ to /d/ to /g/, and provided useful reference data for the ambiguity of each token. Acoustic analyses indicated that the stimuli compared favorably to standard characteristics of naturally-produced consonants, and that the LPC morphing procedure successfully modulated multiple acoustic parameters associated with place of articulation. The entire set of stimuli is freely available on the Internet (http://www.psy.cmu.edu/similar to lholt/php/StephensHoltStimuli.php) for use in research applications. (C) 2011 Elsevier B.V. All rights reserved. C1 Carnegie Mellon Univ, Dept Psychol, Pittsburgh, PA 15213 USA. Carnegie Mellon Univ, Ctr Neural Basis Cognit, Pittsburgh, PA 15213 USA. RP Stephens, JDW (reprint author), N Carolina Agr & Tech State Univ, Dept Psychol, 1601 E Market St, Greensboro, NC 27411 USA. EM jdstephe@ncat.edu; lholt@andrew.cmu.edu FU National Institute on Deafness and Other Communication Disorders [1 F31 DC007284-01]; National Science Foundation [BCS-0345773]; Center for the Neural Basis of Cognition FX The authors thank Christi Adams Gomez and Tony Kelly for help with data collection and preparation of online materials, respectively. This work was supported by a National Research Service Award (1 F31 DC007284-01) from the National Institute on Deafness and Other Communication Disorders to J.D.W.S., by a grant from the National Science Foundation (BCS-0345773) to L.L.H., and by the Center for the Neural Basis of Cognition. CR ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 BLUMSTEIN SE, 1980, J ACOUST SOC AM, V67, P648, DOI 10.1121/1.383890 Boersma P., 2001, GLOT INT, V5, P341 ELMAN JL, 1988, J MEM LANG, V27, P143, DOI 10.1016/0749-596X(88)90071-X Engstrand O., 2000, P 13 SWED PHON C FON, P53 Fant G., 1960, ACOUSTIC THEORY SPEE GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110 HOLT LL, 1999, THESIS Holt LL, 2005, PSYCHOL SCI, V16, P305, DOI 10.1111/j.0956-7976.2005.01532.x Holt LL, 2002, HEARING RES, V167, P156, DOI 10.1016/S0378-5955(02)00383-0 KEWLEYPORT D, 1982, J ACOUST SOC AM, V72, P379, DOI 10.1121/1.388081 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 Krull D., 1988, PHONETIC EXPT RES I, VVII, P66 LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417 Lindblom B., 1963, 29 ROYAL I TECHN SPE LISKER L, 1964, WORD, V20, P384 Lotto AJ, 2006, PERCEPT PSYCHOPHYS, V68, P178, DOI 10.3758/BF03193667 Lotto AJ, 1998, PERCEPT PSYCHOPHYS, V60, P602, DOI 10.3758/BF03206049 MANN VA, 1980, PERCEPT PSYCHOPHYS, V28, P407, DOI 10.3758/BF03204884 Markel JD, 1976, LINEAR PREDICTION SP Massaro D. W., 1998, PERCEIVING TALKING F Massaro D. W., 1987, SPEECH PERCEPTION EA McCandliss BD, 2002, COGN AFFECT BEHAV NE, V2, P89, DOI 10.3758/CABN.2.2.89 MCQUEEN JM, 1991, J EXP PSYCHOL HUMAN, V17, P433, DOI 10.1037/0096-1523.17.2.433 Newman RS, 1997, J EXP PSYCHOL HUMAN, V23, P873, DOI 10.1037/0096-1523.23.3.873 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 Pfitzinger H. R., 2004, P 10 AUSTR INT C SPE, P545 Pitt MA, 1998, J MEM LANG, V39, P347, DOI 10.1006/jmla.1998.2571 SLANEY M, 1996, P 1996 IEEE ICASSP A, P1001 Stephens JDW, 2010, J ACOUST SOC AM, V128, P2138, DOI 10.1121/1.3479537 Stevens K.N., 1998, ACOUSTIC PHONETICS STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567 SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923 NR 36 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 877 EP 888 DI 10.1016/j.specom.2011.02.007 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500008 ER PT J AU Akdemir, E Ciloglu, T AF Akdemir, Eren Ciloglu, Tolga TI Bimodal automatic speech segmentation based on audio and visual information fusion SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech segmentation; Audiovisual; Lip motion; Text-to-speech ID SENSORY INTEGRATION; AUDIOVISUAL SPEECH; RECOGNITION; MODELS; PERCEPTION AB Bimodal automatic speech segmentation using visual information together with audio data is introduced. The accuracy of automatic segmentation directly affects the quality of speech processing systems using the segmented database. The collaboration of audio and visual data results in lower average absolute boundary error between the manual segmentation and automatic segmentation results. The information from two modalities are fused at the feature level and used in a HMM based speech segmentation system. A Turkish audiovisual speech database has been prepared and used in the experiments. The average absolute boundary error decreases up to 18% by using different audiovisual feature vectors. The benefits of incorporating visual information are discussed for different phoneme boundary types. Each audiovisual feature vector results in a different performance at different types of phoneme boundaries. The average absolute boundary error decreases by approximately 25% by using audiovisual feature vectors selectively for different boundary classes. Visual data is collected using an ordinary webcam. The proposed method is very convenient to be used in practice. (C) 2011 Elsevier B.V. All rights reserved. C1 [Akdemir, Eren; Ciloglu, Tolga] Middle E Tech Univ, Elect & Elect Engn Dept, TR-06531 Ankara, Turkey. RP Akdemir, E (reprint author), Middle E Tech Univ, Elect & Elect Engn Dept, TR-06531 Ankara, Turkey. EM erenakdemir@gmail.com FU Scientific and Technological Research Council of Turkey (TUBITAK) [107e101] FX This work is supported by Scientific and Technological Research Council of Turkey (TUBITAK) Project no: 107e101. CR Akdemir E, 2008, SPEECH COMMUN, V50, P594, DOI 10.1016/j.specom.2008.04.005 Bayati M., 2006, INF THEOR 2006 IEEE, V2006, P557 Bonafonte A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607841 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W Chen T, 1998, P IEEE, V86, P837 Cosi P., 1991, P EUROSPEECH 91, P693 Dodd B., 1998, HEARING EYE ENGWALL A, 2003, 6 INT SEM SPEECH PRO, P43 Hall DL, 1997, P IEEE, V85, P6, DOI 10.1109/5.554205 ITAKURA F, 1975, J ACOUST SOC AM, V57, pS35, DOI 10.1121/1.1995189 Jarifi S., 2008, SPEECH COMMUN, V50, P67, DOI 10.1016/j.specom.2007.07.001 Kawai H., 2004, P IEEE INT C AC SPEE Kaynak MN, 2004, IEEE T SYST MAN CY A, V34, P564, DOI 10.1109/TSMCA.2004.826274 MAK MW, 1994, SPEECH COMMUN, V14, P279, DOI 10.1016/0167-6393(94)90067-1 MAKASHAY MJ, 2000, P INT C SPOK LANG PR, P431 Malfrere F, 2003, SPEECH COMMUN, V40, P503, DOI 10.1016/S0167-6393(02)00131-0 Massaro DW, 1998, AM SCI, V86, P236, DOI 10.1511/1998.25.861 Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x MUNHALL KG, 1995, J ACOUST SOC AM, V98, P1222, DOI 10.1121/1.413621 NETI C, 2000, WORKSH 2000 FIN REP Park SS, 2007, IEEE T AUDIO SPEECH, V15, P2202, DOI 10.1109/TASL.2007.903933 Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150 Prasad VK, 2004, SPEECH COMMUN, V42, P429, DOI 10.1016/j.specom.2003.12.002 Smeele PMT, 1998, J EXP PSYCHOL HUMAN, V24, P1232, DOI 10.1037//0096-1523.24.4.1232 STORK DG, 1996, P 2 INT C AUT FAC GE, V16, P14 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD AQ, 1989, HDB RES FACE PROCESS Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579 VEPA J, 2003, P 8 EUR C SPEECH COM, P293 Wells J., 1997, HDB STANDARDS RE 4 B WOUTERS J, 1998, P ICSLP, V6, P2747, DOI DOI 10.1109/ICASSP.2001.941045 Young S., 2002, HTK BOOK HTK VERSION YUHAS BP, 1990, P IEEE, V78, P1658, DOI 10.1109/5.58349 Zhang J, 1997, P IEEE, V85, P1423 NR 35 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 889 EP 902 DI 10.1016/j.specom.2011.03.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500009 ER PT J AU Prendergast, G Johnson, SR Green, GGR AF Prendergast, Garreth Johnson, Sam R. Green, Gary G. R. TI Extracting amplitude modulations from speech in the time domain SO SPEECH COMMUNICATION LA English DT Article DE Speech; Amplitude modulation; Vocoder; Intelligibility ID AUDITORY-SYSTEM; QUANTITATIVE MODEL; NATURAL SOUNDS; FREQUENCY; INTELLIGIBILITY; RECOGNITION; RECEPTION; CUES AB Natural sounds can be characterised by patterns of changes in loudness (amplitude modulations), and human speech perception studies have focused on the low frequencies contained in the gross temporal structure of speech. Low-pass filtering the temporal envelopes of sub-band filtered speech maintains intelligibility, but it remains unclear how the human auditory system could perform such a modulation domain analysis or even if it does so at all. It is difficult to further manipulate amplitude modulations through frequency-domain filtering to investigate cues the system may use. The current work focuses on a time-domain decomposition of filter output envelopes into pulses of amplitude modulation. The technique demonstrates that signals low-pass filtered in the modulation domain maintain bursts of energy which are comparable to those that can be extracted entirely within the time-domain. This paper presents preliminary work that suggests a time-domain approach, which focuses on the instantaneous features of transient changes in loudness, can be used to study the content of human speech. This approach should be pursued as it allows human speech intelligibility mechanisms to be investigated from a new perspective. (C) 2011 Elsevier B.V. All rights reserved. C1 [Prendergast, Garreth; Johnson, Sam R.; Green, Gary G. R.] Univ York, York Neuroimaging Ctr, York YO10 5DG, N Yorkshire, England. [Prendergast, Garreth; Green, Gary G. R.] Univ York, Hull York Med Sch, York YO10 5DG, N Yorkshire, England. RP Prendergast, G (reprint author), Univ York, York Neuroimaging Ctr, York YO10 5DG, N Yorkshire, England. EM garreth.prendergast@ynic.york.ac.uk RI Green, Gary/D-3543-2009 CR BACON SP, 1989, J ACOUST SOC AM, V85, P2575, DOI 10.1121/1.397751 CAPRANICA R R, 1972, Physiologist, V15, P55 Chi TS, 1999, J ACOUST SOC AM, V106, P2719, DOI 10.1121/1.428100 Clark P, 2009, IEEE T SIGNAL PROCES, V57, P4323, DOI 10.1109/TSP.2009.2025107 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020 Elliott TM, 2009, PLOS COMPUT BIOL, V5, DOI 10.1371/journal.pcbi.1000302 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T GREEN GGR, 1974, J PHYSIOL-LONDON, V241, pP29 Greenberg S, 2004, IEICE T INF SYST, VE87D, P1059 Greenberg S., 1998, INT C SPOK LANG PROC HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Joris PX, 2004, PHYSIOL REV, V84, P541, DOI 10.1152/physrev.00029.2003 KAY RH, 1982, PHYSIOL REV, V62, P894 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Krebs B, 2008, J NEUROPHYSIOL, V100, P1602, DOI 10.1152/jn.90374.2008 Lewicki MS, 2002, NAT NEUROSCI, V5, P356, DOI 10.1038/nn831 Patterson R. D., 1988, 2341 MRC APPL PSYCH PRENDERGAST G, 2010, EUR J NEUROSCI OCT SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Singh NC, 2003, J ACOUST SOC AM, V114, P3394, DOI 10.1121/1.1624067 Slaney M., 1993, 35 APPL COMP INC PER Slaney M., 1994, 45 APPL COMP INC VIEMEISTER NF, 2002, GENETICS FUNCTION AU, P273 NR 27 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 903 EP 913 DI 10.1016/j.specom.2011.03.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500010 ER PT J AU Yu, K Zen, H Mairesse, F Young, S AF Yu, Kai Zen, Heiga Mairesse, Francois Young, Steve TI Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE HMM-based speech synthesis; Context adaptive training; Factorized decision tree; State clustering ID HIDDEN MARKOV-MODELS; REGRESSION AB To achieve natural high quality synthesized speech in HMM-based speech synthesis, the effective modelling of complex acoustic and linguistic contexts is critical. Traditional approaches use context-dependent HMMs with decision tree based parameter clustering to model the full combinatorial of contexts. However, weak contexts, such as word-level emphasis in natural speech, are difficult to capture using this approach. Also, due to combinatorial explosion, incorporating new contexts within the traditional framework may easily lead to the problem of insufficient data coverage. To effectively model weak contexts and reduce the data sparsity problem, different types of contexts should be treated independently. Context adaptive training provides a structured framework for this whereby standard HMMs represent normal contexts and transforms represent the additional effects of weak contexts. In contrast to speaker adaptive training in speech recognition, separate decision trees have to be built for different types of context factors. This paper describes the general framework of context adaptive training and investigates three concrete forms: MLLR, CMLLR and CAT based systems. Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach. However, the MLLR based system achieved the best performance. (C) 2011 Elsevier B.V. All rights reserved. C1 [Yu, Kai; Mairesse, Francois; Young, Steve] Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England. [Zen, Heiga] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England. RP Yu, K (reprint author), Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England. EM ky219@cam.ac.uk RI Yu, Kai/B-1772-2012 OI Yu, Kai/0000-0002-7102-9826 FU UK EPSRC [EP/F013930/1]; EU [216594] FX This research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and by the EU FP7 Programme under grant agreement 216594 (CLASSIC project: www.classic-project.org). The original version of this paper was selected as one of the best papers from Interspeech 2010. It is presented here in revised form following additional peer review. CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 Chou W., 1999, P ICASSP 99, V1, P345 Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953 GALES M, 2010, P INT, P58 Gales M. J. F., 1996, CUEDFINFENGTR263 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223 Imai S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Iwahashi N, 2000, IEICE T INF SYST, VE83D, P1550 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Kominek J., 2003, CMULTI03177 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Nankaku Y., 2008, P ICASSP, P4469 Povey Daniel, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495662 Saino K., 2008, THESIS NAGOYA I TECH Shinoda K., 1997, P EUR Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tokuda K., HMM BASED SPEECH SYN TOKUDA K, 2000, P ICASSP, V3, P1315 Yoshimura T, 1999, P EUR, P2347 Young S., 2009, HTK BOOK HTK VERSION Young S.J., 1994, ARPA WORKSH HUM LANG, P307 Yu K, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495690 YU K, 2009, P ICASSP, P3773 Zen H., 2009, P INT, P2091 Zen H., 2010, P INT, P410 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 28 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 914 EP 923 DI 10.1016/j.specom.2011.03.003 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500011 ER PT J AU Palomaki, KJ Brown, GJ AF Palomaki, Kalle J. Brown, Guy J. TI A computational model of binaural speech recognition: Role of across-frequency vs. within-frequency processing and internal noise SO SPEECH COMMUNICATION LA English DT Article DE Binaural model; Speech recognition; Equalization-cancellation model; Missing data ID INTERAURAL TIME-DELAY; AUDITORY-NERVE DATA; PERCEPTUAL SEGREGATION; SOUND LOCALIZATION; SPATIAL UNMASKING; LEVEL DIFFERENCES; INTELLIGIBILITY; REVERBERATION; EQUALIZATION; RESTORATION AB This study describes a model of binaural speech recognition that is tested against psychoacoustic findings on binaural speech intelligibility in noise. It consists of models of the auditory periphery, binaural pathway and recognition of speech from glimpses based on the missing data approach, which allows the speech reception threshold (SRT) of the model and listeners to be compared. The binaural advantage based on differences between the interaural time differences (ITD) of the target and masker is modelled using the equalization cancellation (EC) mechanism, either independently within each frequency channel or across all channels. The model is tested using a stimulus paradigm in which the target speech and noise interference are split into low- and high-frequency bands, so that the ITD in each band can be varied independently. The match between the model and listener data is quantified by a normalized SRT distance and a correlation metric, which demonstrate a slightly better match for the within-channel model (SRT: 0.5 dB, correlation: 0.94), than for the across-channel model (SRT: 0.7 dB, correlation: 0.90). However, as the differences between the approaches are small and non-significant, our results suggest that listeners exploit ITD via a mechanism that is neither fully frequency-dependent nor fully frequency-independent. (C) 2011 Elsevier B.V. All rights reserved. C1 [Palomaki, Kalle J.] Aalto Univ, Sch Sci & Technol, Dept Comp & Informat Sci, Adapt Informat Res Ctr, FI-00076 Aalto, Finland. [Brown, Guy J.] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Palomaki, KJ (reprint author), Aalto Univ, Sch Sci & Technol, Dept Comp & Informat Sci, Adapt Informat Res Ctr, POB 15400, FI-00076 Aalto, Finland. EM kalle.palomaki@tkk.fi; g.brown@dcs.shef.ac.uk CR Akeroyd MA, 2004, J ACOUST SOC AM, V116, P1135, DOI 10.1121/1.1768959 Akeroyd MA, 2001, J ACOUST SOC AM, V110, P1498, DOI 10.1121/1.1390336 Barker J., 2006, COMPUTATIONAL AUDITO, P297 Beutelmann R, 2006, J ACOUST SOC AM, V120, P331, DOI 10.1121/1.2202888 BREEBAART DJ, 2001, THESIS TU EINDHOVEN BRONKHORST AW, 1988, J ACOUST SOC AM, V83, P1508, DOI 10.1121/1.395906 BROWN GJ, 2005, P INT LISB SEPT 4 8, P1753 COLBURN HS, 1973, J ACOUST SOC AM, V54, P1458, DOI 10.1121/1.1914445 COLBURN HS, 1977, J ACOUST SOC AM, V61, P525, DOI 10.1121/1.381294 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cooke M., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Culling JF, 1998, J ACOUST SOC AM, V103, P3509, DOI 10.1121/1.423059 CULLING JF, 1995, J ACOUST SOC AM, V98, P785, DOI 10.1121/1.413571 Darwin CJ, 1997, J ACOUST SOC AM, V102, P2316, DOI 10.1121/1.419641 Drennan WR, 2003, J ACOUST SOC AM, V114, P2178, DOI 10.1121/1.1609994 DURLACH NI, 1963, J ACOUST SOC AM, V35, P1206, DOI 10.1121/1.1918675 Durlach N.I., 1972, F MODERN AUDITORY TH, P371 EDMONDS B, 2004, THESIS CARDIFF U Edmonds BA, 2006, J ACOUST SOC AM, V120, P1539, DOI 10.1121/1.2228573 Edmonds BA, 2005, J ACOUST SOC AM, V117, P3069, DOI 10.1121/1.1880752 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Hawley ML, 1999, J ACOUST SOC AM, V105, P3436, DOI 10.1121/1.424670 HIRSH IJ, 1950, J ACOUST SOC AM, V22, P196, DOI 10.1121/1.1906588 JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495 KOCK WE, 1950, J ACOUST SOC AM, V22, P801, DOI 10.1121/1.1906692 Leonard R. G., 1984, P ICASSP 84, P111 Liu C, 2001, J ACOUST SOC AM, V110, P3218, DOI 10.1121/1.1419090 Lyon R. F., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Mathworks, 2008, MATLAB McArdle Rachel A, 2005, J Am Acad Audiol, V16, P726, DOI 10.3766/jaaa.16.9.9 Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005 Palomaki KJ, 2004, SPEECH COMMUN, V43, P123, DOI 10.1016/j.specom.2004.02.005 PALOMAKI KJ, 2008, J ACOUST SOC AM, V123, P3715 PLOMP R, 1979, AUDIOLOGY, V18, P43 Ramkissoon Ishara, 2002, Am J Audiol, V11, P23, DOI 10.1044/1059-0889(2002/005) Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 SCHUBERT ED, 1956, J ACOUST SOC AM, V28, P895, DOI 10.1121/1.1908508 Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006 SPIETH W, 1954, J ACOUST SOC AM, V26, P391, DOI 10.1121/1.1907347 STEVENS SS, 1957, PSYCHOL REV, V64, P153, DOI 10.1037/h0046162 van der Heijden M, 1999, J ACOUST SOC AM, V105, P388, DOI 10.1121/1.424628 VOMHOVEL H, 1984, THESIS RHEINISCH WES WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 Warren RM, 1997, PERCEPT PSYCHOPHYS, V59, P275, DOI 10.3758/BF03211895 Wilson Richard H., 2004, Seminars in Hearing, V25, P93 Wilson RH, 2005, J REHABIL RES DEV, V42, P499, DOI 10.1682/JRRD.2004.10.0134 NR 47 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 924 EP 940 DI 10.1016/j.specom.2011.03.005 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500012 ER PT J AU Zimmerer, F Scharinger, M Reetz, H AF Zimmerer, Frank Scharinger, Mathias Reetz, Henning TI When BEAT becomes HOUSE: Factors of word final /t/-deletion in German SO SPEECH COMMUNICATION LA English DT Article DE Segment deletion; Segment reduction; Natural speech; Production; Phonology ID AMERICAN ENGLISH; SPEECH; PERCEPTION; SPEAKING; INTELLIGIBILITY; FREQUENCY; REDUCTION; EXCERPTS AB The deletion and reduction of alveolar /t/ is a phenomenon that has been given considerable attention in the research on speech production and perception. Data have mainly be drawn from spoken language corpora, where a tight control over contributing factors of /t/-deletion is hardly possible. Here, we present a new way of creating a spoken language corpus adhering to some crucial factors we wanted to hold constant for the investigation of word-final /t/-deletion in German. German is especially interesting with regard to /t/-deletion due to its rich suffixal morphology, attributing morphological status to word-final /t/ in many paradigms. We focused on verb inflection and employed a verb form production task for creating a concise corpus of naturally spoken language in which we could control for factors previously established to affect it/-deletion. We then determined the best estimators for /t/-productions (i.e. canonical, deleted, or reduced) in our corpus. The influence of extra-linguistic factors was comparable to previous studies. We suggest that our method of constructing a natural language corpus with carefully selected characteristics is a viable way for the examination of deletions and reductions during speech production. Furthermore, we found that the best predictor for non-canonical productions and deletions was the following phonological context. (C) 2011 Elsevier B.V. All rights reserved. C1 [Zimmerer, Frank; Reetz, Henning] Goethe Univ Frankfurt, Inst Phonet, D-60054 Frankfurt, Germany. [Scharinger, Mathias] Univ Maryland, Dept Linguist, College Pk, MD 20742 USA. [Zimmerer, Frank; Scharinger, Mathias] Univ Konstanz, Dept Linguist, D-78457 Constance, Germany. RP Zimmerer, F (reprint author), Goethe Univ Frankfurt, Inst Phonet, Box 170, D-60054 Frankfurt, Germany. EM zimmerer@em.uni-frankfurt.de FU Deutsche Forschungs Gesellschaft, DFG [SPP 1234, SFB 471] FX This work was supported by the Deutsche Forschungs Gesellschaft, DFG (SPP 1234 and SFB 471). We also wish to thank our reviewers for their very useful comments and suggestions. CR Agresti A., 2002, CATEGORICAL DATA ANA ARVANITI A, 2007, P 16 INT C PHON SCI, P19 Baayen H., 2008, ANAL LINGUISTIC DATA Baayen H. R., 1995, CELEX LEXICAL DATABA Bates D., 2010, IME4 LINEAR MIXED EF Boersma P., 2007, PRAAT DOING PHONETIC BRESLOW NE, 1993, J AM STAT ASSOC, V88, P9, DOI 10.2307/2290687 Byrd D, 1996, J PHONETICS, V24, P263, DOI 10.1006/jpho.1996.0014 BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 Fasold Ralph W., 1972, TENSE MARKING BLACK Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7 Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 GREENBERG S, 2002, P 2 INT C HUM LANG T, P36, DOI 10.3115/1289189.1289251 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 GUY GR, 1992, CHANGE, V3, P223 Guy Gregory R, 1980, LOCATING LANGUAGE TI, P1 Hay J, 2001, LINGUISTICS, V39, P1041, DOI 10.1515/ling.2001.041 Hume E., 2007, BUCKEYE CORPUS CONVE IPDS, 1994, KIEL CORP SPONT SPEE Johnson Keith, 2004, P 1 SESS 10 INT S Jurafsky Daniel, 2001, FREQUENCY EMERGENCE, P229, DOI 10.1075/tsl.45.13jur KINGSTON J, 2006, P 3 C LAB APPR SPAN Kohler Klaus, 1995, EINFUHRUNG PHONETIK, VSecond Koreman J, 2006, J ACOUST SOC AM, V119, P582, DOI 10.1121/1.2133436 LABOV W, 1967, NEW DIRECTIONS ELEME, P1 LAHIRI A, 2007, P 16 INT C PHON SCI, P19 LIEBERMAN P, 1963, LANG SPEECH, V6, P172 Mitterer H, 2008, J MEM LANG, V59, P133, DOI 10.1016/j.jml.2008.02.004 Mitterer H, 2006, J PHONETICS, V34, P73, DOI 10.1016/j.wocn.2005.03.003 Neu H., 1980, LOCATING LANGUAGE TI, P37 Nolan F., 1992, PAPERS LABORATORY PH, P261 PICKETT JM, 1963, LANG SPEECH, V6, P151 Pinheiro J. C., 2000, MIXED EFFECTS MODELS POLLACK I, 1963, LANG SPEECH, V6, P165 PYCHA A, 2010, J INT PHON ASSOC, V39, P1 R Development Core Team, 2010, R LANG ENV STAT COMP Raymond William D., 2006, LANG VAR CHANGE, V18, P55 Shriberg E., 1999, P INT C PHON SCI SAN, P619 Sumner M, 2005, J MEM LANG, V52, P322, DOI 10.1016/j.jml.2004.11.004 Tree Jean E. Fox, 1997, Cognition, V62, P151 Turk AE, 2007, J PHONETICS, V35, P445, DOI 10.1016/j.wocn.2006.12.001 Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093 Wolfram Walt, 1969, SOCIOLINGUISTIC DESC ZIMMERER F, 2009, THESIS U FRANKFURT Zimmerer F, 2009, J ACOUST SOC AM, V125, P2307, DOI 10.1121/1.3021438 ZUE VW, 1979, J ACOUST SOC AM, V66, P1039, DOI 10.1121/1.383323 NR 46 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2011 VL 53 IS 6 BP 941 EP 954 DI 10.1016/j.specom.2011.03.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 767BE UT WOS:000290829500013 ER PT J AU Heckmann, M Raj, B Smaragdis, P AF Heckmann, Martin Raj, Bhiksha Smaragdis, Paris TI Special Issue: Perceptual and Statistical Audition Preface SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 591 EP 591 DI 10.1016/j.specom.2011.03.004 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900001 ER PT J AU Dietz, M Ewert, SD Hohmann, V AF Dietz, Mathias Ewert, Stephan D. Hohmann, Volker TI Auditory model based direction estimation of concurrent speakers from binaural signals SO SPEECH COMMUNICATION LA English DT Article DE Binaural processing; Auditory modeling; Direction estimation ID TIME-FREQUENCY MASKS; SOURCE LOCALIZATION; NOISE-REDUCTION; CONTRALATERAL INHIBITION; INTERAURAL PARAMETERS; ROOM REVERBERATION; SPEECH RECOGNITION; SOUND LOCALIZATION; CROSS-CORRELATION; HEARING-AIDS AB Humans show a very robust ability to localize sounds in adverse conditions. Computational models of binaural sound localization and technical approaches of direction-of-arrival (DOA) estimation also show good performance, however, both their binaural feature extraction and the strategies for further analysis partly differ from what is currently known about the human auditory system. This study investigates auditory model based DOA estimation emphasizing known features and limitations of the auditory binaural processing such as (i) high temporal resolution, (ii) restricted frequency range to exploit temporal fine-structure, (iii) use of temporal envelope disparities, and (iv) a limited range to compensate for interaural time delay. DOA estimation performance was investigated for up to five concurrent speakers in free field and for up to three speakers in the presence of noise. The DOA errors in these conditions were always smaller than 5 degrees. A condition with moving speakers was also tested and up to three moving speakers could be tracked simultaneously. Analysis of DOA performance as a function of the binaural temporal resolution showed that short time constants of about 5 ms employed by the auditory model were crucial for robustness against concurrent sources. (C) 2010 Elsevier B.V. All rights reserved. C1 [Dietz, Mathias; Ewert, Stephan D.; Hohmann, Volker] Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. RP Dietz, M (reprint author), Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. EM mathias.dietz@uni-oldenburg.de FU DFG [SFB/TRR31]; International Graduate School FX This study was supported by the DFG (SFB/TRR31 'The Active Auditory System') and the International Graduate School "Neurosensory Science, Systems and Applications". We would like to thank the members of the Medical Physics group and Birger Kollmeier for continuous support and fruitful discussions. We are grateful to Hendrik Kayser for his valuable help with the head-related impulse responses. CR Ajmera J., 2004, P ICASSP, V1, P605 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 BERNSTEIN LR, 1994, J ACOUST SOC AM, V95, P3561, DOI 10.1121/1.409973 Bernstein LR, 2002, J ACOUST SOC AM, V112, P1026, DOI 10.1121/1.1497620 BLAUERT J, 1986, J ACOUST SOC AM, V80, P533, DOI 10.1121/1.394048 Braasch J, 2002, ACTA ACUST UNITED AC, V88, P956 Brand A, 2002, NATURE, V417, P543, DOI 10.1038/417543a Breebaart J, 2001, J ACOUST SOC AM, V110, P1105, DOI 10.1121/1.1383299 Cherry CE, 1953, J ACOUST SOC AM, V25, P975, DOI DOI 10.1121/1.1907229 Cooke M, 2003, J PHONETICS, V31, P579, DOI 10.1016/S0095-4470(03)00013-5 COOKE MP, 1993, SPEECH COMMUN, V13, P391, DOI 10.1016/0167-6393(93)90037-L Culling JF, 2000, J ACOUST SOC AM, V107, P517, DOI 10.1121/1.428320 de Cheveigne A, 1999, SPEECH COMMUN, V27, P175, DOI 10.1016/S0167-6393(98)00074-0 de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 Dietz M, 2009, J ACOUST SOC AM, V125, P1622, DOI 10.1121/1.3076045 Dietz M, 2008, BRAIN RES, V1220, P234, DOI 10.1016/j.brainres.2007.09.026 Drullman R, 2000, J ACOUST SOC AM, V107, P2224, DOI 10.1121/1.428503 Ewert SD, 2000, J ACOUST SOC AM, V108, P1181, DOI 10.1121/1.1288665 Faller C, 2004, J ACOUST SOC AM, V116, P3075, DOI 10.1121/1.1791872 Garofolo J., 1990, DARPA TIMIT ACOUSTIC Goupell MJ, 2006, J ACOUST SOC AM, V119, P3971, DOI 10.1121/1.2200147 Grimm G, 2009, IEEE T AUDIO SPEECH, V17, P1408, DOI 10.1109/TASL.2009.2020531 Hartikainen J., 2008, OPTIMAL FILTERING KA Haykin S, 2005, NEURAL COMPUT, V17, P1875, DOI 10.1162/0899766054322964 Heil P, 2003, SPEECH COMMUN, V41, P123, DOI 10.1016/S0167-6393(02)00099-7 Hohmann V, 2002, ACTA ACUST UNITED AC, V88, P433 Joris PX, 2006, J NEUROSCI, V26, P279, DOI 10.1523/JNEUROSCI.2285-05.2006 Kayser H, 2009, EURASIP J ADV SIG PR, DOI 10.1155/2009/298605 KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830 KOEHNKE J, 1986, J ACOUST SOC AM, V79, P1558, DOI 10.1121/1.393682 KOLLMEIER B, 1990, J ACOUST SOC AM, V87, P1709, DOI 10.1121/1.399419 KUHN GF, 1977, J ACOUST SOC AM, V62, P157, DOI 10.1121/1.381498 Li YP, 2009, SPEECH COMMUN, V51, P230, DOI 10.1016/j.specom.2008.09.001 LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1608, DOI 10.1121/1.394325 Liu C, 2000, J ACOUST SOC AM, V108, P1888, DOI 10.1121/1.1290516 Louage DHG, 2006, J NEUROSCI, V26, P96, DOI 10.1523/JNEUROSCI.2339-05.2006 MARQUARDT T, 2007, HEARING SENSORY PROC, P312 May T, 2011, IEEE T AUDIO SPEECH, V19, P1, DOI 10.1109/TASL.2010.2042128 McAlpine D, 2001, NAT NEUROSCI, V4, P396, DOI 10.1038/86049 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Nix J, 2007, IEEE T AUDIO SPEECH, V15, P995, DOI 10.1109/TASL.2006.889788 Nix J, 2006, J ACOUST SOC AM, V119, P463, DOI 10.1121/1.2139619 PALMER AR, 1986, HEARING RES, V24, P1, DOI 10.1016/0378-5955(86)90002-X Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005 Park HM, 2009, SPEECH COMMUN, V51, P15, DOI 10.1016/j.specom.2008.05.012 Patterson RD, 1987, M IOC SPEECH GROUP A Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150 POLLACK I, 1959, J ACOUST SOC AM, V31, P1250, DOI 10.1121/1.1907852 Puria S, 1997, J ACOUST SOC AM, V101, P2754, DOI 10.1121/1.418563 Rohdenburg T, 2008, INT CONF ACOUST SPEE, P2449, DOI 10.1109/ICASSP.2008.4518143 Roman N, 2008, IEEE T AUDIO SPEECH, V16, P728, DOI 10.1109/TASL.2008.918978 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 RUGGERO MA, 1991, J NEUROSCI, V11, P1057 Sarkka S, 2007, INFORM FUSION, V8, P2, DOI 10.1016/j.inffus.2005.09.009 SAYERS BM, 1964, J ACOUST SOC AM, V36, P923, DOI 10.1121/1.1919121 Siveke I, 2008, J NEUROSCI, V28, P2043, DOI 10.1523/JNEUROSCI.4488-07.2008 Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003 Supper B, 2006, IEEE T AUDIO SPEECH, V14, P1008, DOI 10.1109/TSA.2005.857787 Thompson S. P., 1882, PHILOS MAG, V13, P406 van de Par S, 1999, J ACOUST SOC AM, V106, P1940, DOI 10.1121/1.427942 Wang D., 2006, COMPUTATIONAL AUDITO Wittkop T, 2003, SPEECH COMMUN, V39, P111, DOI 10.1016/S0167-6393(02)00062-6 Wu MY, 2003, IEEE T SPEECH AUDI P, V11, P229, DOI 10.1109/TSA.2003.811539 NR 63 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 592 EP 605 DI 10.1016/j.specom.2010.05.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900002 ER PT J AU Weiss, RJ Mandel, MI Ellis, DPW AF Weiss, Ron J. Mandel, Michael I. Ellis, Daniel P. W. TI Combining localization cues and source model constraints for binaural source separation SO SPEECH COMMUNICATION LA English DT Article DE Source separation; Binaural; Source models; Eigenvoices; EM ID DATA SPEECH RECOGNITION; STATISTICS AB We describe a system for separating multiple sources from a two-channel recording based on interaural cues and prior knowledge of the statistics of the underlying source signals. The proposed algorithm effectively combines information derived from low level perceptual cues, similar to those used by the human auditory system, with higher level information related to speaker identity. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels in the presence of reverberation. In simulated mixtures of speech from two and three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 1.7 dB over a baseline algorithm which uses only interaural cues. Further improvement is obtained by incorporating eigenvoice speaker adaptation to enable the source model to better match the sources present in the signal. This improves performance over the baseline by 2.7 dB when the speakers used for training and testing are matched. However, the improvement is minimal when the test data is very different from that used in training. (C) 2011 Elsevier B.V. All rights reserved. C1 [Weiss, Ron J.; Mandel, Michael I.; Ellis, Daniel P. W.] Columbia Univ, Dept Elect Engn, LabROSA, New York, NY 10027 USA. RP Weiss, RJ (reprint author), Columbia Univ, Dept Elect Engn, LabROSA, New York, NY 10027 USA. EM ronw@ee.columbia.edu; mim@ee.columbia.edu; dpwe@ee.columbia.edu FU NSF [IIS-0238301, IIS-0535168]; EU FX This work was supported by the NSF under Grants No. IIS-0238301 and IIS-0535168, and by EU project AMIDA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Sponsors. CR AARABI P, 2002, IEEE T SYSTEMS MAN C, V32 Algazi V. R., 2001, IEEE WORKSH APPL SIG, P99 BLAUERT J, 1997, SPATIAL HEARNING PSY CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005 Cooke M, 2010, COMPUT SPEECH LANG, V24, P1, DOI 10.1016/j.csl.2009.02.006 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Harding S, 2006, IEEE T AUDIO SPEECH, V14, P58, DOI 10.1109/TSA.2005.860354 JOURJINE A, 2000, P IEEE INT C AC SPEE, V5, P2985 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Mandel M. I., 2007, P IEEE WORKSH APPL S, P275 Mandel MI, 2010, IEEE T AUDIO SPEECH, V18, P382, DOI 10.1109/TASL.2009.2029711 Nix J, 2006, J ACOUST SOC AM, V119, P463, DOI 10.1121/1.2139619 Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005 RENNIE S, 2003, P IEEE INT C AC SPEE, V1, P88 RENNIE SI, 2005, P 10 INT WORKSH ART, P293 ROMAN N, 2004, ADV NEURAL INFORM PR Roman N, 2006, J ACOUST SOC AM, V120, P458, DOI 10.1121/1.2204590 Roweis S. T., 2003, P EUR, P1009 Sawada H., 2007, P IEEE WORKSH APPL S, P139 Shinn-Cunningham BG, 2005, J ACOUST SOC AM, V117, P3100, DOI 10.1121/1.1872572 WANG D, 2005, IDEAL BINARY MASK CO, P181 Weiss RJ, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P419 Weiss RJ, 2010, COMPUT SPEECH LANG, V24, P16, DOI 10.1016/j.csl.2008.03.003 Weiss Ron J, 2009, THESIS COLUMBIA U WIGHTMAN FL, 1992, J ACOUST SOC AM, V91, P1648, DOI 10.1121/1.402445 Wilson K, 2007, INT CONF ACOUST SPEE, P33 Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896] NR 29 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 606 EP 621 DI 10.1016/j.specom.2011.01.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900003 ER PT J AU Lu, YC Cooke, M AF Lu, Yan-Chen Cooke, Martin TI Motion strategies for binaural localisation of speech sources in azimuth and distance by artificial listeners SO SPEECH COMMUNICATION LA English DT Article DE Active hearing; Sound source localisation; Interaural time difference; Motion parallax; Particle filtering ID TIME-DELAY ESTIMATION; SOUND LOCALIZATION; PARTICLE FILTERS; HEAD MOVEMENTS; REVERBERATION; PERCEPTION; SIMULATION; TRACKING; CUES AB Localisation in azimuth and distance of sound sources such as speech is an important ability for both human and artificial listeners. While progress has been made, particularly for azimuth estimation, most work has been directed at the special case of static listeners and static sound sources. Although dynamic sound sources create their own localisation challenges such as motion blur, moving listeners have the potential to exploit additional cues not available in the static situation. An example is motion parallax, based on a sequence of azimuth estimates, which can be used to triangulate sound source location. The current study examines what types of listener (or sensor) motion are beneficial for localisation. Is any kind of motion useful, or do certain motion trajectories deliver robust estimates rapidly? Eight listener motion strategies and a no motion baseline were tested, including simple approaches such as random walks and motion limited to head rotations only, as well as more sophisticated strategies designed to maximise the amount of new information available at each time step or to minimise the overall estimate uncertainty. Sequential integration of estimates was achieved using a particle filtering framework. Evaluations, performed in a simulated acoustic environment with single sources under both anechoic and reverberant conditions, demonstrated that two strategies were particularly effective for localisation. The first was simply to move towards the most likely source location, which is beneficial in increasing signal-to-noise ratio, particularly in reverberant conditions. The other high performing approach was based on moving in the direction which led to the largest reduction in the uncertainty of the location estimate. Both strategies achieved estimation errors nearly an order of magnitude less than those obtainable with a static approach, demonstrating the power of motion-based cues to sound source localisation. (C) 2010 Elsevier B.V. All rights reserved. C1 [Lu, Yan-Chen] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. [Cooke, Martin] Ikerbasque Basque Fdn Sci, Bilbao 48011, Spain. [Cooke, Martin] Univ Basque Country, Language & Speech Lab, Fac Letters, Bilbao, Spain. RP Lu, YC (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM y.c.lu@dcs.shef.ac.uk CR Aarabi P, 2002, IEEE T SYST MAN CY C, V32, P474, DOI 10.1109/TSMCB.2002.804369 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 Arulampalam MS, 2002, IEEE T SIGNAL PROCES, V50, P174, DOI 10.1109/78.978374 ASHMEAD DH, 1995, J EXP PSYCHOL HUMAN, V21, P239, DOI 10.1037/0096-1523.21.2.239 Asoh H., 2004, P FUS, P805 Blauert J., 1997, SPATIAL HEARING PSYC Bodden M., 1993, Acta Acustica, V1 Brandstein MS, 1997, INT CONF ACOUST SPEE, P375, DOI 10.1109/ICASSP.1997.599651 BRANDSTEIN MS, 1997, P WASPAA Campbell D. R., 2005, Computing and Information Systems, V9 Champagne B, 1996, IEEE T SPEECH AUDI P, V4, P148, DOI 10.1109/89.486067 Chen JD, 2006, EURASIP J APPL SIG P, DOI 10.1155/ASP/2006/26503 COOKE M, 2008, AUDITORY SIGNAL PROC DEFREITAS N, 1998, 328 CUEDFINFENGTR Del Moral P., 1996, MARKOV PROCESS RELAT, V2, P555 Douc R., 2005, Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (IEEE Cat. No. 05EX1094) Doucet A, 2000, STAT COMPUT, V10, P197, DOI 10.1023/A:1008935410038 Eyring C. F., 1930, J ACOUST SOC AM, V1, P168, DOI 10.1121/1.1901884 Faller C, 2004, J ACOUST SOC AM, V116, P3075, DOI 10.1121/1.1791872 GAIK W, 1993, J ACOUST SOC AM, V94, P98, DOI 10.1121/1.406947 GARDNER WG, 1996, STANDARDS COMPUTER G GORDON NJ, 1993, IEE PROC-F, V140, P107 Jan EE, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1321 JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495 Kidd G, 2005, J ACOUST SOC AM, V118, P3804, DOI 10.1121/1.2109187 Kitagawa G., 1996, J COMPUTATIONAL GRAP, V5, P1, DOI DOI 10.2307/1390750 KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830 LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1608, DOI 10.1121/1.394325 Liu JS, 1998, J AM STAT ASSOC, V93, P1032, DOI 10.2307/2669847 LOOMIS JM, 1990, J ACOUST SOC AM, V88, P1757, DOI 10.1121/1.400250 Lu Y.-C., 2007, P INT ANTW BELG AUG, P574 Lukowicz P, 2004, LECT NOTES COMPUT SC, V3001, P18 MACKENSEN P, 2004, THESIS TU BERLIN BER MARTINSON E, 2006, P IEEE RSJ INT C INT, P1139 OTANI M, 2007, P JAP CHIN JOINT C A Patterson R.D., 1988, 2341 APU Pitt MK, 1999, J AM STAT ASSOC, V94, P590, DOI 10.2307/2670179 REKLEITIS IM, 2003, THESIS MCGILL U MONT RUI Y, 2004, ACOUST SPEECH SIG PR, P133 Sasaki Y., 2006, P IEEE RSJ INT C INT, P380 Sawhney N., 2000, ACM Transactions on Computer-Human Interaction, V7, DOI 10.1145/355324.355327 Speigle J. M., 1993, Proceedings IEEE 1993 Symposium on Research Frontiers in Virtual Reality (Cat. No.93TH0585-0), DOI 10.1109/VRAIS.1993.378257 THURLOW WR, 1967, J ACOUST SOC AM, V42, P489, DOI 10.1121/1.1910605 Viste H, 2004, P 7 INT C DIG AUD EF, P145 Wallach H, 1940, J EXP PSYCHOL, V27, P339, DOI 10.1037/h0054629 Wang H, 1997, INT CONF ACOUST SPEE, P187, DOI 10.1109/ICASSP.1997.599595 Ward DB, 2003, IEEE T SPEECH AUDI P, V11, P826, DOI 10.1109/TSA.2003.818112 West M., 1997, BAYESIAN FORECASTING, V2nd Zahorik P, 2005, ACTA ACUST UNITED AC, V91, P409 NR 49 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 622 EP 642 DI 10.1016/j.specom.2010.06.001 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900004 ER PT J AU Pichevar, R Najaf-Zadeh, H Thibault, L Landili, H AF Pichevar, Ramin Najaf-Zadeh, Hossein Thibault, Louis Landili, Hassan TI Auditory-inspired sparse representation of audio signals SO SPEECH COMMUNICATION LA English DT Article DE Sparse representations; Masking; Quantization; Temporal data mining; Episode discovery; Audio coding; Matching pursuit; Auditory pattern recognition ID MATCHING PURSUITS; EPISODES; SPIKES; FILTER AB This article deals with the generation of auditory-inspired spectro-temporal features aimed at audio coding. To do so, we first generate sparse audio representations we call spikegrams, using projections on gammatone/gammachirp kernels that generate neural spikes. Unlike Fourier-based representations, these representations are powerful at identifying auditory events, such as onsets, offsets, transients, and harmonic structures. We show that the introduction of adaptiveness in the selection of gammachirp kernels enhances the compression rate compared to the case where the kernels are non-adaptive. We also integrate a masking model that helps reduce bitrate without loss of perceptible audio quality. We finally propose a method to extract frequent audio objects (patterns) in the aforementioned sparse representations. The extracted frequency-domain patterns (audio objects) help us address spikes (audio events) collectively rather than individually. When audio compression is needed, the different patterns are stored in a small codebook that can be used to efficiently encode audio materials in a lossless way. The approach is applied to different audio signals and results are discussed and compared. This work is a first step towards the design of a high-quality auditory-inspired "object-based" audio coder. Crown Copyright (C) 2010 Published by Elsevier B.V. All rights reserved. C1 [Pichevar, Ramin; Najaf-Zadeh, Hossein; Thibault, Louis; Landili, Hassan] Commun Res Ctr, Ottawa, ON K2H 8S2, Canada. RP Pichevar, R (reprint author), Commun Res Ctr, 3701 Carling Ave, Ottawa, ON K2H 8S2, Canada. EM Ramin.Pichevar@usherbrooke.ca FU University of Sherbrooke FX The authors would like to thank Richard Boudreau, Hunter Hong, and Frederic Mustiere for proofreading the paper. They also express their gratitude to Debprakash Patnaik and Koniparambil Unnikrishnan for providing them with the GMiner toolbox and for fruitful discussions on frequent episode discovery. The first author would also like to thank Jean Rouat for fruitful discussions on machine learning, as well as the University of Sherbrooke for a travel grant that made the discussions on machine learning and frequent episode discovery possible. Many thanks also to the three anonymous reviewers for their constructive comments. CR Abdallah SA, 2006, IEEE T NEURAL NETWOR, V17, P179, DOI 10.1109/TNN.2005.861031 Abeles M., 1991, CORTICONICS NEURAL C Bech S., 2006, PERCEPTUAL AUDIO EVA Christensen MG, 2006, IEEE T AUDIO SPEECH, V14, P1340, DOI 10.1109/TSA.2005.858038 Feldbauer C, 2005, EURASIP J APPL SIG P, V2005, P1334, DOI 10.1155/ASP.2005.1334 Goodwin MM, 1999, IEEE T SIGNAL PROCES, V47, P1890, DOI 10.1109/78.771038 Graham D., 2006, EVOLUTION NERVOUS SY Gribonval R, 2001, IEEE T SIGNAL PROCES, V49, P994, DOI 10.1109/78.917803 HEUSDENS R, 2001, IEEE INT C AUD SPEEC IRINO T, 2006, IEEE T AUDIO SPEECH, V14, P2008 Irino T, 2001, J ACOUST SOC AM, V109, P2008, DOI 10.1121/1.1367253 Izhikevich EM, 2006, NEURAL COMPUT, V18, P245, DOI 10.1162/089976606775093882 Laxman S, 2007, IEEE T KNOWL DATA EN, V19, P1188, DOI [10.1109/TKDE.2007.1055, 10.1109/TKDE.2007.1055.] MALLAT SG, 1993, IEEE T SIGNAL PROCES, V41, P3397, DOI 10.1109/78.258082 Mannila H, 1997, DATA MIN KNOWL DISC, V1, P259, DOI 10.1023/A:1009748302351 NAJAFZADEH H, 2008, AUD ENG SOC CONV NET Patnaik D, 2008, SCI PROGRAMMING-NETH, V16, P49, DOI 10.3233/SPR-2008-0242 Patterson RD, 1986, FREQUENCY SELECTIVIT, P123 PICHEVAR R, 2007, AUD ENG SOC CONV AUS PICHEVAR R, 2008, EUR SIGN PROC C LAUS PICHEVAR R, 2010, INT JOINT C NEUR NET PICHEVAR R, 2008, AUD ENG SOC CONV NET Ravelli E, 2008, IEEE T AUDIO SPEECH, V16, P1361, DOI 10.1109/TASL.2008.2004290 Rozell CJ, 2008, NEURAL COMPUT, V20, P2526, DOI 10.1162/neco.2008.03-07-486 SMITH E, 2006, NATURE, V7079, P978 Smith E, 2005, NEURAL COMPUT, V17, P19, DOI 10.1162/0899766052530839 STRAHL S, 2008, BRAIN RES, V1220, P3 THIEDE T, 2000, AUD ENG C, P3 VERMA TS, 1999, ACOUST SPEECH SIG PR, P981 Zwicker E., 1990, PSYCHOACOUSTICS FACT ZWICKER E, 1984, J ACOUST SOC AM, V75, P219, DOI 10.1121/1.390398 NR 31 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 643 EP 657 DI 10.1016/j.specom.2010.09.008 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900005 ER PT J AU Le Roux, J Kameoka, H Ono, N de Cheveigne, A Sagayama, S AF Le Roux, Jonathan Kameoka, Hirokazu Ono, Nobutaka de Cheveigne, Alain Sagayama, Shigeki TI Computational auditory induction as a missing-data model-fitting problem with Bregman divergence SO SPEECH COMMUNICATION LA English DT Article DE Auditory induction; Acoustical scene analysis; Missing data; Auxiliary function; Bregman divergence; EM algorithm; Non-negative matrix factorization; Harmonic-temporal clustering ID NONNEGATIVE MATRIX FACTORIZATION; SPEECH RECOGNITION; AUTOREGRESSIVE PROCESSES; ACOUSTIC-SIGNALS; LONG GAPS; INTERPOLATION; RECONSTRUCTION; REPRESENTATION; MUSIC AB The human auditory system has the ability, known as auditory induction, to estimate the missing parts of a continuous auditory stream briefly covered by noise and perceptually resynthesize them. In this article, we formulate this ability as a model-based spectrogram analysis and clustering problem with missing data, show how to solve it using an auxiliary function method, and explain how this method is generally related to the expectation-maximization (EM) algorithm for a certain type of divergence measures called Bregman divergences, thus enabling the use of prior distributions on the parameters. We illustrate how our method can be used to simultaneously analyze a scene and estimate missing information with two algorithms: the first, based on non-negative matrix factorization (NMF), performs analysis of polyphonic multi-instrumental musical pieces. Our method allows this algorithm to cope with gaps within the audio data, estimating the timbre of the instruments and their pitch, and reconstructing the missing parts. The second, based on a recently introduced technique for the analysis of complex acoustical scenes called harmonic-temporal clustering (FITC), enables us to perform robust fundamental frequency estimation from incomplete speech data. (C) 2010 Elsevier B.V. All rights reserved. C1 [Le Roux, Jonathan; Kameoka, Hirokazu] NTT Corp, NTT Commun Sci Labs, Kanagawa 2430198, Japan. [Le Roux, Jonathan; Ono, Nobutaka; Sagayama, Shigeki] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan. [de Cheveigne, Alain] Univ Paris 05, Ctr Natl Rech Sci, F-75230 Paris 05, France. [de Cheveigne, Alain] Ecole Normale Super, F-75230 Paris 05, France. RP Le Roux, J (reprint author), NTT Corp, NTT Commun Sci Labs, 3-1 Morinosato Wakamiya, Kanagawa 2430198, Japan. EM leroux@cs.brl.ntt.co.jp; kameoka@cs.brl.ntt.co.jp; onono@hil.t.u-tokyo.ac.jp; Alain.de.Cheveigne@ens.fr; sagayama@hil.t.u-tokyo.ac.jp RI de Cheveigne, Alain/F-4947-2012 CR ACHAN K, 2005, P 2005 IEEE INT C AC, V5, P221, DOI 10.1109/ICASSP.2005.1416280 Bagshaw P., 1993, P EUR C SPEECH COMM, P1003 Banerjee A, 2005, J MACH LEARN RES, V6, P1705 Barker J., 2006, COMPUTATIONAL AUDITO, P297 Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 Barlow H, 2001, BEHAV BRAIN SCI, V24, P602 Bertalmio M, 2000, COMP GRAPH, P417 Bregman AS., 1990, AUDITORY SCENE ANAL BREGMAN LM, 1967, COMP MATH MATH PHYS, V7, P620 Cemgil A. T., 2008, CUEDFINFENGTR609 CEMGIL AT, 2005, P EUR SIGN PROC C EU CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 Clark P, 2008, INT CONF ACOUST SPEE, P3741, DOI 10.1109/ICASSP.2008.4518466 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Criminisi A, 2004, IEEE T IMAGE PROCESS, V13, P1200, DOI 10.1109/TIP.2004.833105 CSISZAR I, 1975, ANN PROBAB, V3, P146, DOI 10.1214/aop/1176996454 de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 Eggert J., 2004, P IEEE INT JOINT C N, V4, P2529 Ellis D. P. W., 1996, THESIS MIT ELLIS DPW, 1993, P IEEE WORKSH APPL S Esquef PAA, 2006, IEEE T AUDIO SPEECH, V14, P1391, DOI 10.1109/TSA.2005.858018 Fevotte C, 2009, NEURAL COMPUT, V21, P793, DOI 10.1162/neco.2008.04-08-771 Fujisaki H., 1969, Annual Report of the Engineering Research Institute, Faculty of Engineering, University of Tokyo, V28 Godsill S.J., 1998, DIGITAL AUDIO RESTOR GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 Grunwald P. D., 2007, MINIMUM DESCRIPTION Helmholtz H., 1954, SENSATIONS TONE PHYS Helmholtz Hermann von, 1885, SENSATIONS TONE PHYS JANSSEN AJEM, 1986, IEEE T ACOUST SPEECH, V34, P317, DOI 10.1109/TASSP.1986.1164824 Kameoka H, 2007, IEEE T AUDIO SPEECH, V15, P982, DOI 10.1109/TASL.2006.885248 Kameoka H., 2007, THESIS U TOKYO Kameoka H, 2009, INT CONF ACOUST SPEE, P3437, DOI 10.1109/ICASSP.2009.4960364 Kashino M., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.318 Lee DD, 2001, ADV NEUR IN, V13, P556 Lee DD, 1999, NATURE, V401, P788 LEROUX J, 2007, P AC SOC JPN AUT M, P351 Le Roux J, 2007, IEEE T AUDIO SPEECH, V15, P1135, DOI 10.1109/TASL.2007.894510 LEROUX J, 2008, 200811 METR U TOK LU L, 2003, P ICASSP, V5, P636 MAHER RC, 1994, J AUDIO ENG SOC, V42, P350 MAHER RC, 1993, 95 AES CONV NEW YORK, P1 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MENG XL, 1993, BIOMETRIKA, V80, P267, DOI 10.2307/2337198 Morup M., 2006, SPARSE NONNEGATIVE M Ono N., 2008, P SAPA SEPT, P23 Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Rajan JJ, 1997, IEE P-VIS IMAGE SIGN, V144, P249, DOI 10.1049/ip-vis:19971305 RAYNER PJW, 1991, P IEEE WORKSH APPL S REYESGOMEZ M, 2004, P ISCA WORKSH STAT P, P25 Sajda P, 2003, P SOC PHOTO-OPT INS, V5207, P321, DOI 10.1117/12.504676 Schmidt M. N., 2006, P 6 INT C IND COMP A, P700 SMARAGDIS P, 2009, P IEEE WORKSH MACH L Smaragdis P., 2004, P INT C IND COMP AN, P494 VASEGHI SV, 1990, IEE PROC-I, V137, P38 Veldhuis R., 1990, RESTORATION LOST SAM Virtanenm T., 2008, P ISCA TUT RES WORKS, P17 von Helmholtz H.L.F., 1863, SENSATIONS TONE PHYS Wang D., 2006, COMPUTATIONAL AUDITO Warren R. M., 1982, AUDITORY PERCEPTION WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 WOLFE PJ, 2005, P INT C AC SPEECH SI, V5, P517, DOI 10.1109/ICASSP.2005.1416354 NR 62 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 658 EP 676 DI 10.1016/j.specom.2010.08.009 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900006 ER PT J AU Li, JF Sakamoto, S Hongo, S Akagi, M Suzuki, Y AF Li, Junfeng Sakamoto, Shuichi Hongo, Satoshi Akagi, Masato Suzuki, Yoiti TI Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication SO SPEECH COMMUNICATION LA English DT Article DE Binaural masking level difference; Equalization-cancellation model; Two-stage binaural speech enhancement (TS-BASE); Binaural cue preservation; Sound localization ID ARRAY HEARING-AIDS; NOISE-REDUCTION; ENVIRONMENTS; OUTPUT AB Speech enhancement has been researched extensively for many years to provide high-quality speech communication in the presence of background noise and concurrent interference signals. Human listening is robust against these acoustic interferences using only two ears, but state-of-the-art two-channel algorithms function poorly. Motivated by psychoacoustic studies of binaural hearing (equalization-cancellation (EC) theory), in this paper, we propose a two-stage binaural speech enhancement with Wiener filter (TS-BASE/WF) approach that is a two-input two-output system. In this proposed TS-BASE/WF, interference signals are first estimated by equalizing and cancelling the target signal in a way inspired by the EC theory, a time-variant Wiener filter is then applied to enhance the target signal given the noisy mixture signals. The main advantages of the proposed TS-BASE/WF are (1) effectiveness in dealing with non-stationary multiple-source interference signals, and (2) success in preserving binaural cues after processing. These advantages were confirmed according to the comprehensive objective and subjective evaluations in different acoustical spatial configurations in terms of speech enhancement and binaural cue preservation. (C) 2010 Elsevier B.V. All rights reserved. C1 [Li, Junfeng; Akagi, Masato] Japan Adv Inst Sci & Technol, Sch Informat Sci, Tokyo, Japan. [Sakamoto, Shuichi; Suzuki, Yoiti] Tohoku Univ, Elect Commun Res Inst, Sendai, Miyagi 980, Japan. [Hongo, Satoshi] Miyagi Natl Coll Technol, Dept Design & Comp Applicat, Sendai, Miyagi, Japan. RP Li, JF (reprint author), Japan Adv Inst Sci & Technol, Sch Informat Sci, Tokyo, Japan. EM junfeng@jaist.ac.jp CR AICHNER R, 2007, P ICASSP2007 Blauert J., 1997, SPATIAL HEARING PSYC BOGAERT TV, 2007, ICASSP2007, P565 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Brandstein M., 2001, MICROPHONE ARRAYS SI, V1st Campbell DR, 2003, SPEECH COMMUN, V39, P97, DOI 10.1016/S0167-6393(02)00061-4 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298 Doclo S, 2007, SPEECH COMMUN, V49, P636, DOI 10.1016/j.specom.2007.02.001 DORBECKER M, 1996, EUSIPCO1996, P995 Durlach N. I., 1972, F MODERN AUDITORY TH, V2, P369 DURLACH NI, 1963, J ACOUST SOC AM, V35, P1206, DOI 10.1121/1.1918675 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132 Griffiths J., 1982, IEEE T ANTENN PROPAG, V30, P27 Klasen TJ, 2007, IEEE T SIGNAL PROCES, V55, P1579, DOI 10.1109/TSP.2006.888897 KOCK WE, 1950, J ACOUST SOC AM, V22, P801, DOI 10.1121/1.1906692 Kollmeier B, 1993, Scand Audiol Suppl, V38, P28 Li JF, 2008, 2008 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING, VOLS 1 AND 2, PROCEEDINGS, P97 Li JF, 2008, IEICE T FUND ELECTR, VE91A, P1337, DOI 10.1093/ietfec/e91-a.6.1337 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lotter T., 2005, P EUROSPEECH2005, P2285 Nakashima H., 2003, Acoustical Science and Technology, V24, DOI 10.1250/ast.24.172 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Roman N, 2006, J ACOUST SOC AM, V120, P4040, DOI 10.1121/1.2355480 Scalart P., 1996, IEEE INT C AC SPEECH, V2, P629 Shields PW, 2001, J ACOUST SOC AM, V110, P3232, DOI 10.1121/1.1413750 Suzuki Y, 1999, IEICE T FUND ELECTR, VE82A, P588 Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005 Waibel Alex, 2008, 2008 Second International Symposium on Universal Communication, DOI 10.1109/ISUC.2008.78 Wang D., 2006, COMPUTATIONAL AUDITO Welker DP, 1997, IEEE T SPEECH AUDI P, V5, P543, DOI 10.1109/89.641299 Wiener N., 1949, EXTRAPOLATION INTERP NR 33 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 677 EP 689 DI 10.1016/j.specom.2010.04.009 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900007 ER PT J AU Bach, JH Anemuller, J Kollmeier, B AF Bach, Joerg-Hendrik Anemueller, Joern Kollmeier, Birger TI Robust speech detection in real acoustic backgrounds with perceptually motivated features SO SPEECH COMMUNICATION LA English DT Article DE Speech detection; Pattern classification; Amplitude modulations; Fluctuating noise; Real-world scenario ID AUDITORY-CORTEX; RECOGNITION; CLASSIFICATION; PERIODICITY; FREQUENCY; PLP AB The current study presents an analysis of the robustness of a speech detector in real background sounds. One of the most important aspects of automatic speech/nonspeech classification is robustness in the presence of strongly varying external conditions. These include variations of the signal-to-noise ratio as well as fluctuations of the background noise. These variations are systematically evaluated by choosing different mismatched conditions between training and testing of the speech/nonspeech classifiers. The detection performance of the classifier with respect to these mismatched conditions is used as a measure of robustness and generalisation. The generalisation towards un-trained SNR conditions and unknown background noises is evaluated and compared to a matched baseline condition. The classifier consists of a feature front-end, which computes amplitude modulation spectral features (AMS), and a support vector machine (SVM) back-end. The AMS features are based on Fourier decomposition over time of short-term spectrograms. Mel-frequency cepstral coefficients (MFCC) as well as relative spectral features (RASTA) based on perceptual linear prediction (PLP) serve as baseline. The results show that RASTA-filtered PLP features perform best in the matched task. In the generalisation tasks however, the AMS features emerge as more robust in most cases, while MFCC features are outperformed by both other feature types. In a second set of experiments, a hierarchical approach is analysed which employs a background classification step prior to the speech/nonspeech classifier in order to improve the robustness of the detection scores in novel backgrounds. The background sounds used are recorded in typical everyday scenarios. The hierarchy provides a benefit in overall performance if the robust AMS features are employed. The generalisation capabilities of the hierarchy towards novel backgrounds and SNRs is found to be optimal when a limited number of training backgrounds is used (compared to the inclusion of all available background data). The best backgrounds in terms of generalisation capabilities are found to be backgrounds in which some component of speech (such as unintelligible background babble) is present, which corroborates the hypothesis that the AMS features provide a decomposition of signals which is by itself very suitable for training very general speech/nonspeech detectors. This is also supported by the finding that the SVMs combined with RASTA-PLPs require nonlinear kernels to reach a similar performance as the AMS patterns with linear kernels. (C) 2010 Elsevier B.V. All rights reserved. C1 [Bach, Joerg-Hendrik; Anemueller, Joern; Kollmeier, Birger] Carl von Ossietzky Univ Oldenburg, Dept Med Phys, D-26111 Oldenburg, Germany. RP Bach, JH (reprint author), Carl von Ossietzky Str 8-11, D-26111 Oldenburg, Germany. EM j.bach@uni-oldenburg.de FU European Union; European Graduate School for Neurosensory Science and Systems FX This work was supported by the European Union within the 6th Framework Programme through the IP IST DIRAC, and by the European Graduate School for Neurosensory Science and Systems. We are indebted to Nina Pohl and Georg Klump of the Group for Animal Physiology and Behaviour at the University of Oldenburg for providing high quality, undisturbed recordings of natural scenes, and to Hendrik Kayser, who contributed significantly to this work, most notably the office and city recordings. We thank Bernd T. Meyer for giving helpful advice in numerous discussion, and for proofreading an earlier version of this manuscript. CR ANEMULLER J, 2008, P INT BRISB AUSTR ANEMULLER J, 2008, 1 INT C COGN SYST CO Bee MA, 2004, J NEUROPHYSIOL, V92, P1088, DOI 10.1152/jn.00884.2003 Bregman AS., 1990, AUDITORY SCENE ANAL Buchler M, 2005, EURASIP J APPL SIG P, V2005, P2991, DOI 10.1155/ASP.2005.2991 Chang C.-C., 2001, LIBSVM LIB SUPPORT V DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DRESCHLER WA, 1999, J ACOUST SOC AM, V105, P1296, DOI 10.1121/1.426174 Garofolo JS, 1993, TIMIT ACOUSTIC PHONE GRAMSS T, 1990, SPEECH COMMUN, V9, P35, DOI 10.1016/0167-6393(90)90043-9 GREENBERG S, 1997, P ICASSP MUN GERM Happel MFK, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P670 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 *ITU, 1996, REC G X729 ANN B Kingsbury BED, 1997, INT CONF ACOUST SPEE, P1259, DOI 10.1109/ICASSP.1997.596174 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 KLEINSCHMIDT M, 2002, ROBUST SPEECH RECOGN KOLLMEIER B, 1994, J ACOUST SOC AM, V95, P1593, DOI 10.1121/1.408546 Langner G, 1997, J COMP PHYSIOL A, V181, P665, DOI 10.1007/s003590050148 Lin HT, 2007, MACH LEARN, V68, P267, DOI 10.1007/s10994-007-5018-6 LUO J, 2008, P ICVS SANT GREEC MAGANTI HK, 2007, P ICASSP HON MARKAKI M, 2008, WORKSH STAT PERC AUD, P7 Martin A. F., 1997, P EUROSPEECH, P1895 Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548 Mesgarani N, 2008, J ACOUST SOC AM, V123, P899, DOI 10.1121/1.2816572 Mesgarani N, 2006, IEEE T AUDIO SPEECH, V14, P920, DOI 10.1109/TSA.2005.858055 Meyer BT, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P906 OSTENDORF M, 1998, P FORTSCHR AK DAGA 9, P402 Platt J., 2000, PROBABILISTIC OUTPUT RABINER LR, 1975, AT&T TECH J, V54, P297 SCHREINER CE, 1988, J NEUROPHYSIOL, V60, P1823 SHIRE ML, 2000, ICASSP IST Tchorz J, 2003, IEEE T SPEECH AUDI P, V11, P184, DOI 10.1109/TSA.2003.811542 Vapnik V., 1995, NATURE STAT LEARNING Young S., 2002, HTK BOOK HTK VERSION ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630 NR 38 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 690 EP 706 DI 10.1016/j.specom.2010.07.003 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900008 ER PT J AU Yin, H Hohmann, V Nadeu, C AF Yin, Hui Hohmann, Volker Nadeu, Climent TI Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency SO SPEECH COMMUNICATION LA English DT Article DE Gammatone filterbank; Instantaneous frequency; Speech recognition AB Most of the features used by modern automatic speech recognition systems, such as mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP) coefficients, represent spectral envelope of the speech signal only. Nevertheless, phase or frequency modulation as represented in recent perceptual models of the peripheral auditory system might also contribute to speech decoding. Furthermore, such features can be complementary to the envelope features. This paper proposes a variety of features based on a linear auditory filterbank, the Gammatone filterbank. Envelope features are derived from the envelope of the subband filter outputs. Phase/frequency modulation is represented by the subband instantaneous frequency (IF) and is used explicitly by concatenating envelope-based and IF-based features or is used implicitly by IF-based frequency reassignment. Speech recognition experiments using a standard HMM-based recognizer under both clean training and multi-condition training are conducted on a Chinese mandarin digits corpus. The experimental results show that the proposed envelope and phase based features can improve recognition rates in clean and noisy conditions compared to the reference MFCC-based recognizer. (C) 2010 Elsevier B.V. All rights reserved. C1 [Yin, Hui] Beijing Inst Technol, Dept Elect Engn, Beijing 100081, Peoples R China. [Yin, Hui; Hohmann, Volker; Nadeu, Climent] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain. [Hohmann, Volker] Carl von Ossietzky Univ Oldenburg, D-2900 Oldenburg, Germany. RP Yin, H (reprint author), Beijing Inst Technol, Dept Elect Engn, Beijing 100081, Peoples R China. EM hchhuihui@gmail.com; volker.hohmann@uni-oldenburg.de; climent.nadeu@upc.edu RI Nadeu, Climent/B-9638-2014 OI Nadeu, Climent/0000-0002-5863-0983 FU Spanish Ministry of Education and Science [TEC2007-65470]; National Nature Science Foundation of China [NSFC 60605015] FX This research was partially supported by the Spanish project SAPIRE (TEC2007-65470) as well as a research grant to Volker Hohmann from the Spanish Ministry of Education and Science, and partially supported by the National Nature Science Foundation of China under Grant NSFC 60605015. CR Alsteris LD, 2007, DIGIT SIGNAL PROCESS, V17, P578, DOI 10.1016/j.dsp.2006.06.007 BOER E, 1978, J ACOUST SOC AM, V63, P115 Dimitriadis D, 2005, IEEE SIGNAL PROC LET, V12, P621, DOI 10.1109/LSP.2005.853050 *ETSI, 2003, 2002212 ETSI ES Gardner TJ, 2005, J ACOUST SOC AM, V117, P2896, DOI 10.1121/1.1863072 GU L, 2001, IEEE ICASSP 2001 Haque S, 2009, SPEECH COMMUN, V51, P58, DOI 10.1016/j.specom.2008.06.002 Herzke T, 2005, EURASIP J APPL SIG P, V2005, P3034, DOI 10.1155/ASP.2005.3034 HOHMANN V, 2006, INT S HEAR ISH 2006, P11 Hohmann V, 2002, ACTA ACUST UNITED AC, V88, P433 HOLMBERG M, 2007, SPEECH COMMUN, P917 IKBAL S, 2003, ACOUST SPEECH SIG PR, P133 Johannesma PIM, 1972, P IPO S HEARING THEO, P58 Kleinschmidt M, 2001, SPEECH COMMUN, V34, P75, DOI 10.1016/S0167-6393(00)00047-9 KUBO Y, 2008, IEICE T INF SYST, V91, P8 KUMARESAN R, 2003, AS C SIGN SYST COMP, V2, P2078 Potamianos A, 1996, J ACOUST SOC AM, V99, P3795, DOI 10.1121/1.414997 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Munkong R., 2008, IEEE SIGNAL PROCESSI, P98 Pagano M., 2000, PRINCIPLES BIOSTATIS Patterson RD, 1987, M IOC SPEECH GROUP A Plante F, 1998, IEEE T SPEECH AUDI P, V6, P282, DOI 10.1109/89.668821 Potamianos A, 2001, IEEE T SPEECH AUDI P, V9, P196, DOI 10.1109/89.905994 Schluter R, 2007, INT CONF ACOUST SPEE, P649 STARK AP, 2008, INTERSPEECH WANG Y, 2003, EPIDEMIC SPREADING R, P25 Wang YR, 2006, LECT NOTES COMPUT SC, V4274, P370 Young S., 1995, HIDDEN MARKOV MODEL NR 28 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 707 EP 715 DI 10.1016/j.specom.2010.04.008 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900009 ER PT J AU Kubo, Y Okawa, S Kurematsu, A Shirai, K AF Kubo, Yotaro Okawa, Shigeki Kurematsu, Akira Shirai, Katsuhiko TI Temporal AM-FM combination for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Frequency modulation; Multistream speech recognition; HMM/MLP-tandem approach ID NOISY; FREQUENCY; HUMANS AB A novel method for feature extraction from the frequency modulation (FM) in speech signals is proposed for robust speech recognition. To exploit of the multistream speech recognizers, each stream should compensate for the shortcomings of the other streams. In this light, FM features are promising as complemental features of amplitude modulation (AM). In order to extract effective features from FM patterns, we applied the proposed feature extraction method by the data-driven modulation analysis of instantaneous frequency. By evaluating the frequency responses of the temporal filters obtained by the proposed method, we confirmed that the modulation observed around 4 Hz is important for the discrimination of FM patterns, as in the case of AM features. We evaluated the robustness of our method by performing noisy speech recognition experiments. We confirmed that our FM features can improve the noise robustness of speech recognizers even when the FM features are not combined with conventional AM and/or spectral envelope features. We also performed multistream speech recognition experiments. The experimental results show that combination of the conventional AM system and proposed FM system reduced word error by 43.6% at 10 dB SNR as compared to the baseline MFCC system and by 20.2% as compared to the conventional AM system. We investigated the complementarity of the AM and FM features by performing speech recognition experiments in artificial noisy environments. We found the FM features to be robust to wide-band noise, which certainly degrades the performance of AM features. Further, we evaluated the efficiency of multiconditional training. Although the performance of the proposed combination method was degraded by multiconditional training, we confirmed that the performance of the proposed FM method improved. Through a series of experiments, we confirmed that our FM features can be used as independent features as well as complemental features. (C) 2010 Elsevier B.V. All rights reserved. C1 [Kubo, Yotaro; Kurematsu, Akira; Shirai, Katsuhiko] Waseda Univ, Dept Comp Sci, Shinjuku Ku, Tokyo 1698555, Japan. [Okawa, Shigeki] Chiba Inst Technol, Chiba 2750016, Japan. RP Kubo, Y (reprint author), Waseda Univ, Dept Comp Sci, Shinjuku Ku, 3-4-1 Ohkubo, Tokyo 1698555, Japan. EM yotaro@ieee.org FU Ministry of Education, Culture, Sports, Science and Technology, Japan [21.04190] FX The authors would like to thank the anonymous reviewers and the editor for their valuable comments and suggestions for improving the quality of this paper. The authors also would like to thank Prof. Mikio Tohyama for introducing them to perceptual studies on zero-crossing points of signals. This study was partly supported by a Grant-in-Aid for JSPS Fellows (21.04190) from the Ministry of Education, Culture, Sports, Science and Technology, Japan. CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 BOASHASH B, 1992, P IEEE, V80, P520, DOI 10.1109/5.135376 CHEN B, 2005, P ICASSP 05, V1, P945, DOI 10.1109/ICASSP.2005.1415271 CHEN B, 2003, P EUROSPEECH, P853 Chen JD, 2004, IEEE SIGNAL PROC LET, V11, P258, DOI 10.1109/LSP.2003.821689 Dimitriadis D, 2005, IEEE SIGNAL PROC LET, V12, P621, DOI 10.1109/LSP.2005.853050 GAJIC B, 2003, P ICASSP 2006 HONG K, P62 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1998, P ICSLP 98 SYDN AUST Hermansky H., 2000, ACOUST SPEECH SIG PR, P1635 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 IKBAL S, 2004, P INTERSPPECH ICSLP, P2553 Janin A., 1999, P 6 EUR C SPEECH COM, P591 Kaiser J., 1993, P IEEE ICASSP 93, P149 Kim DS, 1999, IEEE T SPEECH AUDI P, V7, P55 Kubo Y, 2008, INT CONF ACOUST SPEE, P4709 Kubo Y, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P642 Kubo Y, 2008, IEICE T INF SYST, VE91D, P448, DOI 10.1093/ietisy/e91-d.3.448 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Morgan N., 1995, IEEE SIGNAL PROC MAY, P25 Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826 Nakamura S, 2005, IEICE T INF SYST, VE88D, P535, DOI 10.1093/ietisy/e88-d.3.535 Rumelhart DH, 1986, PARALLEL DISTRIBUTED, V1 Sharma S., 1999, THESIS OREGON GRADUA Suzuki H., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.163 VUUREN S, 1997, P EUROSPEECH 1997, P409 Wang Y., 2003, P 22 INT S REL DISTR, P25 YOSHIDA K, 2002, P ICASSP 2002 NR 30 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 716 EP 725 DI 10.1016/j.specom.2010.08.012 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900010 ER PT J AU Markaki, M Stylianou, Y AF Markaki, Maria Stylianou, Yannis TI Discrimination of speech from nonspeeech in broadcast news based on modulation frequency features SO SPEECH COMMUNICATION LA English DT Article DE Speech discrimination; Modulation spectrum; Mutual information; Higher order singular value decomposition ID SPEAKER DIARIZATION; SEGMENTATION; SYSTEMS AB In audio content analysis, the discrimination of speech and non-speech is the first processing step before speaker segmentation and recognition, or speech transcription. Speech/non-speech segmentation algorithms usually consist of a frame-based scoring phase using MFCC features, combined with a smoothing phase. In this paper, a content based speech discrimination algorithm is designed to exploit long-term information inherent in modulation spectrum. In order to address the varying degrees of redundancy and discriminative power of the acoustic and modulation frequency subspaces, we first employ a generalization of SVD to tensors (Higher Order SVD) to reduce dimensions. Projection of modulation spectral features on the principal axes with the higher energy in each subspace results in a compact set of features with minimum redundancy. We further estimate the relevance of these projections to speech discrimination based on mutual information to the target class. This system is built upon a segment-based SVM classifier in order to recognize the presence of voice activity in audio signal. Detection experiments using Greek and US English broadcast news data composed of many speakers in various acoustic conditions suggest that the system provides complementary information to state-of-the-art mel-cepstral features. (C) 2010 Elsevier B.V. All rights reserved. C1 [Markaki, Maria; Stylianou, Yannis] Univ Crete, Dept Comp Sci, Iraklion, Greece. RP Markaki, M (reprint author), Univ Crete, Dept Comp Sci, Iraklion, Greece. EM mmarkaki@csd.uoe.gr CR Aronowitz H, 2007, INT CONF ACOUST SPEE, P393 Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 ATLAS L, 2005, MODULATION TOOLBOX M Barras C, 2006, IEEE T AUDIO SPEECH, V14, P1505, DOI 10.1109/TASL.2006.878261 Cover T M, 1991, ELEMENTS INFORM THEO De Lathauwer L, 2000, SIAM J MATRIX ANAL A, V21, P1253, DOI 10.1137/S0895479896305696 GREENBERG S, 1997, P ICASSP, V3, P1647 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Jain A, 2005, PATTERN RECOGN, V38, P2270, DOI 10.1016/j.patcog.2005.01.012 Joachims T., 1999, ADV KERNEL METHODS S, P41 Kinnunen T., 2008, P OD SPEAK LANG REC KINNUNEN T, 2007, P SPECOM 2007 Lu L, 2003, MULTIMEDIA SYST, V8, P482, DOI 10.1007/s00530-002-0065-0 Malyska N, 2005, INT CONF ACOUST SPEE, P873 MARKAKI M, 2009, P IEEE EMBC 09 Martin A., 1997, DET CURVE ASSESSMENT, P1895 Mesgarani N, 2006, IEEE T AUDIO SPEECH, V14, P920, DOI 10.1109/TSA.2005.858055 Peng HC, 2005, IEEE T PATTERN ANAL, V27, P1226 QUATIERI TF, 2003, IEEE WORKSH APPL SIG Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Sanderson C., 2002, 0233 IDIAPRR Saunders J, 1996, INT CONF ACOUST SPEE, P993, DOI 10.1109/ICASSP.1996.543290 SCHEIRER E, 1997, P IEEE INT C AC SPEE, P1331 Schimmel SM, 2007, INT CONF ACOUST SPEE, P605 Slonim N, 2005, ESTIMATING MUTUAL IN Spina MS, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P594 Sukittanon S, 2004, IEEE T SIGNAL PROCES, V52, P3023, DOI 10.1109/TSP.2004.833861 Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 Wooters C., 2004, P FALL 2004 RICH TRA NR 29 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 726 EP 735 DI 10.1016/j.specom.2010.08.007 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900011 ER PT J AU Heckmann, M Domont, X Joublin, F Goerick, C AF Heckmann, Martin Domont, Xavier Joublin, Frank Goerick, Christian TI A hierarchical framework for spectro-temporal feature extraction SO SPEECH COMMUNICATION LA English DT Article DE Spectro-temporal; Auditory; Robust speech recognition; Image processing; Learning; Competition; Hierarchical ID INDEPENDENT COMPONENT ANALYSIS; AUTOMATIC SPEECH RECOGNITION; OBJECT RECOGNITION; RECEPTIVE-FIELDS; AUDITORY-CORTEX; SOUNDS; ENHANCEMENT; MODULATIONS; PERCEPTION; COMPLEX AB In this paper we present a hierarchical framework for the extraction of spectro-temporal acoustic features. The design of the features targets higher robustness in dynamic environments. Motivated by the large gap between human and machine performance in such conditions we take inspirations from the organization of the mammalian auditory cortex in the design of our features. This includes the joint processing of spectral and temporal information, the organization in hierarchical layers, competition between coequal features, the use of high-dimensional sparse feature spaces, and the learning of the underlying receptive fields in a data-driven manner. Due to these properties we termed the features as hierarchical spectro-temporal (HIST) features. For the learning of the features at the first layer we use Independent Component Analysis (ICA). At the second layer of our feature hierarchy we apply Non-Negative Sparse Coding (NNSC) to obtain features spanning a larger frequency and time region. We investigate the contribution of the different subparts of this feature extraction process to the overall performance. This includes an analysis of the benefits of the hierarchical processing, the comparison of different feature extraction methods on the first layer, the evaluation of the feature competition, and the investigation of the influence of different receptive field sizes on the second layer. Additionally, we compare our features to MFCC and RASTA-PLP features in a continuous digit recognition task in noise. On a wideband dataset we constructed ourselves based on the Aurora-2 task, as well as on the actual Aurora-2 database. We show that a combination of the proposed HIST features and RASTA-PLP features yields significant improvements and that the proposed features carry complementary information to RASTA-PLP and MFCC features. (C) 2010 Elsevier B.V. All rights reserved. C1 [Heckmann, Martin; Domont, Xavier; Joublin, Frank; Goerick, Christian] Honda Res Inst Europe GmbH, D-63073 Offenbach, Germany. [Domont, Xavier] Tech Univ Darmstadt, Control Theory & Robot Lab, D-64283 Darmstadt, Germany. RP Heckmann, M (reprint author), Honda Res Inst Europe GmbH, D-63073 Offenbach, Germany. EM martin.heckmann@honda-ri.de; xavier.domont@rtr.tu-darmstadt.de; frank.joublin@honda-ri.de; christian.goerick@honda-ri.de CR BAER T, 1993, J REHABIL RES DEV, V30, P49 Behnke S., 2003, P INT JOINT C NEUR N, V4, P2758, DOI 10.1109/IJCNN.2003.1224004 CHEN B, 2004, P 8 INT C SPOK LANG CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 Cho YC, 2005, PATTERN RECOGN LETT, V26, P1327, DOI 10.1016/j.patrec.2004.11.026 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 CRICK F, 1984, P NATL ACAD SCI-BIOL, V81, P4586, DOI 10.1073/pnas.81.14.4586 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 deCharms RC, 1998, SCIENCE, V280, P1439, DOI 10.1126/science.280.5368.1439 Domont X, 2008, INT CONF ACOUST SPEE, P4417, DOI 10.1109/ICASSP.2008.4518635 DOMONT X, 2007, LECT NOTES COMPUTER, P142 DUSAN S, 2005, 9 EUR C SPEEC COMM T ELHILALI M, 2006, P INT C AC SPEECH SI EZZAT T, 2007, P INTERSPEECH ISCA A Fant G., 1970, ACOUSTIC THEORY SPEE FANT G, 1979, SPEECH TRANSMISS LAB, V1, P70 Felleman DJ, 1991, CEREB CORTEX, V1, P1, DOI 10.1093/cercor/1.1.1 FERGUS R, 2003, P IEEE COMP SOC C CO, V2 Flynn R, 2008, SPEECH COMMUN, V50, P797, DOI 10.1016/j.specom.2008.05.004 Fritz J, 2003, NAT NEUROSCI, V6, P1216, DOI 10.1038/nn1141 FUKUSHIMA K, 1980, BIOL CYBERN, V36, P193, DOI 10.1007/BF00344251 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Glaser C, 2010, IEEE T AUDIO SPEECH, V18, P224, DOI 10.1109/TASL.2009.2025536 Hague S., 2009, SPEECH COMMUN, V51, P58 HECKMANN M, 2010, ISCA TUT RE IN PRESS HECKMANN M, 2009, P INTERSPEECH ISCA B Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 2000, P INT C AC SPEECH SI, V3 HERMANSKY H, 1998, 5 INT C SPOK LANG PR Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch G., 2005, FANT FILTERING NOISE Hoyer PO, 2004, J MACH LEARN RES, V5, P1457 HUBEL DH, 1965, J NEUROPHYSIOL, V28, P229 Hyvarinen A, 1999, IEEE T NEURAL NETWOR, V10, P626, DOI 10.1109/72.761722 KIM C, 2010, P INT C AC SPEECH SI, P4574 King AJ, 2009, NAT NEUROSCI, V12, P698, DOI 10.1038/nn.2308 Klein DJ, 2003, EURASIP J APPL SIG P, V2003, P659, DOI 10.1155/S1110865703303051 Kleinschmidt M., 2002, P INT C SPOK LANG PR Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416 Leonard R., 1984, INT C AC SPEECH SIGN, V9 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Mesgarani N, 2006, IEEE T AUDIO SPEECH, V14, P920, DOI 10.1109/TSA.2005.858055 MEYER B, 2008, P INTERSPEECH ISCA B Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826 Morgan N., 1995, IEEE Signal Processing Magazine, V12, DOI 10.1109/79.382443 Olshausen BA, 1996, NATURE, V381, P607, DOI 10.1038/381607a0 PATTERSON RD, 1992, ADV BIOSCI, V83, P429 Pearce D., 2000, P INT C SPOK LANG PR Rauschecker JP, 1998, CURR OPIN NEUROBIOL, V8, P516, DOI 10.1016/S0959-4388(98)80040-8 Read HL, 2002, CURR OPIN NEUROBIOL, V12, P433, DOI 10.1016/S0959-4388(02)00342-2 Riesenhuber M, 1999, NAT NEUROSCI, V2, P1019 Schreiner CE, 1994, AUDIT NEUROSCI, V1, P39 Scott SK, 2003, TRENDS NEUROSCI, V26, P100, DOI 10.1016/S0166-2236(02)00037-1 Shamma S, 2001, TRENDS COGN SCI, V5, P340, DOI 10.1016/S1364-6613(00)01704-6 SHERRY Y, 2008, P INTERSPEECH ISCA B Slaney M., 1993, 35 APPL COMP CO Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009 Stevens K.N., 2000, ACOUSTIC PHONETICS SUR M, 1988, SCIENCE, V242, P1437, DOI 10.1126/science.2462279 van Hateren JH, 1998, P ROY SOC B-BIOL SCI, V265, P2315 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Vilar JM, 2008, INT CONF ACOUST SPEE, P5101, DOI 10.1109/ICASSP.2008.4518806 WANG H, 2008, P INTERSPEECH ISCA B Wersing H, 2003, NEURAL COMPUT, V15, P1559, DOI 10.1162/089976603321891800 Young ED, 2008, PHILOS T R SOC B, V363, P923, DOI 10.1098/rstb.2007.2151 NR 66 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 736 EP 752 DI 10.1016/j.specom.2010.08.006 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900012 ER PT J AU Meyer, BT Kollmeier, B AF Meyer, Bernd T. Kollmeier, Birger TI Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Spectro-temporal feature extraction; Automatic speech recognition; Robustness; Intrinsic variability ID RECOGNIZERS AB The effect of bio-inspired spectro-temporal processing for automatic speech recognition (ASR) is analyzed for two different tasks with focus on the robustness of spectro-temporal Gabor features in comparison to mel-frequency cepstral coefficients (MFCCs). Experiments aiming at extrinsic factors such as additive noise and changes of the transmission channel were carried out on a digit classification task (AURORA 2) for which spectro-temporal features were found to be more robust than the MFCC baseline against a wide range of noise sources. Intrinsic variations, i.e., changes in speaking rate, speaking effort and pitch, were analyzed on a phoneme recognition task with matched training and test conditions. The sensitivity of Gabor and MFCC features against various speaking styles was found to be different in a systematic way. An analysis based on phoneme confusions for both feature types suggests that spectro-temporal and purely spectral features carry complementary information. The usefulness of the combined information was demonstrated in a system using a combination of both types of features which yields a decrease in word-error rate of 16% compared to the best single-stream recognizer and 47% compared to an MFCC baseline. (C) 2010 Elsevier B.V. All rights reserved. C1 [Meyer, Bernd T.; Kollmeier, Birger] Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. RP Meyer, BT (reprint author), Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. EM bernd.meyer@uni-oldenburg.de FU DFG [SFB/TRR 31] FX Supported by the DFG (SFB/TRR 31 'The active auditory system'; URL: http://www.uni-oldenburg.de/sfbtr31). The OLLO speech database OLLO has been developed as part of the EU DIVINES Project IST-2002-002034. CR BARKER J, 2007, SPEECH COMM DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220 DEVALOIS RL, 1980, ANNU REV PSYCHOL, V31, P309, DOI 10.1146/annurev.ps.31.020180.001521 Dimitriadis D, 2005, IEEE SIGNAL PROC LET, V12, P621, DOI 10.1109/LSP.2005.853050 DRESCHLER WA, 1999, J ACOUST SOC AM, V105, P1296, DOI 10.1121/1.426174 Dreschler WA, 2001, AUDIOLOGY, V40, P148 Ellis D., 2003, RASTA PLP MATLAB EZZAT T, 2007, P INT GRAMSS T, 1991, P IEEE 2 INT C ART N, P180 GRAMSS T, 1990, SPEECH COMMUN, V9, P35, DOI 10.1016/0167-6393(90)90043-9 Happel MFK, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P670 HECKMANN M, 2008, P INT, P4417 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H., 2000, ACOUST SPEECH SIG PR, P1635 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1999, ACOUST SPEECH SIG PR, P289 HIRSCH H, 2000, P ISCA ITRW ASR, P2697 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 KAERNBACH C, 2000, P CONTR PSYCH AC RES, P295 KLEINSCHMIDT M, 2003, THESIS Kleinschmidt M., 2002, P ICSLP Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416 Leonard R., 1984, P ICASSP, V9, P328 LIEB M, 2002, P ICSLP, P449 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 MESGARANI N, 2007, P INT MEYER B, 2006, WORKSH SPEECH INTR V, P95 MEYER B, 2008, P INT Meyer B., 2007, P INT, P1485 Qiu AQ, 2003, J NEUROPHYSIOL, V90, P456, DOI 10.1152/jn.00851.2002 Scharenborg O, 2007, SPEECH COMMUN, V49, P336, DOI 10.1016/j.specom.2007.01.009 Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950 Wesker T., 2005, P INT, P1273 Young S., 1995, HTK BOOK Zhao SY, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P898 NR 36 TC 13 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 753 EP 767 DI 10.1016/j.specom.2010.07.002 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900013 ER PT J AU Wu, SQ Falk, TH Chan, WY AF Wu, Siqing Falk, Tiago H. Chan, Wai-Yip TI Automatic speech emotion recognition using modulation spectral features SO SPEECH COMMUNICATION LA English DT Article DE Emotion recognition; Speech modulation; Spectro-temporal representation; Affective computing; Speech analysis ID FREQUENCY; CLASSIFICATION; ENVELOPE AB In this study, modulation spectral features (MSFs) are proposed for the automatic recognition of human affective information from speech. The features are extracted from an auditory-inspired long-term spectro-temporal representation. Obtained using an auditory filterbank and a modulation filterbank for speech analysis, the representation captures both acoustic frequency and temporal modulation frequency components, thereby conveying information that is important for human speech perception but missing from conventional short-term spectral features. On an experiment assessing classification of discrete emotion categories, the MSFs show promising performance in comparison with features that are based on mel-frequency cepstral coefficients and perceptual linear prediction coefficients, two commonly used short-term spectral representations. The MSFs further render a substantial improvement in recognition performance when used to augment prosodic features, which have been extensively used for emotion recognition. Using both types of features, an overall recognition rate of 91.6% is obtained for classifying seven emotion categories. Moreover, in an experiment assessing recognition of continuous emotions, the proposed features in combination with prosodic features attain estimation performance comparable to human evaluation. (C) 2010 Elsevier B.V. All rights reserved. C1 [Wu, Siqing; Chan, Wai-Yip] Queens Univ, Dept Elect & Comp Engn, Kingston, ON K7L 3N6, Canada. [Falk, Tiago H.] Univ Toronto, Inst Biomat & Biomed Engn, Toronto, ON M5S 3G9, Canada. RP Wu, SQ (reprint author), Queens Univ, Dept Elect & Comp Engn, Kingston, ON K7L 3N6, Canada. EM siqing.wu@queensu.ca; tiago.falk@ieee.org; chan@queensu.ca CR Abelin A., 2000, P ISCA WORKSH SPEECH, P110 AERTSEN AMHJ, 1980, BIOL CYBERN, V38, P223, DOI 10.1007/BF00337015 [Anonymous], 1996, G729 ITUT Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 BARRA R, 2006, P ICASSP 06, V1, P1085 Batliner A, 2006, P IS LTC 2006 LJUBL, P240 Bishop C. M., 2006, PATTERN RECOGNITION Burkhardt F., 2005, P INT, P1517 Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578 Chang CC, 2009, LIBSVM LIB SUPPORT V Chih T., 2005, J ACOUST SOC AM, V118, P887 Clavel C, 2008, SPEECH COMMUN, V50, P487, DOI 10.1016/j.specom.2008.03.012 COWIE R, 1996, P 4 INT C SPOK LANG, V3, P1989, DOI 10.1109/ICSLP.1996.608027 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 Ekman P., 1999, BASIC EMOTIONS HDB C, P45 Ewert SD, 2000, J ACOUST SOC AM, V108, P1181, DOI 10.1121/1.1288665 Falk T., 2008, P INT WORKSH AC ECH Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679 Falk TH, 2010, IEEE T INSTRUM MEAS, V59, P978, DOI 10.1109/TIM.2009.2024697 Fletcher H, 1940, REV MOD PHYS, V12, P0047, DOI 10.1103/RevModPhys.12.47 Giannakopoulos T, 2009, INT CONF ACOUST SPEE, P65, DOI 10.1109/ICASSP.2009.4959521 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Grimm M, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P865, DOI 10.1109/ICME.2008.4607572 Grimm M., 2007, P IEEE INT C AC SPEE, VIV, P1085 Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010 Gunes H, 2007, J NETW COMPUT APPL, V30, P1334, DOI 10.1016/j.jnca.2006.09.007 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hsu CW, 2007, PRACTICAL GUIDE SUPP International Telecommunications Union, 1993, P56 ITUT Ishi CT, 2010, EURASIP J AUDIO SPEE, DOI 10.1155/2010/528193 Kaiser J., 1990, P IEEE INT C AC SPEE, V1, P381 Kanedera N, 1999, SPEECH COMMUN, V28, P43, DOI 10.1016/S0167-6393(99)00002-3 Kittler J., 1978, Pattern Recognition and Signal Processing Lee S., 2005, P EUR LISB PORT, P497 LUGGER M, 2008, P INT C AC SPEECH SI, V4, P4945 Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826 Mozziconacci S, 2002, P 1 INT C SPEECH PRO, P1 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Picard R. W., 1997, AFFECTIVE COMPUTING Rabiner L, 1993, FUNDAMENTALS SPEECH Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Scherer S., 2007, P INT ENV 2007, P152 Schuller B., 2007, P INT, P2253 Schuller B, 2007, P ICASSP, V4, P941 Schuller B., 2006, P SPEECH PROS Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006 Shamma S, 2001, TRENDS COGN SCI, V5, P340, DOI 10.1016/S1364-6613(00)01704-6 Shamma S., 2003, IETE J RES, V49, P193 Slaney M, 1993, EFFICIENT IMPLEMENTA Sun R, 2009, INT CONF ACOUST SPEE, P4509, DOI 10.1109/ICASSP.2009.4960632 TALKIN D, 1995, ROBUST ALGORITHM PIT, pCH14 Vapnik V., 1995, NATURE STAT LEARNING Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003 Vlasenko B., 2007, P INTERSPEECH 2007, P2225 Wollmer M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P597 Wu S., 2009, P INT C DIG SIGN PRO, P1 Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995 NR 60 TC 32 Z9 35 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 768 EP 785 DI 10.1016/j.specom.2010.08.013 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900014 ER PT J AU Alias, F Formiga, L Llora, X AF Alias, Francesc Formiga, Lluis Llora, Xavier TI Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept SO SPEECH COMMUNICATION LA English DT Article DE Perceptual weight tuning; Unit selection text-to-speech synthesis; Active interactive genetic algorithms AB Unit-selection speech synthesis is one of the current corpus-based text-to-speech synthesis techniques. The quality of the generated speech depends on the accuracy of the unit selection process, which in turn relies on the cost function definition. This function should map the user perceptual preferences when selecting synthesis units, which is still an open research issue. This paper proposes a complete methodology for the tuning of the cost function weights by fusing the human judgments with the cost function, through efficient and reliable interactive weight tuning. To that effect, active interactive genetic algorithms (aiGAs) are used to guide the subjective weight adjustments. The application of aiGAs to this process allows mitigating user fatigue and frustration by improving user consistency. However, it is still unfeasible to subjectively adjust the weights of the whole corpus units (diphones and triphones in this work). This makes it mandatory to perform unit clustering before conducting the tuning process. The aiGA-based weight tuning proposal is evaluated in a small speech corpus as a proof-of-concept and results in more natural synthetic speech when compared to previous objective and subjective-based approaches. (C) 2011 Elsevier B.V. All rights reserved. C1 [Alias, Francesc; Formiga, Lluis] La Salle Univ Ramon Llull, GTM Grp Rec Tecnol Media, Barcelona 08022, Spain. [Llora, Xavier] Univ Illinois, Natl Ctr Supercomp Applicat, Urbana, IL 61801 USA. RP Alias, F (reprint author), La Salle Univ Ramon Llull, GTM Grp Rec Tecnol Media, C Quatre Camins 2, Barcelona 08022, Spain. EM falias@salle.url.edu RI Alias, Francesc/L-1088-2014 OI Alias, Francesc/0000-0002-1921-2375 FU European Commission [FP6 IST-4-027122-IP] FX This work has been partially supported by the European Commission, Project SALERO (FP6 IST-4-027122-IP). We would like to thank The Andrew W. Mellon Foundation and the National Center for Supercomputing Applications for their support during the preparation of this manuscript. CR Alias F., 2004, P 8 INT C SPOK LANG, P1221 Alias F., 2003, P 8 EUR C SPEECH COM, P1333 ALIAS F, 2006, P ICASSP TOUL FRANC, V1, P865 BLACK A, 2002, IEEE WORKSH SPEECH S BLACK A, 2007, P ICASSP HON HI, V4, P1229 Black A. B., 2005, P INT 2005 LISB PORT, P77 Black A. W., 1997, P EUR, P601 Black A.W., 1997, HCRCTR83 U ED BREUER R, 2004, P 8 INT C SPOK LANG, P1217 Campillo F. D., 2006, SPEECH COMMUN, V48, P941, DOI 10.1016/j.specom.2005.12.004 CHU M, 2002, ACOUST SPEECH SIG PR, P453 Chu M, 2001, P 7 EUR C SPEECH COM, P2087 Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014 COELLOCOELLO CA, 1998, LANIARD0908 Cristianini N., 2000, INTRO SUPPORT VECTOR Deb K., 2000, 200001 KANGAL IND I Durant EA, 2004, IEEE T SPEECH AUDI P, V12, P144, DOI 10.1109/TSA.2003.822640 FERNANDEZ R, 2006, TC STAR WORKSH SPEEC, P175 FORMIGA L, 2010, P 2010 IEEE C EV COM, P2322 Goldberg D., 1989, COMPLEX SYSTEMS, V3, P493 Goldberg D. E., 2002, DESIGN INNOVATION LE Goldberg D. E, 1989, GENETIC ALGORITHMS S GUNTER S, 2001, 3 IAPR TC15 WORKSH G, P229 Hunt A., 1996, P INT C AC SPEECH SI, V1, P373 Kim NS, 2004, IEEE SIGNAL PROC LET, V11, P40, DOI 10.1109/LSP.2003.819345 LEE M, 2001, P 4 ISCA WORKSH SPEE, P75 Llora X., 2005, P GEN EV COMP C, P1363, DOI 10.1145/1068009.1068228 MENG H, 2002, P 7 INT C SPOK LANG, P2373 Meron Y., 1999, P EUROSPEECH BUD HUN, V5, P2319 Pareto f, 1896, COURS EC POLITIQUE, VII Pareto V., 1896, COURS EC POLITIQUE, P1 PARK SS, 2003, P EUR GEN SWITZ, V1, P281 Peng H., 2002, P 7 INT C SPOK LANG, P1341 Sebag M, 1998, LECT NOTES COMPUT SC, V1498, P418 Strom V, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1873 Takagi H, 2001, P IEEE, V89, P1275, DOI 10.1109/5.949485 Toda T., 2004, P ICASSP 2004 MONTR, P657 Toda T, 2006, SPEECH COMMUN, V48, P45, DOI 10.1016/j.specom.2005.05.011 TODA T, 2003, THESIS NARA I SCI TE Wu CH, 2001, SPEECH COMMUN, V35, P219, DOI 10.1016/S0167-6393(00)00075-3 YI JRW, 2003, THESIS MIT CAMBRIDGE NR 41 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY-JUN PY 2011 VL 53 IS 5 SI SI BP 786 EP 800 DI 10.1016/j.specom.2011.01.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 757DV UT WOS:000290065900015 ER PT J AU Kim, W Hansen, JHL AF Kim, Wooil Hansen, John H. L. TI Variational noise model composition through model perturbation for robust speech recognition with time-varying background noise SO SPEECH COMMUNICATION LA English DT Article DE Variational model composition (VMC); Time-varying noise; Feature compensation; Multiple environmental models; Robust speech recognition ID MISSING-FEATURE RECONSTRUCTION; COMPENSATION; ENHANCEMENT; COMBINATION AB This study proposes a novel model composition method to improve speech recognition performance in time-varying background noise conditions. It is suggested that each element of the cepstral coefficients represents the frequency degree of the changing components in the envelope of the log-spectrum. With this motivation, in the proposed method, variational noise models are formulated by selectively applying perturbation factors to the mean parameters of a basis model, resulting in a collection of noise models that more accurately reflect the natural range of spectral patterns Fen in the log-spectral domain. The basis noise model is obtained from the silence segments of the input speech. The perturbation factors are designed separately for changes in the energy level and spectral envelope. The proposed variational model composition (VMC) method is employed to generate multiple environmental models for our previously proposed parallel combined gaussian mixture model (PCGMM) based feature compensation algorithm. The mixture sharing technique is integrated to reduce computational expenses, caused by employing the variational models. Experimental results prove that the proposed method is considerably more effective at increasing speech recognition performance in time-varying background noise conditions, with +31.31%, +10.65%, and +20.54% average relative improvements in word error rate for speech babble, background music, and real-life in-vehicle noise conditions respectively, compared to the original basic PCGMM method. (C) 2010 Elsevier B.V. All rights reserved. C1 [Kim, Wooil; Hansen, John H. L.] Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. EM john.hansen@utdallas.edu FU USAF [FA8750-09-C-0067] FX This work was supported by the USAF under a subcontract to RADC, Inc., Contract FA8750-09-C-0067 (Approved for public release. Distribution unlimited). A preliminary study of this work was presented at the Inter-speech-2009, Brighton, UK, September 2009 (Kim and Hansen, 2009c). CR Angkititrakul P., 2007, IEEE INT VEH S, P566 ANGKITITRAKUL P, 2009, INVEHICLE CORPUS SIG, pCH5 [Anonymous], 2000, 201108 ETSI ES BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Deller J., 2000, DISCRETE TIME PROCES EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 ETSI, 2002, 202050 ETSI ES FREY J, 2001, EUR 2001 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088 Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Hansen JHL, 2004, DSP IN VEHICLE MOBIL HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387 Hirsch H.G., 2000, ISCA ITRW ASR2000 KIM NS, 1997, IEEE WORKSH SPEECH R, P389 Kim NS, 2002, SPEECH COMMUN, V37, P231, DOI 10.1016/S0167-6393(01)00013-9 Kim W, 2009, SPEECH COMMUN, V51, P83, DOI 10.1016/j.specom.2008.06.004 Kim W, 2006, INT CONF ACOUST SPEE, P305 KIM W, 2009, INT 2009, P2399 Kim W, 2007, 2007 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, VOLS 1 AND 2, P687 Kim W, 2009, IEEE T AUDIO SPEECH, V17, P1292, DOI 10.1109/TASL.2009.2015080 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 1994, EUSIPCO 94, P1182 Moreno P.J., 1996, THESIS CARNEGIE MELL Moreno PJ, 1998, SPEECH COMMUN, V24, P267, DOI 10.1016/S0167-6393(98)00025-9 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Sasou A., 2004, ICSLP2004, P121 Stouten V., 2004, ICASSP2004, P949 VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970 YAO K, 2001, ADV NEURAL INFO PROC, V14 NR 34 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 451 EP 464 DI 10.1016/j.specom.2010.12.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000001 ER PT J AU Paliwal, K Wojcicki, K Shannon, B AF Paliwal, Kuldip Wojcicki, Kamil Shannon, Benjamin TI The importance of phase in speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Analysis window; Short-time Fourier analysis; Analysis-modification-synthesis (AMS); Magnitude spectrum; Minimum mean-square error (MMSE) short-time spectral amplitude (STSA) estimator; Phase spectrum; Phase spectrum compensation (PSC); MMSE PSC ID TIME FOURIER-TRANSFORM; SPECTRAL AMPLITUDE ESTIMATOR; HUMAN LISTENING TESTS; SIGNAL RECONSTRUCTION; MAGNITUDE; RECOGNITION; SUBTRACTION; PERCEPTION; SUPPRESSION; NOISE AB Typical speech enhancement methods, based on the short-time Fourier analysis-modification-synthesis (AMS) framework, modify only the magnitude spectrum and keep the phase spectrum unchanged. In this paper our aim is to show that by modifying the phase spectrum in the enhancement process the quality of the resulting speech can be improved. For this we use analysis windows of 32 ms duration and investigate a number of approaches to phase spectrum computation. These include the use of matched or mismatched analysis windows for magnitude and phase spectra estimation during AMS processing, as well as the phase spectrum compensation (PSC) method. We consider four cases and conduct a series of objective and subjective experiments that examine the importance of the phase spectrum for speech quality in a systematic manner. In the first (oracle) case, our goal is to determine maximum speech quality improvements achievable when accurate phase spectrum estimates are available, but when no enhancement is performed on the magnitude spectrum. For this purpose speech stimuli are constructed, where (during AMS processing) the phase spectrum is computed from clean speech, while the magnitude spectrum is computed from noisy speech. While such a situation does not arise in practice, it does provide us with a useful insight into how much a precise knowledge of the phase spectrum can contribute towards speech quality. In this first case, matched and mismatched analysis window approaches are investigated. Particular attention is given to the choice of analysis window type used during phase spectrum computation, where the effect of spectral dynamic range on speech quality is examined. In the second (non-oracle) case, we consider a more realistic scenario where only the noisy spectra (observable in practice) is available. We study the potential of the mismatched window approach for speech quality improvements in this non-oracle case. We would also like to determine how much room for improvement exists between this case and the best (oracle) case. In the third case, we use the PSC algorithm to enhance the phase spectrum. We compare this approach with the oracle and non-oracle matched and mismatched window techniques investigated in the preceding cases. While in the first three cases we consider the usefulness of various approaches to phase spectrum computation within the AMS framework when noisy magnitude spectrum is used, in the fourth case we examine the usefulness of these techniques when enhanced magnitude spectrum is employed. Our aim (in the context of traditional magnitude spectrum-based enhancement methods) is to determine how much benefit in terms of speech quality can be attained by also processing the phase spectrum. For this purpose, the minimum mean-square error (M M SE) short-time spectral amplitude (STSA) estimates are employed instead of noisy magnitude spectra. The results of the oracle experiments show that accurate phase spectrum estimates can considerably contribute towards speech quality, as well as that the use of mismatched analysis windows (in the computation of the magnitude and phase spectra) provides significant improvements in both objective and subjective speech quality - especially, when the choice of analysis window used for phase spectrum computation is carefully considered. The mismatched window approach was also found to improve speech quality in the non-oracle case. While the improvements were found to be statistically significant, they were only modest compared to those observed in the oracle case. This suggests that researc into better phase spectrum estimation algorithms, while a challenging task, could be worthwhile. The results of the PSC experiments indicate that the PSC method achieves better speech quality improvements than the other non-oracle methods considered. The results of the MMSE experiments suggest that accurate phase spectrum estimates have a potential to significantly improve performance of existing magnitude spectrum-based methods. Out of the non-oracle approaches considered, the combination of the MMSE STSA method with the PSC algorithm produced significantly better speech quality improvements than those achieved by these methods individually. (C) 2010 Elsevier B.V. All rights reserved. C1 [Paliwal, Kuldip; Wojcicki, Kamil; Shannon, Benjamin] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia. RP Wojcicki, K (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia. EM kamil.wojcicki@ieee.org CR Alsteris L., 2004, P IEEE INT C AC SPEE, V1, P573 ALSTERIS L, 2005, P INT S SIGN PROC AP Alsteris LD, 2007, COMPUT SPEECH LANG, V21, P174, DOI 10.1016/j.csl.2006.03.001 Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Ghitza O, 2001, J ACOUST SOC AM, V110, P1628, DOI 10.1121/1.1396325 GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 HARRIS FJ, 1978, P IEEE, V66, P51, DOI 10.1109/PROC.1978.10837 Hayes M. H., 1996, STAT DIGITAL SIGNAL, V1st HAYES MH, 1980, IEEE T ACOUST SPEECH, V28, P672, DOI 10.1109/TASSP.1980.1163463 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 *ITU T, 2001, 0862 ITUT Kim DS, 2003, IEEE T SPEECH AUDI P, V11, P355, DOI 10.1109/TSA.2003.814409 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Loveimi E., 2010, P INT S COMM CONTR S, P1 Lu Y, 2008, SPEECH COMMUN, V50, P453, DOI 10.1016/j.specom.2008.01.003 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 MCAULAY RJ, 1995, PSYCHOACOUSTICS FACT, P121 Nakagawa S, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1065 NAWAB SH, 1983, IEEE T ACOUST SPEECH, V31, P986, DOI 10.1109/TASSP.1983.1164162 OPPENHEIM AV, 1979, P IEEE INT C AC SPEE, V4, P632 OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022 Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P177 Paliwal K. K., 2003, P EUR 2003, P2117 Alsteris LD, 2006, SPEECH COMMUN, V48, P727, DOI 10.1016/j.specom.2005.10.005 Paliwal K.K., 2003, P IPSJ SPOK LANG PRO, P1 Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 POBLOTH H, 1999, ACOUST SPEECH SIG PR, P29 Quatieri T. F., 2002, DISCRETE TIME SPEECH REDDY NS, 1985, IEEE T CIRCUITS SYST, V32, P616, DOI 10.1109/TCS.1985.1085749 Schluter R., 2001, P IEEE INT C AC SPEE, P133 Shannon BJ, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1423 Shi GJ, 2006, IEEE T AUDIO SPEECH, V14, P1867, DOI 10.1109/TSA.2005.858512 Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328 SKOGLUND J, 1997, P IEEE SPEECH COD WO, P51 Stark AP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P549 VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Wackerly D., 2007, MATH STAT APPL WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 Wang LB, 2009, INT CONF ACOUST SPEE, P4529, DOI 10.1109/ICASSP.2009.4960637 Wiener N., 1949, EXTRAPOLATION INTERP Wojcicki K, 2008, IEEE SIGNAL PROC LET, V15, P461, DOI 10.1109/LSP.2008.923579 WOJCICKI K, 2008, P ISCA TUT RES WORKS WOJCICKI K, 2007, P IEEE INT C AC SPEE, V4, P729 YEGNANARAYANA B, 1987, P IEEE INT C AC SPEE, V12, P301 NR 54 TC 17 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 465 EP 494 DI 10.1016/j.specom.2010.12.003 PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000002 ER PT J AU Lu, CT AF Lu, Ching-Ta TI Enhancement of single channel speech using perceptual-decision-directed approach SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Power spectral subtraction; Masking property; Decision directed; Perceptual gain factor ID MASKING PROPERTIES; NOISE; TRANSFORM; QUALITY; CODERS AB The masking properties of the human ear have been successfully applied to adapt a speech enhancement system, yielding an improvement in speech quality. The accuracy of estimated speech spectra plays a major role in computing the noise masking threshold. Although traditional methods using the power-spectral-subtraction method to roughly estimate the speech spectra can provide an acceptable performance, the estimated speech spectra can be further improved for computing the noise masking threshold. In this article, we aim at finding a better spectral estimate of speech by the two-step-decision-directed method. In turn, this estimate is employed to compute the noise masking threshold of a perceptual gain factor. Experimental results show that the amounts of residual noise can be efficiently suppressed by embedding the two-step-decision-directed algorithm in the perceptual gain factor. (C) 2010 Elsevier B.V. All rights reserved. C1 Asia Univ, Dept Informat Commun, Taichung 41354, Taiwan. RP Lu, CT (reprint author), Asia Univ, Dept Informat Commun, 500 Lioufeng Rd, Taichung 41354, Taiwan. EM lucas1@ms26.hinet.net FU National Science Council, Taiwan [NSC 98-2221-E-468-006] FX This research was supported by the National Science Council, Taiwan, under Contract Number NSC 98-2221-E-468-006. The author would like to thank the anonymous reviewers for their valuable comments which improve the quality of this paper. CR Amehraye A, 2008, INT CONF ACOUST SPEE, P2081, DOI 10.1109/ICASSP.2008.4518051 [Anonymous], 2003, SUBJ TEST METH EV SP, P835 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Ding HJ, 2009, SPEECH COMMUN, V51, P259, DOI 10.1016/j.specom.2008.09.003 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Ghanbari Y, 2006, SPEECH COMMUN, V48, P927, DOI 10.1016/j.specom.2005.12.002 Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714 *ITU T, 2001, 0862 ITUT Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lu C-T, 2007, DIGIT SIGNAL PROCESS, V17, P171 Lu CT, 2003, SPEECH COMMUN, V41, P409, DOI 10.1016/S0167-6393(03)00011-6 Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001 Lu CT, 2004, ELECTRON LETT, V40, P394, DOI 10.1049/el:20040266 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621 Rix A., 2001, P IEEE INT C AC SPEE, P749 SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662 Udrea RM, 2008, SIGNAL PROCESS, V88, P1293 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 WANG SH, 1992, IEEE J SEL AREA COMM, V10, P819, DOI 10.1109/49.138987 YANG WH, 1998, ACOUST SPEECH SIG PR, P541 Yu G, 2008, IEEE T SIGNAL PROCES, V56, P1830, DOI 10.1109/TSP.2007.912893 NR 22 TC 7 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 495 EP 507 DI 10.1016/j.specom.2010.11.008 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000003 ER PT J AU Gao, J Zhao, QW Yan, YH AF Gao, Jie Zhao, Qingwei Yan, Yonghong TI Towards precise and robust automatic synchronization of live speech and its transcripts SO SPEECH COMMUNICATION LA English DT Article DE Automatically synchronizing spoken utterances with their transcripts; Frame-synchronous likelihood ratio test; Hidden markov models; Simultaneous broadcast news subtitling ID SUBTITLING SYSTEM; RECOGNITION AB This paper presents our efforts in automatically synchronizing spoken utterances with their transcripts (textual contents) (ASUT), where the speech is a live stream and its corresponding transcripts are known. This task is first simplified to the problem of online detecting the end times of spoken utterances and then a solution based on a novel frame-synchronous likelihood ratio test (FSLRT) procedure is proposed. We detail the formulation and implementation of the proposed FSLRT procedure under the Hidden Markov Models (HMMs) framework, and we study its property and parameter settings empirically. Because synchronization failures may occur in the FSLRT-based AUST systems, this paper also extends the FSLRT procedure to its multiple-instance version to increase the robustness of the system. The proposed multiple-instance FSLRT can detect the synchronization failures and restart the system from an appropriate point. Therefore a fully automatic FSLRT-based ASUT system could be constructed. The FSLRT-based ASUT system is evaluated in a simultaneous broadcasting news subtitling task. Experimental results show that the proposed method achieves satisfying performance and it outperforms an automatic speech recognition-based method both in terms of robustness and precision. Finally, the FSLRT-based news subtitling system can correctly subtitle about 90% of the sentences with an average time deviation of about 100 ms, running at the speed of 0.37 real time (RT). (C) 2011 Elsevier B.V. All rights reserved. C1 [Gao, Jie; Zhao, Qingwei; Yan, Yonghong] Chinese Acad Sci, ThinkIT Speech Lab, Inst Acoust, Beijing 100190, Peoples R China. RP Gao, J (reprint author), Chinese Acad Sci, ThinkIT Speech Lab, Inst Acoust, Beijing 100190, Peoples R China. EM jgao@hccl.ioa.ac.cn FU National Science & Technology Pillar Program [2008BAI50B03]; National Natural Science Foundation of China [10925419, 90920302, 10874203, 60875014] FX This work is partially supported by The National Science & Technology Pillar Program (2008BAI50B03), National Natural Science Foundation of China (No. 10925419, 90920302, 10874203, 60875014). We thank the members of ThinkIT Speech Lab for the fruitful discussions on the mathematical formulation of FSLRT. We also thank the reviewers for insightful their comments and suggestions, which are very helpful for improving the quality of our manuscript. CR Ando A, 2003, IEICE T INF SYST, VE86D, P15 Boulianne G, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P273 Duda R., 2000, PATTERN CLASSIFICATI Gao J, 2009, LECT NOTES COMPUT SC, V5553, P576 GAO J, 2009, P INT 2009 ISCA, P2115 GUO Y, 2007, P INT, P2949 Hosom J.-P., 2000, THESIS OREGON GRADUA HUANG C, 2003, AUTOMATIC CLOSED CAP Jurafsky Daniel, 2000, SPEECH LANGUAGE PROC Kan MY, 2008, IEEE T AUDIO SPEECH, V16, P338, DOI 10.1109/TASL.2007.911559 MANUEL J, 2002, IEEE T MUTIMEDIAS, V3, P88 MORENO P, 1998, P ICSLP 1998 ISCA Neto J, 2008, INT CONF ACOUST SPEE, P1561, DOI 10.1109/ICASSP.2008.4517921 Ney H, 1999, IEEE SIGNAL PROC MAG, V16, P64, DOI 10.1109/79.790984 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 ROBERTRIBES J, 1997, P EUR ISCA, P903 Shao J, 2008, IEICE T INF SYST, VE91D, P529, DOI 10.1093/ietisy/e9l-d.3.529 WEINTRAUB M, 1995, INT CONF ACOUST SPEE, P297, DOI 10.1109/ICASSP.1995.479532 WHEATLEY B, 1992, P IEEE INT C AC SPEE, P533, DOI 10.1109/ICASSP.1992.225853 NR 20 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 508 EP 523 DI 10.1016/j.specom.2011.01.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000004 ER PT J AU Jan, T Wang, WW Wang, DL AF Jan, Tariqullah Wang, Wenwu Wang, DeLiang TI A multistage approach to blind separation of convolutive speech mixtures SO SPEECH COMMUNICATION LA English DT Article DE Independent component analysis (ICA); Convolutive mixtures; Ideal binary mask (IBM); Estimated binary mask; Cepstral smoothing; Musical noise ID TIME-FREQUENCY MASKING; NONSTATIONARY SOURCES; PERMUTATION PROBLEM; NATURAL GRADIENT; DOMAIN; REPRESENTATION; SENSORS AB We propose a novel algorithm for the separation of convolutive speech mixtures using two-microphone recordings, based on the combination of independent component analysis (ICA) and ideal binary mask (IBM), together with a post-filtering process in the cepstral domain. The proposed algorithm consists of three steps. First, a constrained convolutive ICA algorithm is applied to separate the source signals from two-microphone recordings. In the second step, we estimate the IBM by comparing the energy of corresponding time frequency (T-F) units from the separated sources obtained with the convolutive ICA algorithm. The last step is to reduce musical noise caused by T-F masking using cepstral smoothing. The performance of the proposed approach is evaluated using both reverberant mixtures generated using a simulated room model and real recordings in terms of signal to noise ratio measurement. The proposed algorithm offers considerably higher efficiency and improved speech quality while producing similar separation performance compared with a recent approach. (C) 2011 Elsevier B.V. All rights reserved. C1 [Jan, Tariqullah; Wang, Wenwu] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 5XH, Surrey, England. [Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. [Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA. RP Jan, T (reprint author), Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 5XH, Surrey, England. EM t.jan@surrey.ac.uk; w.wang@surrey.ac.uk; dwang@cse.ohio-state.edu FU Royal Academy of Engineering [IJB/AM/08-587]; EPSRC [EP/H012842/1]; AFOSR [FA9550-08-1-0155]; NSF [IIS-0534707]; NWFP UET Peshawar, Pakistan FX We are grateful to M. S. Pedersen for providing the mat-lab code of Pedersen et al. (2008) and the assistance in the preparation of this work. Part of the work was conducted while W. Wang was visiting OSU. T. U. Jan was supported by the NWFP UET Peshawar, Pakistan. W. Wang was supported in part by a Royal Academy of Engineering travel Grant (IJB/AM/08-587) and an EPSRC Grant (EP/H012842/1). D. L. Wang was supported in part by an AFOSR Grant (FA9550-08-1-0155) and an NSF Grant (IIS-0534707). CR Aissa-El-Bey A, 2007, IEEE T AUDIO SPEECH, V15, P1540, DOI 10.1109/TASL.2007.898455 Amari S, 1997, FIRST IEEE SIGNAL PROCESSING WORKSHOP ON SIGNAL PROCESSING ADVANCES IN WIRELESS COMMUNICATIONS, P101, DOI 10.1109/SPAWC.1997.630083 Araki S., 2005, P IEEE INT C AC SPEE, V3, P81, DOI 10.1109/ICASSP.2005.1415651 Araki S, 2003, IEEE T SPEECH AUDI P, V11, P109, DOI 10.1109/TSA.2003.809193 ARAKI S, 2004, P INT C IND COMP AN, P898 Araki S, 2007, SIGNAL PROCESS, V87, P1833, DOI 10.1016/j.sigpro.2007.02.003 BACK AD, 1994, IEEE WORKSH NEUR NET, P565 Buchner H, 2004, AUDIO SIGNAL PROCESSING: FOR NEXT-GENERATION MULTIMEDIA COMMUNICATION SYSTEMS, P255, DOI 10.1007/1-4020-7769-6_10 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 Cichocki A., 2002, ADAPTIVE BLIND SIGNA Dillon H., 2001, HEARING AIDS Douglas SC, 2007, IEEE T AUDIO SPEECH, V15, P1511, DOI 10.1109/TASL.2007.899176 Douglas SC, 2005, IEEE T SPEECH AUDI P, V13, P92, DOI 10.1109/TSA.2004.838538 Douglas SC, 2003, SPEECH COMMUN, V39, P65, DOI 10.1016/S0167-6393(02)00059-6 Gaubitch N. D., 1979, ALLEN BERKLEY IMAGE Han S, 2009, P 7 INT C INF COMM S, P356 He ZS, 2007, IEEE T AUDIO SPEECH, V15, P1551, DOI 10.1109/TASL.2007.898457 Hoel P.G., 1976, ELEMENTARY STAT Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Hyvarinen A, 2001, INDEPENDENT COMPONEN Jan T, 2009, INT CONF ACOUST SPEE, P1713, DOI 10.1109/ICASSP.2009.4959933 Lambert RH, 1997, INT CONF ACOUST SPEE, P423, DOI 10.1109/ICASSP.1997.599665 Lee T. W., 1998, INDEPENDENT COMPONEN Lee TW, 1997, 1997 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, P2129, DOI 10.1109/ICNN.1997.614235 Madhu N, 2008, INT CONF ACOUST SPEE, P45, DOI 10.1109/ICASSP.2008.4517542 Makino S, 2005, IEICE T FUND ELECTR, VE88A, P1640, DOI 10.1093/ietfec/e88-a.7.1640 Matsuoka K., 2001, INT WORKSH ICA BSS, P722 Mazur R, 2009, IEEE T AUDIO SPEECH, V17, P117, DOI 10.1109/TASL.2008.2005349 MITIANONDIS N, 2002, INT J ADAPT CONTROL, P1 Mukai R, 2004, LECT NOTES COMPUT SC, V3195, P461 Murata N, 2001, NEUROCOMPUTING, V41, P1, DOI 10.1016/S0925-2312(00)00345-3 Nesta F., 2009, IEEE WORKSH APPL SIG, P105 NESTA F, 2008, P HANDS FREE SPEECH, P232 Nickel RM, 2006, INT CONF ACOUST SPEE, P629 Olsson RK, 2006, INT CONF ACOUST SPEE, P657 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 Pedersen MS, 2008, IEEE T NEURAL NETWOR, V19, P475, DOI 10.1109/TNN.2007.911740 Rahbar K, 2005, IEEE T SPEECH AUDI P, V13, P832, DOI 10.1109/TSA.2005.851925 Reju VG, 2010, IEEE T AUDIO SPEECH, V18, P101, DOI 10.1109/TASL.2009.2024380 RODRIGUES GF, 2009, P 8 IND C AN SIGN SE, P621 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 Sawada H, 2004, IEEE T SPEECH AUDI P, V12, P530, DOI 10.1109/TSA.2004.832994 Sawada H, 2006, IEEE T AUDIO SPEECH, V14, P2165, DOI 10.1109/TASL.2006.872599 Sawada H, 2003, IEICE T FUND ELECTR, VE86A, P590 Sawada H, 2007, IEEE T AUDIO SPEECH, V15, P1592, DOI 10.1109/TASL.2007.899218 Schobben L, 2002, IEEE T SIGNAL PROCES, V50, P1855 Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2 SOON VC, 1993, P IEEE INT S CIRC SY, V1, P703 Wang D., 2008, TRENDS AMPLIF, V12, P332 Wang D., 2006, COMPUTATIONAL AUDITO Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12 Wang DL, 2009, J ACOUST SOC AM, V125, P2336, DOI 10.1121/1.3083233 WANG W, 2004, OSA Wang WW, 2005, IEEE T SIGNAL PROCES, V53, P1654, DOI 10.1109/TSP.2005.845433 Yoshioka T., 2009, P 17 EUR SIGN PROC C, P1432 NR 56 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 524 EP 539 DI 10.1016/j.specom.2011.01.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000005 ER PT J AU Le, N Ambikairajah, E Epps, J Sethu, V Choi, EHC AF Phu Ngoc Le Ambikairajah, Eliathamby Epps, Julien Sethu, Vidhyasaharan Choi, Eric H. C. TI Investigation of spectral centroid features for cognitive load classification SO SPEECH COMMUNICATION LA English DT Article DE Cognitive load; Gaussian mixture model; Spectral centroid feature; Frequency scale; Kullback-Leibler distance ID SPEECH RECOGNITION; STRESS; SYSTEM; WORKLOAD AB Speech is a promising modality for the convenient measurement of cognitive load, and recent years have seen the development of several cognitive load classification systems. Many of these systems have utilised mel frequency cepstral coefficients (MFCC) and prosodic features like pitch and intensity to discriminate between different cognitive load levels. However, the accuracies obtained by these systems are still not high enough to allow for their use outside of laboratory environments. One reason for this might be the imperfect acoustic description of speech provided by MFCCs. Since these features do not characterise the distribution of the spectral energy within subbands, in this paper, we investigate the use of spectral centroid frequency (SCF) and spectral centroid amplitude (SCA) features, applying them to the problem of automatic cognitive load classification. The effect of varying the number of filters and the frequency scale used is also evaluated, in terms of the effectiveness of the resultant spectral centroid features in discriminating between cognitive loads. The results of classification experiments show that the spectral centroid features consistently and significantly outperform a baseline system employing MFCC, pitch, and intensity features. Experimental results reported in this paper indicate that the fusion of an SCF based system with an SCA based system results in a relative reduction in error rate of 39% and 29% for two different cognitive load databases. (C) 2011 Elsevier B.V. All rights reserved. C1 [Phu Ngoc Le; Ambikairajah, Eliathamby; Epps, Julien; Sethu, Vidhyasaharan] Univ New S Wales, Sch Elect Engn & Telecommun, UNSW, Sydney, NSW 2052, Australia. [Phu Ngoc Le; Ambikairajah, Eliathamby; Epps, Julien; Choi, Eric H. C.] Natl ICT Australia NICTA, ATP Res Lab, Eveleigh 2015, Australia. RP Le, N (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, UNSW, Sydney, NSW 2052, Australia. EM phule@unsw.edu.au; ambi@ee.unsw.edu.au; j.epps@unsw.edu.au; vidhyasaharan@gmail.com; Eric.Choi@nicta.com.au CR Berthold A, 1999, CISM COUR L, P235 BORIL H, 2010, P INT MAKH CHIB JAP, P502 Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8 Gajic B, 2006, IEEE T AUDIO SPEECH, V14, P600, DOI 10.1109/TSA.2005.855834 GERVEN PWM, 2004, PSYCHOPHYSIOLOGY, V41, P167 Goldberger J., 2005, P INT, P1985 GRIFFIN GR, 1987, AVIAT SPACE ENVIR MD, V58, P1165 HE L, 2009, P 5 INT C NAT COMP T, P260 Hosseinzadeh D, 2008, EURASIP J ADV SIG PR, DOI 10.1155/2008/258184 Khawaja M. A., 2007, P 19 AUSTR C COMP HU, P57, DOI 10.1145/1324892.1324902 Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416 Kua J. M. K., 2010, P OD SPEAK LANG REC, P34 Le P. N., 2009, P 7 INT C INF COMM S, P1 LE PN, 2010, P 20 INT C PATT REC, P4516 LIVELY SE, 1993, J ACOUST SOC AM, V93, P2962, DOI 10.1121/1.405815 LU X, 2007, SPEECH COMMUN, P312 Mendoza E, 1998, J VOICE, V12, P263, DOI 10.1016/S0892-1997(98)80017-9 *MET, 2007, LEX FRAM READ MULLER C, 2001, LECT NOTES COMPUTER, P24 Paas F, 2003, EDUC PSYCHOL, V38, P1, DOI 10.1207/S15326985EP3801_1 PALIWAL KK, 1998, ACOUST SPEECH SIG PR, P617 Pass F.G.W.C., 1994, J EDUC PSYCHOL, V86, P122 Pelecanos J., 2001, ODYSSEY 2001 CRET GR, P213 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Scherer K. R., 2002, P ICSLP, P2017 SHANNON BJ, 2003, P MICR ENG RES C BRI, P1 SHRIBERG E, 1992, P DARPA SPEECH NAT L, P419, DOI 10.3115/1075527.1075628 STEENEKEN HJM, 1999, ACOUST SPEECH SIG PR, P2079 Talkin D., 1995, SPEECH CODING SYNTHE, P495 Thiruvaran T., 2006, P 11 AUSTR INT C SPE, P148 Yap T. F., 2010, P ISSPA KUAL LUMP MA, P221 YAP TF, 2010, P INTERSPEECH MAK CH, P2022 Yap TF, 2009, INT CONF ACOUST SPEE, P4825 YAP TF, 2010, P ICASSP DALL TEX US, P5234 Yin B, 2008, INT CONF ACOUST SPEE, P2041, DOI 10.1109/ICASSP.2008.4518041 Yin B., 2007, P CHISIG, P249, DOI 10.1145/1324892.1324946 NR 37 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 540 EP 551 DI 10.1016/j.specom.2011.01.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000006 ER PT J AU Legat, M Matousek, J Tihelka, D AF Legat, M. Matousek, J. Tihelka, D. TI On the detection of pitch marks using a robust multi-phase algorithm SO SPEECH COMMUNICATION LA English DT Article DE Glottal closure instant; Pitch mark; Speech signal polarity; Fundamental frequency ID SPEECH SYNTHESIS; GLOTTAL CLOSURE; VOICED SPEECH; INSTANTS; EXCITATION; MODEL; WAVE AB A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have been proposed in recent years. In this paper, we propose to take advantage of both glottal and speech signals in order to increase the accuracy of detection of GCIs. All aspects of this particular issue, from determining speech polarity to handling a delay between glottal and corresponding speech signal, are addressed. A robust multi-phase algorithm (MPA), which combines different methods applied on both signals in a unique way, is presented. Within the process, a special attention is paid to determination of speech waveform polarity, as it was found to be considerably influencing the performance of the detection algorithms. Another feature of the proposed method is that every detected GCI is given a confidence score, which allows to locate potentially inaccurate GCI subsequences. The performance of the proposed algorithm was tested and compared with other freely available GCI detection algorithms. The MPA algorithm was found to be more robust in terms of detection accuracy over various sets of sentences, languages and phone classes. Finally, some pitfalls of the GCI detection are discussed. (C) 2011 Elsevier B.V. All rights reserved. C1 [Legat, M.; Matousek, J.; Tihelka, D.] Univ W Bohemia, Fac Sci Appl, Dept Cybernet, Plzen 30614, Czech Republic. RP Matousek, J (reprint author), Univ W Bohemia, Fac Sci Appl, Dept Cybernet, Univ 8, Plzen 30614, Czech Republic. EM legatm@kky.zcu.cz; jmatouse@kky.zeu.cz RI Matousek, Jindrich/C-2146-2011; Tihelka, Daniel/A-4318-2012 OI Matousek, Jindrich/0000-0002-7408-7730; Tihelka, Daniel/0000-0002-3149-2330 FU Ministry of Education of the Czech Republic [2C06020]; Grant Agency of the Czech Republic [GACR 102/09/0989] FX This research was supported by the Ministry of Education of the Czech Republic, Project No. 2C06020, and by the Grant Agency of the Czech Republic, Project No. GACR 102/09/0989. The access to the METACentrum clusters provided under the research intent MSM6383917201 is highly appreciated. CR BANGA ER, 2002, IMPROVEMENTS SPEECH, P52 BELLEGARDA JR, 2004, P 5 ISCA SPEECH SYNT, P133 Boersma P., 2005, PRAAT SOFTWARE SPEEC Brookes M, 2006, IEEE T AUDIO SPEECH, V14, P456, DOI 10.1109/TSA.2005.857810 Chambers JM, 1983, GRAPHICAL METHODS DA Chen J., 2001, COMPUTATIONAL LINGUI, V6, P1 CHENGYUAN L, 2004, P INTERSPEECH JEJ KO, P1189 de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 Dutoit T., 2008, SPRINGER HDB SPEECH, P437, DOI 10.1007/978-3-540-49127-9_21 Dutoit T, 1996, SPEECH COMMUN, V19, P119, DOI 10.1016/0167-6393(96)00029-5 Hagmuller M, 2006, SPEECH COMMUN, V48, P1650, DOI 10.1016/j.specom.2006.07.008 Hamon C., 1989, P INT C AC SPEECH SI, P238 Hanzlicek Z, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P681 Huang X., 2001, SPOKEN LANGUAGE PROC HUSSEIN H, 2007, P INTERSPEECH ANTW B, P1653 Hussein H, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P135 KLEIJN WB, 2002, P IEEE WORKSH SPEECH, P163 Legat M, 2007, LECT NOTES ARTIF INT, V4629, P502 Legat M, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1641 Ma C, 1994, IEEE T SPEECH AUDI P, V2, P258 MATOUSEK J, 2008, P 6 INT C LANG RES E MATOUSEK J, 2001, P EUR 2001 ALB, V3, P2047 MATOUSEK J, 2004, P ICSLP JEJ KOR, V3, P1933 Matousek J, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1626 MCKENNA JG, 2001, P 4 ISCA TUT RES WOR MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Murthy PS, 1999, IEEE T SPEECH AUDI P, V7, P609 Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5 ROTHENBERG M, 1988, J SPEECH HEAR RES, V31, P338 ROTHENBERG M, 1992, J VOICE, V6, P36, DOI 10.1016/S0892-1997(05)80007-4 SAKAMOTO M, 2000, P INT C SPOK LANG PR, V3, P650 Schoentgen J, 2003, J VOICE, V17, P114, DOI 10.1016/S0892-1997(03)0014-6 SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 STRUBE HW, 1974, J ACOUST SOC AM, V56, P1625, DOI 10.1121/1.1903487 Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068 Tuan V. N., 1999, P EUR C SPEECH TECHN, P2805 Valbret H., 1992, P ICASSP92 MARCH, V1, P145 Verhelst W., 1993, P IEEE INT C AC SPEE, V2, P554 WEN D, 1998, P ICASSP SEATTL WA, V2, P857, DOI 10.1109/ICASSP.1998.675400 YEGNANARAYANA B, 1995, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.1995.479809 NR 40 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 552 EP 566 DI 10.1016/j.specom.2011.01.008 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000007 ER PT J AU Ananthakrishnan, G Engwall, O AF Ananthakrishnan, G. Engwall, Olov TI Mapping between acoustic and articulatory gestures SO SPEECH COMMUNICATION LA English DT Article DE Acoustic gestures; Articulatory gestures; Acoustic-to-articulatory inversion; Critical trajectory error ID SPEECH; SEGMENTATION; MOVEMENTS; FEATURES; MODEL; RECOGNITION; PERCEPTION; HMM AB This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories arc essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45-1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony. (C) 2011 Elsevier B.V. All rights reserved. C1 [Ananthakrishnan, G.; Engwall, Olov] KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol CTT, SE-10044 Stockholm, Sweden. RP Ananthakrishnan, G (reprint author), KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol CTT, SE-10044 Stockholm, Sweden. EM agopal@kth.se; engwall@kth.se FU Swedish Research Council [621-2008-4490] FX This work is supported by the Grant 621-2008-4490 from the Swedish Research Council. CR ANANTHAKRISHNAN G, 2006, P INT C INT SENS INF, P115 Ananthakrishnan G., 2009, P INT BRIGHT UK, P2799 ARIKI Y, 1989, IEE PROC-I, V136, P133 Atal S., 1978, J ACOUST SOC AM, V63, P1535 Bilmes J., 1998, INT COMPUT SCI I, V4, P1 Browman Catherine, 1986, PHONOLOGY YB, V3, P219 CHILDERS DG, 1995, SPEECH COMMUN, V16, P127, DOI 10.1016/0167-6393(94)00050-K Diehl RL, 2004, ANNU REV PSYCHOL, V55, P149, DOI 10.1146/annurev.psych.55.090902.142028 DUSAN S, 2000, P 5 SEM SPEECH PROD, P237 ENGWALL O, 2006, P 7 INT SEM SPEECH P, P469 FARHAT A, 1993, P EUROSPEECH, P657 Fowler CA, 1996, J ACOUST SOC AM, V99, P1730, DOI 10.1121/1.415237 GHOLAMPOUR I, 1998, P INT C SPOK LANG PR, P1555 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 Hoole P., 1996, FORSCHUNGSBERICHTE I, V34, P158 KATSAMANIS A, 2008, P EUR SIGN PROC C LA KEATING PA, 1984, LANGUAGE, V60, P286, DOI 10.2307/413642 Kjellstrom H, 2009, SPEECH COMMUN, V51, P195, DOI 10.1016/j.specom.2008.07.005 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 Liu SA, 1996, J ACOUST SOC AM, V100, P3417, DOI 10.1121/1.416983 MACNEILA.PF, 1970, PSYCHOL REV, V77, P182, DOI 10.1037/h0029070 MAEDA S, 1988, J ACOUST SOC AM, V84, pS146, DOI 10.1121/1.2025845 Markov K, 2006, SPEECH COMMUN, V48, P161, DOI 10.1016/j.specom.2005.07.003 McGowan RS, 2009, J ACOUST SOC AM, V126, P2011, DOI 10.1121/1.3184581 MILLER JD, 1989, J ACOUST SOC AM, V85, P2114, DOI 10.1121/1.397862 MILNER B, 1995, P EUROSPEECH, P519 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Neiberg D, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1485 NEIBERG D, 2009, P INTERSPEECH BRIGHT, P1387 OUNI S, 2002, P INT C SPOK LANG PR, P2301 OZBEK IY, 2009, P INT BRIGHT UK, P2807 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 Perrier P, 2008, J NEUROPHYSIOL, V100, P1171, DOI 10.1152/jn.01116.2007 Qin C, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2306 Reeves B., 1993, EFFECTS AUDIO VIDEO Richmond K., 2002, THESIS CTR SPEECH TE Richmond K, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P577 SAITO T, 1998, P ICSLP SYDN AUSTR, V7, P2839 Sarkar A, 2005, INT CONF ACOUST SPEE, P397 SCHMIDT RA, 1979, PSYCHOL REV, V86, P415, DOI 10.1037//0033-295X.86.5.415 SENEFF S, 1988, TRANSCRIPTION ALIGNM Stephenson T., 2000, P INT C SPOK LANG PR, P951 Stevens KN, 2002, J ACOUST SOC AM, V111, P1872, DOI 10.1121/1.1458026 Sung H. G., 2004, THESIS RICE U HOUSTO Svendsen T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Toda T., 2004, 5 ISCA SPEECH SYNTH, P31 Toda T., 2004, P ICSLP JEJ ISL KOR, P1129 Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001 Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579 Toutios A., 2003, 6 HELL EUR C COMP MA, P1 VANHEMERT JP, 1991, IEEE T SIGNAL PROCES, V39, P1008, DOI 10.1109/78.80941 VIVIANI P, 1982, NEUROSCIENCE, V7, P431, DOI 10.1016/0306-4522(82)90277-9 Wrench A., 1999, MOCHA TIMIT ARTICULA Wrench A. A., 2000, P ICSLP BEIJ CHIN, P145 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004 ZLOKARNIK I, 1993, P 3 EUR C SPEECH COM, P2215 ZUE V, 1989, P IEEE INT C ACOUSTI, P389 NR 58 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2011 VL 53 IS 4 BP 567 EP 589 DI 10.1016/j.specom.2011.01.009 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 742GJ UT WOS:000288929000008 ER PT J AU Vayrynen, E Toivanen, J Seppanen, T AF Vayrynen, Eero Toivanen, Juhani Seppanen, Tapio TI Classification of emotion in spoken Finnish using vowel-length segments: Increasing reliability with a fusion technique SO SPEECH COMMUNICATION LA English DT Article DE Automatic classification of emotion; Prosodic features; Vocal source features; Classifier fusion; Vowel segments; Spoken Finnish ID FLOATING SEARCH METHODS; FEATURE-SELECTION; RECOGNITION; SPEECH; QUOTIENT; STRESS AB Classification of emotional content of short Finnish emotional [a:] vowel speech samples is performed using vocal source parameter and traditional intonation contour parameter derived prosodic features. A multiple kNN classifier based decision level fusion classification architecture is proposed for multimodal speech prosody and vocal source expert fusion. The sum fusion rule and the sequential forward floating search (SFFS) algorithm are used to produce leveraged expert classifiers. Automatic classification tests in five emotional classes demonstrate that significantly higher than random level emotional content classification performance is achievable using both prosodic and vocal source features. The fusion classification approach is further shown to be capable of emotional content classification in the vowel domain approaching the performance level of the human reference. (C) 2010 Elsevier B.V. All rights reserved. C1 [Vayrynen, Eero; Seppanen, Tapio] Univ Oulu, Elect & Informat Engn Dept, Comp Engn Lab, FI-90014 Oulu, Finland. [Toivanen, Juhani] Acad Finland, FI-90014 Oulu, Finland. [Toivanen, Juhani] Univ Oulu, Elect & Informat Engn Dept, Informat Proc Lab, FI-90014 Oulu, Finland. RP Vayrynen, E (reprint author), Univ Oulu, Elect & Informat Engn Dept, Comp Engn Lab, POB 4500, FI-90014 Oulu, Finland. EM eero.vayrynen@ee.oulu.fi; juhani.toi-vanen@ee.oulu.fi; tapio.seppanen@ee.oulu.fi FU Academy of Finland [1114920]; Infotech Oulu Graduate School of the University of Oulu FX Academy of Finland (project number 1114920) and Infotech Oulu Graduate School of the University of Oulu are gratefully acknowledged for financial support. CR Airas M., 2005, P INT 2005 LISB PORT, P2145 ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365 Alku P, 1996, SPEECH COMMUN, V18, P131, DOI 10.1016/0167-6393(95)00040-2 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Barra R, 2006, INT CONF ACOUST SPEE, P1085 Batliner A, 2000, P ISCA WORKSH SPEECH, P195 Breazeal C, 2002, AUTON ROBOT, V12, P83, DOI 10.1023/A:1013215010749 CAMPBELL JP, 2003, 9 8 EUR C SPEECH COM, P2665 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cruttenden Alan, 1997, INTONATION, V2nd DELLAERT F, 1996, P 4 INT C SPOK LANG, V3, P1970, DOI 10.1109/ICSLP.1996.608022 Drugman T., 2008, P IEEE INT C SIGN PR Duda, 2001, PATTERN CLASSIFICATI Engberg I. S., 1996, DOCUMENTATION DANISH Gobl C, 2003, P ISCA TUT RES WORKS, P151 Kim S, 2007, 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, P48, DOI 10.1109/MMSP.2007.4412815 Kittler J, 2000, LECT NOTES COMPUT SC, V1876, P45 Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 Laukkanen AM, 1996, J PHONETICS, V24, P313, DOI 10.1006/jpho.1996.0017 Lee C.M., 2001, P IEEE WORKSH AUT SP McGilloway S., 2000, P ISCA WORKSH SPEECH, P207 Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004 Mozziconacci S., 1999, P 14 INT C PHON SCI, P2001 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 Paeschke A., 2000, P ISCA WORKSH SPEECH, P75 Polzin T., 2000, P ISCA WORKSH SPEECH, P201 PUDIL P, 1994, PATTERN RECOGN LETT, V15, P1119, DOI 10.1016/0167-8655(94)90127-9 Pulakka H., 2005, THESIS HELSINKI U TE Schaefer A, 2003, NEUROIMAGE, V18, P938, DOI 10.1016/S1053-8119(03)00009-0 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHULLER B, 2006, P 32 DTSCH JAHR AK D Seppanen Tapio, 2003, P EUR GEN SWITZ, P717 Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3 Somol P, 1999, PATTERN RECOGN LETT, V20, P1157, DOI 10.1016/S0167-8655(99)00083-5 Suomi K., 2008, STUDIA HUMANIORA OUL, V9 Suomi K, 2003, J PHONETICS, V31, P113, DOI 10.1016/S0095-4470(02)00074-8 ten Bosch L, 2003, SPEECH COMMUN, V40, P213, DOI 10.1016/S0167-6393(02)00083-3 Toivanen J., 2003, P 15 INT C PHON SCI, V3, P2469 Toivanen J, 2004, LANG SPEECH, V47, P383 Ververidis D., 2004, P ICASSP2004 Vroomen J, 1998, J MEM LANG, V38, P133, DOI 10.1006/jmla.1997.2548 NR 43 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 269 EP 282 DI 10.1016/j.specom.2010.09.007 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300001 ER PT J AU Shinoda, K Watanabe, Y Iwata, K Liang, YA Nakagawa, R Furui, S AF Shinoda, Koichi Watanabe, Yasushi Iwata, Kenji Liang, Yuan Nakagawa, Ryuta Furui, Sadaoki TI Semi-synchronous speech and pen input for mobile user interfaces SO SPEECH COMMUNICATION LA English DT Article DE User interfaces; Speech recognition; Handwritten character recognition; Multi-modal recognition; Adaptation AB This paper proposes new interfaces using semi-synchronous speech and pen input for mobile environments. A user speaks while writing, and the pen input complements the speech so that recognition performance will be higher than with speech alone. Since the input speed and input information are different between the two modes, speaking and writing, a time lag always exists between them. Therefore, conventional multi-modal recognition algorithms cannot be directly applied to this interface. To tackle this problem, we developed a multi-modal recognition algorithm that can handle this asynchronicity (time-lag) by using a segment-based unification scheme and a method of adapting to the time-lag characteristics of individual users. Five different pen-input interfaces, each of which is assumed to be given for a phrase unit in speech, were evaluated in speech recognition experiments using noisy speech data. The recognition accuracy of the proposed method was higher than that of speech alone in all five interfaces. We also carried out a subjective test to examine the usability of each interface. We found a trade-off between usability and improvement in recognition performance. (C) 2010 Elsevier B.V. All rights reserved. C1 [Shinoda, Koichi; Watanabe, Yasushi; Iwata, Kenji; Liang, Yuan; Nakagawa, Ryuta; Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Shinoda, K (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan. EM shinoda@cs.titech.ac.jp RI Shinoda, Koichi/D-3198-2014 OI Shinoda, Koichi/0000-0003-1095-3203 FU JSPS [15300054, 20300063] FX This research was partially supported by JSPS Grants-in-Aid for Scientific Research (B) 15300054 and 20300063. CR BAN H, 2004, P INTERSPEECH 2004 I HAYAMIZU S, 1993, IEICE T INF SYST, VE76D, P17 Hui PY, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1197 Itahashi S., 1991, Journal of the Acoustical Society of Japan, V47 Itou K., 1998, P ICSLP, P3261 Lee A., 2001, P EUR C SPEECH COMM, P1691 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 NAKAGAWA M, 1995, TECHNICAL REPORT IEI, V95, P43 NAKAI N, 2001, P ICDAR 2001, P491 Tamura S, 2004, J VLSI SIG PROC SYST, V36, P117, DOI 10.1023/B:VLSI.0000015091.47302.07 Watanabe Y, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2675 WATANABE Y, 2007, P ICASSP 2007 HON HA, V4, P409 Wu LZ, 1999, IEEE T MULTIMEDIA, V1, P334 ZHOU X, 2006, P ICASSP 2006, V1, P609 JAPANESE DICTATION T NR 15 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 283 EP 291 DI 10.1016/j.specom.2010.10.001 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300002 ER PT J AU Vieru, B de Mareueil, PB Adda-Decker, M AF Vieru, Bianca de Mareueil, Philippe Boula Adda-Decker, Martine TI Characterisation and identification of non-native French accents SO SPEECH COMMUNICATION LA English DT Article DE Foreign accents; Non-native French; Perceptual experiments; Automatic speech alignment; Pronunciation variants; Data mining techniques; Automatic classification ID FOREIGN ACCENT; PRONUNCIATION VARIANTS; ENGLISH; PERCEPTION; LANGUAGE; SPEECH AB This paper focuses on foreign accent characterisation and identification in French. How many accents may a native French speaker recognise and which cues does (s)he use? Our interest concentrates on French productions stemming from speakers of six different mother tongues: Arabic, English, German, Italian, Portuguese and Spanish, also compared with native French speakers (from the Ile-de-France region). Using automatic speech processing, our objective is to identify the most reliable acoustic cues distinguishing these accents, and to link these cues with human perception. We measured acoustic parameters such as duration and voicing for consonants, the first two formant values for vowels, word-final schwa-related prosodic features and the percentages of confusions obtained using automatic alignment including non-standard pronunciation variants. Machine learning techniques were used to select the most discriminant cues distinguishing different accents and to classify speakers according to their accents. The results obtained in automatic identification of the different linguistic origins under investigation compare favourably to perceptual data. Major identified accent-specific cues include the devoicing of voiced stop consonants, /b/similar to/v/ and /s/similar to/z/ confusions, the "rolled r" and schwa fronting or raising. These cues can contribute to improve pronunciation modeling in automatic speech recognition of accented speech. (C) 2010 Elsevier B.V. All rights reserved. C1 [Vieru, Bianca; de Mareueil, Philippe Boula; Adda-Decker, Martine] LIMSI CNRS, F-91403 Orsay, France. RP Adda-Decker, M (reprint author), LIMSI CNRS, BP 133, F-91403 Orsay, France. EM madda@limsi.fr CR Abdelli-Beruh NB, 2004, PHONETICA, V61, P201, DOI 10.1159/000084158 Adank Patti, 2003, THESIS RADBOUD U NIJ Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1 Adda-Decker M., 2007, P INT C PHON SCI ICP, P613 ALBA O, 2001, MANUAL FONETICA HISP ANGKITITRAKUL P, 2003, P INTERSPEECH 2003 E, P1353 Arai T., 1997, P EUR RHOD GREEC, P1011 Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608 BARTKOVA K, 2004, P SPECOM 04 INT C SP, P22 Berkling K, 2001, SPEECH COMMUN, V35, P125, DOI 10.1016/S0167-6393(00)00100-X Boersma P., 2001, GLOT INT, V5, P341 BOSCH L, 2002, P ISCA WORKSH PRON M, P111 Boula de Marettil P., 2004, P INT JEJ ISL KOR, P341 Bouselmi G, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P109 Calliope, 1989, PAROLE SON TRAITEMEN CINCAREK T, 2004, P ICSLP JEJ ISL KOR, P1509 Clopper CG, 2004, J PHONETICS, V32, P111, DOI 10.1016/S0095-4470(03)00009-3 Delattre P, 1965, COMP PHONETIC FEATUR DELLWO V, 2010, THESIS U BONN GERMAN de Mareuil PB, 2006, PHONETICA, V63, P247, DOI 10.1159/000097308 DISNER SF, 1980, J ACOUST SOC AM, V67, P253, DOI 10.1121/1.383734 Durand J., 2003, TRIBUNE INT LANGUES, V33, P3 FERRAGNE E, 2007, SPEAKER CLASSIFICATI, V2, P243 Flege J., 1982, STUDIES 2 LANGUAGE A, V5, P1, DOI 10.1017/S0272263100004563 FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256 FLEGE JE, 1981, LANG SPEECH, V24, P125 Flege JE, 2003, SPEECH COMMUN, V40, P467, DOI 10.1016/S0167-6393(02)00128-0 FRELANDRICARD M, 1996, REV PHONET APPL, P61 Frota S., 2007, SEGMENTAL PROSODIC I, P131 Gauvain J.L., 2005, P INT LISB PORT, P1665 Gendrot C., 2005, P INT LISB, P2453 Ghazali S., 2002, P 1 INT C SPEECH PRO, P331 Goronzy S, 2004, SPEECH COMMUN, V42, P109, DOI 10.1016/j.specom.2003.09.003 Grabe Esther, 2002, LAB PHONOLOGY, V7, P515 Gutierrez Diez F., 2008, J ACOUST SOC AM, V123, P3886 GUYON I, 2003, J MACHINE LEARN RES, V3, P1265 Harrington J, 2000, NATURE, V408, P927, DOI 10.1038/35050160 HUCKVALE M, 2007, P INT C PHON SCI SAA, P1821 HUCKVALE M, 2004, P ICSLP, P29 Ihaka R., 1996, J COMPUTATIONAL GRAP, V5, P299, DOI DOI 10.2307/1390807 Jilka M., 2000, THESIS U STUTTGART G King R. W., 1997, P EUROSPEECH 97, P2323 Lamel L, 2007, INT CONF ACOUST SPEE, P997 LIVESCU K, 2000, ACOUST SPEECH SIG PR, P1683 Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081 Martin A. F., 2008, P OD SPEAK LANG REC NEAREY TM, 1989, J ACOUST SOC AM, V85, P2088, DOI 10.1121/1.397861 Piske T, 2001, J PHONETICS, V29, P191, DOI 10.1006/jpho.2001.0134 Quilis Antonio, 1993, TRATADO FONOLOGIA FO Ramus F., 1999, THESIS EHESS PARIS Raux A., 2004, P ICSLP 04 INT C SPO, P613 ROMANO A, 2010, DIMENSIONE TEMPORALE, P45 Rouas JL, 2008, SPEECH COMMUN, V50, P965, DOI 10.1016/j.specom.2008.05.006 Sangwan A, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P172 SCHADEN S, 2003, P INT C LANG RES EV, P1395 SILKE G, 2004, SPEECH COMMUN, V42, P109 VELOSO J, 2007, ACT JOURN ET LING NA, P55 VIERUDIMULESCU B, 2008, THESIS U PARIS SUD O Witten I.H., 2005, DATA MINING PRACTICA WOEHRLING C, 2009, P INT BRIGHT UK, P2183 Woehrling Cecile, 2006, REV PAROLE, V37, P25 Yamada R. A., 1994, P INT C SPOK LANG PR, P2023 NR 62 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 292 EP 310 DI 10.1016/j.specom.2010.10.002 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300003 ER PT J AU Mayo, C Clark, RAJ King, S AF Mayo, Catherine Clark, Robert A. J. King, Simon TI Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; Evaluation; Speech perception; Acoustic cue weighting; Multidimensional scaling ID VOICE QUALITY ASSESSMENT; FINAL STOP CONSONANTS; COMPLEX SOUNDS; CHILDREN; ADULTS; PERCEPTION; CATEGORIZATION; ENGLISH; RATINGS AB The quality of current commercial speech synthesis systems is now so high that system improvements are being made at subtle sub- and supra-segmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them. The aim of the current study was to determine which acoustic characteristics of unit-selection synthetic speech are most salient to listeners when evaluating the naturalness of such speech. This study made use of multidimensional scaling techniques to analyse listeners' pairwise comparisons of synthetic speech sentences. Results indicate that listeners place a great deal of perceptual importance on the presence of artifacts and discontinuities in the speech, somewhat less importance on aspects of segmental quality, and very little importance on stress/intonation appropriateness. These relative differences in importance will impact on listeners' ability to attend to these different acoustic characteristics of synthetic speech, and should therefore be taken into account when designing appropriate methods of synthetic speech evaluation. (C) 2010 Elsevier B.V. All rights reserved. C1 [Mayo, Catherine; Clark, Robert A. J.; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland. RP Mayo, C (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland. EM catherin@ling.ed.ac.uk; robert@cstr.ed.ac.uk; simon.king@ed.ac.uk CR Allen P, 1997, J ACOUST SOC AM, V102, P2255, DOI 10.1121/1.419637 Allen P, 2002, J ACOUST SOC AM, V112, P211, DOI 10.1121/1.1482075 BAILLY G, 2003, ISCA SPEC SESS HOT T BEST CT, 1981, PERCEPT PSYCHOPHYS, V29, P191, DOI 10.3758/BF03207286 Black A., 1997, FESTIVAL SPEECH SYNT Bradlow AR, 1999, PERCEPT PSYCHOPHYS, V61, P206, DOI 10.3758/BF03206883 Cernak M, 2005, P EUR C AC, P2725 CERNAK M, 2009, P ICSVI INT C SOUND CHEN JD, 1999, P EUROSPEECH, P611 Christensen LA, 1997, J ACOUST SOC AM, V102, P2297, DOI 10.1121/1.419639 Clark R. A. J., 1999, P EUR 99 6 EUR C SPE, P1623 Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014 CLARK RAJ, 2003, INT C PHON SCI BARCE, P1141 *EXP ADV GROUP LAN, 1996, EV NAT LANG PROC SYS FALK TH, 2008, P BLIZZ WORKSH BRISB Fisher C, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P343 Francis AL, 2008, J ACOUST SOC AM, V124, P1234, DOI 10.1121/1.2945161 Garofolo J., 1988, GETTING STARTED DARP GORDON PC, 1993, COGNITIVE PSYCHOL, V25, P1, DOI 10.1006/cogp.1993.1001 Hall JL, 2001, J ACOUST SOC AM, V110, P2167, DOI 10.1121/1.1397322 Hazan V, 2000, J PHONETICS, V28, P377, DOI 10.1006/jpho.2000.0121 Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9 HAZAN V, 1998, ICSLP SYD AUSTR, P2163 HIRST D, 1998, P ESCA COCOSDA WORKS *ITU T, 1994, P85 ITUT Iverson P, 2005, J ACOUST SOC AM, V118, P3267, DOI 10.1121/1.2062307 Iverson P, 2003, COGNITION, V87, pB47, DOI 10.1016/S0010-0277(02)00198-1 JILKA M, 2003, P 15 ICPHS BARC, P2549 JILKA M, 2005, P INT 2005 LISB PORT, P2393 Jusczyk P. W., 1997, DISCOVERY SPOKEN LAN KLABBERS E, 2001, IEEE T SPEECH AUDIO, V9 Klabbers E., 1998, P ICSLP, P1983 Kreiman J, 1998, J ACOUST SOC AM, V104, P1598, DOI 10.1121/1.424372 Kreiman J, 2007, J ACOUST SOC AM, V122, P2354, DOI 10.1121/1.2770547 Kreiman J, 2000, J ACOUST SOC AM, V108, P1867, DOI 10.1121/1.1289362 KREIMAN J, 2004, J ACOUST SOC AM, V115, P2609 Kruskal J. B., 1978, SAGE U PAPER SERIES LAMEL LF, 1989, P SPEECH I O ASS SPE, P2161 Marozeau J, 2003, J ACOUST SOC AM, V114, P2946, DOI 10.1121/1.1618239 Mayo C, 2004, J ACOUST SOC AM, V115, P3184, DOI 10.1121/1.1738838 Mayo C, 2005, J ACOUST SOC AM, V118, P1730, DOI 10.1121/1.1979451 MAYO C, 2005, P INT 2005 LISB PORT MOLLER S, 2009, P NAG DAGA 2009 ROTT, P1168 Nittrouer S, 2004, J ACOUST SOC AM, V115, P1777, DOI 10.1121/1.1651192 CUTLER A, 1994, J MEM LANG, V33, P824, DOI 10.1006/jmla.1994.1039 PLUMPE M, 1998, P ESCA COCOSDA WORKS RABINOV CR, 1995, J SPEECH HEAR RES, V38, P26 Schnieder W., 2002, E PRIME USERS GUIDE STYLIANOU Y, 2001, P ICASSP INT C AC SP SYRDAL A, 2004, J ACOUST SOC AM, V115, P2543 SYRDAL AK, 2001, P EUROSPEECH AALB DE, P979 Turk A., 2006, METHODS EMPIRICAL PR, P1 VAINIO M, 2002, P IEEE 2002 WORKSH S Vepa J., 2004, TEXT SPEECH SYNTHESI WARDRIPFRUIN C, 1985, J ACOUST SOC AM, V77, P1907, DOI 10.1121/1.391833 WARDRIPFRUIN C, 1982, J ACOUST SOC AM, V71, P187, DOI 10.1121/1.387346 Watson J. M. M., 1997, THESIS QUEEN MARGARE WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 WOUTERS J, 1998, P ICSLP, V6, P2747, DOI DOI 10.1109/ICASSP.2001.941045 Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002 NR 60 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 311 EP 326 DI 10.1016/j.specom.2010.10.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300004 ER PT J AU Paliwal, K Schwerin, B Wojcicki, K AF Paliwal, Kuldip Schwerin, Belinda Wojcicki, Kamil TI Role of modulation magnitude and phase spectrum towards speech intelligibility SO SPEECH COMMUNICATION LA English DT Article DE Analysis frame duration; Modulation frame duration; Modulation domain; Modulation magnitude spectrum; Modulation phase spectrum; Speech intelligibility; Speech transmission index (STI); Analysis-modification-synthesis (AMS) ID TRANSMISSION INDEX; QUALITY ESTIMATION; RECOGNITION; ENHANCEMENT AB In this paper our aim is to investigate the properties of the modulation domain and more specifically, to evaluate the relative contributions of the modulation magnitude and phase spectra towards speech intelligibility. For this purpose, we extend the traditional (acoustic domain) analysis modification synthesis framework to include modulation domain processing. We use this framework to construct stimuli that retain only selected spectral components, for the purpose of objective and subjective intelligibility tests. We conduct three experiments. In the first, we investigate the relative contributions to intelligibility of the modulation magnitude, modulation phase, and acoustic phase spectra. In the second experiment, the effect of modulation frame duration on intelligibility for processing of the modulation magnitude spectrum is investigated. In the third experiment, the effect of modulation frame duration on intelligibility for processing of the modulation phase spectrum is investigated. Results of these experiments show that both the modulation magnitude and phase spectra are important for speech intelligibility, and that significant improvement is gained by the inclusion of acoustic phase information. They also show that smaller modulation frame durations improve intelligibility when processing the modulation magnitude spectrum, while longer frame durations improve intelligibility when processing the modulation phase spectrum. (C) 2010 Elsevier B.V. All rights reserved. C1 [Paliwal, Kuldip; Schwerin, Belinda; Wojcicki, Kamil] Griffith Univ, Signal Proc Lab, Sch Engn, Brisbane, Qld 4111, Australia. RP Schwerin, B (reprint author), Griffith Univ, Signal Proc Lab, Sch Engn, Nathan Campus, Brisbane, Qld 4111, Australia. EM belsch71@gmail.com CR Atlas L., 2004, P IEEE INT C AC SPEE, V2, P761 Atlas LE, 2001, P SOC PHOTO-OPT INS, V4474, P1, DOI 10.1117/12.448636 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 Falk T., 2008, P INT WORKSH AC ECH Falk T. H., 2007, P ISCA C INT SPEECH, P970 Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679 Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247 Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 GREENBERG S, 1998, P INT C SPOK LANG PR, V6, P2803 GREENBERG S, 1997, P ICASSP, V3, P1647 GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 Hanson B. A., 1993, P IEEE INT C AC SPEE, VII, P79 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Huang X., 2001, SPOKEN LANGUAGE PROC KANEDERA N, 1998, ACOUST SPEECH SIG PR, P613 Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466 Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lyons JG, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P387 OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022 PALIWAL K, 2010, SPL101 GRIFF U Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755 Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004 Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001 Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Quatieri T. F., 2002, DISCRETE TIME SPEECH Rix A., 2001, P862 ITUT SCHROEDER MR, 1975, P IEEE, V63, P1332, DOI 10.1109/PROC.1975.9941 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397 Tyagi V., 2003, P ISCA EUR C SPEECH, P981 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 WOJCICKI K, 2007, P IEEE INT C AC SPEE, V4, P729 Wu S., 2009, INT C DIG SIGN PROC NR 39 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 327 EP 339 DI 10.1016/j.specom.2010.10.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300005 ER PT J AU Ma, JF Loizou, PC AF Ma, Jianfen Loizou, Philipos C. TI SNR loss: A new objective measure for predicting the intelligibility of noise-suppressed speech SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Speech enhancement; Speech intelligibility indices ID SPECTRAL AMPLITUDE ESTIMATOR; RECEPTION THRESHOLD; VECTOR QUANTIZATION; FLUCTUATING NOISE; SUBSPACE APPROACH; ENHANCEMENT; REDUCTION; INDEX; ALGORITHMS; PARAMETERS AB Most of the existing intelligibility measures do not account for the distortions present in processed speech, such as those introduced by speech-enhancement algorithms. In the present study, we propose three new objective measures that can be used for prediction of intelligibility of processed (e.g., via an enhancement algorithm) speech in noisy conditions. All three measures use a critical-band spectral representation of the clean and noise-suppressed signals and are based on the measurement of the SNR loss incurred in each critical band after the corrupted signal goes through a speech enhancement algorithm. The proposed measures are flexible in that they can provide different weights to the two types of spectral distortions introduced by enhancement algorithms, namely spectral attenuation and spectral amplification distortions. The proposed measures were evaluated with intelligibility scores obtained by normal-hearing listeners in 72 noisy conditions involving noise-suppressed speech (consonants and sentences) corrupted by four different maskers (car, babble, train and street interferences). Highest correlation (r = -0.85) with sentence recognition scores was obtained using a variant of the SNR loss measure that only included vowel/consonant transitions and weak consonant information. High correlation was maintained for all noise types, with a maximum correlation (r = -0.88) achieved in street noise conditions. (C) 2010 Elsevier B.V. All rights reserved. C1 [Ma, Jianfen; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. [Ma, Jianfen] Taiyuan Univ Technol, Taiyuan 030024, Shanxi, Peoples R China. RP Loizou, PC (reprint author), Univ Texas Dallas, Dept Elect Engn, POB 830688,EC 33, Richardson, TX 75083 USA. EM loizou@utdallas.edu CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 [Anonymous], 2000, P862 ITUT ANSI, 1997, S351997 ANSI Beerends J., 2004, P WORKSH MEAS SPEECH Benesty J, 2009, SPRINGER TOP SIGN PR, V2, P1, DOI 10.1007/978-3-642-00296-0_1 Benesty J, 2008, IEEE T AUDIO SPEECH, V16, P757, DOI 10.1109/TASL.2008.919072 Berouti M., 1979, P IEEE INT C AC SPEE, P208 Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851 COHEN I, 2008, HDB SPEECH PROCESSIN, P873 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 GREENBERG JE, 1993, J ACOUST SOC AM, V94, P3009, DOI 10.1121/1.407334 Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083 HIRSCH H, 2000, P ISCA ITRW ASR200 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058 Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031 Kamath S.D., 2002, IEEE INT C AC SPEECH Kates J M, 1987, J Rehabil Res Dev, V24, P271 KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657 Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575 KRYTER K, 1926, J ACOUST SOC AM, V34, P1698 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094 Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493 MATTILA V, 2003, P ONL WORKSH MEAS SP Nein HW, 2001, IEEE T SPEECH AUDI P, V9, P73 PAAJANEN E, 2000, IEEE SPEECH COD WORK, P23 Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755 Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Rhebergen KS, 2006, J ACOUST SOC AM, V120, P3988, DOI 10.1121/1.2358008 Rhebergen KS, 2005, J ACOUST SOC AM, V117, P2181, DOI 10.1121/1.1861713 Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199 Spriet A, 2005, IEEE T SPEECH AUDI P, V13, P487, DOI 10.1109/TSA.2005.845821 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 YOON YS, 2006, SNR LOSS HEARING IMP NR 44 TC 19 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 340 EP 354 DI 10.1016/j.specom.2010.10.005 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300006 ER PT J AU So, S Paliwal, KK AF So, Stephen Paliwal, Kuldip K. TI Suppressing the influence of additive noise on the Kalman gain for low residual noise speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Kalman filtering; Speech enhancement; Linear prediction; Dolph-Chebycher windows ID LINEAR PREDICTION; FILTER AB In this paper, we present a detailed analysis of the Kalman filter for the application of speech enhancement and identify its shortcomings when the linear predictor model parameters are estimated from speech that has been corrupted with additive noise. We show that when only noise-corrupted speech is available, the poor performance of the Kalman filter may be attributed to the presence of large values in the Kalman gain during low speech energy regions, which cause a large degree of residual noise to be present in the output. These large Kalman gain values result from poor estimates of the LPCs due to the presence of additive noise. This paper presents the analysis and application of the Kalman gain trajectory as a useful indicator of Kalman filter performance, which can be used to motivate further methods of improvement. As an example, we analyse the previously-reported application of long and overlapped tapered windows using Kalman gain trajectories to explain the reduction and smoothing of residual noise in the enhanced output. In addition, we investigate further extensions, such as Dolph-Chebychev windowing and iterative LPC estimation. This modified Kalman filter was found to have improved on the conventional and iterative versions of the Kalman filter in both objective and subjective testing. (C) 2011 Elsevier B.V. All rights reserved. C1 [So, Stephen; Paliwal, Kuldip K.] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Brisbane, Qld 4111, Australia. RP So, S (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Brisbane, Qld 4111, Australia. EM s.so@griffith.edu.au; k.paliwal@griffith.edu.au RI So, Stephen/D-6649-2011 CR Astrom K. J., 1997, PRENTICE HALL INFORM BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erkelens JS, 1997, IEEE T SPEECH AUDI P, V5, P116, DOI 10.1109/89.554773 Gabrea M, 1999, IEEE SIGNAL PROC LET, V6, P55, DOI 10.1109/97.744623 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144 Hayes M. H., 1996, STAT DIGITAL SIGNAL, V1st Haykin S., 2002, PRENTICE HALL INFORM Holmes J., 2001, SPEECH SYNTHESIS REC Hu Y, 2006, P IEEE INT C AC SPEE, V1, P153 Hu Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1447 Kalman R.E, 1960, J BASIC ENG, V82, P35, DOI DOI 10.1115/1.3662552 Kay S. M., 1993, PRENTICE HALL SIGNAL, V1 Li C. J., 2006, THESIS AARLBORG U DE LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 Ma N, 2006, IEEE T AUDIO SPEECH, V14, P19, DOI 10.1109/TSA.2005.858515 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MEHRA MK, 1970, IEEE T AUTOMAT CONTR, V15, P175 OHYA T, 1994, IEEE 44 VEH TECHN C, P1680 Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P177 Rix A., 2001, P862 ITUT So S, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P391 SORQVIST P, 1997, P IEEE INT C AC SPEE, V2, P1219 WANG T, 2002, IEEE WORKSH SPEECH C Wiener N., 1949, EXTRAPOLATION INTERP NR 28 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 355 EP 378 DI 10.1016/j.specom.2010.10.006 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300007 ER PT J AU Krause, JC Pelley-Lopez, KA Tessler, MP AF Krause, Jean C. Pelley-Lopez, Katherine A. Tessler, Morgan P. TI A method for transcribing the manual components of Cued Speech SO SPEECH COMMUNICATION LA English DT Article DE Transcription; Coarticulation; Cue accuracy; Handshape identification; Placement identification ID RECOGNITION; PERCEPTION; FRENCH; HAND; AID AB Designed to allow visual communication of speech signals, Cued Speech consists of discrete hand signals that are produced in synchrony with the visual mouth movements of speech. The purpose of this paper is to describe a method for transcribing these hand signals. Procedures are presented for identifying (1) the steady-state portion of the cue to be analyzed, (2) the cue's handshape, and (3) the cue's placement. Reliability is evaluated, using materials from 12 cuers that were transcribed on two separate occasions (either by the original rater or a second rater). Results show very good intra-rater and inter-rater reliability on average, which remained good across a variety of individual cuers, even when the cuer's hand gestures were heavily coarticulated. Given its high reliability, this transcription method may be of benefit to applications that require systematic and quantitative analysis of Cued Speech production in various populations. In addition, some of the transcription principles from this method may be helpful in improving accuracy of automatic Cued Speech recognition systems. (C) 2010 Elsevier B.V. All rights reserved. C1 [Krause, Jean C.; Pelley-Lopez, Katherine A.; Tessler, Morgan P.] Univ S Florida, Dept Commun Sci & Disorders, Tampa, FL 33620 USA. RP Krause, JC (reprint author), Univ S Florida, Dept Commun Sci & Disorders, Tampa, FL 33620 USA. EM jeankrause@usf.edu FU National Institute on Deafness and Other Communication Disorders (NIH) [5 R03 DC 007355] FX The authors wish to thank Dana Herrington and Jessica Vick for many helpful technical discussions, and Joe Frisbie for the cue chart used in Fig. 1. Financial support for this work was provided in part by a grant from the National Institute on Deafness and Other Communication Disorders (NIH Grant No. 5 R03 DC 007355). CR Alegria J, 2005, J DEAF STUD DEAF EDU, V10, P122, DOI 10.1093/deafed/eni013 Attina V, 2004, SPEECH COMMUN, V44, P197, DOI 10.1016/j.specom.2004.10.013 Auer ET, 2007, J SPEECH LANG HEAR R, V50, P1157, DOI 10.1044/1092-4388(2007/080) CORNETT RO, 1967, AM ANN DEAF, V112, P3 CORNETT RO, 2001, CUED SPEECH RESOURCE CORNETT RO, 1977, PROCESS AIDS DEAF, P224 *CUED SPEECH ASS U, 2009, WRIT CUES CUE SCRIPT Duchnowski P, 2000, IEEE T BIO-MED ENG, V47, P487, DOI 10.1109/10.828148 EBRAHIMI D, 1991, IEEE T BIOMED ENG, V38, P44 ERBER NP, 1975, J SPEECH HEAR DISORD, V40, P481 *FILMS HUM, 1989, LIF CYCL PLANTS Gibert G, 2005, J ACOUST SOC AM, V118, P1144, DOI 10.1121/1.1944587 Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788 Heracleous P, 2009, IEEE SIGNAL PROC LET, V16, P339, DOI 10.1109/LSP.2009.2016011 Krause JC, 2008, J DEAF STUD DEAF EDU, V13, P432, DOI 10.1093/deafed/enm059 Leybaert J., 2003, OXFORD HDB DEAF STUD, P261 Leybaert J, 2001, J SPEECH LANG HEAR R, V44, P949, DOI 10.1044/1092-4388(2001/074) MASSARO DW, 2009, CUED SPEECH CUED LAN, pCH20 *NAT CUED SPEECH A, 1994, CUED SPEECH J, V5, P73 NICHOLLS GH, 1982, J SPEECH HEAR RES, V25, P262 UCHANSKI RM, 1994, J REHABIL RES DEV, V31, P20 UPTON HW, 1968, AM ANN DEAF, V113, P222 NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 379 EP 389 DI 10.1016/j.specom.2010.11.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300008 ER PT J AU Kocinski, J Libiszewski, P Sek, A AF Kocinski, Jedrzej Libiszewski, Pawel Sek, Aleksander TI Spatial efficiency of blind source separation based on decorrelation - subjective and objective assessment SO SPEECH COMMUNICATION LA English DT Article DE Blind source separation; Beamforming; Speech intelligibility; Speech enhancement ID FREQUENCY-DOMAIN; CONVOLUTIVE MIXTURES; SPEECH MIXTURES; NOISE-REDUCTION; INTELLIGIBILITY AB Blind source separation (BSS) method is one of the newest multisensorial methods that exploits statistical properties of simultaneously recorded independent signals to separate them out. The objective of this method is similar to that of beamforming, namely a set of spatial filters that separate source signals are calculated. Thus, it seems to be reasonable to investigate the spatial efficiency of BSS that is reported in this study. A dummy head with two microphones was used to record two signals in an anechoic chamber: target speech and babble noise in different spatial configurations. Then the speech reception thresholds (SRTs, i.e. signal-to-noise ratio, SNR yielding 50% speech intelligibility) before and after BSS algorithm (Parra and Spence, 2000) were determined for audiologically normal subjects. A significant speech intelligibility improvement was noticed after the BSS was applied. This happened in most cases when the target and masker sources were spatially separated. Moreover, the comparison of objective (SNR enhancement) and subjective (intelligibility improvement) assessment methods is reported here. It must be emphasized that these measures give different results. (C) 2010 Elsevier B.V. All rights reserved. C1 [Kocinski, Jedrzej; Libiszewski, Pawel; Sek, Aleksander] Adam Mickiewicz Univ, Inst Acoust, PL-61614 Poznan, Poland. RP Kocinski, J (reprint author), Adam Mickiewicz Univ, Inst Acoust, 85 Umultowska Str, PL-61614 Poznan, Poland. EM jedrzej.kocinski@amu.edu.pl FU Polish-Norwegian Research Fund; European Union FX This work was supported by Polish-Norwegian Research Fund and FP6 of the European Union 'HearCom'. The authors would like to thank two anonymous reviewers for useful comments and remarks on the earlier version of this manuscript. CR ANEMULLER J, 2000, ICA 2000, P215 Araki S, 2003, EURASIP J APPL SIG P, V2003, P1157, DOI 10.1155/S1110865703305074 Berouti M, 1979, IEEE INT C AC SPEECH, V4, P208, DOI 10.1109/ICASSP.1979.1170788 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Brandstein M., 2001, MICROPHONE ARRAYS SI, V1st CARDOSO JF, 1989, P IC ASSP, V89, P2109 Deller J., 2000, DISCRETE TIME PROCES Douglas SC, 2003, SPEECH COMMUN, V39, P65, DOI 10.1016/S0167-6393(02)00059-6 Drgas S, 2008, ARCH ACOUST, V33, P455 Ephraim E., 1984, IEEE T ACOUST SPEECH, VASSP-32, P1109 GAUTHAM JM, 2010, 9 INT C LAT VAR AN S Hao JC, 2009, IEEE T AUDIO SPEECH, V17, P24, DOI 10.1109/TASL.2008.2005342 HARMELING S, 2001, CONVBSS Hyvarinen A, 2001, INDEPENDENT COMPONEN Johnson D, 1993, ARRAY SIGNAL PROCESS KAJALA M, 2001, IEEE INT C AC SPEECH, V5, P2917 KITAWAKI N, 2007, ETSI WORKSH SPEECH Kocinski J., 2005, Archives of Acoustics, V30 Kocinski J, 2008, SPEECH COMMUN, V50, P29, DOI 10.1016/j.specom.2007.06.003 Kokkinakis K, 2008, J ACOUST SOC AM, V123, P2379, DOI 10.1121/1.2839887 Lee J. K., 2003, 32 INT C EXP NOIS CO LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 Libiszewski P, 2007, ARCH ACOUST, V32, P337 Makino S, 2005, IEICE T FUND ELECTR, VE88A, P1640, DOI 10.1093/ietfec/e88-a.7.1640 MATSUOKA K, 1995, NEURAL NETWORKS, V8, P411, DOI 10.1016/0893-6080(94)00083-X Moore BC., 2003, INTRO PSYCHOL HEARIN MUKAI R, 2004, ISCAS 2004 Ozimek E., 2006, ARCH ACOUST, V31, P431 Ozimek E, 2009, INT J AUDIOL, V48, P433, DOI 10.1080/14992020902725521 Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 Parra LC, 2006, J ACOUST SOC AM, V119, P3839, DOI 10.1121/1.2197606 Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150 Pham D.-T., 2003, ICA 2003 NAR JAP SARUWATARI H, 2003, 4 INT S IND COMP AN Sawada H., 2005, SPEECH ENHANCEMENT Scalart P., 1996, IEEE INT C AC SPEECH, V1, P629 Shinn-Cunningham BG, 2001, J ACOUST SOC AM, V110, P1118, DOI 10.1121/1.1386633 Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2 SMARAGDIS P, 1997, INFORM THEORETIC APP WILSON KW, 2008, ICASSP LAS VEG NEV U Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896] Yu T, 2009, INT CONF ACOUST SPEE, P213 Zhou Y, 2003, SIGNAL PROCESS, V83, P2037, DOI 10.1016/S0165-1684(03)00134-8 NR 43 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 390 EP 402 DI 10.1016/j.specom.2010.11.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300009 ER PT J AU Stark, A Paliwal, K AF Stark, Anthony Paliwal, Kuldip TI MMSE estimation of log-filterbank energies for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; MMSE estimation; Speech enhancement methods ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE SUPPRESSION FILTER; ENHANCEMENT AB In this paper, we derive a minimum mean square error log-filterbank energy estimator for environment-robust automatic speech recognition. While several such estimators exist within the literature, most involve trade-offs between simplifications of the log-filterbank noise distortion model and analytical tractability. To avoid this limitation, we extend a well known spectral domain noise distortion model for use in the log-filterbank energy domain. To do this, several mathematical transformations are developed to transform spectral domain models into filterbank and log-filterbank energy models. As a result, a new estimator is developed that allows for robust estimation of both log-filterbank energies and subsequent Mel-frequency cepstral coefficients. The proposed estimator is evaluated over the Aurora2, and RM speech recognition tasks, with results showing a significant reduction in word recognition error over both baseline results and several competing estimators. (C) 2010 Elsevier B.V. All rights reserved. C1 [Stark, Anthony; Paliwal, Kuldip] Griffith Univ, Signal Proc Lab, Brisbane, Qld 4111, Australia. RP Paliwal, K (reprint author), Griffith Univ, Signal Proc Lab, Nathan Campus, Brisbane, Qld 4111, Australia. EM a.stark@griffith.edu.au; k.paliwal@griffith.edu.au CR ACERO A, 2000, P INT Barker J., 2000, P ICSLP BEIJ CHIN, P373 Cohen I., 2002, SIGNAL PROCESSING LE, V9, P113 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DAVIS S, 1990, READINGS SPEECH RECO Deng L., 2000, P ICSLP EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1991, IEEE T SIGNAL PROCES, V39, P795 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erell A, 1993, IEEE T SPEECH AUDI P, V1, P84, DOI 10.1109/89.221370 FUJIMOTO M, 2000, IEEE INT C AC SPEECH, V3, P1727 GALES L, 1995, THESIS U CAMBRIDGE U Gemello R, 2006, IEEE SIGNAL PROC LET, V13, P56, DOI 10.1109/LSP.2005.860535 Gillick L., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266481 Gradshteyn I. S., 2007, TABLE INTEGRALS SERI, V7th HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hermus K, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/45821 Indrebo KM, 2008, IEEE T AUDIO SPEECH, V16, P1654, DOI 10.1109/TASL.2008.2002083 Lathoud G., 2005, P 2005 IEEE ASRU WOR, P343 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst MALAH D, 1999, ACOUST SPEECH SIG PR, P789 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Moreno P.J., 1996, THESIS CARNEGIE MELL Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225 Pearce D., 2000, ISCA ITRW ASR2000, P29 PRICE P, 1988, IEEE ICASSP 88 NEW Y, V1, P651 Rabiner L.R., 1978, DIGITAL PROCESSING S Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828 Soon IY, 1999, SIGNAL PROCESS, V75, P151, DOI 10.1016/S0165-1684(98)00230-8 SPOUGE JL, 1994, SIAM J NUMER ANAL, V31, P931, DOI 10.1137/0731050 Stouten V., 2006, THESIS KATHOLIEKE U Young S., 2000, HTK BOOK VERSION 3 0 Yu D, 2008, IEEE T AUDIO SPEECH, V16, P1061, DOI 10.1109/TASL.2008.921761 NR 33 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 403 EP 416 DI 10.1016/j.specom.2010.11.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300010 ER PT J AU Hoonhorst, I Medina, V Colin, C Markessis, E Radeau, M Deltenre, P Serniclaes, W AF Hoonhorst, I. Medina, V. Colin, C. Markessis, E. Radeau, M. Deltenre, P. Serniclaes, W. TI Categorical perception of voicing, colors and facial expressions: A developmental study SO SPEECH COMMUNICATION LA English DT Article DE Categorical perception; Boundary precision; Development; Voice onset time; Colors; Facial expressions ID SPEECH-PERCEPTION; CROSS-LANGUAGE; ONSET TIME; CV SYLLABLES; DISCRIMINATION; CHILDREN; INFANTS; SOUNDS; ADULTS; COARTICULATION AB The aim of the present paper was to compare the development of perceptual categorization of voicing, colors and facial expressions in French-speaking children (from 6 to 8 years) and adults. Differences in both categorical perception, i.e. the correspondence between identification and discrimination performances, and in boundary precision, indexed by the steepness of the identification slope, were investigated. Whereas there was no significant effect of age on categorical perception, boundary precision increased with age, both for voicing and facial expressions though not for colors. These results suggest that the development of boundary precision arises from a general cognitive maturation across different perceptual domains. However, this is not without domain specific effects since we found (1) a correlation between the development of voicing perception and some reading performances and (2) an earlier maturation of boundary precision for colors compared to voicing and facial expressions. These comparative data indicate that whereas general cognitive maturation has some influence on the development of perceptual categorization, this is not without domain-specific effects, the structural complexity of the categories being one of them. (C) 2010 Elsevier B.V. All rights reserved. C1 [Hoonhorst, I.; Colin, C.; Radeau, M.; Deltenre, P.] ULB, UNESCOG, B-1050 Brussels, Belgium. [Medina, V.] Univ Paris 07, UFR, F-75013 Paris, France. [Medina, V.; Serniclaes, W.] CNRS, LPP, F-75006 Paris, France. [Medina, V.; Serniclaes, W.] Univ Paris 05, F-75006 Paris, France. [Markessis, E.] ULB, Fac Med, B-1070 Brussels, Belgium. [Deltenre, P.] CHU Brugmann, Clin Neurophysiol, B-1020 Brussels, Belgium. RP Hoonhorst, I (reprint author), ULB, UNESCOG, 50 Ave Franklin Roosevelt CP 191, B-1050 Brussels, Belgium. EM ihoonhor@ulb.ac.be; medina_vicky@yahoo.fr; ccolin@ulb.ac.be; emarkess@ulb.ac.be; moradeau@ulb.ac.be; paul.deltenre@chu-brugmann.be; willy.serniclaes@parisdescartes.fr FU Belgian National Fund for Scientific Research (FNRS); U.L.B; Loicq Foundation; Van Goethem-Brichant Foundation; Brugmann Foundation; Belgian Kids Foundation; ANR (France) [ANR-07-BLAN-0014-01] FX This work was supported financially from funds given to I. Hoonhorst by the Belgian National Fund for Scientific Research (FNRS); to M. Radeau by an FER grant from U.L.B.; to P. Deltenre by the Loicq Foundation, the Van Goethem-Brichant Foundation and the Brugmann Foundation; to E. Markessis by the Belgian Kids Foundation; and to W. Serniclaes by the ANR Program PBELA ANR-07-BLAN-0014-01 (France). The authors are grateful to R. Carre (Dynamique du Langage Lab., CNRS- Lyon 2 University) for providing the speech synthesis software; to C. Van Nechel (Brugmann Hospital, Brussels), M. Vanhaelen (U.C.L., Louvain-la-Neuve) and R. Bruyer (U.C.L., Louvain-la-Neuve) for their help in the creation of color and facial expression stimuli; to Mr. Chaussard, schools inspector and to S. Lacourthiade and C. Moreul, headmistresses of the primary school of Capens and Marquefave, where the main part of the study took place. CR ASLIN RN, 1981, CHILD DEV, V52, P1135, DOI 10.1111/j.1467-8624.1981.tb03159.x BEALE JM, 1995, COGNITION, V57, P217, DOI 10.1016/0010-0277(95)00669-X BEAUPRE M, 2005, MONTREAL SET FACIAL Beddor PS, 2002, J PHONETICS, V30, P591, DOI 10.1006/jpho.2002.0177 Bogliotti C, 2008, J EXP CHILD PSYCHOL, V101, P137, DOI 10.1016/j.jecp.2008.03.006 BORNSTEIN MH, 1976, SCIENCE, V191, P201, DOI 10.1126/science.1246610 Bruyer R, 2007, EUR REV APPL PSYCHOL, V57, P37, DOI 10.1016/j.erap.2006.02.001 Burnham D., 2003, READ WRIT, V16, P573, DOI DOI 10.1023/A:1025593911070 BURNHAM DK, 1991, J CHILD LANG, V18, P231 Campanella S, 2001, VIS COGN, V8, P237 Damper RI, 2000, PERCEPT PSYCHOPHYS, V62, P843, DOI 10.3758/BF03206927 DARCY I, 2007, PAPERS LAB PHONOLOGY, V9 Dunn L. M., 1981, PEABODY PICTURE VOCA Dunn L. M., 1993, ECHELLE VOCABULAIRE EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068 ELLIOTT LL, 1981, J ACOUST SOC AM, V70, P669, DOI 10.1121/1.386929 ELLIOTT LL, 1986, CHILD DEV, V57, P628 Finney D. J., 1971, PROBIT ANAL, V3rd Franklin A, 2004, BRIT J DEV PSYCHOL, V22, P349, DOI 10.1348/0261510041552738 Franklin A, 2005, J EXP CHILD PSYCHOL, V90, P114, DOI 10.1016/j.jecp.2004.10.001 Gao XQ, 2009, J EXP CHILD PSYCHOL, V102, P503, DOI 10.1016/j.jecp.2008.11.002 Hazan V, 2000, J PHONETICS, V28, P377, DOI 10.1006/jpho.2000.0121 Hoonhorst I, 2009, J EXP CHILD PSYCHOL, V104, P353, DOI 10.1016/j.jecp.2009.07.005 Hoonhorst I, 2009, CLIN NEUROPHYSIOL, V120, P897, DOI 10.1016/j.clinph.2009.02.174 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Kotsoni E, 2001, PERCEPTION, V30, P1115, DOI 10.1068/p3155 Kraljic T, 2006, PSYCHON B REV, V13, P262, DOI 10.3758/BF03193841 KRAUSE SE, 1982, J ACOUST SOC AM, V71, P990, DOI 10.1121/1.387580 Lalonde CE, 1995, INFANT BEHAV DEV, V18, P459, DOI 10.1016/0163-6383(95)90035-7 LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417 MacMillan N. A., 2005, DETECTION THEORY USE Martin-Malivel J, 2007, BEHAV NEUROSCI, V121, P1145, DOI 10.1037/0735-7044.121.6.1145 McCullagh P, 1983, GEN LINEAR MODELS Medina V, 2010, J PHONETICS, V38, P493, DOI 10.1016/j.wocn.2010.06.002 MEDINA V, 2009, 10 ISCA C INT BRIGTH Mitterer H, 2008, PSYCHOL SCI, V19, P629, DOI 10.1111/j.1467-9280.2008.02133.x Mitterer H, 2006, PERCEPT PSYCHOPHYS, V68, P1227, DOI 10.3758/BF03193723 Mondloch CJ, 2002, PERCEPTION, V31, P553, DOI 10.1068/p3339 NEAREY TM, 1990, J PHONETICS, V18, P347 POLLACK I, 1971, PSYCHON SCI, V24, P299 RASKIN LA, 1983, PSYCHOL RES-PSYCH FO, V45, P135, DOI 10.1007/BF00308665 Roberson D, 2005, BEHAV BRAIN SCI, V28, P505 SANDELL JH, 1979, J COMP PHYSIOL PSYCH, V93, P626, DOI 10.1037/h0077594 Schouten B, 2003, SPEECH COMMUN, V41, P71, DOI 10.1016/S0167-6393(02)00094-8 Schwarzer G, 2000, CHILD DEV, V71, P391, DOI 10.1111/1467-8624.00152 Serniclaes W, 2005, COGNITION, V98, pB35, DOI 10.1016/j.cognition.2005.03.002 Serniclaes W., 1987, THESIS U LIBRE BRUXE SIMON C, 1978, J ACOUST SOC AM, V63, P925, DOI 10.1121/1.381772 SNOWDON CT, 1987, CATEGORICAL PERCEPTI, P332 STEVENS KN, 1974, J ACOUST SOC AM, V55, P653, DOI 10.1121/1.1914578 STREETER LA, 1976, J ACOUST SOC AM, V59, P448, DOI 10.1121/1.380864 SUMMERFIELD Q, 1975, SPEECH PERCEPTION SE, V2 von Frisch K, 1964, BEES THEIR VISION CH WERKER JF, 1985, PERCEPT PSYCHOPHYS, V37, P35, DOI 10.3758/BF03207136 WOOD CC, 1976, J ACOUST SOC AM, V60, P1381, DOI 10.1121/1.381231 WRIGHT AA, 1972, VISION RES, V12, P1447, DOI 10.1016/0042-6989(72)90171-X ZLATIN MA, 1975, J SPEECH HEAR RES, V18, P541 NR 57 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 417 EP 430 DI 10.1016/j.specom.2010.11.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300011 ER PT J AU Patel, R McNab, C AF Patel, Rupal McNab, Catherine TI Displaying prosodic text to enhance expressive oral reading SO SPEECH COMMUNICATION LA English DT Article DE Children; Oral reading; Prosody; Reading software; Expressive reading ID VOCAL FUNDAMENTAL-FREQUENCY; 11-YEAR-OLD CHILDREN; CONTRASTIVE STRESS; POOR READERS; SIMPLE VIEW; FLUENCY; COMPREHENSION; INTONATION; LANGUAGE; CRIES AB This study assessed the effectiveness of software designed to facilitate expressive oral reading through text manipulations that convey prosody. The software presented stories in standard (S) and manipulated formats corresponding to variations in fundamental frequency (F), intensity (I), duration (D), and combined cues (C) indicating modulation of pitch, loudness and length, respectively. Ten early readers (mean age = 7.6 years) attended three sessions. During the first session, children read two stories in standard format to establish a baseline. The second session provided training and practice in the manipulated formats. In the third, post-training session, sections of each story were read in each condition (S, F, I, D, C in random order). Recordings were acoustically examined for changes in word duration, peak intensity and peak F0 from baseline to post-training. When provided with pitch cues (F), children increased utterance-wide peak F0 range (mean = 34.5 Hz) and absolute peak F0 for accented words. Pitch cues were more effective in isolation (F) than in combination (C). Although Condition I elicited increased intensity of salient words, Conditions S and D had minimal impact on prosodic variation. Findings suggest that textual manipulations conveying prosody can be readily learned by children to improve reading expressivity. (C) 2010 Elsevier B.V. All rights reserved. C1 [Patel, Rupal; McNab, Catherine] Northeastern Univ, Dept Speech Language Pathol & Audiol, Boston, MA 02115 USA. RP Patel, R (reprint author), Northeastern Univ, Dept Speech Language Pathol & Audiol, 360 Huntington Ave,102 Forsyth Bldg, Boston, MA 02115 USA. EM r.patel@neu.edu FU American Speech and Hearing Association; Northeastern University; National Science Foundation [IIS-0915527] FX This study was conducted in the Communication Analysis and Design Laboratory in the Department of Speech Language Pathology and Audiology at Northeastern University. This work was supported in part by funding from the American Speech and Hearing Association SPARC award (Students Preparing for Academic and Research Careers), the Northeastern University Provost Award, and the National Science Foundation (Grant IIS-0915527). The authors are grateful to Kevin Reilly, Michael Epstein and Timothy Mills for their time, effort, and suggestions and to Ghadeer Rahhal, who was instrumental in implementing the ReadN'Karaoke program and supplemental acoustic analysis software. Last, the authors thank the children and families who participated for their time, effort, and enthusiasm. CR Aylett M, 2006, J ACOUST SOC AM, V119, P3048, DOI 10.1121/1.2188331 BATES E, 1976, J CHILD LANG, V1, P227 Blevins W., 2001, BUILDING FLUENCY LES BOERSMA P, 2007, SYSTEM DOING PHONETI BOLINGER DL, 1961, LANGUAGE, V37, P83, DOI 10.2307/411252 Bolinger D., 1989, INTONATION ITS USES BREWSTER K, 1989, LINGUISTICS CLIN PRA, P186 CARVER RP, 1993, J READING BEHAV, V25, P439 Chafe W., 1988, WRIT COMMUN, V5, P396, DOI DOI 10.1177/0741088388005004001 Cooper W. E., 1980, SYNTAX SPEECH COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372 Cowie R, 2002, LANG SPEECH, V45, P47 Cromer W, 1970, J Educ Psychol, V61, P471, DOI 10.1037/h0030288 CRUTTENDEN A, 1985, J CHILD LANG, V12, P643 Crystal D., 1979, LANG ACQUIS, p[33, 174] Cutler A, 1997, LANG SPEECH, V40, P141 CUTLER A, 1987, J CHILD LANG, V14, P145 DOWHOWER SL, 1987, READING RES Q, P390 FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022 FRY DB, 1958, LANG SPEECH, V1, P126 Fuchs L. S., 2001, SCI STUD READ, V5, P239, DOI DOI 10.1207/S1532799XSSR0503_3 FURROW D, 1984, J CHILD LANG, V11, P203 Gibson E. J., 1975, PSYCHOL READING Gilbert HR, 1996, INT J PEDIATR OTORHI, V34, P237, DOI 10.1016/0165-5876(95)01273-7 Grigos MI, 2007, J SPEECH LANG HEAR R, V50, P119, DOI 10.1044/1092-4388(2007/010) HERMAN PA, 1985, READING RES Q, V20, P535 HOOVER WA, 1990, READ WRIT, V2, P127, DOI 10.1007/BF00401799 Hudson RF, 2005, READ TEACH, V58, P702, DOI 10.1598/RT.58.8.1 Kent R., 1997, SPEECH SCI Kuhn MR, 2003, J EDUC PSYCHOL, V95, P3, DOI 10.1037/0022-0663.95.1.3 LABERGE D, 1974, COGNITIVE PSYCHOL, V6, P293, DOI 10.1016/0010-0285(74)90015-2 Ladd DR, 2008, CAMB STUD LINGUIST, V79, P1 Lehiste I., 1970, SUPRASEGMENTALS LeVasseur VM, 2006, APPL PSYCHOLINGUIST, V27, P423, DOI 10.1017/S0142716406060346 Lind K, 2002, INT J PEDIATR OTORHI, V64, P97, DOI 10.1016/S0165-5876(02)00024-1 Locke J. L., 1993, CHILDS PATH SPOKEN L Miller J, 2006, J EDUC PSYCHOL, V98, P839, DOI 10.1037/0022-0663.98.4.839 Morgan JL, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P1 MORRIS D, 2002, EVERY CHILD READING *NAEP, 1995, LIST CHILD READ AL O National Institute of Child Health and Human Development, 2000, NIH PUBL, V00-4769 OSHEA LJ, 1983, READ RES QUART, V18, P458, DOI 10.2307/747380 Patel R, 2006, SPEECH COMMUN, V48, P1308, DOI 10.1016/j.specom.2006.06.007 Patel R, 2009, J SPEECH LANG HEAR R, V52, P790, DOI 10.1044/1092-4388(2008/07-0137) Pinnell G., 1995, LISTENING CHILDREN R Protopapas A, 1997, J ACOUST SOC AM, V102, P3723, DOI 10.1121/1.420403 Rasinski T. V., 2003, FLUENT READER ORAL R RASINSKI TV, 1990, J EDUC RES, V83, P147 *READ NAT, READ NAT STRAT *READ THEAT, READ THEAT SCRIPTS P SAMUELS SJ, 1988, READ TEACH, V41, P756 Schreiber P., 1982, LANG ACQUIS, P78 Schreiber P. A., 1987, COMPREHENDING ORAL W, P243 Schreiber P. A., 1991, THEOR PRACT, V30, P158, DOI [10.2307/1476877, DOI 10.1080/00405849109543496] SCHREIBER PA, 1980, J READING BEHAV, V12, P177 Schwanenflugel PJ, 2004, J EDUC PSYCHOL, V96, P119, DOI 10.1037/0022-0663.96.1.119 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 Snow D, 1998, J SPEECH LANG HEAR R, V41, P576 SNOW D, 1994, J SPEECH HEAR RES, V37, P831 Stanovich K., 1996, HDB READING RES, V2, P418 Stathopoulos ET, 1997, J SPEECH LANG HEAR R, V40, P595 Titze IR, 1994, PRINCIPLES VOICE PRO Traunmuller H, 2000, J ACOUST SOC AM, V107, P3438, DOI 10.1121/1.429414 Wells B, 2004, J CHILD LANG, V31, P749, DOI 10.1017/S030500090400652X WERKER JF, 1994, INFANT BEHAV DEV, V17, P323, DOI 10.1016/0163-6383(94)90012-4 Whalley K, 2006, J RES READ, V29, P288, DOI 10.1111/j.1467-9817.2006.00309.x Wiig E. H., 2004, CLIN EVALUATION LANG WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 YOUNG A, 1995, J EXP CHILD PSYCHOL, V60, P428, DOI 10.1006/jecp.1995.1048 Young AR, 1996, APPL PSYCHOLINGUIST, V17, P59, DOI 10.1017/S0142716400009462 NR 70 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 431 EP 441 DI 10.1016/j.specom.2010.11.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300012 ER PT J AU Stan, A Yamagishi, J King, S Aylett, M AF Stan, Adriana Yamagishi, Junichi King, Simon Aylett, Matthew TI The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; HTS; Romanian; HMMs; Sampling frequency; Auditory scale AB This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called "RSS", along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given. Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis. (C) 2010 Elsevier B.V. All rights reserved. C1 [Stan, Adriana] Tech Univ Cluj Napoca, Dept Commun, Cluj Napoca 400027, Romania. [Yamagishi, Junichi; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland. [Aylett, Matthew] CereProc Ltd, Edinburgh EH8 9LE, Midlothian, Scotland. RP Stan, A (reprint author), Tech Univ Cluj Napoca, Dept Commun, 26-28 George Baritiu St, Cluj Napoca 400027, Romania. EM adriana.stan@com.utcluj.ro; jyamagis@staffmail.ed.ac.uk; simon.king@ed.ac.uk; matthew@cereproc.com RI Stan, Adriana /G-1257-2014 OI Stan, Adriana /0000-0003-2894-5770 FU European Social Fund [POSDRU/6/1.5/S/5]; European Community [213845]; eDIKT initiative FX Adriana Stan is funded by the European Social Fund, project POSDRU/6/1.5/S/5 and was visiting CSTR at the time of this work. Junichi Yamagishi and Simon King are partially funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant agreement 213845 (the EMIME project).This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF - http://www.ecdf.ed.ac.uk). The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk). CR Aylett M. P., 2007, P AISB 2007 NEWC UK, P174 Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X Black A. W., 1995, P EUROSPEECH MADR SP, P581 BURILEANU D, 1999, P EUROSPEECH 99 BUD, P2063 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Fant G, 2005, TEXT SPEECH LANG TEC, V24, P199 FERENCZ A, 1997, THESIS U CLUJ NAPOCA FRUNZA O, 2005, P EUROLAN 2005 WORKS Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110 Karaiskos V., 2008, P BLIZZ CHALL WORKSH KAWAHARA H, 2001, 2 MAVEBAW Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 MURAOKA T, 1978, J AUDIO ENG SOC, V26, P252 Ohtani Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2266 Olshen R., 1984, CLASSIFICATION REGRE, V1st PATTERSON RD, 1982, J ACOUST SOC AM, V76, P640 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695 TOKUDA K, 1991, IEICE T FUND ELECTR, V74, P1240 TOKUDA K, 1994, TECHNICAL REPORT NAG Tokuda K, 2002, IEICE T INF SYST, VE85D, P455 Tokuda K., 1994, P INT C SPOK LANG PR, P1043 Yamagishi J, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P581 Yamagishi Junichi, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495562 Yamagishi Junichi, 2008, P BLIZZ CHALL 2008 B Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Zen H, 2007, P 6 ISCA WORKSH SPEE, P294 Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 ZWICKER E, 1965, PSYCH REV, V72, P2 NR 31 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2011 VL 53 IS 3 BP 442 EP 450 DI 10.1016/j.specom.2010.12.002 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 724FK UT WOS:000287561300013 ER PT J AU Huijbiegts, M de Jong, F AF Huijbiegts, Marijn de Jong, Fianciska TI Robust speech/non-speech classification in heterogeneous multimedia content SO SPEECH COMMUNICATION LA English DT Article DE Speech/non speech classification; Rich transcription; SHoUT toolkit ID RECOGNITION AB In this paper we present a speech/non speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data Because no parameter tuning is needed and no training data is required to train models for specific sounds the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification The experiments show that the performance of the proposed system is 83% and 44% (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007 (C) 2010 Elsevier B V All rights reserved C1 [Huijbiegts, Marijn; de Jong, Fianciska] Univ Twente, Dept Comp Sci, NL-7500 AE Enschede, Netherlands. RP Huijbiegts, M (reprint author), Univ Twente, Dept Comp Sci, POB 217, NL-7500 AE Enschede, Netherlands. FU Dutch government; EU [IST-FP6-506811, IST-FP6-027685, IST-PF6-027413] FX The work reported here was partly supported by the bsik-program MultimediaN which is funded by the Dutch government(http //www multimedian nl) and the EU projects AMI (IST-FP6-506811), MESH (IST-FP6-027685), and Media Campaign (IST-PF6-027413) We would like to thank IDIAP for providing us with the speech/music benchmark files CR Ajmera J, 2003, SPEECH COMMUN, V40, P351, DOI 10.1016/S0167-6393(02)00087-0 Anguera X., 2006, THESIS U POLITECNICA Anguera X., 2007, LECT NOTES COMPUTER, V4299 Byrne W, 2004, IEEE T SPEECH AUDI P, V12, P420, DOI 10.1109/TSA.2004.828702 Cassidy S, 2004, P NIST RT04S EV WORK Chen S., 1998, P DARPA BROADC NEWS Fiscus JG, 2006, LECT NOTES COMPUT SC, V4299, P309 Garofolo J., 2000, P RECH INF ASS ORD C Gauvain JL, 1999, P DARPA BROADC NEWS, P99 Goldman J., 2005, INT J DIGITAL LIB, V5, P287, DOI 10.1007/s00799-004-0101-0 Hain T., 1998, P DARPA BROADC NEWS, P133 Huang J, 2007, P NIST RICH TRANSCR Huijbregts M, 2001, PROSODY BASED BOUNDA HUIJBREGTS M., 2007, LECT NOTES COMPUTER Istrate D, 2006, LECT NOTES COMPUTER ITO MR, 1971, IEEE T ACOUST SPEECH, VAU19, P235, DOI 10.1109/TAU.1971.1162189 OOSTDIJK N, 2000, 2 INT C LANG RES EV, V2, P887 Pellom B., 2003, P ICASSP Rentzeperis E, 2007, LECT NOTES COMPUTER SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 Stolcke A, 2007, P NIST RICH TRANSCR van Leeuwen D, 2007, LECT NOTES COMPUTER Wolfel M, 2007, P NIST RICH TRANSCR NR 23 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 143 EP 153 DI 10.1016/j.specom.2010.08.008 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300001 ER PT J AU Krishnamoorthy, P Prasanna, SRM AF Krishnamoorthy, P. Prasanna, S. R. M. TI Enhancement of noisy speech by temporal and spectral processing SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Temporal processing; Spectral processing; Temporal and spectral processing ID LINEAR PREDICTION; AMPLITUDE ESTIMATOR; SUBTRACTION METHOD; REPRESENTATION; REDUCTION; DATABASE; DOMAIN AB This paper presents a noisy speech enhancement method by combining linear prediction (LP) residual weighting in the time domain and spectral processing in the frequency domain to provide better noise suppression as well as better enhancement in the speech regions The noisy speech is initially processed by the excitation source (LP residual) based temporal processing that Involves identifying and enhancing the excitation source based speech-specific features present at the gross and fine temporal levels The gross level features are identified by estimating the following speech parameters sum of the peaks in the discrete Fourier transform (DFT) spectrum, smoothed Hilbert envelope of the LP residual and modulation spectrum values, all from the noisy speech signal The fine level features are identified using the knowledge of the instants of significant excitation A weight function is derived from the gross and fine weight functions to obtain the temporally processed speech signal The temporally processed speech is further subjected to spectral domain processing Spectral processing involves estimation and removal of degrading components, and also identification and enhancement of speech-specific spectral components The proposed method is evaluated using different objective and subjective quality measures The quality measures show that the proposed combined temporal and spectral processing method provides better enhancement compared to either temporal or spectral processing alone (C) 2010 Elsevier B V All rights reserved C1 [Prasanna, S. R. M.] Indian Inst Technol, Dept Elect & Commun Engn, Gauhati 781039, Assam, India. [Krishnamoorthy, P.] Samsung India Software Ctr, Noida 201301, India. RP Prasanna, SRM (reprint author), Indian Inst Technol, Dept Elect & Commun Engn, Gauhati 781039, Assam, India. CR ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267 Berouti M., 1979, P IEEE INT C AC SPEE, P208 Munkong R, 2008, IEEE SIGNAL PROC MAG, V25, P98, DOI 10.1109/MSP.2008.918418 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chang JH, 2007, PATTERN RECOGN, V40, P1123, DOI 10.1016/j.patcog.2006.07.006 Chen B, 2007, SPEECH COMMUN, V49, P134, DOI 10.1016/j.specom.2006.12.005 CHEN B, 2005, P IEEE ICASSP, V1, P1097 Deller J. R., 1993, DISCRETE TIME PROCES DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Greenberg S, 1997, INT CONF ACOUST SPEE, P1647, DOI 10.1109/ICASSP.1997.598826 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2006, P INT PHIL PA US Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 Jin W, 2006, SPEECH COMMUN, V48, P1349, DOI 10.1016/j.specom.2006.07.001 Kamath S., 2002, P IEEE INT C AC SPEE Kim W, 2000, IEE P-VIS IMAGE SIGN, V147, P423, DOI 10.1049/ip-vis:20000408 Krishnamoorthy P, 2009, IEEE T AUDIO SPEECH, V17, P253, DOI 10.1109/TASL.2008.2008039 Krishnamoorthy P, 2008, ADCOM: 2008 16TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, P112 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Marple J., 1999, IEEE T SIGNAL PROCES, V47, P2600 Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927 Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 PRASANNA SRM, 2004, P IEEE INT C AC SPEE, V1, pI109 Prasanna SRM, 2005, IEEE P 3 INT C INT S, P140 Press W.H., 1992, NUMERICAL RECIPES C Proakis J. G., 1996, DIGITAL SIGNAL PROCE Rix AW, 2002, J AUDIO ENG SOC, V50, P755 SCHROEDE.MR, 1970, PR INST ELECTR ELECT, V58, P707, DOI 10.1109/PROC.1970.7725 Senapati S, 2008, SPEECH COMMUN, V50, P504, DOI 10.1016/j.specom.2008.03.004 Seok JW, 1999, ELECTRON LETT, V35, P123, DOI 10.1049/el:19990122 Shao Y, 2007, IEEE T SYST MAN CY B, V37, P877, DOI 10.1109/TSMCB.2007.895365 SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 Sri Rama Murty K, 2007, P INT ANTW BELG, P2941 Yegnanarayana B, 2009, IEEE T AUDIO SPEECH, V17, P614, DOI 10.1109/TASL.2008.2012194 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Yamashita K, 2005, IEEE SIGNAL PROC LET, V12, P465, DOI 10.1109/LSP.2005.847864 Yang LP, 2005, J ACOUST SOC AM, V117, P1001, DOI 10.1121/1.1852873 Yegnanarayana B., 2002, P IEEE INT C AC SPEE, V1, P541 Yegnanarayana B, 1999, SPEECH COMMUN, V28, P25, DOI 10.1016/S0167-6393(98)00070-3 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 46 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 154 EP 174 DI 10.1016/j.specom.2010.08.011 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300002 ER PT J AU Wang, RL Lu, JL AF Wang, Ruili Lu, Jingli TI Investigation of golden speakers for second language learners from imitation preference perspective by voice modification SO SPEECH COMMUNICATION LA English DT Article DE Computer assisted language learning (CALL); Computer Assisted Pronunciation Training (CAPT); Voice modification; Pitch; Speech rate AB This paper investigates what voice features (e g, speech rate and pitch-formants) make a teacher's voice preferable for second language learners to imitate, when they practice sentence pronunciation using Computer-Assisted Pronunciation Training (CAPT) systems The CAPT system employed in our investigation uses a single teacher's voice as the source to automatically resynthesize several sample voices with different voice features based on the features of a learner's voice Our approach is different from that in the study conducted by Probst et al which uses multiple native speakers' voices as sample voices [Probst, K, Ke, Y, Eskenazi M, 2002 Enhancing foreign language tutors-in search of the golden speaker Speech Communication 37 (3-4) 161-173] Our approach can reduce the influence of characteristics of teachers' voices (e g voice quality and clarity) on the investigation Our experimental results show that a teacher s voice, which has similar speech rate and pitch-formants to a learner's voice, is not always the learner's first imitation preference Many factors can influence learners' imitation preferences, e g, background and proficiency of the language that they are learning Also a learner's preferences may change at different learning stages We thus advocate an automatic voice modification function in CAPT systems provide speech learning material with a wide variety of voice features, e g, different speech rates or different pitch-formants Learners then can control the voice modifications according to their preferences (C) 2010 Elsevier B V All rights reserved C1 [Wang, Ruili] Massey Univ, Sch Engn & Adv Technol, Palmerston North, New Zealand. Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China. RP Wang, RL (reprint author), Massey Univ, Sch Engn & Adv Technol, Palmerston North, New Zealand. CR Arnett M.K., 1952, J SO STATES COMM ASS, V17, P203 Bissiri MP, 2009, SPEECH COMMUN, V51, P933, DOI 10.1016/j.specom.2009.03.001 Black A, 2007, P ISCA ITRW SLATE WO Boersma P., 2009, PRAAT DOING PHONETIC Clark J., 2007, INTRO PHONETICS PHON Derwing TM, 2003, CAN MOD LANG REV, V59, P546 Dyck C, 2002, LANG LEARN TECHNOL, V6, P27 Erro D, 2007, INTERSPEECH 2007 EUR Eskenazi M., 2000, P INSTIL 2000 INT SP, P73 Eskenazi M, 1998, P SPEECH TECHN LANG, P77 Eskenazi M, 2009, SPEECH COMMUN, V51, P832, DOI 10.1016/j.specom.2009.04.005 Fant G., 1960, ACOUSTIC THEORY SPEE Felps D, 2009, SPEECH COMMUN, V51, P920, DOI 10.1016/j.specom.2008.11.004 Hirose K, 2004, P INT S TON ASP LANG, P77 Hismanoglu M., 2006, J LANGUAGE LINGUISTI, V2, P101 Jacob A, 2008, ADV HUMAN COMPUTER I Lee S. T., 2008, THESIS AUSTR CATHOLI Lu J, 2010, INTERSPEECH2010 MAK, P606 Meszaros K, 2005, FOLIA PHONIATR LOGO, V57, P111, DOI 10.1159/000083572 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Nagano K, 1990, 1 INT C SPOK LANG PR, P1169 Nolan Francis, 2003, P 15 INT C PHON SCI, P771 Ostendorf M, 1995, 95001 ECS BOST U Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7 Sundstrom A., 1998, P ISCA WORKSH SPEECH, P49 NR 25 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 175 EP 184 DI 10.1016/j.specom.2010.08.015 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300003 ER PT J AU Lobdell, BE Allen, JB Hasegawa-Johnson, MA AF Lobdell, B. E. Allen, J. B. Hasegawa-Johnson, M. A. TI Intelligibility predictors and neural representation of speech SO SPEECH COMMUNICATION LA English DT Article DE Speech perception; Articulation; Index; Speech recognition; Speech representation ID ARTICULATION INDEX; - DISTINCTION; WORD RECOGNITION; PERCEPTION; NOISE; MODEL; IDENTIFICATION; PLACE; CUES; INTEGRATION AB Intelligibility predictors tell us a great deal about human speech perception, in particular which acoustic factors strongly effect human behavior and which do not A particular intelligibility predictor, the Articulation Index (AI), is interesting because it models human behavior in noise, and its form has implications about representation of speech in the brain Specifically, the Articulation Index implies that a listener pre-consciously estimates the making noise distribution and uses it to classify time/frequency samples as speech or non-speech We classify consonants using representations of speech and noise which are consistent with this hypothesis and determine whether their error rate and error patterns are more or less consistent with human behavior than representations typical of automatic speech recognition systems The new representations resulted in error patterns more similar to humans in cases where the testing and training data sets do not have the same masking noise spectrum (C) 2010 Elsevier B V All rights reserved C1 [Lobdell, B. E.; Allen, J. B.; Hasegawa-Johnson, M. A.] Univ Illinois, Beckman Inst, Urbana, IL 61820 USA. RP Lobdell, BE (reprint author), Univ Illinois, Beckman Inst, 405 N Mathews Ave, Urbana, IL 61820 USA. CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Allen JB, 2005, J ACOUST SOC AM, V117, P2212, DOI 10.1121/1.1856231 ANSI, 1969, S35 ANSI ANSI, 1997, S35 ANSI BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 COOPER FS, 1952, J ACOUST SOC AM, V24, P597, DOI 10.1121/1.1906940 Cover T. M., 2006, ELEMENTS INFORM THEO, V2nd Darwin C. J., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90006-1 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 Drullman R, 1996, J ACOUST SOC AM, V99, P2358, DOI 10.1121/1.415423 DURLACH NI, 1986, J ACOUST SOC AM, V80, P63, DOI 10.1121/1.394084 FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605 Fletcher H, 1938, J ACOUST SOC AM, V9, P275, DOI 10.1121/1.1915935 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 Hant JJ, 2003, SPEECH COMMUN, V40, P291, DOI 10.1016/S0167-6393(02)00068-7 HEDRICK MS, 1993, J ACOUST SOC AM, V94, P2005, DOI 10.1121/1.407503 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 JONGMAN A, 1989, J ACOUST SOC AM, V85, P1718, DOI 10.1121/1.397961 Kewley Port D, 1983, J ACOUST SOC AM, V73, P1779 KRYTER K, 1926, J ACOUST SOC AM, V34, P1698 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094 LEE B, 2007, BIENNIAL DSP VEHICLE Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Musch H, 2000, ACOUST RES LET ONLIN, V2, P25 Omar MK, 2004, IEEE T SIGNAL PROCES, V52, P2701, DOI 10.1109/TSP.2004.834344 Padmanabhan M, 2005, IEEE T SPEECH AUDI P, V13, P512, DOI 10.1109/TSA.2005.848876 PAVLOVIC CV, 1984, J ACOUST SOC AM, V75, P1606, DOI 10.1121/1.390870 Phatak SA, 2007, J ACOUST SOC AM, V121, P2312, DOI 10.1121/1.2642397 Phatak SA, 2008, J ACOUST SOC AM, V124, P1220, DOI 10.1121/1.2913251 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 REPP BH, 1988, J ACOUST SOC AM, V83, P237, DOI 10.1121/1.396529 REPP BH, 1986, J ACOUST SOC AM, V79, P1987, DOI 10.1121/1.393207 Ronan D, 2004, J ACOUST SOC AM, V116, P1749, DOI 10.1121/1.1777858 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 SHARF DJ, 1972, J ACOUST SOC AM, V51, P652, DOI 10.1121/1.1912890 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569 VIEMEISTER NF, 1991, J ACOUST SOC AM, V90, P858, DOI 10.1121/1.401953 NR 44 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 185 EP 194 DI 10.1016/j.specom.2010.08.016 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300004 ER PT J AU Alwan, A Jiang, JT Chen, W AF Alwan, Abeer Jiang, Jintao Chen, Willa TI Perception of place of articulation for plosives and fricatives in noise SO SPEECH COMMUNICATION LA English DT Article DE Speech perception; Place of articulation; Plosives; Fricatives; Noise; Psychoacoustics ID VOICELESS STOP CONSONANTS; LOCUS EQUATIONS; NONNATIVE LISTENERS; IMPAIRED LISTENERS; ENGLISH FRICATIVES; RELATIVE AMPLITUDE; SPEECH-PERCEPTION; WORD RECOGNITION; CUES; CONFUSIONS AB This study aims at uncovering perceptually-relevant acoustic cues for the labial versus alveolar place of articulation distinction in syllable-initial plosives {/b/,/d/ /p/,/t/} and fricatives {/f/,/s/,/v/,/z/} in noise Speech materials consisted of naturally-spoken consonant-vowel (CV) syllables from four talkers where the vowel was one of {/a/ /i/,/u/} Acoustic analyses using logistic regression show that formant frequency measurements, relative spectral amplitude measurements, and burst/noise durations are generally reliable cues for labial/alveolar classification In a subsequent perceptual experiment, each pair of syllables with the labial/alveolar distinction (e g, /ba,da/) was presented to listeners in various levels of signal-to noise-ratio (SNR) in a 2-AFC task A threshold SNR was obtained for each syllable pair using sigmoid fitting of the percent correct scores Results show that the perception of the labial/alveolar distinction in noise depends on the manner of articulation the vowel context, and the interaction between voicing and manner of articulation Correlation analyses of the acoustic measurements and the threshold SNRs show that formant frequency measurements (such as F1 and F2 onset frequencies and F2 and F3 frequency changes) become increasingly important for the perception of labial/alveolar distinctions as the SNR degrades (C) 2010 Elsevier B V All rights reserved C1 [Alwan, Abeer; Jiang, Jintao; Chen, Willa] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA. RP Alwan, A (reprint author), Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA. FU NIH-NIDCD [R29-DC02033]; NSF; Radcliffe Institute FX This work was supported in part by the NIH-NIDCD Grant R29-DC02033, the NSF, and a Fellowship from the Radcliffe Institute to Abeer Alwan We thank Marcia Chen for her help in data analysis and Wendy Espeland, Marwa Elshakry, and Christine Stansell for commenting on an earlier version of this manuscript Thanks also to Steven Lulich for constructive comments The views expressed here are those of the authors and do not necessarily represent those of the NSF CR Alwan A, 1992, P INT C SPOK LANG PR, P1063 BEHRENS S, 1988, J ACOUST SOC AM, V84, P861, DOI 10.1121/1.396655 Benki JR, 2003, J ACOUST SOC AM, V113, P1689, DOI 10.1121/1.1534102 BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 ENGEN KJV, 2007, J ACOUST SOC AM, V121, P519 Fant G., 1973, SPEECH SOUNDS FEATUR, P110 Farar CL, 1987, J ACOUST SOC AM, V81, P1085 Fruchter D, 1997, J ACOUST SOC AM, V102, P2997, DOI 10.1121/1.421012 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Guerlekian JA, 1981, J ACOUST SOC AM, V70, P1624 Hant JJ, 2000, THESIS U CALIFORNIA Hant JJ, 2003, SPEECH COMMUN, V40, P291, DOI 10.1016/S0167-6393(02)00068-7 Hant JJ, 2000, P 6 INT C SPOK LANG, P941 HARRIS KS, 1958, LANG SPEECH, V1, P1 HEDRICK MS, 1993, J ACOUST SOC AM, V94, P2005, DOI 10.1121/1.407503 Hedrick MS, 1996, J ACOUST SOC AM, V100, P3398, DOI 10.1121/1.416981 Hedrick MS, 2007, J SPEECH LANG HEAR R, V50, P254, DOI 10.1044/1092-4388(2007/019) HEDRICK MS, 1995, J ACOUST SOC AM, V98, P1292, DOI 10.1121/1.413466 HEINZ JM, 1961, J ACOUST SOC AM, V33, P589, DOI 10.1121/1.1908734 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841 JONGMAN A, 1989, J ACOUST SOC AM, V85, P1718, DOI 10.1121/1.397961 Kewley Port D, 1982, J ACOUST SOC AM, V72, P379 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 Li N, 2007, J ACOUST SOC AM, V122, P1165, DOI 10.1121/1.2749454 Liberman AM, 1954, PSYCHOL MONOGR-GEN A, V68, P1 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Nittrouer S, 2003, J ACOUST SOC AM, V113, P2254 OHDE RN, 1983, J ACOUST SOC AM, V74, P706, DOI 10.1121/1.389856 Parikh G, 2005, J ACOUST SOC AM, V118, P3874, DOI 10.1121/1.2118407 Potter R. K., 1947, VISIBLE SPEECH Redford MA, 1999, J ACOUST SOC AM, V106, P1555, DOI 10.1121/1.427152 Shadle CH, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1521 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650 SOLI SD, 1979, J ACOUST SOC AM, V66, P46, DOI 10.1121/1.382972 Stevens K. N., 1985, PHONETIC LINGUISTICS, P243 Stevens KN, 1999, P INT C PHON SCI SAN, P1117 Stevens K.N., 1998, ACOUSTIC PHONETICS STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569 Suchato A., 2004, THESIS MIT CAMBRIDGE SUSSMAN HM, 1995, J ACOUST SOC AM, V97, P3112, DOI 10.1121/1.411873 SUSSMAN HM, 1993, J ACOUST SOC AM, V94, P1256, DOI 10.1121/1.408178 SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923 WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417 You H. Y., 1979, THESIS U EDMONTON ED Zue V. W., 1976, THESIS MIT CAMBRIDGE NR 55 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 195 EP 209 DI 10.1016/j.specom.2010.09.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300005 ER PT J AU Borowicz, A Petrovsky, A AF Borowicz, Adam Petrovsky, Alexandr TI Signal subspace approach for psychoacoustically motivated speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; KLT; Psychoacoustics ID NOISE AB In this paper we deal with the perceptually motivated signal subspace methods for speech enhancement We focus on extended spectral-domain-constrained (SDC) estimator It is obtained using Lagrange multipliers method We present an algorithm for a precise computation of the Lagrange multipliers allowing for a direct shaping the residual noise power spectrum In addition the SDC estimator is presented in a new possibly more effective form As a practical implementation of the estimator we propose perceptually constrained signal subspace (PCSS) method for speech enhancement The approach utilizes masking phenomena for residual noise shaping and is optimal for the case of coloured noise Also, less demanding approximate version of this method is derived Finally comparative evaluation of the most known subspace based methods is performed using objective speech quality measures and listening tests Results show that the PCSS method outperforms other methods providing high noise attenuation and better speech quality (C) 2010 Elsevier B V All rights reserved C1 [Borowicz, Adam; Petrovsky, Alexandr] Bialystok Tech Univ, Dept Real Time Syst, PL-15351 Bialystok, Poland. RP Borowicz, A (reprint author), Bialystok Tech Univ, Dept Real Time Syst, Wiejska Str 45A, PL-15351 Bialystok, Poland. CR EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC GUSTAFSSON S, 1998, P IEEE INT C AC SPEE, V1, P397, DOI 10.1109/ICASSP.1998.674451 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 JABLOUN F, 2002, P IEEE INT C AC SPEE, V1, P569 Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Lev An H, 2003, IEEE SIGN PROCESS LE, V10, P104 Mittal P, 2000, IEEE T SPEECH AUDIO, V8, P159 Petrovsky A., 2004, AES CONV 116 BERL GE Rezayee A, 2001, IEEE T SPEECH AUDI P, V9, P87, DOI 10.1109/89.902276 Vetter R, 1999, P EUR C SPEECH COMM, P2411 YANG B, 1995, IEEE T SIGNAL PROCES, V43, P95 YANG WH, 1998, ACOUST SPEECH SIG PR, P541 NR 14 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 210 EP 219 DI 10.1016/j.specom.2010.09.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300006 ER PT J AU Feld, J Sommers, M AF Feld, Julia Sommers, Mitchell TI There goes the neighborhood Lipreading and the structure of the mental lexicon SO SPEECH COMMUNICATION LA English DT Article DE Lipreading; Spoken word recognition; Lexical competition; Lexical neighborhoods ID AUDIOVISUAL WORD RECOGNITION; SPEECH-PERCEPTION; SPOKEN WORDS; DISTINCTIVENESS; MODEL; COMPETITION; ADULTS AB A central question in spoken word recognition research is whether words are recognized relationally, in the context of other words in the mental lexicon (McClelland and Elman, 1986, Luce and Pisoni 1998) The current research evaluated metrics for measuring the influence of the mental lexicon on visually perceived (lipread) spoken word recognition Lexical competition (the extent to which perceptually similar words influence recognition of a stimulus word) was quantified using metrics that are well-established in the literature, as well as a novel statistical method for calculating perceptual confusability, based on the Phi square statistic The Phi square statistic proved an effective measure for assessing lexical competition and explained significant variance in visual spoken word recognition beyond that accounted for by traditional metrics Because these values include the influence of a large subset of the lexicon (rather than only perceptually similar words) it suggests that even perceptually distant words may receive some activation, and therefore provide competition, during spoken word recognition This work supports and extends earlier research (Auer, 2002, Mattys et al, 2002) that proposed a common recognition system underlying auditory and visual spoken word recognition and provides support for the use of the Phi-square statistic for quantifying lexical competition (C) 2010 Elsevier B V All rights reserved C1 [Feld, Julia; Sommers, Mitchell] Washington Univ, Dept Psychol, St Louis, MO 63103 USA. RP Feld, J (reprint author), Washington Univ, Dept Psychol, Campus Box 1125, St Louis, MO 63103 USA. RI Strand, Julia/J-5432-2014 OI Strand, Julia/0000-0001-5950-0139 FU National Institute on Aging [RO1 AG 18029-4] FX We thank Lorin Lachs, Luis Hernandez, and David Pisoni for generously sharing with us their identification data from the Hoosier Audiovisual Multitalker Database This work was supported in part by grant award number RO1 AG 18029-4 from the National Institute on Aging CR AUER E, 2008, J ACOUST SOC AM, V124, P2459 Auer ET, 2002, PSYCHON B REV, V9, P341, DOI 10.3758/BF03196291 Auer ET, 1997, J ACOUST SOC AM, V102, P3704, DOI 10.1121/1.420402 Balota DA, 2007, BEHAV RES METHODS, V39, P445, DOI 10.3758/BF03193014 BERNSTEIN LE, 1997, P ESCA ESCOP WORKSH, P21 GOLDINGER SD, 1989, J MEM LANG, V28, P501, DOI 10.1016/0749-596X(89)90009-0 Iverson P, 1998, SPEECH COMMUN, V26, P45, DOI 10.1016/S0167-6393(98)00049-1 JACKSON PL, 1988, NEW REFLECTIONS SPEE, P99 Jusczyk PW, 2002, EAR HEARING, V23, P2, DOI 10.1097/00003446-200202000-00002 Kaiser AR, 2003, J SPEECH LANG HEAR R, V46, P390, DOI 10.1044/1092-4388(2003/032) LACHS L, 1998, RES SPOKEN LANGUAGE, V22, P377 Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001 Lund K, 1996, BEHAV RES METH INSTR, V28, P203, DOI 10.3758/BF03204766 Mattys SL, 2002, PERCEPT PSYCHOPHYS, V64, P667, DOI 10.3758/BF03194734 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 MORTON J, 1979, PSYCHOLINGUISTICS ST, V2 MURRAY NT, 2008, FDN AURAL REHABILITA MURRAY NT, 2007, TRENDS AMPLIF, V11, P233 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 OWENS E, 1985, J SPEECH HEAR RES, V28, P381 Seitz P. F., 1998, PHLEX PHONOLOGICALLY Sommers MS, 1999, PSYCHOL AGING, V14, P458, DOI 10.1037/0882-7974.14.3.458 SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009 Vitevitch MS, 1998, PSYCHOL SCI, V9, P325, DOI 10.1111/1467-9280.00064 WALDEN BE, 1977, J SPEECH HEAR RES, V20, P130 NR 25 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 220 EP 228 DI 10.1016/j.specom.2010.09.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300007 ER PT J AU Morrison, GS AF Morrison, Geoffrey Stewart TI A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data Multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM) SO SPEECH COMMUNICATION LA English DT Article DE Forensic voice comparison; Likelihood ratio; Acoustic-phonetic; Multivariate kernel density; GMM-UBM ID SPEAKER RECOGNITION; CASEWORK; SYSTEMS; SCIENCE; FUSION; IMPACT AB Two procedures for the calculation of forensic likelihood ratios were tested on the same set of acoustic phonetic data One procedure was a multivariate kernel density procedure (MVKD) which is common in acoustic phonetic forensic voice comparison, and the other was a Gaussian mixture model universal background model (GMM-UBM) which is common in automatic forensic voice comparison The data were coefficient values from discrete cosine transforms fitted to second-formant trajectories of /ar /er/, /ou/, /au/, and /oi/ tokens produced by 27 male speakers of Australian English Scores were calculated separately for each phoneme and then fused using logistic regression The performance of the fused GMM-UBM system was much better than that of the fused MVKD system, both in terms of accuracy (as measured using the log-likelihood ratio cost, C-llr) and precision (as measured using an empirical estimate of the 95% credible interval for the likelihood ratios from the different-speaker comparisons) (C) 2010 Elsevier B V All rights reserved C1 [Morrison, Geoffrey Stewart] Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia. [Morrison, Geoffrey Stewart] Australian Natl Univ, Sch Language Studies, Canberra, ACT 0200, Australia. RP Morrison, GS (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia. FU Australian Research Council [DP0774115] FX This research was funded by Australian Research Council Discovery Grant No DP0774115 Thanks to Yuko Kinoshita for supplying the original audio data Thanks to Julien Epps for comments on an earlier version of this paper CR Agresti A, 2007, INTRO CATEGORICAL DA, V2nd Aitken CGG, 2004, STAT EVALUATION FORE Aitken CGG, 2004, J ROY STAT SOC C-APP, V53, P109, DOI 10.1046/j.0035-9254.2003.05271.x Aitken C.G.G., 1991, USE STAT FORENSIC SC Aitken CGG, 2004, APPL STAT, V53, P665, DOI [10 1111/j 1467 9876 2004 02031 x, DOI 10.1111/J.1467.9876.2004.02031.X] Alderman T., 2004, P 10 AUSTR INT C SPE, P177 Alexander A, 2005, INT J SPEECH LANG LA, V12, P214, DOI 10.1558/sll.2005.12.2.214 Alexander A., 2004, P 8 INT C SPOK LANG ALEXANDER A, 2005, THESIS ECOLE POLYTEC [Anonymous], 2002, FORENSIC SPEAKER IDE Association of Forensic Science Providers, 2009, SCI JUSTICE, V49, P161, DOI [DOI 10.1016/J.SCIJUS.2009.07.004, 10.1016/j.scijus.2009.07.004] Balding DJ, 2005, STAT PRACT, P1, DOI 10.1002/9780470867693 Becker T, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1505 Becker T, 2010, P OD 2010 LANG SPEAK, P58 Becker T, 2009, P NAG DAGA INT C AC Broeders APA, 1995, P INT C PHON SCI STO, V3, P154 Brummer N, 2007, IEEE T AUDIO SPEECH, V15, P2072, DOI 10.1109/TASL.2007.902870 Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001 Buckleton J, 2005, FORENSIC DNA EVIDENCE INTERPRETATION, P27 Castro D.R., 2007, THESIS U AUTONOMA MA Champod C, 2000, SPEECH COMMUN, V31, P193, DOI 10.1016/S0167-6393(99)00078-3 Cook R, 1998, SCI JUSTICE, V38, P231, DOI 10.1016/S1355-0306(98)72117-3 Curran J. M., 2005, LAW PROBABILITY RISK, V4, P115, DOI [10.1093/1pr/mgi009, DOI 10.1093/LPR/MGI009, 10 1093/Ipr/mgi009] Curran JM, 2002, SCI JUSTICE, V42, P29, DOI 10.1016/S1355-0306(02)71794-2 Drygajlo A, 2007, IEEE SIGNAL PROC MAR, P132, DOI [10 1109/MSP 2007 323278, DOI 10.1109/MSP.2007.323278] Duda R., 2000, PATTERN CLASSIFICATI Enzinger E, 2010, P 39 AUD ENG SOC C A Evett I, 2000, SCI JUSTICE, V40, P233, DOI 10.1016/S1355-0306(00)71993-9 Evett IW, 1996, ADV FOREN H, V6, P79 Evett I.W., 1990, FORENSIC SCI PROGR, V4, P141 Evett IW, 1998, SCI JUSTICE, V38, P198, DOI 10.1016/S1355-0306(98)72105-7 Ferrer L, 2009, J MACH LEARN RES, V10, P2079 Friedman J., 2009, ELEMENTS STAT LEARNI, V2nd Gonzalez Rodriguez J, 2007, IEEE T AUDIO SPEECH, V15, P2104, DOI [10 1109/ TASL 2007 902747, DOI 10.1109/TASL.2007.902747] Gonzalez Rodriguez J, 2005, FORENSIC SCI INT, V155, P126, DOI [10 1016/j forsciint 2004 11 007, DOI 10.1016/J.FORSCIINT.2004.11.007] Guillemin BJ, 2008, INT J SPEECH LANG LA, V15, P193, DOI 10.1558/ijsll.v15i2.193 Hosmer DW, 2000, APPL LOGISTIC REGRES, V2nd Ishihara S, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1941 Jessen M., 2008, LANG LING COMPASS, V2, P671, DOI [DOI 10.1111/J.1749-818X.2008.00066.X, 10.1111/j.1749-818x.2008.00066.x] Kahn J, 2010, P OD 2010 LANG SPEAK, P109 Kinoshita Y, 2008, P OD 2008 SPEAK LANG Kinoshita Y, 2009, INT J SPEECH LANG LA, V16, P91, DOI 10.1558/ijsll.v16i1.91 Kinoshita Y., 2006, P 11 AUSTR INT C SPE, P112 Lewis S., 1984, P I ACOUSTICS, V6, P69 LINDLEY DV, 1977, BIOMETRIKA, V64, P207, DOI 10.1093/biomet/64.2.207 Lucy D., 2005, INTRO STAT FORENSIC Menard S., 2002, APPL LOGISTIC REGRES Meuwly D, 2001, THESIS U LAUSANNE LA Meuwly D, 2001, P 2001 SPEAK OD SPEA Morrison G. S., 2010, P OD 2010 LANG SPEAK, P63 MORRISON GS, 2008, INT J SPEECH LANG LA, V15, P247, DOI DOI 10.1558/IJSLL.V15I2.249 Morrison GS, 2009, SCI JUSTICE, V49, P298, DOI 10.1016/j.scijus.2009.09.002 Morrison GS, 2009, J ACOUST SOC AM, V125, P2387, DOI 10.1121/1.3081384 Morrison GS, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1501 Morrison GS, 2010, EXPERT EVIDENCE National Research Council, 2009, STRENGTH FOR SCI US Pigeon S, 2000, DIGIT SIGNAL PROCESS, V10, P237, DOI 10.1006/dspr.1999.0358 Ramos Castro D, 2006, P OD 2006 SPEAK LANG, DOI [10 1109/ODYSSEY 2006 248088, DOI 10.1109/ODYSSEY.2006.248088] Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Robertson B., 1995, INTERPRETING EVIDENC Rodriguez J., 2006, COMPUT SPEECH LANG, V20, P331, DOI DOI 10.1016/J.CSL.2005.08.005 Rose P., 2004, P 10 AUSTR INT C SPE, P492 Rose P, 2006, COMPUT SPEECH LANG, V20, P159, DOI 10.1016/j.csl.2005.07.003 Rose P, 2009, INT J SPEECH LANG LA, V16, P139, DOI 10.1558/ijsll.v16i1.139 Rose P., 2006, P 11 AUSTR INT C SPE, P329 Rose P, 2006, P OD 2006 SPEAK LANG, DOI [10 1109/ODYSSEY 2006 248095, DOI 10.1109/ODYSSEY.2006.248095] Rose P., 2007, P 16 INT C PHON SCI, P1817 Rose Phil, 2006, P 11 AUSTR INT C SPE, P64 Thiruvaran T, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1497 van Leeuwen D., 2007, SPEAKER CLASSIFICATI, VI, P330, DOI [DOI 10.1007/978-3-540-74200-5_19, 10.1007/978- 3-540-74200-5_19] NR 70 TC 11 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 242 EP 256 DI 10.1016/j.specom.2010.09.005 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300009 ER PT J AU Lei, Y Hansen, JHL AF Lei, Yun Hansen, John H. L. TI Mismatch modeling and compensation for robust speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Speaker verification; Factor analysis; Eigenchannel; Mismatch compensation ID SPEECH RECOGNITION; CHANNEL COMPENSATION; VARIABILITY; NOISE AB In this study primary channel mismatch scenario between enrollment and test conditions in a speaker verification task are analyzed and modeled A novel Gaussian mixture modeling with a universal background model (GMM-UBM) frame based compensation model related to the mismatch is formulated and evaluated using National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2008 data, along with a comparison to the well-known eigenchannel model Proposed compensation method show significant improvement versus an eigenchannel model when only the supervector of the UBM is employed Here the supervector of the enrollment speaker model is not included for estimation of the mismatch since it is difficult to obtain the real supervector of the speaker based on the limited 5 mm channel dependent speech data only The proposed mismatch compensation model, therefore show that construction of the supervector obtained from a UBM model can more accurately describe the mismatch between enrollment and test data, resulting in effective classification performance improvement for speaker/speech applications (C) 2010 Elsevier B V All rights reserved C1 [Hansen, John H. L.] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dept Elect Engn, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. FU AFRL [FA8750-09-C-0067] FX This project was supported by AFRL under a subcontract to RADC Inc under FA8750-09-C-0067 Approved for public release, distribution unlimited CR Acero A., 1990, THESIS CARNEGIE MELL Acero Alex, 2000, INTERSPEECH, P869 Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499 Campbell WM, 2006, INT CONF ACOUST SPEE, P97 Castaldo F, 2007, IEEE T AUDIO SPEECH, V15, P1969, DOI 10.1109/TASL.2007.901823 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 Gales M.J.F, 1995, THESIS CAMBRIDGE U Hank Liao, 2007, THESIS U CAMBRIDGE Hansen J. H. L., 1988, THESIS GEORGIA I TEC Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147 Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693 Kenny P., 2004, IEEE ICASSP 2004 MON, P37 Kenny P, 2005, IEEE T SPEECH AUDI P, V13, P345, DOI 10.1109/TSA.2004.840940 Kenny P, 2005, INT CONF ACOUST SPEE, P637 Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7 Matrouf D, 2007, INTERSPEECH, P1242 Moreno P.J., 1996, THESIS CARNEGIE MELL Pelecanos J., 2001, IEEE OD 2001 SPEAK L, P18 REYNOLDS DA, 1995, INT CONF ACOUST SPEE, P329, DOI 10.1109/ICASSP.1995.479540 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Solomonoff A, 2005, INT CONF ACOUST SPEE, P629 Stolcke A., 2005, INTERSPEECH, P2425 Vair C, 2006, IEEE OD SPEAK LANG R, P1 VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970 Vogt R, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P853 Vogt R., 2005, INTERSPEECH, P3117 NR 27 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 257 EP 268 DI 10.1016/j.specom.2010.09.006 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300010 ER PT J AU Dai, P Soon, IY AF Dai, Peng Soon, Ing Yann TI A temporal warped 2D psychoacoustic modeling for robust speech recognition system SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; 2D mask; Simultaneous masking; Temporal masking; Temporal warping ID WORD RECOGNITION; AUDITORY-SYSTEM; MASKING; ADAPTATION; FREQUENCY AB Human auditory system performs better than speech recognition system under noisy condition, which leads us to the idea of Incorporating the human auditory system into automatic speech recognition engines In this paper, a hybrid feature extraction method, which utilizes forward masking, backward masking, and lateral inhibition, is incorporated into mel-frequency cepstral coefficients (MFCC) The integration is Implemented using a warped 2D psychoacoustic filter The AURORA2 database is utilized for testing, and the Hidden Markov Model (HMM) is used for recognition Comparison is made against lateral inhibition (LI), forward masking (FM) cepstral mean and variance normalization (CMVN), the original 2D psychoacoustic filter and the RASTA filter Experimental results show that the word recognition rate is significantly improved, especially under noisy conditions (C) 2010 Elsevier B V All rights reserved C1 [Dai, Peng; Soon, Ing Yann] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. RP Dai, P (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. RI Soon, Ing Yann/A-5173-2011 CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Ambikairajah E, 1997, ELECTRON COMMUN ENG, V9, P165, DOI 10.1049/ecej:19970403 Brookes M., 1997, VOICEBOX CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 Dai P., 2009, P ICICS DEC DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Golden B, 2000, SPEECH AUDIO SIGNAL Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576 Kvale MN, 2004, J NEUROPHYSIOL, V91, P604, DOI 10.1152/jn.00484.2003 LEONARD RG, 1984, P ICASSP, V3, P42 Lois LE, 1962, J ACOUST SOC AM, V34, P1116 Luo XW, 2008, 2008 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING, VOLS 1 AND 2, PROCEEDINGS, P1105 Nghia PT, 2008, P ICATC, P349 Oxenham AJ, 2001, J ACOUST SOC AM, V109, P732, DOI 10.1121/1.1336501 Oxenham AJ, 2000, HEARING RES, V150, P258, DOI 10.1016/S0378-5955(00)00206-9 Park KY, 2003, NEUROCOMPUTING, V52-4, P615, DOI 10.1016/S0925-2312(02)00791-9 SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1622, DOI 10.1121/1.392800 Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 NR 22 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2011 VL 53 IS 2 BP 229 EP 241 DI 10.1016/j.specom.2010.09.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LQ UT WOS:000285665300008 ER PT J AU Kim, W Stern, RM AF Kim, Wooil Stern, Richard M. TI Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; Missing-feature reconstruction; Frequency-dependent mask classification; Colored-noise masker generation; Multi-band partition method AB "Missing-feature" techniques to improve speech recognition accuracy are based on the blind determination of which cells in a spectrogram-like display of speech are corrupted by the effects of noise or other types of disturbance (and hence are "missing"). In this paper we present three new approaches that improve the speech recognition accuracy obtained using missing-feature techniques. It had been found in previous studies (e.g. Seltzer et al., 2004) that Bayesian approaches to missing-feature classification are effective in ameliorating the effects of various types of additive noise. While Seltzer et al. primarily used white noise for training their Bayesian classifier, we have found that this is not the best type of training signal when noise with greater spectral and/or temporal variation is encountered in the testing environment. The first innovation introduced in this paper, referred to as frequency-dependent classification, involves independent classification in each of the various frequency bands in which the incoming speech is analyzed based on parallel sets of frequency-dependent features. The second innovation, referred to as colored-noise generation using multi-band partitioning, involves the use of masking noises with artificially-introduced spectral and temporal variation in training the Bayesian classifier used to determine which spectro-temporal components of incoming speech are corrupted by noise in unknown testing environments. The third innovation consists of an adaptive method to estimate the a priori values of the mask classifier that determines whether a particular time-frequency segment of the test data should be considered to be reliable or not. It is shown that these innovations provide improved speech recognition accuracy on a small vocabulary test when missing-feature restoration is applied to incoming speech that is corrupted by additive noise of an unknown nature, especially at lower signal-to-noise ratios. (C) 2010 Elsevier B.V. All rights reserved. C1 [Kim, Wooil] Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, Richardson, TX 75080 USA. [Stern, Richard M.] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA. [Stern, Richard M.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. RP Kim, W (reprint author), Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. EM wikim@utdallas.edu FU Korea Science and Engineering Foundation (KOSEF); National Science Foundation [IIS-0420866] FX This work was supported by the Postdoctoral Fellow Program of the Korea Science and Engineering Foundation (KOSEF) and by the National Science Foundation (Grant IIS-0420866). Much of the work described here was previously presented at the Interspeech 2005 (Kim et al., 2005) and IEEE ICASSP 2006 (Kim and Stern, 2006) conferences, although with somewhat different notation. CR Barker J., 2000, ICSLP 2000, P373 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cooke M, 1997, INT CONF ACOUST SPEE, P863, DOI 10.1109/ICASSP.1997.596072 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 *ETSI, 2000, 201108V112 ETSI ES Harding S, 2005, INT CONF ACOUST SPEE, P537 Hirsch H.G., 2000, ISCA ITRW ASR 2000 Jancovic P., 2003, EUROSP GEN SWITZ, P2161 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Josifovski L., 1999, EUROSPEECH 99, P2837 Kim W, 2006, INT CONF ACOUST SPEE, P305 Kim W., 2005, INTERSPEECH 2005, P2637 Lippmann R.P., 1997, EUROSPEECH, P37 Martin R., 1994, EUSIPCO 94, P1182 Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225 Park HM, 2009, SPEECH COMMUN, V51, P15, DOI 10.1016/j.specom.2008.05.012 Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828 Raj B., 2005, ASRU 2005, P65 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Renevey P., 2001, CRAC2001 Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006 Seltzer M.L., 2000, THESIS CARNEGIE MELL Singh R., 2002, CRC HDB NOISE REDUCT Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003 NR 25 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 1 EP 11 DI 10.1016/j.specom.2010.08.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700001 ER PT J AU Knoll, M Scharrer, L Costall, A AF Knoll, Monja Scharrer, Lisa Costall, Alan TI "Look at the shark": Evaluation of student- and actress-produced standardised sentences of infant- and foreigner-directed speech SO SPEECH COMMUNICATION LA English DT Article DE IDS; Standardised sentences; Simulated speech; Actresses; Hyperarticulation ID VOCAL EMOTION EXPRESSION; CONVERSATIONAL SPEECH; IRONIC COMMUNICATION; ACOUSTIC PROFILES; CUES; INTELLIGIBILITY; PERCEPTION; INTONATION; LANGUAGE; VOICE AB Standardised sentence production is routinely used in speech research to avoid content variability typical of natural speech production. However, the validity of such standardised material is not well understood. Here, we evaluated the use of standardised sentences by comparing them to two existing, non-standardised datasets of simulated free and natural speech (the latter produced by mothers in real interactions). Standardised sentences and simulated free speech were produced by students and actresses without an interaction partner. Each dataset comprised recordings of infant- (IDS), foreigner- (FDS) and adult-directed (ADS) speech, which were analysed for mean F-0, vowel duration and hyperarticulation. Whilst students' mean F-0 pattern in standardised speech was closer to the natural speech than their previous 'simulated free speech', no difference in vowel hyperarticulation and duration patterns was found for students' standardised sentences between the three speech styles. Actresses' F-0, vowel duration and hyperarticulation patterns in standardised speech were similar to the natural speech, and a part improvement on their 'simulated free speech'. These results suggest that successful reproduction of some acoustic measures (e.g., F-0) can be achieved with standardised content regardless of the type of speaker, whereas the production of other acoustic measures (e.g., hyperarticulation) are context- and speaker-dependent. (C) 2010 Elsevier B.V. All rights reserved. C1 [Knoll, Monja] Univ W Scotland, Sch Social Sci, Paisley PA1 2BE, Renfrew, Scotland. [Scharrer, Lisa] Univ Munster, Dept Psychol, D-48149 Munster, Germany. [Knoll, Monja; Costall, Alan] Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England. RP Knoll, M (reprint author), Univ W Scotland, Sch Social Sci, Paisley PA1 2BE, Renfrew, Scotland. EM monja.knoll@uws.ac.uk; Lisa.Scharrer@uni-muenster.de; alan.costall@port.ac.uk FU Economic and Social Research Council (ESRC) FX We are grateful to two anonymous reviewers for helpful comments that greatly improved this manuscript. Our particular thanks go to David Bauckham and the 'The Bridge Theatre Training Company' for their invaluable help in recruiting the actresses used in this study. This research was supported by a research bursary from the Economic and Social Research Council (ESRC) to Monja Knoll. CR Anolli L, 2002, INT J PSYCHOL, V37, P266, DOI 10.1080/00207590244000106 Anolli L, 2000, J PSYCHOLINGUIST RES, V29, P275, DOI 10.1023/A:1005100221723 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 Biersack S, 2005, PHONETICA, V62, P106, DOI 10.1159/000090092 Biersack S., 2005, P 9 EUR C SPEECH COM, P2401 Boersma P., 2006, PRAAT DOING PHONETIC Bradlow A.R., 2007, J ACOUSTICAL SOC A 2, V121, P3072 Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 Bradlow AR, 2003, J SPEECH LANG HEAR R, V46, P80, DOI 10.1044/1092-4388(2003/007) Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114 Burnham D., 2002, SCIENCE, V296, P1095 Cheang HS, 2008, SPEECH COMMUN, V50, P366, DOI 10.1016/j.specom.2007.11.003 COHEN A, 1961, AM J PSYCHOL, V74, P90, DOI 10.2307/1419829 Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078 FRIEND M, 1994, J ACOUST SOC AM, V96, P1283, DOI 10.1121/1.410276 JAKOBOVITS LA, 1965, PSYCHOL REP, V17, P785 Johnson N. L., 1970, CONTINUOUS UNIVARIAT, V1 Johnson N. L., 1970, CONTINUOUS UNIVARIAT, V2 KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 Knoll M, 2009, SPEECH COMMUN, V51, P296, DOI 10.1016/j.specom.2008.10.001 Knower FH, 1941, J SOC PSYCHOL, V14, P369 KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473 Kuhl PK, 1997, SCIENCE, V277, P684, DOI 10.1126/science.277.5326.684 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 Laukka P, 2005, COGNITION EMOTION, V19, P633, DOI 10.1080/02699930441000445 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 MILMOE S, 1967, J ABNORM PSYCHOL, V72, P78, DOI 10.1037/h0024219 Oviatt S, 1998, SPEECH COMMUN, V24, P87, DOI 10.1016/S0167-6393(98)00005-3 PAPOUSEK M, 1991, APPL PSYCHOLINGUIST, V12, P481, DOI 10.1017/S0142716400005889 Parkinson B, 1996, BRIT J PSYCHOL, V87, P663 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 Rockwell P, 2000, J PSYCHOLINGUIST RES, V29, P483, DOI 10.1023/A:1005120109296 ROGERS PL, 1971, BEHAV RES METH INSTR, V3, P16, DOI 10.3758/BF03208115 ROSS M, 1973, AM ANN DEAF, V118, P37 Schaeffler F., 2006, 3 SPEECH PROS C DRES SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHERER KR, 1971, J EXP RES PERS, V5, P155 Scherer KR, 2007, EMOTION, V7, P158, DOI 10.1037/1528-3542.7.1.158 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 SCHERER KR, 1972, J PSYCHOLINGUIST RES, V1, P269, DOI 10.1007/BF01074443 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788 Soskin W.F., 1961, J COMMUN, V11, P73, DOI 10.1111/j.1460-2466.1961.tb00331.x STARKWEATHER JA, 1956, AM J PSYCHOL, V69, P121, DOI 10.2307/1418129 Starkweather J.A., 1967, RES VERBAL BEHAV SOM, P253 Tischer B., 1988, SPRACHE KOGNIT, V7, P205 Trainor LJ, 2000, PSYCHOL SCI, V11, P188, DOI 10.1111/1467-9280.00240 Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003 Viscovich N, 2003, PERCEPT MOTOR SKILL, V96, P759, DOI 10.2466/PMS.96.3.759-771 WALLBOTT HG, 1986, J PERS SOC PSYCHOL, V51, P690, DOI 10.1037//0022-3514.51.4.690 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 NR 53 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 12 EP 22 DI 10.1016/j.specom.2010.08.004 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700002 ER PT J AU Hjalmarsson, A AF Hjalmarsson, Anna TI The additive effect of turn-taking cues in human and synthetic voice SO SPEECH COMMUNICATION LA English DT Article DE Turn-taking; Speech synthesis; Human-like interaction; Conversational interfaces ID CONVERSATION AB A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners' turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan's findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time. (C) 2010 Elsevier B.V. All rights reserved. C1 KTH, SE-10044 Stockholm, Sweden. RP Hjalmarsson, A (reprint author), KTH, Lindstedsvagen 24, SE-10044 Stockholm, Sweden. EM annah@speech.kth.se FU Swedish Research Council [2007-6431, GENDIAL] FX This research was carried out at Centre for Speech Technology, KTH. The research is also supported by the Swedish Research Council Project #2007-6431, GENDIAL. Many thanks to Rolf Carlson, Jens Edlund, Joakim Gustafson, Mattias Heldner, Julia Hirschberg and Gabriel Skantze for help with valuable comments and annotation of data. Many thanks also to the reviewers for valuable comments that helped to improve the paper. CR BEATTIE GW, 1982, NATURE, V300, P744, DOI 10.1038/300744a0 Bruce Gosta, 1977, SWEDISH WORD ACCENTS Butterworth B., 1975, J PSYCHOLINGUIST RES, V4 Campione E., 2002, ESCA WORKSH SPEECH P, P199 Clark H., 2002, SPEECH COMMUN, V36 Cutler Anne, 1986, INTONATION DISCOURSE, P139 De Ruiter JP, 2006, LANGUAGE, V82, P515 DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031 Duncan Jr S, 1977, FACE TO FACE INTERAC Edlund J, 2005, PHONETICA, V62, P215, DOI 10.1159/000090099 Edlund J, 2008, SPEECH COMMUN, V50, P630, DOI 10.1016/j.specom.2008.04.002 Fernandez R., 2007, P 11 WORKSH SEM PRAG, P25 Ferrer L., 2003, IEEE INT C AC SPEECH Ferrer L., 2002, P INT C SPOK LANG PR, P2061 Ford C. E., 1996, INTERACTION GRAMMAR, P134, DOI 10.1017/CBO9780511620874.003 Gravano A., 2009, THESIS COLUMBIA U Gustafson J, 2008, LECT NOTES ARTIF INT, V5078, P293, DOI 10.1007/978-3-540-69369-7_35 Heldner M, 2010, J PHONETICS, V38, P555, DOI 10.1016/j.wocn.2010.08.002 Hjalmarsson A., 2007, P SIGDIAL ANTW BELG, P132 Hjalmarsson A., 2008, P SIGDIAL 2008 COL O IZDEBSKI K, 1978, J SPEECH HEAR RES, V21, P638 KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4 Kilger A., 1995, RR9511 GERM RES CTR Koiso H, 1998, LANG SPEECH, V41, P295 Levelt W. J. M., 1989, SPEAKING FROM INTENT LOCAL J, 1986, HUM STUD, V9, P185, DOI 10.1007/BF00148126 LOCAL JK, 1986, J LINGUIST, V22, P411, DOI 10.1017/S0022226700010859 METZ CE, 1978, SEMIN NUCL MED, V8, P283, DOI 10.1016/S0001-2998(78)80014-2 Oliveira Jr M., 2008, SPEECH PROSODY 2008, P485 Reiter E., 1997, NAT LANG ENG, V3, P57, DOI DOI 10.1017/S1351324997001502 SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 SCHAFFER D, 1983, J PHONETICS, V11, P243 Schourup L, 1999, LINGUA, V107, P227, DOI 10.1016/S0024-3841(96)90026-1 Selting M., 1996, PRAGMATICS, V6, P357 Shriberg E. E., 1994, THESIS U CALIFORNIA Watanabe M, 2008, SPEECH COMMUN, V50, P81, DOI 10.1016/j.specom.2007.06.002 Weilhammer K., 2003, ICPHS 2003 BARC SPAI WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 NR 38 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 23 EP 35 DI 10.1016/j.specom.2010.08.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700003 ER PT J AU Stark, A Paliwal, K AF Stark, Anthony Paliwal, Kuldip TI Use of speech presence uncertainty with MMSE spectral energy estimation for robust automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; MMSE estimation; Speech enhancement methods ID AMPLITUDE ESTIMATOR; ENHANCEMENT; NOISE; SUPPRESSION; EPHRAIM AB In this paper, we investigate the use of the minimum mean square error (MMSE) spectral energy estimator for use in environment-robust automatic speech recognition (ASR). In the past, it has been common to use the MMSE log-spectral amplitude estimator for this task. However, this estimator was originally derived under subjective human listening criteria. Therefore its complex suppression rule may not be optimal for use in ASR. On the other hand, it can be shown that the MMSE spectral energy estimator is closely related to the MMSE Mel-frequency cepstral coefficient (MFCC) estimator. Despite this, the spectral energy estimator has tended to suffer from the problem of excessive residual noise. We examine the cause of this residual noise and show that the introduction of a heuristic based speech presence uncertainty (SPU) can significantly improve its performance as a front-end ASR enhancement regime. The proposed spectral energy SPU estimator is evaluated on the Aurora2, RM and OLLO2 speech recognition tasks and can be shown to significantly improve additive noise robustness over the more common spectral amplitude and log-spectral amplitude estimators. (C) 2010 Elsevier B.V. All rights reserved. C1 [Stark, Anthony; Paliwal, Kuldip] Griffith Univ, Signal Proc Lab, Nathan, Qld 4111, Australia. RP Paliwal, K (reprint author), Griffith Univ, Signal Proc Lab, Nathan, Qld 4111, Australia. EM a.stark@griffith.edu.au; k.paliwal@griffith.edu.au CR Berouti M, 1979, IEEE INT C AC SPEECH, V4, P208, DOI 10.1109/ICASSP.1979.1170788 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1991, IEEE T SIGNAL PROCES, V39, P795 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FUJIMOTO M, 2000, IEEE INT C AC SPEECH, V3, P1727 GALES L, 1995, THESIS U CAMBRIDGE U Gemello R, 2006, IEEE SIGNAL PROC LET, V13, P56, DOI 10.1109/LSP.2005.860535 HERMUS K, 2007, EURASIP J APPL SIG P, P195 Huang X., 2001, SPOKEN LANGUAGE PROC Lathoud G., 2005, P 2005 IEEE ASRU WOR, P343 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Malah D, 1999, INT CONF ACOUST SPEE, P789 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 PALIWAL K, 1987, IEEE INT C AC SPEECH, P297 Pearce D., 2000, ISCA ITRW ASR2000, P29 Price P., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196669 Wesker T., 2005, P INT, P1273 Wiener N., 1949, EXTRAPOLATION INTERP Young S., 2000, HTK BOOK VERSION 3 0 NR 23 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 51 EP 61 DI 10.1016/j.specom.2010.08.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700005 ER PT J AU Raab, M Gruhn, R Noth, E AF Raab, Martin Gruhn, Rainer Noeth, Elmar TI A scalable architecture for multilingual speech recognition on embedded devices SO SPEECH COMMUNICATION LA English DT Article DE Multilingual speech recognition; Non-native speech; Projections between Gaussian spaces; Gaussian Mixture Model distances ID MIXTURE; MODELS AB In-car infotainment and navigation devices are typical examples where speech based interfaces are successfully applied. While classical applications are monolingual, such as voice commands or monolingual destination input, the trend goes towards multilingual applications. Examples are music player control or multilingual destination input. As soon as more languages are considered the training and decoding complexity of the speech recognizer increases. For large multilingual systems, some kind of parameter tying is needed to keep the decoding task feasible on embedded systems with limited resources. A traditional technique for this is to use a semi-continuous Hidden Markov Model as the acoustic model. The monolingual codebook on which such a system relies is not appropriate for multilingual recognition. We introduce Multilingual Weighted Codebooks that give good results with low decoding complexity. These codebooks depend on the actual language combination and increase the training complexity. Therefore an algorithm is needed that can reduce the training complexity. Our first proposal are mathematically motivated projections between Hidden Markov Models defined in Gaussian spaces. Although theoretically optimal, these projections were difficult to employ directly in speech decoders. We found approximated projections to be most effective for practical application, giving good performance without requiring major modifications to the common speech recognizer architecture. With a combination of the Multilingual Weighted Codebooks and Gaussian Mixture Model projections we create an efficient and scalable architecture for non-native speech recognition. Our new architecture offers a solution to the combinatoric problems of training and decoding for multiple languages. It builds new multilingual systems in only 0.002% of the time of a traditional HMM training, and achieves comparable performance on foreign languages. (C) 2010 Elsevier B.V. All rights reserved. C1 [Raab, Martin; Gruhn, Rainer] Harman Becker Automot Syst, Speech Dialog Syst, Ulm, Germany. [Raab, Martin; Noeth, Elmar] Univ Erlangen Nurnberg, Chair Pattern Recognit, Erlangen, Germany. RP Raab, M (reprint author), Harman Becker Automot Syst, Speech Dialog Syst, Ulm, Germany. EM martin.raab@informatik.uni-erlangen.de; rainer.gruhn@alumni.uni-ulm.de CR Bartkova K, 2006, INT CONF ACOUST SPEE, P1037 Biehl M., 1990, P ASI SUMM WORKSH NE BOUSELMI G, 2007, P INT ANTW BELG AUG, P1449 Boyd S., 2004, CONVEX OPTIMIZATION Dalsgaard P., 1998, P ICSLP SYDN AUSTR, P2623 Fuegen C., 2003, P ASRU, P441 Goronzy S., 2001, P EUROSPEECH 2001, P309 Harbeck S., 1998, P WORKSH TEXT SPEECH, P375 Hershey JR, 2007, INT CONF ACOUST SPEE, P317 Huang X.D., 1990, P ICASSP, P689 Iskra D., 2002, P LREC, P329 Jensen J.H., 2007, P INT C MUS INF RETR, P107 Jian B, 2005, IEEE I CONF COMP VIS, P1246 JUANG BH, 1985, AT&T TECH J, V64, P391 Koch W., 2004, THESIS U KIEL KIEL Koehler J., 2001, SPEECH COMMUN, V35, P21 Kuhn H.W., 1951, P 2 BERK S MATH STAT, P481 LADEFOGED P, 1990, LANGUAGE, V66, P550, DOI 10.2307/414611 Lang H., 2009, THESIS U ULM ULM Lieb E.H., 2001, ANALYSIS Menzel W., 2000, P LREC ATH GREEC, P957 Niesler T., 2006, P ITRW STELL S AFR Noord G., 2009, TEXTCAT Noth E., 1999, NATO ASI SERIES F, P363 Park J., 2004, P INT C SPOK LANG PR, P693 Petersen KB, 2008, MATRIX COOKBOOK Platt J.C., 1988, CONSTRAINED DIFFEREN Raab M., 2009, P DAGA ROTT NETH, P411 Raab M, 2008, INT CONF ACOUST SPEE, P4257, DOI 10.1109/ICASSP.2008.4518595 RAAB M, 2007, P ASRU KYOT JAP, P413 RAAB M, 2008, P TSD BRNO CZECH REP, P485 SCHADEN S, 2006, THESIS U DUISBURG ES Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 Segura J., 2007, HIWIRE DATABASE NOIS Tan TP, 2007, INT CONF ACOUST SPEE, P1009 Tomokiyo L.M., 2001, P MSLP AALB DENM, P39 Uebler U, 2001, SPEECH COMMUN, V35, P53, DOI 10.1016/S0167-6393(00)00095-9 UEDA Y, 1990, P INT C SPOKEN LANGU, P1209 Wang Z., 2002, P ICMI OCT, P247 Weng F., 1997, P EUR RHOD, P359 Witt S., 1999, THESIS CAMBRIDGE U C Zhang F., 2005, SCHUR COMPLEMENT ITS NR 42 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 62 EP 74 DI 10.1016/j.specom.2010.07.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700006 ER PT J AU Loots, L Niesler, T AF Loots, Linsen Niesler, Thomas TI Automatic conversion between pronunciations of different English accents SO SPEECH COMMUNICATION LA English DT Article DE English accents; Pronunciation modelling; G2P; P2P; GP2P; Decision trees; South African English AB We describe the application of decision trees to the automatic conversion of pronunciations between American, British and South African English accents. The resulting phoneme-to-phoneme (P2P) conversion technique derives the pronunciation of a word in a new target accent by taking advantage of its existing available pronunciation in a different source accent. We find that it is substantially more accurate to derive pronunciations in this way than directly from the orthography and available target accent pronunciations using more conventional grapheme-to-phoneme (G2P) conversion. Furthermore, by including both the graphemes and the phonemes of the source accent, grapheme-and-phoneme-to-phoneme (GP2P) conversion delivers additional increases in accuracy in relation to P2P. These findings are particularly important for less-resourced varieties of English, for which extensive manually-prepared pronunciation dictionaries are not available. By means of the P2P and GP2P approaches, the pronunciations of new words can be obtained with better accuracy than is possible using G2P methods. (C) 2010 Elsevier B.V. All rights reserved. C1 [Loots, Linsen; Niesler, Thomas] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa. RP Niesler, T (reprint author), Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa. EM linsen@sun.ac.za; trn@sun.ac.za FU South African National Research Foundation (NRF) [FA2007022300015, TTK2007041000010] FX This material is based on work supported financially by the South African National Research Foundation (NRF) under Grants FA2007022300015 and TTK2007041000010, and was executed using the High-Performance Computing (HPC) facility at Stellenbosch University. CR Andersen O., 1996, P ICSLP PHIL US BEEP, 2009, BRIT EXAMPLE PRONUNC Bekker I., 2009, THESIS NW U POTCHEFS Bisani M., 2004, P ICASSP MONTR CAN Black A., 1998, P ESCA SPEECH SYNTH Bowerman Sean, 2004, HDB VARIETIES ENGLIS, VI, P931 CMU, 2009, CARN MELL U PRON DIC Daelemans W., 1994, PROGR SPEECH SYNTHES, P77 Daelemans W, 1999, MACH LEARN, V34, P11, DOI 10.1023/A:1007585615670 Damper R., 2004, P 5 ISCA SPEECH SYNT Damper RI, 1999, COMPUT SPEECH LANG, V13, P155, DOI 10.1006/csla.1998.0117 Demberg V., 2007, P ACL PRAG CZECH REP Dines J., 2008, P SPOK LANG TECHN WO Han K.-S., 2004, P ICSLP JEJ KOR Humphries J., 2001, P ICASSP SALT LAK CI Kienappel A.K., 2001, P EUR AALB DENM LDC, 2009, COMLEX ENGL PRON LEX Loots L., 2009, P INT BRIGHT UK Marchand Y, 2000, COMPUT LINGUIST, V26, P195, DOI 10.1162/089120100561674 Niesler T., 2005, SO AFRICAN LINGUISTI, V23, P459, DOI 10.2989/16073610509486401 Olshen R., 1984, CLASSIFICATION REGRE, V1st Pagel V., 1998, P ICSLP SYDN AUSTR Rabiner L, 1993, FUNDAMENTALS SPEECH Suontausta J., 2000, P ICSLP BEIJ CHIN Taylor P., 2005, P INT LISB PORT Torkkola K., 1993, P ICASSP MINN US van den Heuvel H., 2007, P INT ANTW BELG Webster G., 2008, P INT BRISB AUSTR Wells John, 1982, ACCENTS ENGLISH NR 29 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 75 EP 84 DI 10.1016/j.specom.2010.07.006 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700007 ER PT J AU Lazaridis, A Mporas, I Ganchev, T Kokkinakis, G Fakotakis, N AF Lazaridis, Alexandros Mporas, Iosif Ganchev, Todor Kokkinakis, George Fakotakis, Nikos TI Improving phone duration modelling using support vector regression fusion SO SPEECH COMMUNICATION LA English DT Article DE Duration modelling; Parallel fusion scheme; Phone duration prediction; Support vector regression; Text-to-speech synthesis ID TEXT-TO-SPEECH; SEGMENTAL DURATION; NETWORKS; DATABASE; APPROXIMATION; RECOGNITION AB In the present work, we propose a scheme for the fusion of different phone duration models, operating in parallel. Specifically, the predictions from a group of dissimilar and independent to each other individual duration models are fed to a machine learning algorithm, which reconciles and fuses the outputs of the individual models, yielding more precise phone duration predictions. The performance of the individual duration models and of the proposed fusion scheme is evaluated on the American-English KED TIMIT and on the Greek WCL-1 databases. On both databases, the SVR-based individual model demonstrates the lowest error rate. When compared to the second-best individual algorithm, a relative reduction of the mean absolute error (MAE) and the root mean square error (RMSE) by 5.5% and 3.7% on KED TIMIT, and 6.8% and 3.7% on WCL-1 is achieved. At the fusion stage, we evaluate the performance of 12 fusion techniques. The proposed fusion scheme, when implemented with SVR-based fusion, contributes to the improvement of the phone duration prediction accuracy over the one of the best individual model, by 1.9% and 2.0% in terms of relative reduction of the MAE and RMSE on KED TIMIT, and by 2.6% and 1.8% on the WCL-1 database. (C) 2010 Elsevier B.V. All rights reserved. C1 [Lazaridis, Alexandros; Mporas, Iosif; Ganchev, Todor; Kokkinakis, George; Fakotakis, Nikos] Univ Patras, Dept Elect & Comp Engn, Wire Commun Lab, Artificial Intelligence Grp, Rion 26500, Greece. RP Lazaridis, A (reprint author), Univ Patras, Dept Elect & Comp Engn, Wire Commun Lab, Artificial Intelligence Grp, Rion 26500, Greece. EM alaza@upatras.gr; imporas@upatras.gr; tganchev@ieee.org; gkokkin@wel.ee.upatras.gr; fakotaki@upatras.gr CR AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759 AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 Allen J., 1987, TEXT SPEECH MITALK S BARTKOVA K, 1987, SPEECH COMMUN, V6, P245, DOI 10.1016/0167-6393(87)90029-X Beckman M. E., 1994, GUIDELINES TOBI LABE Bellegarda JR, 2001, IEEE T SPEECH AUDI P, V9, P52, DOI 10.1109/89.890071 Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350 Campbell W. N., 1992, TALKING MACHINES THE, P211 CARLSON R, 1986, PHONETICA, V43, P140 Chen SH, 2003, IEEE T SPEECH AUDI P, V11, P308, DOI 10.1109/TSA.2003.814377 Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911 *CSTR, 2001, CSTR US KED TIMIT Dutoit T., 1997, INTRO TEXT TO SPEECH EDWARDS J, 1988, PHONETICA, V45, P156 Epitropakis G., 1993, P EUROSPEECH 93 BERL, P1995 Ferrer L., 2003, P EUR GEN, P2017 Freedman D. A., 2007, STATISTICS Friedman JH, 2002, COMPUT STAT DATA AN, V38, P367, DOI 10.1016/S0167-9473(01)00065-2 Friedman JH, 2001, ANN STAT, V29, P1189, DOI 10.1214/aos/1013203451 Furui S., 2000, DIGITAL SPEECH PROCE Goubanova O, 2008, SPEECH COMMUN, V50, P301, DOI 10.1016/j.specom.2007.10.002 Goubanova O., 2000, P ICSLP 2000 BEIJ CH, P427 Huang X., 2001, SPOKEN LANGUAGE PROC Iwahashi N, 2000, IEICE T INF SYST, VE83D, P1550 Jennequin N., 2007, P ICASSP 2007 HON HA, P641 Kaariainen M, 2004, J MACH LEARN RES, V5, P1107 Klatt D. H., 1976, J ACOUST SOC AM, V59, P1209 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Kohavi R, 1997, ARTIF INTELL, V97, P273, DOI 10.1016/S0004-3702(97)00043-X Kohler K.J., 1988, DIGITALE SPRACHVERAR, P165 KOMINEK J, 2004, P 8 INT C SPOK LANG, P1385 KOMINEK J, 2003, CMULTI03177 SCH COMP Laver J, 1980, PHONETIC DESCRIPTION Laver John, 1994, PRINCIPLES PHONETICS Lazaridis A, 2007, PROC INT C TOOLS ART, P518, DOI 10.1109/ICTAI.2007.33 Lee S., 1999, P OR COCOSDA 99, P109 Levinson S, 1986, P IEEE ICASSP, P1241 MITCHELL CD, 1995, DIGIT SIGNAL PROCESS, V5, P43, DOI 10.1006/dspr.1995.1004 MONKOWSKI MD, 1995, INT CONF ACOUST SPEE, P528, DOI 10.1109/ICASSP.1995.479645 Olive J.P., 1985, J ACOUST SOC AM S1, V78, pS6, DOI 10.1121/1.2022951 PARK J, 1993, NEURAL COMPUT, V5, P305, DOI 10.1162/neco.1993.5.2.305 Platt JC, 1999, ADVANCES IN KERNEL METHODS, P185 Pols LCW, 1996, SPEECH COMMUN, V19, P161, DOI 10.1016/0167-6393(96)00033-7 Quinlan R. J., 1992, P 5 AUSTR JOINT C AR, P343 Rao KS, 2007, COMPUT SPEECH LANG, V21, P282, DOI 10.1016/j.csl.2006.06.003 Rao KS, 2005, 2005 INTERNATIONAL CONFERENCE ON INTELLIGENT SENSING AND INFORMATION PROCESSING, PROCEEDINGS, P258 Riley M., 1992, TALKING MACHINES THE, P265 Scholkopf B., 2002, LEARNING KERNELS Shih C., 1997, PROGR SPEECH SYNTHES, P383 Silverman K., 1992, P INT C SPOK LANG PR, P867 Simoes A.R.M., 1990, P WORKSH SPEECH SYNT, P173 Smola A.J., 1998, 1998030 NEUROCOLT TR TAKEDA K, 1989, J ACOUST SOC AM, V86, P2081, DOI 10.1121/1.398467 Van Santen J. P. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90016-Y VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 Vapnik V., 1995, NATURE STAT LEARNING Vapnik V, 1998, STAT LEARNING THEORY Vilalta R., 2002, ARTIF INTELL REV, V18, p[2, 77] Wang L., 2004, P ICASSP 2004, P641 Wang Y, 1997, 9 EUR C MACH LEARN P, P128 WILCOXON F, 1945, BIOMETRICS BULL, V1, P80, DOI 10.2307/3001968 Witten H., 1999, DATA MINING PRACTICA Yallop C., 1995, INTRO PHONETICS PHON Yamagishi J, 2008, SPEECH COMMUN, V50, P405, DOI 10.1016/j.specom.2007.12.003 Yiourgalis N, 1996, SPEECH COMMUN, V19, P21, DOI 10.1016/0167-6393(96)00012-X Zervas P, 2008, J QUANT LINGUIST, V15, P154, DOI 10.1080/09296170801961827 NR 69 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 85 EP 97 DI 10.1016/j.specom.2010.07.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700008 ER PT J AU Ghosh, PK Narayanan, SS AF Ghosh, Prasanta Kumar Narayanan, Shrikanth S. TI Joint source-filter optimization for robust glottal source estimation in the presence of shimmer and jitter SO SPEECH COMMUNICATION LA English DT Article DE Glottal flow derivative; Shimmer; Jitter; Glottal source estimation ID SPEECH; MODEL; FLOW; WAVE; QUALITY AB We propose a glottal source estimation method robust to shimmer and jitter in the glottal flow. The proposed estimation method is based on a joint source-filter optimization technique. The glottal source is modeled by the Liljencrants-Fant (LF) model and the vocal-tract filter is modeled by an auto-regressive filter, which is common in the source-filter approach to speech production. The optimization estimates the parameters of the LF model, the amplitudes of the glottal flow in each pitch period, and the vocal-tract filter coefficients so that the speech production model best describes the observed speech samples. Experiments with synthetic and real speech data show that the proposed estimation method is robust to different phonation types with varying shimmer and jitter characteristics. (C) 2010 Elsevier B.V. All rights reserved. C1 [Ghosh, Prasanta Kumar; Narayanan, Shrikanth S.] Univ So Calif, Dept Elect Engn, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA. RP Ghosh, PK (reprint author), Univ So Calif, Dept Elect Engn, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA. EM prasantg@usc.edu RI Narayanan, Shrikanth/D-5676-2012 FU National Science Foundation (NSF); US Army FX This research work was supported by the National Science Foundation (NSF) and the US Army. CR Airas M, 2006, PHONETICA, V63, P26, DOI 10.1159/000091405 Airas M, 2008, LOGOP PHONIATR VOCO, V33, P49, DOI 10.1080/14015430701855333 ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Baken R. J., 1999, CLIN MEASUREMENT SPE Carre R., 1981, 4 FASE S APR 1981 VE Childers D.G., 2000, SPEECH PROCESSING SY DING W, 1995, IEICE T INF SYST, VE78D, P738 Drugman T., 2009, P INTERSPEECH, P2891 Fant G., 1985, STL QPSR 4 85 R I TE Flanagan J. L., 1965, SPEECH ANAL SYNTHESI Frohlich M, 2001, J ACOUST SOC AM, V110, P479, DOI 10.1121/1.1379076 Fu Q, 2006, IEEE T AUDIO SPEECH, V14, P492, DOI 10.1109/TSA.2005.857807 HALL MG, 1983, SIGNAL PROCESS, V5, P267, DOI 10.1016/0165-1684(83)90074-9 Hess W., 1983, PITCH DETERMINATION KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KRISHNAMURTHY AK, 1992, IEEE T SIGNAL PROCES, V40, P682, DOI 10.1109/78.120812 KRISHNAMURTHY AK, 1986, IEEE T ACOUST SPEECH, V34, P730, DOI 10.1109/TASSP.1986.1164909 Markel JD, 1976, LINEAR PREDICTION SP MILLER RL, 1959, J ACOUST SOC AM, V31, P667, DOI 10.1121/1.1907771 Moore E., 2003, P 25 ANN C ENG MED B, V3, P2849 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 Quatieri T. F., 2001, DISCRETE TIME SPEECH Rabiner L. R., 2010, THEORY APPL DIGITAL ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389 Saratxaga I., 2009, P INT BRIGHT UK, P1075 Shimamura T, 2001, IEEE T SPEECH AUDI P, V9, P727, DOI 10.1109/89.952490 Strik H, 1998, J ACOUST SOC AM, V103, P2659, DOI 10.1121/1.422786 Veldhuis R, 1998, J ACOUST SOC AM, V103, P566, DOI 10.1121/1.421103 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 Yoshiyuki H., 1982, J SPEECH HEAR RES, V25, P12 NR 30 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 98 EP 109 DI 10.1016/j.specom.2010.07.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700009 ER PT J AU Apsingekar, VR De Leon, PL AF Apsingekar, Vijendra Raj De Leon, Phillip L. TI Speaker verification score normalization using speaker model clusters SO SPEECH COMMUNICATION LA English DT Article DE Speaker verification; Score normalization ID IDENTIFICATION AB Among the various proposed score normalizations, T- and Z-norm are most widely used in speaker verification systems. The main idea in these normalizations is to reduce the variations in impostor scores in order to improve accuracy. These normalizations require selection of a set of cohort models or utterances in order to estimate the impostor score distribution. In this paper we investigate basing this selection on recently-proposed speaker model clusters (SMCs). We evaluate this approach using the NTIMIT and NIST-2002 corpora and compare against T- and Z-norm which use other cohort selection methods. We also propose three new normalization techniques, Delta-, Delta T- and TC-norm, which also use SMCs to estimate the normalization parameters. Our results show that we can lower the equal error rate and minimum decision cost function with fewer cohort models using SMC-based score normalization approaches. (C) 2010 Elsevier B.V. All rights reserved. C1 [Apsingekar, Vijendra Raj; De Leon, Phillip L.] New Mexico State Univ, Klipsch Sch Elect & Comp Engn, Las Cruces, NM 88003 USA. RP De Leon, PL (reprint author), New Mexico State Univ, Klipsch Sch Elect & Comp Engn, Las Cruces, NM 88003 USA. EM vijendra@nmsu.edu; pde-leon@nmsu.edu RI De Leon, Phillip/N-8884-2014 OI De Leon, Phillip/0000-0002-7665-9632 CR Apsingekar VR, 2009, IEEE T AUDIO SPEECH, V17, P848, DOI 10.1109/TASL.2008.2010882 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Bimbot F, 2004, EURASIP J APPL SIG P, V2004, P430, DOI 10.1155/S1110865704310024 Judge GG, 1988, THEORY PRACTICE ECON Li K. P., 1998, P IEEE INT C AC SPEE, V1, P595 Longworth C, 2009, IEEE T AUDIO SPEECH, V17, P748, DOI 10.1109/TASL.2008.2012193 RAMASWAMY GN, 2003, P ICASSP, V2, P61 Ramos-Castro D, 2007, PATTERN RECOGN LETT, V28, P90, DOI 10.1016/j.patrec.2006.06.008 Ravulakollu K, 2008, 42ND ANNUAL 2008 IEEE INTERNATIONAL CARNAHAN CONFERENCE ON SECURITY TECHNOLOGY, PROCEEDINGS, P56, DOI 10.1109/CCST.2008.4751277 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds D. A., 1995, Lincoln Laboratory Journal, V8 REYNOLDS DA, 2003, ACOUST SPEECH SIG PR, P53 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Sturim D. E., 2005, P IEEE INT C AC SPEE, P741, DOI 10.1109/ICASSP.2005.1415220 van Leeuwen David A., 2005, INTERSPEECH 1981 198, P1981 Zhang SX, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P1275 NR 16 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 110 EP 118 DI 10.1016/j.specom.2010.07.001 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700010 ER PT J AU Mak, MW Rao, W AF Mak, Man-Wai Rao, Wei TI Utterance partitioning with acoustic vector resampling for GMM-SVM speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Speaker verification; GMM-supervectors (GSV); Utterance partitioning; GMM-SVM; Support vector machine; Random resampling; Data imbalance ID RECOGNITION; MACHINES; CLASSIFICATION; ENSEMBLE AB Recent research has demonstrated the merit of combining Gaussian mixture models and support vector machine (SVM) for text-independent speaker verification. However, one unaddressed issue in this GMM-SVM approach is the imbalance between the numbers of speaker-class utterances and impostor-class utterances available for training a speaker-dependent SVM. This paper proposes a resampling technique - namely utterance partitioning with acoustic vector resampling (UP-AVR) - to mitigate the data imbalance problem. Briefly, the sequence order of acoustic vectors in an enrollment utterance is first randomized, which is followed by partitioning the randomized sequence into a number of segments. Each of these segments is then used to produce a GM M supervector via MAP adaptation and mean vector concatenation. The randomization and partitioning processes are repeated several times to produce a sufficient number of speaker-class supervectors for training an SVM. Experimental evaluations based on the NIST 2002 and 2004 SRE suggest that UP-AVR can reduce the error rate of GMM-SVM systems. (C) 2010 Elsevier B.V. All rights reserved. C1 [Mak, Man-Wai; Rao, Wei] Hong Kong Polytech Univ, Elect & Informat Engn Dept, Ctr Signal Proc, Hong Kong, Hong Kong, Peoples R China. RP Mak, MW (reprint author), Hong Kong Polytech Univ, Elect & Informat Engn Dept, Ctr Signal Proc, Hong Kong, Hong Kong, Peoples R China. EM enmwmak@polyu.edu.hk FU Center for Signal Processing, The Hong Polytechnic University [1-BB9W]; Research Grant Council of The Hong Kong SAR [PolyU 5264/09E] FX This work was in part supported by Center for Signal Processing, The Hong Polytechnic University (1-BB9W) and Research Grant Council of The Hong Kong SAR (PolyU 5264/09E). CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Bar-Yosef Y., 2009, INTERSPEECH, P1271 BoIle R., 2004, GUIDE BIOMETRICS Bolle R. M., 1999, P AUTOID 99, P9 Bonastre J. F., 2005, ICASSP, V1, P737 Campbell W. M., 2006, P IEEE INT C AC SPEE, V1, P97 Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086 Chawla N. V., 2003, 7 EUR C PRINC PRACT, P107 Chawla NV, 2002, J ARTIF INTELL RES, V16, P321 Cieri C., 2004, P 4 INT C LANG RES E, P69 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Dehak N, 2009, INT CONF ACOUST SPEE, P4237, DOI 10.1109/ICASSP.2009.4960564 EFRON B, 1983, AM STAT, V37, P36, DOI 10.2307/2685844 Fauve B., 2008, ODYSSEY 2008 Ferrer L., 2004, ICASSP, P173 Gillick L., 1989, P ICASSP, P532 Gonzalez-Rodriguez J, 2006, COMPUT SPEECH LANG, V20, P331, DOI 10.1016/j.csl.2005.08.005 Kang PS, 2006, LECT NOTES COMPUT SC, V4232, P837 Kenny P., 2003, EUROSPEECH, P2961 LeCun Y., 2005, P 10 INT WORKSH ART Lin Y, 2002, MACH LEARN, V46, P191, DOI 10.1023/A:1012406528296 Lin ZY, 2009, LECT NOTES COMPUT SC, V5678, P536 LONGWORTH C, 2009, IEEE T AUDIO SPEECH, V17 Martin A. F., 1997, P EUROSPEECH, P1895 Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213 Ramaswamy G. N., 2003, ICASSP, V2, P61 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Solomonoff A., 2004, P OD SPEAK LANG REC, P41 Solomonoff A, 2005, INT CONF ACOUST SPEE, P629 Sun AX, 2009, DECIS SUPPORT SYST, V48, P191, DOI 10.1016/j.dss.2009.07.011 Tang YC, 2009, IEEE T SYST MAN CY B, V39, P281, DOI 10.1109/TSMCB.2008.2002909 van Leeuwen D. A., 2005, INTERSPEECH, P1981 Veropoulos K., 1999, P INT JOINT C ART IN, P55 Wu G, 2005, IEEE T KNOWL DATA EN, V17, P786, DOI 10.1109/TKDE.2005.95 Zhang SX, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P1275 NR 36 TC 8 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 119 EP 130 DI 10.1016/j.specom.2010.06.011 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700011 ER PT J AU Alpan, A Maryn, Y Kacha, A Grenez, F Schoentgen, J AF Alpan, A. Maryn, Y. Kacha, A. Grenez, F. Schoentgen, J. TI Multi-band dysperiodicity analyses of disordered connected speech SO SPEECH COMMUNICATION LA English DT Article DE Connected disordered speech; Variogram; Signal-to-dysperiodicity ratio; Cepstral peak prominence; Multi-band analysis; Multi-variable analysis ID VOCAL DYSPERIODICITIES; SPASMODIC DYSPHONIA; ACOUSTIC MEASURES; SUSTAINED VOWELS; RUNNING SPEECH; CEPSTRAL PEAK; VOICES; DISCRIMINATION; LARYNGOGRAPH; PREDICTION AB The objective is to analyse vocal dysperiodicities in connected speech produced by dysphonic speakers. The analysis involves a variogram-based method that enables tracking instantaneous vocal dysperiodicities. The dysperiodicity trace is summarized by means of the signal-to-dysperiodicity ratio, which has been shown to correlate strongly with the perceived degree of hoarseness of the speaker. Previously, this method has been evaluated on small corpora only. In this article, analyses have been carried out on two corpora comprising over 250 and 700 speakers. This has enabled carrying out multi-frequency band and multi-cue analyses without risking overfitting. The analysis results are compared to the cepstral peak prominence, which is a popular cue that indirectly summarizes vocal dysperiodicities frame-wise. A perceptual rating has been available for the first corpus whereas speakers in the second corpus have been categorized as normal or pathological only. For the first corpus, results show that the correlation with perceptual scores increases statistically significantly for multi-band analysis compared to conventional full-band analysis. Also, combining the cepstral peak prominence with the low-frequency band signal-to-dysperiodicity ratio statistically significantly increases their combined correlation with perceptual scores. The signal-to-dysperiodicity ratios of the two corpora have been separately submitted to principal component analysis. The results show that the first two principal components are interpretable in terms of the degree of dysphonia and the spectral slope, respectively. The clinical relevance of the principal components has been confirmed by linear discriminant analysis. (C) 2010 Elsevier B.V. All rights reserved. C1 [Alpan, A.] Univ Libre Bruxelles, Lab Images Signals & Telecommun Devices, LIST, B-1050 Brussels, Belgium. [Maryn, Y.] Sint Jan Gen Hosp, Dept Speech Language Pathol & Audiol, Dept Otorhinolaryngol & Head & Neck Surgery, Brugge, Belgium. RP Alpan, A (reprint author), Univ Libre Bruxelles, Lab Images Signals & Telecommun Devices, LIST, CP165-51,Av F Roosevelt 50, B-1050 Brussels, Belgium. EM aalpan@ulb.ac.be; Youri.Maryn@azbrugge.be; akacha@ulb.ac.be; fgrenez@ulb.ac.be; jschoent@ulb.ac.be FU "Region Wallonne", Belgium FX This research was supported by the "Region Wallonne", Belgium, in the framework of the "WALEO II" programme. We are grateful to Dr. Marc Remade, Department of Otorhinolaryngology and Head and Neck Surgery, University Hospital of Louvain at Mont-Godinne, Belgium, for providing the Kay Elemetrics Database. CR Alpan A., 2009, P INT BRIGHT UK, P959 Alpan A., 2007, P INT ANTW BELG AUG, P1178 Awan SN, 2005, J VOICE, V19, P268, DOI 10.1016/j.jvoice.2004.03.005 Bettens F, 2005, J ACOUST SOC AM, V117, P328, DOI 10.1121/1.1835511 DOLANSKY L, 1968, IEEE T ACOUST SPEECH, VAU16, P51, DOI 10.1109/TAU.1968.1161962 DUNN OJ, 1969, J AM STAT ASSOC, V64, P366, DOI 10.2307/2283746 Edgar JD, 2001, J VOICE, V15, P362, DOI 10.1016/S0892-1997(01)00038-8 Fairbanks G, 1960, VOICE ARTICULATION D Flalberstam B., 2004, ORL, V66, P70 FOURCIN AJ, 1971, MED BIOL ILLUS, V21, P172 FOURCIN AJ, 1977, PHONETICA, V34, P313 Fredouille C, 2009, EURASIP J ADV SIG PR, DOI 10.1155/2009/982102 Haslett J, 1997, STATISTICIAN, V46, P475 HECKER MHL, 1971, J ACOUST SOC AM, V49, P1275, DOI 10.1121/1.1912490 Heman-Ackah YD, 2002, J VOICE, V16, P20, DOI 10.1016/S0892-1997(02)00067-X Heman-Ackah YD, 2004, J VOICE, V18, P203, DOI 10.1016/j.jvoice.2004.01.005 Hernan-Ackah Y.D., 2003, ANN OTOLRHINOLLARYNG, V112, P324 Hernandez-Espinosa C., 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, DOI 10.1109/IJCNN.2000.860781 Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311 Hinkle DE, 1998, APPL STAT BEHAV SCI Hirano M, 1981, CLIN EXAMINATION VOI Hotelling H, 1940, ANN MATH STAT, V11, P271, DOI 10.1214/aoms/1177731867 Jayant N. S., 1984, CODING WAVEFORMS PRI Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd Kacha A, 2006, SPEECH COMMUN, V48, P1365, DOI 10.1016/j.specom.2006.07.003 Kacha A, 2006, BIOMED SIGNAL PROCES, V1, P137, DOI 10.1016/j.bspc.2006.07.002 Kay Elemetrics Corp, 1994, DIS VOIC DAT VER 1 0 Klingholz F., 1987, SPEECH COMMUN, V6, P1 KLINGHOLTZ F, 1990, J ACOUST SOC AM, V87, P2218, DOI 10.1121/1.399189 LAVER J, 1986, J PHONETICS, V14, P517 LIEBERMAN P, 1963, J ACOUST SOC AM, V35, P344, DOI 10.1121/1.1918465 Maryn Y., J VOICE IN PRESS MURRY T, 1980, J SPEECH HEAR RES, V23, P361 MUTA H, 1988, J ACOUST SOC AM, V84, P1292, DOI 10.1121/1.396628 Oppenheimer A., 1975, DIGITAL SIGNAL PROCE Parsa V, 2001, J SPEECH LANG HEAR R, V44, P327, DOI 10.1044/1092-4388(2001/027) Parsa V., 2002, P INT C SPOK LANG PR, P2505 Qi YY, 1999, J ACOUST SOC AM, V105, P2532, DOI 10.1121/1.426860 Sapienza CM, 2002, J SPEECH LANG HEAR R, V45, P830, DOI 10.1044/1092-4388(2002/067) Schoentgen J, 2003, J ACOUST SOC AM, V113, P553, DOI 10.1121/1.1523384 Stevens SS, 1937, J ACOUST SOC AM, V8, P185, DOI 10.1121/1.1915893 Umapathy K, 2005, IEEE T BIO-MED ENG, V52, P421, DOI 10.1109/TBME.2004.842962 Yiu E, 2000, CLIN LINGUIST PHONET, V14, P295 NR 43 TC 11 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 131 EP 141 DI 10.1016/j.specom.2010.06.010 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700012 ER PT J AU Mori, H Satake, T Nakamura, M Kasuya, H AF Mori, Hiroki Satake, Tomoyuki Nakamura, Makoto Kasuya, Hideki TI Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics SO SPEECH COMMUNICATION LA English DT Article DE Emotional state; Expressive speech; Annotation; Abstract dimensions; Spontaneous speech; Spoken dialogue ID FACIAL EXPRESSIONS; EMOTIONAL STATES; SPEECH; GENERATION; RATINGS AB The Utsunomiya University (UU) Spoken Dialogue Database for Paralinguistic Information Studies is introduced. The UU Database is especially intended for use in understanding the usage, structure and effect of paralinguistic information in expressive Japanese conversational speech. Paralinguistic information refers to meaningful information, such as emotion or attitude, delivered along with linguistic messages. The UU Database comes with labels of perceived emotional states for all utterances. The emotional states were annotated with six abstract dimensions: pleasant-unpleasant, aroused-sleepy, dominant-submissive, credible-doubtful, interested-indifferent, and positive-negative. To stimulate expressively-rich and vivid conversation, the "4-frame cartoon sorting task" was devised. In this task, four cards each containing one frame extracted from a cartoon are shuffled, and each participant with two cards out of the four then has to estimate the original order. The effectiveness of the method was supported by a broad distribution of subjective emotional state ratings. Preliminary annotation experiments by a large number of annotators confirmed that most annotators could provide fairly consistent ratings for a repeated identical stimulus, and the inter-rater agreement was good (W similar or equal to 0.5) for three of the six dimensions. Based on the results, three annotators were selected for labeling all 4840 utterances. The high degree of agreement was verified using such measures as Kendall's W. The results of correlation analyses showed that not only prosodic parameters such as intensity and f(0) but also a voice quality parameter were related to the dimensions. Multiple correlation of above 0.7 and RMS error of about 0.6 were obtained for the recognition of some dimensions using linear combinations of the speech parameters. Overall, the perceived emotional states of speakers can be accurately estimated from the speech parameters in most cases. (C) 2010 Elsevier B.V. All rights reserved. C1 [Mori, Hiroki; Satake, Tomoyuki; Kasuya, Hideki] Utsunomiya Univ, Grad Sch Engn, Utsunomiya, Tochigi 3218585, Japan. [Nakamura, Makoto] Utsunomiya Univ, Fac Int Studies, Utsunomiya, Tochigi 3218505, Japan. RP Mori, H (reprint author), Utsunomiya Univ, Grad Sch Engn, 7-1-2 Yoto, Utsunomiya, Tochigi 3218585, Japan. EM hiroki@klab.ee.utsunomiya-u.ac.jp CR ANDERSON AH, 1991, LANG SPEECH, V34, P351 ARGYLE M, 1965, SOCIOMETRY, V28, P289, DOI 10.2307/2786027 ARGYLE M, 1971, EUR J SOC PSYCHOL, V1, P385, DOI 10.1002/ejsp.2420010307 Arimoto Y, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P322 Auberge V., 2003, P EUR 2003, P185 Burgoon J. K., 1989, NONVERBAL COMMUNICAT Campbell N., 2003, 1 JST CREST INT WORK, P61 Campbell N., 2004, J PHONET SOC JPN, V8, P9 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 Devillers L, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P801 Douglas-Cowie E., 2000, P ISCA WORKSH SPEECH, P39 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 Duck S.W., 1993, SOCIAL CONTEXT RELAT, V3 Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317 Horiuchi Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14 Greasley P., 1995, P 13 INT C PHON SCI, P242 Grimm M, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P865, DOI 10.1109/ICME.2008.4607572 HUTTAR GL, 1968, J SPEECH HEAR RES, V11, P481 Kasuya H., 1999, P INT C PHON SCI SAN, P2505 Kasuya H., 2000, P INT C SPOK LANG PR, P345 Keeley M., 1994, UNDERSTANDING RELATI, V4, P35 MacIntyre R, 1995, DYSFLUENCY ANNOTATIO Mehrabian A., 1974, THEORY AFFILIATION Miller G.R., 1993, INTERPERSONAL COMMUN, V14 Mori Hiroki, 2009, Acoustical Science and Technology, V30, DOI 10.1250/ast.30.376 Mori H., 2005, J ACOUST SOC JPN, V61, P690 Mori H., 2007, SIGSLUDA60303 JSAI Mori H, 2008, IEICE T INF SYST, VE91D, P1628, DOI 10.1093/ietisy/e91-d.6.1628 Mori H., 2007, P INT 2007, P102 Morimoto T., 1994, P ICSLP, P1791 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Ohtsuka T., 2001, P EUR 2001, V3, P2267 PALMER MT, 1989, COMMUN MONOGR, V56, P1 Patterson M.L., 1991, FUNDAMENTALS NONVERB, P58 Rosenthal R., 1979, SENSITIVITY NONVERBA RUSSELL JA, 1985, J PERS SOC PSYCHOL, V48, P1290, DOI 10.1037/0022-3514.48.5.1290 RUSSELL JA, 1989, J PERS SOC PSYCHOL, V57, P493, DOI 10.1037/0022-3514.57.3.493 Satake T., 2009, SIGSLUDA80313 JSAI Scherer K. R., 1977, MOTIV EMOTION, V1, P331, DOI 10.1007/BF00992539 Scherer K. R., 1989, HDB SOCIAL PSYCHOPHY, P165 SCHLOSBERG H, 1952, J EXP PSYCHOL, V44, P229, DOI 10.1037/h0055778 Schroder M, 2004, THESIS SAARLAND U STANG DJ, 1973, J PERS SOC PSYCHOL, V27, P405, DOI 10.1037/h0034940 The Japan Ministry of Education Culture Sports Science and Technology, 2000, PHYS FITN TEST, P1 Truong KP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P318 WARNER RM, 1987, J NONVERBAL BEHAV, V11, P57, DOI 10.1007/BF00990958 Witten I, 2005, DATA MINING NR 48 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2011 VL 53 IS 1 BP 36 EP 50 DI 10.1016/j.specom.2010.08.002 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 699LA UT WOS:000285663700004 ER PT J AU Cutler, A Cooke, M Lecumberri, MLG AF Cutler, Anne Cooke, Martin Lecumberri, M. Luisa Garcia TI Special Issue: Non-native Speech Perception in Adverse Conditions Preface SO SPEECH COMMUNICATION LA English DT Editorial Material RI Cutler, Anne/C-9467-2012 NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 863 EP 863 DI 10.1016/j.specom.2010.11.003 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700001 ER PT J AU Lecumberri, MLG Cooke, M Cutler, A AF Garcia Lecumberri, Maria Luisa Cooke, Martin Cutler, Anne TI Non-native speech perception in adverse conditions: A review SO SPEECH COMMUNICATION LA English DT Review DE Non-native; Speech perception; Noise; Review ID TRAINING JAPANESE LISTENERS; SPOKEN-WORD RECOGNITION; INFORMATIONAL MASKING; LANGUAGE-ACQUISITION; ENGLISH CONSONANTS; NATIVE-LANGUAGE; LEXICAL ACCESS; 2ND LANGUAGE; FOREIGN ACCENT; 2ND-LANGUAGE ACQUISITION AB If listening in adverse conditions is hard, then listening in a foreign language is doubly so: non-native listeners have to cope with both imperfect signals and imperfect knowledge. Comparison of native and non-native listener performance in speech-in-noise tasks helps to clarify the role of prior linguistic experience in speech perception, and, more directly, contributes to an understanding of the problems faced by language learners in everyday listening situations. This article reviews experimental studies on non-native listening in adverse conditions, organised around three principal contributory factors: the task facing listeners, the effect of adverse conditions on speech, and the differences among listener populations. Based on a comprehensive tabulation of key studies, we identify robust findings, research trends and gaps in current knowledge. (C) 2010 Elsevier B.V. All rights reserved. C1 [Garcia Lecumberri, Maria Luisa; Cooke, Martin] Univ Basque Country, Fac Letras, Language & Speech Lab, Vitoria 01006, Spain. [Cooke, Martin] Basque Fdn Sci, IKERBASQUE, Bilbao 48011, Spain. [Cutler, Anne] Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands. [Cutler, Anne] Univ Western Sydney, MARCS Auditory Labs, Penrith, NSW 1797, Australia. [Cutler, Anne] Radboud Univ Nijmegen, Donders Inst Brain Cognit & Behav, Nijmegen, Netherlands. RP Lecumberri, MLG (reprint author), Univ Basque Country, Fac Letras, Language & Speech Lab, Paseo Univ 5, Vitoria 01006, Spain. EM garcia.lecumberri@ehu.es RI Cutler, Anne/C-9467-2012 FU EU, Basque Government [IT311-10]; Spanish Government [FFI2009-10264] FX M.L. Garcia Lecumberri and M. Cooke were supported by the EU Marie Curie RTN "Sound to Sense", Basque Government Grant IT311-10 and Spanish Government grant FFI2009-10264. We thank Madhu Shashanka for recording the reverberant sound example. CR Adank P, 2009, J EXP PSYCHOL HUMAN, V35, P520, DOI 10.1037/a0013552 Akker Evelien, 2003, BILING-LANG COGN, V6, P81, DOI 10.1017/S1366728903001056 Assmann P. F., 2004, SPRINGER HDB AUDITOR, V18 BASHFORD JA, 1988, J ACOUST SOC AM, V84, P1635, DOI 10.1121/1.397178 BASHFORD JA, 1987, PERCEPT PSYCHOPHYS, V42, P114, DOI 10.3758/BF03210499 Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234 BERGMAN, 1980, AGING PERCEPTION SPE, P123 Best C. T., 1995, SPEECH PERCEPTION LI, P171 Best CT, 2001, J ACOUST SOC AM, V109, P775, DOI 10.1121/1.1332378 BLACK JW, 1962, J SPEECH HEAR RES, V5, P70 Bloomfield Leonard, 1933, LANGUAGE Bohn O. S., 1995, SPEECH PERCEPTION LI, P279 BOHN OS, 1990, APPL PSYCHOLINGUIST, V11, P303, DOI 10.1017/S0142716400008912 Bohn OS, 2000, AMST STUD THEORY HIS, V198, P1 BOHN OS, 2007, HONOR JE FLEGE BOSCH L, 1997, EUROSPEECH 97, V1, P231 Bradlow A, 2010, SPEECH COMMUN, V52, P930, DOI 10.1016/j.specom.2010.06.003 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Broersma M, 2008, SYSTEM, V36, P22, DOI 10.1016/j.system.2007.11.003 BROERSMA M, 2010, Q J EXP PSYCHOL Broersma M, 2010, SPEECH COMMUN, V52, P980, DOI 10.1016/j.specom.2010.08.010 Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Burki-Cohen J, 2001, LANG SPEECH, V44, P149 BUUS S, 1986, P INT 86, P895 CARHART R, 1969, J ACOUST SOC AM, V45, P694, DOI 10.1121/1.1911445 Cebrian J, 2006, J PHONETICS, V34, P372, DOI 10.1016/j.wocn.2005.08.003 Cieslicka A, 2006, SECOND LANG RES, V22, P115, DOI 10.1191/0267658306sr263oa Clahsen H, 2006, APPL PSYCHOLINGUIST, V27, P3, DOI 10.1017/S0142716406060024 Clopper C.G., 2006, J ACOUST SOC AM, V119, P3424 Cooke M, 2010, SPEECH COMMUN, V52, P954, DOI 10.1016/j.specom.2010.04.004 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Cooper N, 2002, LANG SPEECH, V45, P207 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 Cutler A, 2006, J PHONETICS, V34, P269, DOI 10.1016/j.wocn.2005.06.002 Cutler A, 2007, P INTERSPEECH 2007 A, P1585 Darwin CJ, 2003, J ACOUST SOC AM, V114, P2913, DOI 10.1121/1.1616924 Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156 ECKMAN FR, 1977, LANG LEARN, V27, P315, DOI 10.1111/j.1467-1770.1977.tb00124.x Eisner F, 2005, PERCEPT PSYCHOPHYS, V67, P224, DOI 10.3758/BF03206487 Ellis Rod, 1994, STUDY 2 LANGUAGE ACQ Ezzatian P, 2010, SPEECH COMMUN, V52, P919, DOI 10.1016/j.specom.2010.04.001 FAIRBANKS G, 1958, J ACOUST SOC AM, V30, P596, DOI 10.1121/1.1909702 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 Flege J. E., 1988, HUMAN COMMUNICATION, V1, P224 Flege J. E., 1995, SPEECH PERCEPTION LI, P233 Flege J. E., 1987, APPLIED LINGUISTICS, V8, P162, DOI [10.1093/applin/8.2.162, DOI 10.1093/APPLIN/8.2.162] Flege James Emil, 2001, STUDIES 2 LANGUAGE A, V23, P527 Flege JE, 1999, SEC LANG ACQ RES, P101 Flege JE, 1999, J MEM LANG, V41, P78, DOI 10.1006/jmla.1999.2638 FLEGE JE, 1995, J ACOUST SOC AM, V97, P3125, DOI 10.1121/1.413041 Flege JE, 1997, J PHONETICS, V25, P169, DOI 10.1006/jpho.1996.0040 Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052 FOX RA, 1995, J ACOUST SOC AM, V97, P2540, DOI 10.1121/1.411974 Florentine M., 1985, P INT 85, P1021 Florentine M, 1984, J ACOUST SOC AM, V75, pS84, DOI 10.1121/ 1.2021645 Frauenfelder U. H., 1998, LANGUAGE COMPREHENSI, P1 Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343 GARLAND S, 2007, BILINGUAL SPECTRUM Garnier M., 2007, THESIS U PARIS 6 Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613 GAT IB, 1978, AUDIOLOGY, V17, P339 Golestani N, 2009, BILING-LANG COGN, V12, P385, DOI 10.1017/S1366728909990150 Gooskens C, 2010, SPEECH COMMUN, V52, P1022, DOI 10.1016/j.specom.2010.06.005 Grosjean F, 1998, LANGUAGE COGNITION, V1, P131 Grosjean F., 2010, BILINGUAL LIFE REALI Grosjean F., 2001, ONE MIND 2 LANGUAGES, P1 Guion SG, 2000, J PHONETICS, V28, P27, DOI 10.1006/jpho.2000.0104 Hardison DM, 1996, LANG LEARN, V46, P3, DOI 10.1111/j.1467-1770.1996.tb00640.x HARLEY B, 1995, LANG LEARN, V45, P43, DOI 10.1111/j.1467-1770.1995.tb00962.x Hazan V, 2000, LANG SPEECH, V43, P273 Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9 Hazan V, 2010, SPEECH COMMUN, V52, P996, DOI 10.1016/j.specom.2010.05.003 Heinrich A, 2010, SPEECH COMMUN, V52, P1038, DOI 10.1016/j.specom.2010.09.009 Hoen M., 2007, SPEECH COMMUN, V12, P905, DOI 10.1016/j.specom.2007.05.008 HOWES D, 1957, J ACOUST SOC AM, V29, P296, DOI 10.1121/1.1908862 Imai S, 2005, J ACOUST SOC AM, V117, P896, DOI 10.1121/1.1823291 IOUP G, 1984, LANG LEARN, V34, P1, DOI 10.1111/j.1467-1770.1984.tb01001.x IVERSON P, 1995, J ACOUST SOC AM, V97, P553, DOI 10.1121/1.412280 Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841 Jones C, 2007, COMPUT SPEECH LANG, V21, P641, DOI 10.1016/j.csl.2007.03.001 KREUL EJ, 1968, J SPEECH HEAR RES, V11, P536 Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533 KUHL PK, 1993, J ACOUST SOC AM, V93, P2423 Kuhl PK, 2000, P NATL ACAD SCI USA, V97, P11850, DOI 10.1073/pnas.97.22.11850 KUHL PK, 1993, J PHONETICS, V21, P125 Lado R., 1957, LINGUISTICS CULTURES LANE H, 1971, J SPEECH HEAR RES, V14, P677 Leather J., 1991, STUDIES 2ND LANGUAGE, V13, P305, DOI 10.1017/S0272263100010019 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 Lee CY, 2010, SPEECH COMMUN, V52, P900, DOI 10.1016/j.specom.2010.01.004 Lenneberg E., 1967, BIOL FDN LANGUAGE LIVELY SE, 1993, J ACOUST SOC AM, V94, P1242, DOI 10.1121/1.408177 LIVELY SE, 1994, J ACOUST SOC AM, V96, P2076, DOI 10.1121/1.410149 LOGAN JS, 1991, J ACOUST SOC AM, V89, P874, DOI 10.1121/1.1894649 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 Long M., 1990, STUDIES 2ND LANGUAGE, V12, P251, DOI DOI 10.1017/S0272263100009165 Lovitt A, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2154 LU Y, 2010, THESIS U SHEFFIELD Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001 MacKay IRA, 2001, J ACOUST SOC AM, V110, P516, DOI 10.1121/1.1377287 MacKay IRA, 2001, PHONETICA, V58, P103, DOI 10.1159/000028490 Macnamara J. T., 1969, DESCRIPTION MEASUREM, P80 Maddieson I., 1984, PATTERNS SOUNDS Major R., 1998, STUDIES 2 LANGUAGE A, V20, P131 Major R. C, 2001, FOREIGN ACCENT ONTOG MAJOR RC, 1999, PHONOLOGICAL ISSUES, P151 Markham D., 1997, PHONETIC IMITATION A MARSLENWILSON W, 1994, PSYCHOL REV, V101, P653, DOI 10.1037//0033-295X.101.4.653 Mattys SL, 2009, COGNITIVE PSYCHOL, V59, P203, DOI 10.1016/j.cogpsych.2009.04.001 Mattys SL, 2010, SPEECH COMMUN, V52, P887, DOI 10.1016/j.specom.2010.01.005 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 McQueen J. M., 2007, OXFORD HDB PSYCHOLIN, P37 MCQUEEN JM, SPEECH COMMUNI UNPUB McQueen JM, 2006, COGNITIVE SCI, V30, P1113, DOI 10.1207/s15516709cog0000_79 McQueen JM, 1999, J EXP PSYCHOL HUMAN, V25, P1363, DOI 10.1037//0096-1523.25.5.1363 Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134 MILLER GA, 1947, PSYCHOL BULL, V44, P105, DOI 10.1037/h0055960 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688 Munro M. J., 1998, STUDIES 2 LANGUAGE A, V20, P139 NABELEK AK, 1982, J ACOUST SOC AM, V71, P1242 NABELEK AK, 1974, J SPEECH HEAR RES, V17, P724 NABELEK AK, 1988, J ACOUST SOC AM, V84, P476 NABELEK AK, 1984, J ACOUST SOC AM, V75, P632 Nelson P, 2005, LANG SPEECH HEAR SER, V36, P219, DOI 10.1044/0161-1461(2005/022) NEUMAN AC, 1983, J ACOUST SOC AM, V73, P2145, DOI 10.1121/1.389538 Norris D, 2008, PSYCHOL REV, V115, P357, DOI 10.1037/0033-295X.115.2.357 Norris D, 2006, COGNITIVE PSYCHOL, V53, P146, DOI 10.1016/j.cogpsych.2006.03.001 Norris D, 2003, COGNITIVE PSYCHOL, V47, P204, DOI 10.1016/S0010-0285(03)00006-9 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 NYGAARD LC, 1994, PSYCHOL SCI, V5, P42, DOI 10.1111/j.1467-9280.1994.tb00612.x Penfield W, 1959, SPEECH BRAIN MECH Picard M, 2001, AUDIOLOGY, V40, P221 PICHCNY MA, 1981, THESIS MIT Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169 PISKE T, 1999, P 14 INT C PHON SCI, P1433 Polianov E., 1931, TRAVAUX CERCLE LINGU, P79 POLLACK I, 1959, J ACOUST SOC AM, V31, P273, DOI 10.1121/1.1907712 Quene H, 2010, SPEECH COMMUN, V52, P911, DOI 10.1016/j.specom.2010.03.005 REPP BH, 1988, J ACOUST SOC AM, V84, P1929, DOI 10.1121/1.397159 Rhebergen KS, 2005, J ACOUST SOC AM, V118, P1274, DOI 10.1121/1.2000751 Rogers CL, 2004, LANG SPEECH, V47, P139 Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X SAMUEL AG, 1981, J EXP PSYCHOL GEN, V110, P474, DOI 10.1037/0096-3445.110.4.474 SAVIN HB, 1963, J ACOUST SOC AM, V35, P200, DOI 10.1121/1.1918432 Schegloff EA, 2000, LANG SOC, V29, P1 SCOVEL T, 1969, LANG LEARN, V19, P245, DOI 10.1111/j.1467-1770.1969.tb00466.x Scovel T., 1988, TIME SPEAK PSYCHOLIN Seliger H. W., 1978, 2 LANGUAGE ACQUISITI, P11 Shimizu T, 2002, AURIS NASUS LARYNX, V29, P121, DOI 10.1016/S0385-8146(01)00133-X Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650 SINGH S, 1966, J ACOUST SOC AM, V40, P635, DOI 10.1121/1.1910130 Singleton D., 1989, LANGUAGE ACQUISITION Sjerps MJ, 2010, J EXP PSYCHOL HUMAN, V36, P195, DOI 10.1037/a0016803 SLOWIACZEK LM, 1987, J EXP PSYCHOL LEARN, V13, P64, DOI 10.1037//0278-7393.13.1.64 Sorace Antonella, 1993, SECOND LANG RES, V9, P22 Soto-Faraco S, 2001, J MEM LANG, V45, P412, DOI 10.1006/jmla.2000.2783 Spivey MJ, 1999, PSYCHOL SCI, V10, P281, DOI 10.1111/1467-9280.00151 SPOLSKY B, 1968, LANG LEARN, V18, P79, DOI 10.1111/j.1467-1770.1968.tb00224.x STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Stockwell Robert P., 1965, SOUNDS ENGLISH SPANI STRANGE W., 1995, SPEECH PERCEPTION LI, P3 Taft M, 1986, LANG COGNITIVE PROC, V1, P297, DOI 10.1080/01690968608404679 TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769 Tremblay A, 2008, APPL PSYCHOLINGUIST, V29, P553, DOI 10.1017/S0142716408080247 Trubetzkoy N. S., 1939, PRINCIPLES PHONOLOGY Vanlancker-Sidtis D, 2003, APPL PSYCHOLINGUIST, V24, P45, DOI 10.1017/S0142716403000031 VANDERVLUGT M, 1986, 21 IPO, P41 van Dommelen WA, 2010, SPEECH COMMUN, V52, P968, DOI 10.1016/j.specom.2010.05.001 Van Engen KJ, 2010, SPEECH COMMUN, V52, P943, DOI 10.1016/j.specom.2010.05.002 Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 VANSUMMERS W, 1988, J ACOUST SOC AM, V84, P917 van Wijngaarden SJ, 2004, J ACOUST SOC AM, V115, P1281, DOI 10.1121/1.1647145 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928 Volin J, 2010, SPEECH COMMUN, V52, P1010, DOI 10.1016/j.specom.2010.06.009 von Hapsburg Deborah, 2004, J Am Acad Audiol, V15, P88, DOI 10.3766/jaaa.15.1.9 von Hapsburg D, 2002, J SPEECH LANG HEAR R, V45, P202 Walsh T., 1981, INDIVIDUAL DIFFERENC, P3 WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417 WANG Y, 2008, J PHONETICS, V37, P344 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 Watkins AJ, 2005, ACTA ACUST UNITED AC, V91, P892 Weber A, 2004, J MEM LANG, V50, P1, DOI 10.1016/S0749-596X(03)00105-0 WEISS W, 2008, J AM ACAD AUDIOL, V19, P5 WODE H, 1980, 2 LANGUAGE DEV TREND Wright R., 2004, PHONETICALLY BASED P ZWITSERLOOD P, 1989, COGNITION, V32, P25, DOI 10.1016/0010-0277(89)90013-9 NR 192 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 864 EP 886 DI 10.1016/j.specom.2010.08.014 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700002 ER PT J AU Mattys, SL Carroll, LM Li, CKW Chan, SLY AF Mattys, Sven L. Carroll, Lucy M. Li, Carrie K. W. Chan, Sonia L. Y. TI Effects of energetic and informational masking on speech segmentation by native and non-native speakers SO SPEECH COMMUNICATION LA English DT Article DE Spoken-word recognition; Speech segmentation; Bilingualism; Processing load; Cognitive load; Energetic masking; Informational masking ID SPOKEN-WORD RECOGNITION; NOISE; PERCEPTION; LISTENERS; LANGUAGE; INTELLIGIBILITY; 2ND-LANGUAGE; INTEGRATION; ENGLISH; CONTEXT AB In this study, we asked whether native and non-native speakers of English use a similar balance of lexical knowledge and acoustic cues, e.g., juncture-specific allophones, to segment spoken English, and whether the two groups are equally affected by energetic masking (a competing talker) and by cognitive load (a simultaneous visual search task). In intact speech, as well as in both adverse conditions, non-native speakers gave relatively less weight to lexical plausibility than to acoustic cues. Under energetic masking, overall segmentation accuracy decreased, but this decrease was of comparable magnitude in native and non-natives speakers. Under cognitive load, native speakers relied relatively more on lexical plausibility than on acoustic cues. This lexical drift was not observed in the non-native group. These results indicate that non-native speakers pay less attention to lexical information-and relatively more attention to acoustic detail-than previously thought. They also suggest that the penetrability of the speech system by cognitive factors depends on listener's proficiency with the language, and especially their level of lexical-semantic knowledge. (C) 2010 Elsevier B.V. All rights reserved. C1 [Mattys, Sven L.; Carroll, Lucy M.; Li, Carrie K. W.; Chan, Sonia L. Y.] Univ Bristol, Dept Expt Psychol, Bristol BS8 1TU, Avon, England. RP Mattys, SL (reprint author), Univ Bristol, Dept Expt Psychol, 12A Priory Rd, Bristol BS8 1TU, Avon, England. EM Sven.Mattys@bris.ac.uk FU Leverhulme Trust [F/00 182/BG]; Marie Curie foundation [MRTN-CT-2006-035561] FX This study was made possible thanks to a grant from the Leverhulme Trust (F/00 182/BG) to S.L. Mattys, and a Research Training Network grant from the Marie Curie foundation (MRTN-CT-2006-035561). We thank Martin Cooke for calibrating the babble noise and calculating the glimpsing percentages. We also thank Lukas Wiget for contributing to data collection and Jeff Bowers for comments on an earlier draft. CR Altenberg EP, 2005, SECOND LANG RES, V21, P325, DOI 10.1191/0267658305sr250oa Baayen R. H., 1995, CELEX LEXICAL DATABA Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005 Barker J, 2007, SPEECH COMMUN, V49, P402, DOI 10.1016/j.specom.2006.11.003 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 Casini L, 2009, COGNITION, V112, P318, DOI 10.1016/j.cognition.2009.04.005 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 Davis MH, 2002, J EXP PSYCHOL HUMAN, V28, P218, DOI 10.1037//0096-1523.28.1.218 Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023 Elston-Guttler KE, 2005, J MEM LANG, V52, P256, DOI 10.1016/j.jml.2004.11.002 Farris C, 2008, TESOL QUART, V42, P397 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 GOW DW, 1995, J EXP PSYCHOL HUMAN, V21, P344, DOI 10.1037//0096-1523.21.2.344 Hazan V, 2000, LANG SPEECH, V43, P273 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 Love T, 2003, EXP PSYCHOL, V50, P204, DOI 10.1027//1618-3169.50.3.204 Mattys SL, 2009, COGNITIVE PSYCHOL, V59, P203, DOI 10.1016/j.cogpsych.2009.04.001 Mattys SL, 2005, J EXP PSYCHOL GEN, V134, P477, DOI 10.1037/0096-3445.134.4.477 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 NABELEK AK, 1984, J ACOUST SOC AM, V75, P632 NORRIS D, 1995, J EXP PSYCHOL LEARN, V21, P1209 OLLER DK, 1973, J ACOUST SOC AM, V54, P1235, DOI 10.1121/1.1914393 Rosenhouse J., 2006, INT J BILINGUAL, V10, P119 Sanders LD, 2002, J SPEECH LANG HEAR R, V45, P519, DOI 10.1044/1092-4388(2002/041) Sanders LD, 2003, COGNITIVE BRAIN RES, V15, P214, DOI 10.1016/S0926-6410(02)00194-5 Ito Kikuyo, 2009, J Acoust Soc Am, V125, P2348, DOI 10.1121/1.3082103 Styles E., 1997, PSYCHOL ATTENTION TAKANO Y, 1993, J CROSS CULT PSYCHOL, V24, P445, DOI 10.1177/0022022193244005 Thorn ASC, 1999, Q J EXP PSYCHOL-A, V52, P303, DOI 10.1080/027249899391089 TREISMAN AM, 1980, COGNITIVE PSYCHOL, V12, P97, DOI 10.1016/0010-0285(80)90005-5 Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928 WHITE L, 2010, Q J EXP PSYCHOL NR 38 TC 9 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 887 EP 899 DI 10.1016/j.specom.2010.01.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700003 ER PT J AU Lee, CY Tao, LA Bond, ZS AF Lee, Chao-Yang Tao, Liang Bond, Z. S. TI Identification of multi-speaker Mandarin tones in noise by native and non-native listeners SO SPEECH COMMUNICATION LA English DT Article DE Mandarin tones; Speech perception; Speaker variability; Noise ID SPEAKER NORMALIZATION; TALKER VARIABILITY; SPEECH-PERCEPTION; LEXICAL TONE; RECOGNITION MEMORY; WORD RECOGNITION; CHINESE TONES; SPOKEN WORDS; ACQUISITION; INFORMATION AB The similarities and contrasts between native and non-native identification of multi-speaker Mandarin tones in quiet and in noise were explored in a perception experiment. Mandarin tone materials produced by three male and three female speakers were presented with five levels of signal-to-noise ratios (quiet, 0, -5, -10, and -15 dB) in two presentation formats (blocked by speaker and mixed across speakers) to listeners with various Mandarin experience (native, first-year, second-year, third-year, and fourth-year students). Stimuli blocked by speaker yielded higher accuracy and shorter reaction time. The additional demand of processing mixed-speaker stimuli, however, did not compromise non-native performance more than native performance. Noise expectedly compromised identification performance, although it did not compromise non-native identification more than native identification. Native listeners expectedly outperformed non-native listeners, although identification performance did not vary systematically as a function of duration of Mandarin experience. It is speculated that sources of variability in speech would affect non-native more than native tone identification only if syllable-internal, canonical F0 information is removed or altered. Published by Elsevier B.V. C1 [Lee, Chao-Yang] Ohio Univ, Sch Hearing Speech & Language Sci, Athens, OH 45701 USA. [Tao, Liang; Bond, Z. S.] Ohio Univ, Dept Linguist, Athens, OH 45701 USA. RP Lee, CY (reprint author), Ohio Univ, Sch Hearing Speech & Language Sci, Athens, OH 45701 USA. EM leec1@ohio.edu FU Ohio University [RC-09-088] FX We would like to thank Ning Zhou for assistance in speech processing, Na Wang and Lauren Dutton for assistance in administering the experiment, and Jessica Stillwell and Jana Van Hooser for assistance in stimulus preparation. We are also grateful to three anonymous reviewers for their helpful comments. This research was partially supported by Research Challenge Award RC-09-088 from Ohio University. CR Bluhme H., 1971, STUD LINGUIST, V22, P51 Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Chin T., 1987, J CHINESE LANGUAGE T, V22, P87 CHURCH BA, 1994, J EXP PSYCHOL LEARN, V20, P521, DOI 10.1037//0278-7393.20.3.521 CREELMAN CD, 1957, J ACOUST SOC AM, V29, P655, DOI 10.1121/1.1909003 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 FOX RA, 1985, J CHINESE LINGUIST, V13, P69 FOX RA, 1990, J CHINESE LINGUIST, V18, P261 GANDOUR J, 1983, J PHONETICS, V11, P149 GANDOUR JT, 1978, LANG SPEECH, V22, P1 Goldinger SD, 1996, J EXP PSYCHOL LEARN, V22, P1166, DOI 10.1037/0278-7393.22.5.1166 Goldinger SD, 1998, PSYCHOL REV, V105, P251, DOI 10.1037/0033-295X.105.2.251 Gottfried TL, 1997, J PHONETICS, V25, P207, DOI 10.1006/jpho.1997.0042 Halle PA, 2004, J PHONETICS, V32, P395, DOI 10.1016/S0095-4470(03)00016-0 Hardison DM, 2003, APPL PSYCHOLINGUIST, V24, P495, DOI 10.1017/S0142716403000250 Johnson K, 2005, BLACKW HBK LINGUIST, P363, DOI 10.1002/9780470757024.ch15 KIRILOFF C, 1969, PHONETICA, V20, P63 Kong YY, 2006, J ACOUST SOC AM, V120, P2830, DOI 10.1121/1.2346009 LEATHER J, 1983, J PHONETICS, V11, P373 LEE CY, 2010, LANG SPEECH, P53 Lee CY, 2008, J PHONETICS, V36, P537, DOI 10.1016/j.wocn.2008.01.002 Lee CY, 2009, J ACOUST SOC AM, V125, P1125, DOI 10.1121/1.3050322 Lee CY, 2009, J PHONETICS, V37, P1, DOI 10.1016/j.wocn.2008.08.001 Lin T., 1984, ZHONGGUO YUYAN XUEBA, V2, P59 Lin W. C. J., 1985, RELC J, V16, P31, DOI 10.1177/003368828501600207 LIVELY SE, 1993, J ACOUST SOC AM, V94, P1242, DOI 10.1121/1.408177 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134 Mertus J. A., 2000, BROWN LAB INTERACTIV Miracle W. C., 1989, J CHINESE LANGUAGE T, V24, P49 Moore CB, 1997, J ACOUST SOC AM, V102, P1864, DOI 10.1121/1.420092 MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688 NABELEK AK, 1984, J ACOUST SOC AM, V75, P632 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 Nusbaum H. C., 1992, SPEECH PERCEPTION PR, P113 PALMERI TJ, 1993, J EXP PSYCHOL LEARN, V19, P309, DOI 10.1037/0278-7393.19.2.309 REPP BH, 1990, J PHONETICS, V18, P481 Sebastian-Galles N, 2005, BLACKW HBK LINGUIST, P546, DOI 10.1002/9780470757024.ch22 Sereno J. A., 2007, LANGUAGE EXPERIENCE, P239 Shen X. S., 1989, J CHINESE LANGUAGE T, V24, P27 Summerfield Q., 1973, REPORT SPEECH RES PR, V2, P12 Takayanagi S, 2002, J SPEECH LANG HEAR R, V45, P585, DOI 10.1044/1092-4388(2002/047) TSAI CH, 2000, CH TSAIS TECHNOLOGY VERBRUGGE RR, 1976, J ACOUST SOC AM, V60, P198, DOI 10.1121/1.381065 Wang Y, 1999, J ACOUST SOC AM, V106, P3649, DOI 10.1121/1.428217 Wei CG, 2007, EAR HEARING, V28, p62S, DOI 10.1097/AUD.0b013e318031512c Wong PCM, 2003, J SPEECH LANG HEAR R, V46, P413, DOI 10.1044/1092-4388(2003/034) Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 XU Y, 1994, J ACOUST SOC AM, V95, P2240, DOI 10.1121/1.408684 Zhou N, 2008, EAR HEARING, V29, P326, DOI 10.1097/AUD.0b013e3181662c42 NR 51 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 900 EP 910 DI 10.1016/j.specom.2010.01.004 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700004 ER PT J AU Quene, H van Delft, LE AF Quene, Hugo van Delft, L. E. TI Non-native durational patterns decrease speech intelligibility SO SPEECH COMMUNICATION LA English DT Article DE Non-native speech; Duration patterns; Segmental durations; Speech Reception Threshold; Intelligibility ID RECEPTION THRESHOLD; SEGMENTAL DURATION; ENGLISH; 2ND-LANGUAGE; PERCEPTION; LANGUAGE; DUTCH; PROSODY; ACCENT; NOISE AB In native speech, durational patterns convey linguistically relevant phenomena such as phrase structure, lexical stress, rhythm, and word boundaries. The lower intelligibility of non-native speech may be partly due to its deviant durational patterns. The present study aims to quantify the relative contributions of non-native durational patterns and of non-native speech sounds to intelligibility. In a Speech Reception Threshold study, duration patterns were transplanted between native and non-native versions of Dutch sentences. Intelligibility thresholds (critical speech-to-noise ratios) differed by about 4 dB between the matching versions with unchanged durational patterns. Results for manipulated versions suggest that about 0.4-1.1 dB of this difference was due to the durational patterns, and that this contribution was larger if the native and non-native patterns were more deviant. The remainder of the difference must have been due to non-native speech sounds in these materials. This finding supports recommendations to attend to durational patterns as well as native-like speech sounds, when learning to speak a foreign language. (C) 2010 Elsevier B.V. All rights reserved. C1 [Quene, Hugo; van Delft, L. E.] Univ Utrecht, Utrecht Inst Linguist OTS, NL-3512 JK Utrecht, Netherlands. RP Quene, H (reprint author), Univ Utrecht, Utrecht Inst Linguist OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM h.quene@uu.nl CR Adams C., 1979, ENGLISH SPEECH RHYTH ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005 Bates D., 2005, R NEWS, V5, P27, DOI DOI 10.1111/J.1523-1739.2005.00280.X Bent T, 2008, PHONETICA, V65, P131, DOI 10.1159/000144077 Boersma P., 2008, PRAAT DOING PHONETIC CAMBIERLANGEVEL.G, 2000, THESIS U AMSTERDAM Chun Dorothy, 2002, DISCOURSE INTONATION Cutler A, 1997, LANG SPEECH, V40, P141 Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V20, P1 EEFTING W, 1993, ANAL SYNTHESIS SPEEC, P225 Faraway J. J., 2006, EXTENDING LINEAR MOD FLEGE JE, 1993, J ACOUST SOC AM, V93, P1589, DOI 10.1121/1.406818 FLEGE JE, 1981, LANG SPEECH, V24, P125 Goetry V, 2000, PSYCHOL BELG, V40, P115 Holm S., 2008, THESIS NORWEGIAN U S Hox J., 2002, MULTILEVEL ANAL TECH KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 LAEUFER C, 1992, J PHONETICS, V20, P411 MAASSEN B, 1984, SPEECH COMMUN, V3, P123, DOI 10.1016/0167-6393(84)90034-7 Mareuil P., 2006, PHONETICA, V63.4, P247, DOI 10.1159/000097308 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Munro MJ, 1999, LANG LEARN, V49, P285, DOI 10.1111/0023-8333.49.s1.8 Munro Murray J., 2008, PHONOLOGY 2 LANGUAGE, P193 Nooteboom S., 1997, HDB PHONETIC SCI, P640 Patel A., 2008, MUSIC LANGUAGE BRAIN PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 Pinheiro J. C., 2000, STAT COMPUTING PLOMP R, 1979, AUDIOLOGY, V18, P43 PLOMP R, 1986, J SPEECH HEAR RES, V29, P146 Quene H, 2004, SPEECH COMMUN, V43, P103, DOI 10.1016/j.specom 2004.02.004 Quene H, 2008, J MEM LANG, V59, P413, DOI 10.1016/j.jml.2008.02.002 QUENE H, 1992, J PHONETICS, V20, P331 Quene H, 2005, PHONETICA, V62, P1, DOI 10.1159/000087222 R Development Core Team, 2008, R LANG ENV STAT COMP Rajadurai J., 2007, WORLD ENGLISH, V26, P87, DOI 10.1111/j.1467-971X.2007.00490.x Shatzman KB, 2006, PERCEPT PSYCHOPHYS, V68, P1, DOI 10.3758/BF03193651 SLIS IH, 1969, LANG SPEECH, V12, P80 SLUIJTER AMC, 1995, PHONETICA, V52, P71 Smith R., 2004, THESIS U CAMBRIDGE STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455 Tajima K, 1997, J PHONETICS, V25, P1, DOI 10.1006/jpho.1996.0031 VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V112, P3004, DOI 10.1121/1.1512289 van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4 White L., 2007, CURRENT ISSUES LINGU, P237 White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003 NR 47 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 911 EP 918 DI 10.1016/j.specom.2010.03.005 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700005 ER PT J AU Ezzatian, P Avivi, M Schneider, BA AF Ezzatian, Payam Avivi, Meital Schneider, Bruce A. TI Do nonnative listeners benefit as much as native listeners from spatial cues that release speech from masking? SO SPEECH COMMUNICATION LA English DT Article DE Second language; Bilingualism; Speech comprehension; Speech perception; Informational masking; Energetic masking; Stream segregation ID INFORMATIONAL MASKING; COMPETING SPEECH; SIMULTANEOUS TALKERS; ENERGETIC MASKING; COCKTAIL PARTY; PERCEPTION; NOISE; RECOGNITION; SEPARATION; HEARING AB Since most everyday communication takes place in less than optimal acoustic settings, it is important to understand how such environments affect nonnative listeners. In this study we compare the speech reception abilities of native and nonnative English speakers when they are asked to repeat semantically anomalous sentences masked by steady-state noise or two other talkers in two conditions: when the target and masker appear to be colocated; and when the target and masker appear to emanate from different loci. We found that the later the age of language acquisition, the higher the threshold for speech reception under all conditions, suggesting that the ability to extract speech information from masking sounds in complex acoustic situations depends on language competency. Interestingly, however, native and nonnative listeners benefited equally from perceived spatial separation (an acoustic cue that releases speech from masking) independent of whether the speech target was masked by speech or noise, suggesting that the acoustic factors that release speech from masking are not affected by linguistic competence. In addition speech reception thresholds were correlated with vocabulary scores in all individuals, both native and nonnative. The implications of these findings for nonnative listeners in acoustically complex environments are discussed. (C) 2010 Elsevier B.V. All rights reserved. C1 [Ezzatian, Payam; Avivi, Meital; Schneider, Bruce A.] Univ Toronto Mississauga, Ctr Res Biol Commun Syst, Dept Psychol, Mississauga, ON L5L 1C6, Canada. RP Schneider, BA (reprint author), Univ Toronto Mississauga, Ctr Res Biol Commun Syst, Dept Psychol, Mississauga, ON L5L 1C6, Canada. EM bruce.schneider@utoronto.ca FU Canadian Institute of Health Research [MT15359]; Natural Sciences and Engineering Research Council of Canada [RGPIN 9952] FX This work was supported by the Canadian Institute of Health Research (MT15359) and Natural Sciences and Engineering Research Council of Canada (RGPIN 9952). We would like to thank James Qi for creating the program used to run our experiments and Lulu Li for help in recruiting participants. CR Abrahamsson N, 2009, LANG LEARN, V59, P249, DOI 10.1111/j.1467-9922.2009.00507.x Akeroyd MA, 2000, J ACOUST SOC AM, V107, P3394, DOI 10.1121/1.429410 BILGER RC, 1984, J SPEECH HEAR RES, V27, P32 Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Bregman AS., 1990, AUDITORY SCENE ANAL Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929 Brungart DS, 2002, J ACOUST SOC AM, V112, P664, DOI 10.1121/1.1490592 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Darwin CJ, 2003, J ACOUST SOC AM, V114, P2913, DOI 10.1121/1.1616924 Flege J. E., 1995, SPEECH PERCEPTION LI, P233 FLORENTINE M, 1985, P ACOUST SOC JAPAN Freyman RL, 1999, J ACOUST SOC AM, V106, P3578, DOI 10.1121/1.428211 Freyman RL, 2001, J ACOUST SOC AM, V109, P2112, DOI 10.1121/1.1354984 Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343 Freyman RL, 2007, J ACOUST SOC AM, V121, P1040, DOI 10.1121/1.2427117 Hasher L., 1988, PSYCHOL LEARN MOTIV, V22, P193, DOI DOI 10.1016/S0079-7421(08)60041-9 Hawley ML, 2004, J ACOUST SOC AM, V115, P833, DOI 10.1121/1.1639908 Heinrich A, 2008, Q J EXP PSYCHOL, V61, P735, DOI 10.1080/17470210701402372 Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432 Helfer KS, 2005, J ACOUST SOC AM, V117, P842, DOI [10.1121/1.1836832, 10.1121/1.183682] Helfer KS, 2008, EAR HEARING, V29, P87 KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 Kidd G, 1998, J ACOUST SOC AM, V104, P422, DOI 10.1121/1.423246 Kidd G, 2005, J ACOUST SOC AM, V118, P3804, DOI 10.1121/1.2109187 Li Ang, 2004, Journal of Experimental Psychology Human Perception and Performance, V30, P1077 Litovsky RY, 2005, J ACOUST SOC AM, V117, P3091, DOI 10.1121/1.1873913 Marrone N, 2008, J ACOUST SOC AM, V124, P3064, DOI 10.1121/1.2980441 Marslen-Wilson W. D., 1989, LEXICAL REPRESENTATI, P3 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134 Noble W, 2002, PERCEPT PSYCHOPHYS, V64, P1325, DOI 10.3758/BF03194775 Rogers CL, 2008, J ACOUST SOC AM, V124, P1278, DOI 10.1121/1.2939127 Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X Saffran JR, 1996, J MEM LANG, V35, P606, DOI 10.1006/jmla.1996.0032 Schneider B. A., 2010, SPRINGER HDB AUDITOR, P167 Schneider BA, 2007, J AM ACAD AUDIOL, V18, P559, DOI 10.3766/jaaa.18.7.4 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928 Verhaeghen P, 1997, PSYCHOL BULL, V122, P231, DOI 10.1037/0033-2909.122.3.231 Yang ZG, 2007, SPEECH COMMUN, V49, P892, DOI 10.1016/j.specom.2007.05.005 NR 42 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 919 EP 929 DI 10.1016/j.specom.2010.04.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700006 ER PT J AU Bradlow, A Clopper, C Smiljanic, R Walter, MA AF Bradlow, Ann Clopper, Cynthia Smiljanic, Rajka Walter, Mary Ann TI A perceptual phonetic similarity space for languages: Evidence from five native language listener groups SO SPEECH COMMUNICATION LA English DT Article DE Phonetic similarity; Cross-language speech intelligibility; Language classification ID SPEECH-INTELLIGIBILITY BENEFIT; FREE CLASSIFICATION; ENGLISH; NOISE; RECOGNITION; CONTRASTS; FEATURES; TALKER AB The goal of the present study was to devise a means of representing languages in a perceptual similarity space based on their overall phonetic similarity. In Experiment 1, native English listeners performed a free classification task in which they grouped 17 diverse languages based on their perceived phonetic similarity. A similarity matrix of the grouping patterns was then submitted to clustering and multidimensional scaling analyses. In Experiment 2, an independent group of native English listeners sorted the group of 17 languages in terms of their distance from English. Experiment 3 repeated Experiment 2 with four groups of non-native English listeners: Dutch, Mandarin, Turkish and Korean listeners. Taken together, the results of these three experiments represent a step towards establishing an approach to assess the overall phonetic similarity of languages. This approach could potentially provide the basis for developing predictions regarding foreign-accented speech intelligibility for various listener groups, and regarding speech perception accuracy in the context of background noise in various languages. (C) 2010 Elsevier B.V. All rights reserved. C1 [Bradlow, Ann] Northwestern Univ, Dept Linguist, Evanston, IL 60208 USA. [Clopper, Cynthia] Ohio State Univ, Dept Linguist, Columbus, OH 43210 USA. [Smiljanic, Rajka] Univ Texas Austin, Dept Linguist, Austin, TX 78712 USA. [Walter, Mary Ann] Middle E Tech Univ, Ankara, Turkey. RP Bradlow, A (reprint author), Northwestern Univ, Dept Linguist, 2016 Sheridan Rd, Evanston, IL 60208 USA. EM abradlow@northwestern.edu FU NIH [F32 DC007237, R01 DC005794] FX We are grateful to Rachel Baker, Arim Choi and Susanne Brouwer for research assistance. This work was supported by NIH Grants F32 DC007237 and R01 DC005794. CR [Anonymous], 1999, HDB INT PHONETIC ASS BARKAT M, 2001, P EUR 2001, P1065 Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234 Bent T, 2008, PHONETICA, V65, P131, DOI 10.1159/000144077 Best CT, 2001, J ACOUST SOC AM, V109, P775, DOI 10.1121/1.1332378 Boersma P., 2009, PRAAT DOING PHONETIC CALANDRUCCIO L, J SPEECH LA IN PRESS Clopper CG, 2007, J PHONETICS, V35, P421, DOI 10.1016/j.wocn.2006.06.001 Clopper CG, 2008, BEHAV RES METHODS, V40, P575, DOI 10.3758/BRM.40.2.575 CORTER JE, 1982, BEHAV RES METH INSTR, V14, P353 Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V20, P1 Dunn M, 2005, SCIENCE, V309, P2072, DOI 10.1126/science.1114615 Flege J. E., 1995, SPEECH PERCEPTION LI, P233 Hayes-Harb R, 2008, J PHONETICS, V36, P664, DOI 10.1016/j.wocn.2008.04.002 Heeringa W, 2009, SPEECH COMMUN, V51, P167, DOI 10.1016/j.specom.2008.07.006 Imai S, 2005, J ACOUST SOC AM, V117, P896, DOI 10.1121/1.1823291 Kruskal JB, 1978, MULTIDIMENSIONAL SCA Kuhl PK, 2008, PHILOS T R SOC B, V363, P979, DOI 10.1098/rstb.2007.2154 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 MEYER J, 2003, P INT C PHON SCI MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049 Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X RHEBERGEN KS, 2005, J ACOUST SOC AM, V118, P1 SMILJANIC R, 2007, P 26 INT C PHON SCI Stibbard RM, 2006, J ACOUST SOC AM, V120, P433, DOI 10.1121/1.2203595 Stockmal V, 2000, APPL PSYCHOLINGUIST, V21, P383, DOI 10.1017/S0142716400003052 Strange W., 2007, LANGUAGE EXPERIENCE, P35, DOI 10.1075/lllt.17.08str TAKANE Y, 1977, PSYCHOMETRIKA, V42, P7, DOI 10.1007/BF02293745 Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928 VASILESCU I., 2005, P INT, P1773 VASILESCU I, 2000, P INT C SPOK LANG PR, V2, P543 NR 34 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 930 EP 942 DI 10.1016/j.specom.2010.06.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700007 ER PT J AU Van Engen, KJ AF Van Engen, Kristin J. TI Similarity and familiarity: Second language sentence recognition in first- and second-language multi-talker babble SO SPEECH COMMUNICATION LA English DT Article DE Speech-in-noise perception; Informational masking; Multi-talker babble; Bilingual speech perception ID NONNATIVE LISTENERS; SPEECH-PERCEPTION; INFORMATIONAL MASKING; NATIVE-LANGUAGE; CLEAR SPEECH; NOISE; REVERBERATION; ENGLISH; INTELLIGIBILITY; IDENTIFICATION AB The intelligibility of speech in noisy environments depends not only on the functionality of listeners' peripheral auditory systems, but also on cognitive factors such as their language learning experience. Previous studies have shown, for example, that normal-hearing listeners attending to a non-native language have more difficulty in identifying speech targets in noisy conditions than do native listeners. Furthermore, native listeners have more difficulty in understanding speech targets in the presence of speech noise in their native language versus a foreign language. The present study addresses the role of listeners' experience with both the target and noise languages by examining second-language sentence recognition in first- and second-language noise. Native English speakers and non-native English speakers whose native language is Mandarin were tested on English sentence recognition in English and Mandarin 2-talker babble. Results show that both listener groups experienced greater difficulty in English versus Mandarin babble, but that native Mandarin listeners experienced a smaller release from masking in Mandarin babble relative to English babble. These results indicate that both the similarity between the target and noise and the language experience of the listeners contribute to the amount of interference listeners experience when listening to speech in the presence of speech noise. (C) 2010 Elsevier B.V. All rights reserved. C1 Northwestern Univ, Dept Linguist, Evanston, IL 60208 USA. RP Van Engen, KJ (reprint author), Northwestern Univ, Dept Linguist, 2016 Sheridan Rd, Evanston, IL 60208 USA. EM k-van@northwestern.edu FU NIH-NIDCD [F31DC009516, R01-DC005794] FX The author thanks Ann Bradlow for helpful discussions at various stages of this project. Special thanks also to Chun Chan for software development and technical support, to Page Piccinini for assistance in data collection, and to Matt Goldrick for assistance with data analysis. This research was supported by Award No. F31DC009516 (Kristin Van Engen, PI) and Grant No. R01-DC005794 from NIH-NIDCD (Ann Bradlow, PI). The content is solely the responsibility of the author and does not necessarily represent the official views of the NIDCD or the NIH. CR Bamford J., 1979, SPEECH HEARING TESTS, P148 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 CALANDRUCCIO L, J SPEECH LA IN PRESS Callan DE, 2004, NEUROIMAGE, V22, P1182, DOI 10.1016/j.neuroimage.2004.03.006 Clahsen H, 2006, TRENDS COGN SCI, V10, P564, DOI 10.1016/j.tics.2006.10.002 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 Durlach N, 2006, J ACOUST SOC AM, V120, P1787, DOI 10.1121/1.2335426 *ETS, 2005, TOEFL INT BAS TEST S Felty RA, 2009, J ACOUST SOC AM, V125, pEL93, DOI 10.1121/1.3073733 Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343 Hazan V, 2000, LANG SPEECH, V43, P273 Jaeger TF, 2008, J MEM LANG, V59, P434, DOI 10.1016/j.jml.2007.11.007 Kidd Jr G., 2007, AUDITORY PERCEPTION, P143 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 Marion V, 2007, J SPEECH LANG HEAR R, V50, P940, DOI 10.1044/1092-4388(2007/067) Mattys SL, 2009, COGNITIVE PSYCHOL, V59, P203, DOI 10.1016/j.cogpsych.2009.04.001 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 Mueller JL, 2005, SECOND LANG RES, V21, P152, DOI 10.1191/0267658305sr256oa NABELEK AK, 1984, J ACOUST SOC AM, V75, P632 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 R Development Core Team, 2005, R LANG ENV STAT COMP RHEBERGEN KS, 2005, J ACOUST SOC AM, V118, P1 Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788 Sperry J L, 1997, J Am Acad Audiol, V8, P71 STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455 TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769 TICE R, 1998, LEVEL16 VANENGEN KJ, 2007, J ACOUST SOC AM, V122, P2994, DOI 10.1121/1.2942684 Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928 VONHAPSBURG D, 2004, J AM ACAD AUDIOL, V14, P559 von Hapsburg D, 2002, J SPEECH LANG HEAR R, V45, P202 NR 38 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 943 EP 953 DI 10.1016/j.specom.2010.05.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700008 ER PT J AU Cooke, M Lecumberri, MLG Scharenborg, O van Dommelen, WA AF Cooke, Martin Garcia Lecumberri, Maria Luisa Scharenborg, Odette van Dommelen, Wim A. TI Language-independent processing in speech perception: Identification of English intervocalic consonants by speakers of eight European languages SO SPEECH COMMUNICATION LA English DT Article DE Consonant identification; Non-native; Cross-language; Noise ID NONNATIVE LISTENERS; BACKGROUND-NOISE; CUE-ENHANCEMENT; NATIVE SPEAKERS; NORMAL-HEARING; INTELLIGIBILITY; RECOGNITION; CONFUSIONS; TALKER; MASKING AB Processing speech in a non-native language requires listeners to cope with influences from their first language and to overcome the effects of limited exposure and experience. These factors may be particularly important when listening in adverse conditions. However, native listeners also suffer in noise, and the intelligibility of speech in noise clearly depends on factors which are independent of a listener's first language. The current study explored the issue of language-independence by comparing the responses of eight listener groups differing in native language when confronted with the task of identifying English intervocalic consonants in three masker backgrounds, viz. stationary speech-shaped noise, temporally-modulated speech-shaped noise and competing English speech. The study analysed the effects of (i) noise type, (ii) speaker, (iii) vowel context, (iv) consonant, (v) phonetic feature classes, (vi) stress position, (vii) gender and (viii) stimulus onset relative to noise onset. A significant degree of similarity in the response to many of these factors was evident across all eight language groups, suggesting that acoustic and auditory considerations play a large role in determining intelligibility. Language-specific influences were observed in the rankings of individual consonants and in the masking effect of competing speech relative to speech-modulated noise. (C) 2010 Elsevier B.V. All rights reserved. C1 [Cooke, Martin] Basque Fdn Sci, Bilbao 48011, Spain. [Cooke, Martin; Garcia Lecumberri, Maria Luisa] Univ Basque Country, Fac Letters, Language & Speech Lab, Vitoria 01006, Spain. [Scharenborg, Odette] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands. [van Dommelen, Wim A.] NTNU, Dept Language & Commun Studies, NO-7491 Trondheim, Norway. RP Cooke, M (reprint author), Basque Fdn Sci, Bilbao 48011, Spain. EM m.cooke@ikerbasque.org RI Scharenborg, Odette/E-2056-2012 FU EU; Netherlands Organisation for Scientific Research (NWO) FX Corpus recording, annotation and native English listening tests took place while Martin Cooke was at the University of Sheffield, UK. We extend our thanks to Francesco Cutugno, Mircea Giurgiu, Bernd Meyer and Jan Volin for coordinating listener groups in Naples, Cluj-Napoca, Oldenburg and Prague; Youyi Lu (University of Sheffield) for speech material; Stuart Rosen (UCL) for making available the FIX software package; and the developers of the R statistical language R Development Core Team (2008). All authors were supported by the EU Marie Curie Research Training Network "Sound to Sense". Odette Scharenborg was supported by a Veni-grant from the Netherlands Organisation for Scientific Research (NWO). We also thank Marc Swerts and the reviewers for their insightful comments on an earlier version of the paper. CR AINSWORTH WA, 1994, J ACOUST SOC AM, V96, P687, DOI 10.1121/1.410306 Alamsaputra DM, 2006, AUGMENT ALTERN COMM, V22, P258, DOI 10.1080/00498250600718555 Barker J, 2007, SPEECH COMMUN, V49, P402, DOI 10.1016/j.specom.2006.11.003 Benki JR, 2003, PHONETICA, V60, P129, DOI 10.1159/000071450 Best C. T., 1995, SPEECH PERCEPTION LI, P171 Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952 Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 BRUNGART DS, 2001, J ACOUST SOC AM, V100, P2527 CARHART R, 1969, J ACOUST SOC AM, V45, P694, DOI 10.1121/1.1911445 Cervera T, 2005, ACTA ACUST UNITED AC, V91, P132 Cooke M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1765 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 CUTLER A, 1979, SENTENCE PROCESSING, P171 CUTLER A, 1987, COGNITIVE PSYCHOL, V19, P141, DOI 10.1016/0010-0285(87)90010-7 Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156 Detey S, 2008, LINGUA, V118, P66, DOI 10.1016/j.lingua.2007.04.003 DUBNO JR, 1981, J ACOUST SOC AM, V69, P249, DOI 10.1121/1.385345 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 Flege J. E., 1995, SPEECH PERCEPTION LI, P233 Florentine M, 1984, J ACOUST SOC AM, V75, pS84, DOI 10.1121/ 1.2021645 FOSS DJ, 1980, COGNITIVE PSYCHOL, V12, P1, DOI 10.1016/0010-0285(80)90002-X Fullgrabe C, 2006, HEARING RES, V211, P74, DOI 10.1016/j.heares.2005.09.001 Gamer M, 2007, IRR VARIOUS COEFFICI Hazan V, 2000, LANG SPEECH, V43, P273 Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9 Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826 Imai S, 2005, J ACOUST SOC AM, V117, P896, DOI 10.1121/1.1823291 Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841 Kendall MG, 1948, RANK CORRELATION MET KUHL PK, 1993, J ACOUST SOC AM, V93, P2423 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lovitt A, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2154 LU Y, 2010, THESIS U SHEFFIELD MacKay IRA, 2001, PHONETICA, V58, P103, DOI 10.1159/000028490 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Moore B. C., 2004, INTRO PSYCHOL HEARIN Parikh G, 2005, J ACOUST SOC AM, V118, P3874, DOI 10.1121/1.2118407 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 Pinheiro J, 2008, NLME LINEAR NONLINEA Pinheiro J. C., 2000, MIXED EFFECTS MODELS R Development Core Team, 2008, R LANG ENV STAT COMP Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455 TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769 Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928 WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417 Wright R., 2004, PHONETICALLY BASED P, P34, DOI 10.1017/CBO9780511486401.002 NR 55 TC 9 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 954 EP 967 DI 10.1016/j.specom.2010.04.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700009 ER PT J AU van Dommelen, WA Hazan, V AF van Dommelen, Wim A. Hazan, Valerie TI Perception of English consonants in noise by native and Norwegian listeners SO SPEECH COMMUNICATION LA English DT Article DE Consonant identification; Non-native; English; Norwegian; Noise ID NONNATIVE LISTENERS; SPEECH-PERCEPTION; INFORMATIONAL MASKING; RECOGNITION; LANGUAGE; JAPANESE; VOWELS; L2; IDENTIFICATION; ASSIMILATION AB The aim of this study was to investigate factors that affect second language speech perception. Listening tests were run in which native and non-native (Norwegian) participants identified English consonants in VCV syllables in quiet and in different noise conditions. An assimilation test investigated the mapping of English consonants onto Norwegian counterparts. Results of the identification test showed a lower non-native performance but there was no evidence that the non-native disadvantage was greater in noise than in quiet. Poorer identification was found for sounds that occur only in English ('novel category' consonants) but this was the case for both English and Norwegian listeners, and thus likely to be related to the acoustic-phonetic properties of consonants in that category. Information transfer analyses revealed a certain impact of phonological factors on L2 perception, as the transmission of the voicing feature was more affected for Norwegian listeners than the transmission of place or manner information. The relation between the results of the identification in noise and assimilation tasks suggests that, at least in higher proficiency L2 learners, assimilation patterns may not be predictive of listeners' ability to hear non-native speech sounds. (C) 2010 Elsevier B.V. All rights reserved. C1 [van Dommelen, Wim A.] Norwegian Univ Sci & Technol, Dept Language & Commun Studies, N-7491 Trondheim, Norway. [Hazan, Valerie] UCL, Dept Speech Hearing & Phonet Sci, London WC1N 1PF, England. RP van Dommelen, WA (reprint author), Norwegian Univ Sci & Technol, Dept Language & Commun Studies, N-7491 Trondheim, Norway. EM wim.van.dommelen@ntnu.no; v.hazan@ucl.ac.uk RI Hazan, Valerie/C-9722-2009 OI Hazan, Valerie/0000-0001-6572-6679 FU EU FX The speech material used in this study is part of the corpus from the Consonant Challenge project organized by Martin Cooke, Maria Luisa Garcia Lecumberri and Odette preparation of the listening test materials, and for making the data for native speakers available to us. Part of the study was financially supported by the EU Marie Curie Research Training Network "Sound to Sense". We acknowledge the useful comments on an earlier version of this paper given by two anonymous reviewers. CR Aoyama K, 2004, J PHONETICS, V32, P233, DOI 10.1016/S0095-4470(03)00036-6 Best C. T., 1995, SPEECH PERCEPTION LI, P171 BEST CT, 1988, J EXP PSYCHOL HUMAN, V14, P345, DOI 10.1037/0096-1523.14.3.345 Best CT, 2007, LANGUAGE EXPERIENCE, P13 Boersma P., 2008, PRAAT DOING PHONETIC Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Cebrian J, 2006, J PHONETICS, V34, P372, DOI 10.1016/j.wocn.2005.08.003 Cooke M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1765 Cooke M, 2010, SPEECH COMMUN, V52, P954, DOI 10.1016/j.specom.2010.04.004 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 DAVIDSENNIELSEN N, 1975, ENGLISH PHONETICS Docherty Gerard J., 1992, TIMING VOICING BRIT EDWARDS TJ, 1981, J ACOUST SOC AM, V69, P535, DOI 10.1121/1.385482 Flege J. E., 1995, SPEECH PERCEPTION LI, P229 Goedegebure A, 2002, INT J AUDIOL, V41, P414, DOI 10.3109/14992020209090419 Guion SG, 2000, J ACOUST SOC AM, V107, P2711, DOI 10.1121/1.428657 Halle PA, 2007, J ACOUST SOC AM, V121, P2899, DOI 10.1121/1.2534656 HAUGEN E, 1995, NORWEGIAN ENGLISH DI Hazan V, 2000, LANG SPEECH, V43, P273 Iverson P, 2007, J ACOUST SOC AM, V122, P2842, DOI 10.1121/1.2783198 Iverson P, 2003, COGNITION, V87, pB47, DOI 10.1016/S0010-0277(02)00198-1 Kang KH, 2006, J ACOUST SOC AM, V119, P1672, DOI 10.1121/1.2166607 Kingston J, 2003, LANG SPEECH, V46, P295 Kristoffersen G., 2000, PHONOLOGY NORWEGIAN Lecumberri MLG, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1781 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 Lengeris A, 2009, PHONETICA, V66, P169, DOI 10.1159/000235659 Levy ES, 2009, J ACOUST SOC AM, V125, P1138, DOI 10.1121/1.3050256 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 MCALLISTER R, 2007, LANGUAGE EXPERIENCE, P153 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Piske T, 2001, J PHONETICS, V29, P191, DOI 10.1006/jpho.2001.0134 Rhebergen KS, 2005, J ACOUST SOC AM, V118, P1274, DOI 10.1121/1.2000751 Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X Sagi E, 2008, J ACOUST SOC AM, V123, P2848, DOI 10.1121/1.2897914 Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650 Strange W., 2007, LANGUAGE EXPERIENCE, P35, DOI 10.1075/lllt.17.08str Strange W., 2004, J ACOUST SOC AM, V115, P2606 TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769 van der Horst R, 1999, J ACOUST SOC AM, V105, P1801, DOI 10.1121/1.426718 VANDOMMELEN WA, 2007, P FON 2007 STOCKH MA, V50, P5 VANENGEN KJ, LANG SPEECH Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 NR 49 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 968 EP 979 DI 10.1016/j.specom.2010.05.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700010 ER PT J AU Broersma, M Scharenborg, O AF Broersma, Mirjam Scharenborg, Odette TI Native and non-native listeners' perception of English consonants in different types of noise SO SPEECH COMMUNICATION LA English DT Article DE Speech Perception; Consonants; Identification; Noise; Non-native Language ID SPEECH-IN-NOISE; NORMAL-HEARING; LANGUAGE; INTELLIGIBILITY; IDENTIFICATION; RECOGNITION; CONFUSIONS; BABBLE AB This paper shows that the effect of different types of noise on recognition of different phonemes by native versus non-native listeners is highly variable, even within classes of phonemes with the same manner or place of articulation. In a phoneme identification experiment, English and Dutch listeners heard all 24 English consonants in VCV stimuli in quiet and in three types of noise: competing talker, speech-shaped noise, and modulated speech-shaped noise (all with SNRs of -6 dB). Differential effects of noise type for English and Dutch listeners were found for eight consonants (/p t k g m n eta r/) but not for the other 16 consonants. For those eight consonants, effects were again highly variable: each noise type hindered non-native listeners more than native listeners for some of the target sounds, but none of the noise types did so for all of the target sounds, not even for phonemes with the same manner or place of articulation. The results imply that the noise types employed will strongly affect the outcomes of any study of native and non-native speech perception in noise. (C) 2010 Elsevier B.V. All rights reserved. C1 [Broersma, Mirjam] Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands. [Broersma, Mirjam] Radboud Univ Nijmegen, Donders Inst Brain Cognit & Behav, NL-6500 HE Nijmegen, Netherlands. Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands. RP Broersma, M (reprint author), Max Planck Inst Psycholinguist, POB 310, NL-6500 AH Nijmegen, Netherlands. EM mirjam@mirjambroersma.nl; o.scharenborg@let.ru.nl RI Scharenborg, Odette/E-2056-2012; Broersma, Mirjam/B-2032-2015 OI Broersma, Mirjam/0000-0001-8511-2877 FU Netherlands Organisation for Scientific Research (NWO) FX Each of the authors was supported by an individual (separate) Veni grant from the Netherlands Organisation for Scientific Research (NWO). We would like to thank Martin Cooke (University of the Basque Country) for kindly providing the English VCV data and two anonymous reviewers for helpful comments. CR BOHN OS, 2007, HONOR JE FLEGE Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Broersma M, 2005, J ACOUST SOC AM, V117, P3890, DOI 10.1121/1.1906060 Broersma M, 2010, J ACOUST SOC AM, V127, P1636, DOI 10.1121/1.3292996 Broersma M, 2008, J ACOUST SOC AM, V124, P712, DOI 10.1121/1.2940578 COOKE M, 2008, P INT 2008 BRISB AUS Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 Felty RA, 2009, J ACOUST SOC AM, V125, pEL93, DOI 10.1121/1.3073733 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 Golestani N, 2009, BILING-LANG COGN, V12, P385, DOI 10.1017/S1366728909990150 Gussenhoven C., 1999, HDB INT PHONETIC ASS, P74 Hazan V, 2000, LANG SPEECH, V43, P273 LECUMBERRI MLG, 2008, P INT 2008 BRISB AUS LU Y, 2010, THESIS U SHEFFIELD U Maniwa K, 2008, J ACOUST SOC AM, V123, P1114, DOI 10.1121/1.2821966 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 NABELEK AK, 1984, J ACOUST SOC AM, V75, P632 Phatak SA, 2008, J ACOUST SOC AM, V124, P1220, DOI 10.1121/1.2913251 Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650 Strange W., 1995, SPEECH PERCEPTION LI Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4 van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928 Wagner A, 2006, J ACOUST SOC AM, V120, P2267, DOI 10.1121/1.2335422 WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417 NR 30 TC 10 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 980 EP 995 DI 10.1016/j.specom.2010.08.010 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700011 ER PT J AU Hazan, V Kim, J Chen, YC AF Hazan, Valerie Kim, Jeesun Chen, Yuchun TI Audiovisual perception in adverse conditions: Language, speaker and listener effects SO SPEECH COMMUNICATION LA English DT Article DE Audiovisual; L2 perception; Speaker/listener variability ID SPEECH-PERCEPTION; CONSONANT RECOGNITION; NONNATIVE LISTENERS; HEARING LIPS; VISUAL CUES; NOISE; ENGLISH; REVERBERATION; 2ND-LANGUAGE; INTEGRATION AB This study investigated the relative contribution of auditory and visual information to speech perception by looking at the effect of visual and auditory degradation on the weighting given to visual cues for native and non-native speakers. Multiple iterations of /ba/, /da/ and /ga/ by five Australian English and five Mandarin Chinese speakers were presented to Australian English, British English and Mandarin Chinese participants. Tokens were presented in auditory, visual and congruent/incongruent audiovisual (AV) modes, either in clear or with visual degradation (blurring), auditory degradation (noise) or combined degradations. In the AV clear condition, English-speaking participants showed greater visual weighting for non-native speakers, but this was not found for Chinese participants. In 'single-channel degradation' conditions, the weighting of the intact channel increased significantly, with little influence of speaker language. There was no strong evidence of native-language effects on the weighting of visual cues. The degree of visual weighting varied widely across individual participants, and was also affected by individual speaker characteristics. The weighting of auditory and visual cues is therefore highly flexible and dependent on the information load of each channel; non-native speaker and language-background effects may influence visual weighting but individual perceiver and speaker strategies also have a strong impact. (C) 2010 Elsevier B.V. All rights reserved. C1 [Hazan, Valerie] UCL, London WC1N 1PF, England. [Kim, Jeesun] Univ Western Sydney, MARCS Auditory Labs, Penrith, NSW 1797, Australia. [Chen, Yuchun] Natl Taiwan Normal Univ, Dept Special Educ, Taipei 10644, Taiwan. RP Hazan, V (reprint author), UCL, Chandler House,2 Wakefield St, London WC1N 1PF, England. EM v.hazan@ucl.ac.uk; j.kim@uws.edu.au RI Hazan, Valerie/C-9722-2009 OI Hazan, Valerie/0000-0001-6572-6679 FU Australian Research Council [DP0666857, TS0669874] FX We are extremely grateful to Chris Davis for his advice at all stages of this study. We also thank the following for their valuable contribution to the study: Steve Nevard and Andrew Faulkner for help with the audiovisual recordings, Erin Cvejic and Michael Fitzpatrick for help in the processing of the speech materials, Jennifer Le in running the data collection in Australia. The second author acknowledges the support of Australian Research Council, Grant Nos. DP0666857 and TS0669874. CR Best C. T., 1995, SPEECH PERCEPTION LI, P171 Best CT, 2007, LANGUAGE EXPERIENCE, P13 BINNIE CA, 1974, J SPEECH HEAR RES, V17, P619 BRANDY WT, 1966, J SPEECH HEAR RES, V9, P461 Chen TH, 2004, PERCEPT PSYCHOPHYS, V66, P820, DOI 10.3758/BF03194976 Chen Y., 2007, P 16 INT C PHON SCI, P2177 Chen YC, 2009, J ACOUST SOC AM, V126, P858, DOI 10.1121/1.3158823 Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 Davis C, 2001, ARTIF INTELL REV, V16, P37, DOI 10.1023/A:1011086120667 Davis C, 2004, Q J EXP PSYCHOL-A, V57, P1103, DOI 10.1080/02724980343000701 DEGELDER B, 1992, COGNITIVE PROCESSING, P413 DEGELDER B, 1995, P 4 EUR C SPEECH COM, P1699 DODD B, 1977, PERCEPTION, V6, P31, DOI 10.1068/p060031 ERBER NP, 1969, J SPEECH HEAR RES, V12, P423 Fixmer E., 1998, P AUD VIS SPEECH PRO, P27 Flege J. E., 1995, SPEECH PERCEPTION LI, P229 Forster KI, 2003, BEHAV RES METH INS C, V35, P116, DOI 10.3758/BF03195503 Fuster-Duran A., 1996, SPEECHREADING HUMANS, P135 Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135 Gagne JP, 2002, SPEECH COMMUN, V37, P213, DOI 10.1016/S0167-6393(01)00012-7 Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788 Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668 Hardison DM, 1999, LANG LEARN, V49, P213, DOI 10.1111/0023-8333.49.s1.7 HAYASHI Y, 1998, P AUD VIS SPEECH PRO, P61 Hazan V, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1191 Hazan V, 2006, J ACOUST SOC AM, V119, P1740, DOI 10.1121/1.2166611 Kuhl P. K., 1994, P INT C SPOK LANG PR, P539 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 MacDonald J, 2000, PERCEPTION, V29, P1155, DOI 10.1068/p3020 Massaro D. W., 1998, PERCEIVING TALKING F MASSARO DW, 1995, MEM COGNITION, V23, P113, DOI 10.3758/BF03210561 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Navarra J, 2007, PSYCHOL RES-PSYCH FO, V71, P4, DOI 10.1007/s00426-005-0031-5 Nielsen K., 2004, P INTERSPEECH 2004, P2533 Ortega-Llebaria M., 2001, P INT C AUD VIS SPEE, P149 Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X Ross Lars A, 2007, Cereb Cortex, V17, P1147, DOI 10.1093/cercor/bhl024 SEKIYAMA K, 2003, P INT C AUD VIS SPEE, P61 SEKIYAMA K, 1993, J PHONETICS, V21, P427 Sekiyama K, 1997, PERCEPT PSYCHOPHYS, V59, P73, DOI 10.3758/BF03206849 Sekiyama K, 2008, DEVELOPMENTAL SCI, V11, P303 SEKIYAMA K, 1995, P 13 INT C PHON SCI, V3, P214 SEKIYAMA K, 1991, J ACOUST SOC AM, V90, P1805 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769 Thomas SM, 2002, PERCEPT PSYCHOPHYS, V64, P932, DOI 10.3758/BF03196797 Wang Y, 2008, J ACOUST SOC AM, V124, P1716, DOI 10.1121/1.2956483 Wang Y, 2009, J PHONETICS, V37, P344, DOI 10.1016/j.wocn.2009.04.002 NR 50 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 996 EP 1009 DI 10.1016/j.specom.2010.05.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700012 ER PT J AU Volin, J Skarnitzl, R AF Volin, Jan Skarnitzl, Radek TI The strength of foreign accent in Czech English under adverse listening conditions SO SPEECH COMMUNICATION LA English DT Article DE Foreign accent; Czech English; Low-pass filtering; Perception; Rhythm metrics; Signal-to-noise ratio ID SPEAKER NORMALIZATION; SPONTANEOUS SPEECH; PERCEIVED ACCENT; INTELLIGIBILITY; PERCEPTION; COMMUNICATION; COMPREHENSION; LANGUAGE; NOISE; AGE AB The study connects two major topics in current speech research: foreign accentedness and speech in adverse conditions. We parallel the research in intelligibility of non-native speech, but instead of linguistic unit recognition we focus on the perception of the foreign accent strength. First, the question of type and degree of perceptual deficiencies occurring along with certain types of signal degradation is tackled. Second, we measure correlations between the accent ratings and certain candidate phenomena that may influence them, e.g., articulation rate, temporal patterning, contrasts in sound pressure levels on selected syllables and F0 variation. The impacts of different types of signal degradation help to estimate the role of segmental/suprasegmental information in assessments of foreignness in Czech English. The full appreciation of the strength of foreign accent is apparently not possible without fine phonetic detail on the segmental level. However, certain suprasegmental features of foreignness are robust enough to manifest at severe levels of signal degradation. Pair-wise variability indices of vowel durations and variation in F0 tracks seem to guide the listener even better in the degraded than in the 'clean' speech signal. (C) 2010 Elsevier B.V. All rights reserved. C1 [Volin, Jan; Skarnitzl, Radek] Charles Univ Prague, Fac Arts, Inst Phonet, Prague 11638 1, Czech Republic. RP Volin, J (reprint author), Charles Univ Prague, Fac Arts, Inst Phonet, Nam Jana Palacha 2, Prague 11638 1, Czech Republic. EM jan.volin@ff.cuni.cz; radek.skarnitzl@ff.cuni.cz FU European Union [MRTN-CT-2006-035561]; Czech Ministry of Education [VZ MSM0021620825] FX We thank our anonymous reviewers for many helpful comments on earlier versions of this paper. This work was supported by the European Union Grant MRTN-CT-2006-035561 - Sound to Sense, and by the Czech Ministry of Education Grant VZ MSM0021620825. CR ANDERSONHSIEH J, 1988, LANG LEARN, V38, P561, DOI 10.1111/j.1467-1770.1988.tb00167.x Asu E. L., 2006, P SPEECH PROS 2006 T, P249 Barry WJ, 2003, P 15 INT C PHON SCI, P2693 Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234 BLADON RAW, 1984, LANG COMMUN, V4, P59, DOI 10.1016/0271-5309(84)90019-3 Boersma P., 2009, PRAAT DOING PHONETIC Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 BRENNAN EM, 1981, J PSYCHOLINGUIST RES, V10, P487, DOI 10.1007/BF01076735 Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952 Derwing TM, 2009, LANG TEACH, V42, P476, DOI 10.1017/S026144480800551X Eskenazi M, 2009, SPEECH COMMUN, V51, P832, DOI 10.1016/j.specom.2009.04.005 FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256 Flege J. E., 2007, LAB PHONOLOGY, P353 FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876 FLEGE JE, 1995, J ACOUST SOC AM, V97, P3125, DOI 10.1121/1.413041 GHESQUIERE P, 2002, P INT C AC SPEECH SI, V1, P749 Gibbon D., 2001, P EUROSPEECH, P91 Grabe Esther, 2002, LAB PHONOLOGY, V7, P515 Hahn LD, 2004, TESOL QUART, V38, P201 Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826 HENTON C, 1990, J PHONETICS, V18, P203 Ikeno A, 2006, INT CONF ACOUST SPEE, P401 INGRAM J, 1987, J PHONETICS, V15, P127 Johnson K, 2005, BLACKW HBK LINGUIST, P363, DOI 10.1002/9780470757024.ch15 Koreman J, 2006, J ACOUST SOC AM, V119, P582, DOI 10.1121/1.2133436 Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 LIEBERMAN P, 1985, J ACOUST SOC AM, V77, P649, DOI 10.1121/1.391883 Mackay IRA, 2006, APPL PSYCHOLINGUIST, V27, P157, DOI 10.1017/S0142716406060231 Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081 MARKOVA P, 2009, P 19 CZECH GERM WORK, P56 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134 Munro M. J., 2001, STUDIES 2 LANGUAGE A, V23, P451 MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049 Munro MJ, 1998, LANG LEARN, V48, P159, DOI 10.1111/1467-9922.00038 Paeschke A., 2000, P ISCA WORKSH SPEECH, P75 PFITZINGER HR, 2006, P 3 INT C SPEECH PRO, V1, P105 PFITZINGER HR, 1998, P ICSLP 98 SYDN AUST, P1087 Pickering L, 2001, TESOL QUART, V35, P233, DOI 10.2307/3587647 Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X RUBIN DL, 1992, RES HIGH EDUC, V33, P511, DOI 10.1007/BF00973770 Scott SK, 2009, J ACOUST SOC AM, V125, P1737, DOI 10.1121/1.3050255 SKARNITZL R, 2005, P 2 PRAG C LING LIT, P11 Southwood MH, 1999, CLIN LINGUIST PHONET, V13, P335 STRIK H, 2003, P 15 ICPHS, P227 Volin J, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P3051 Wagner P. S., 2004, P SPEECH PROS, P227 White L, 2007, P 16 INT C PHON SCI, P1009 Wu TY, 2010, SPEECH COMMUN, V52, P83, DOI 10.1016/j.specom.2009.08.010 NR 51 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 1010 EP 1021 DI 10.1016/j.specom.2010.06.009 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700013 ER PT J AU Gooskens, C van Heuven, VJ van Bezooijen, R Pacilly, JJA AF Gooskens, Charlotte van Heuven, Vincent J. van Bezooijen, Renee Pacilly, Jos J. A. TI Is spoken Danish less intelligible than Swedish? SO SPEECH COMMUNICATION LA English DT Article DE Mutual intelligibility; Danish; Swedish; Babble noise; Semantically unpredictable sentences; Map tasks; Cognates ID WORD RECOGNITION; SPEECH; CORPUS; NOISE AB The most straightforward way to explain why Danes understand spoken Swedish relatively better than Swedes understand spoken Danish would be that spoken Danish is intrinsically a more difficult language to understand than spoken Swedish. We discuss circumstantial evidence suggesting that Danish is intrinsically poorly intelligible. We then report on a formal experiment in which we tested the intelligibility of Danish and Swedish materials spoken by three representative male speakers per language (isolated cognate and non-cognate words, words in semantically unpredictable sentences, words in spontaneous interaction in map tasks) presented in descending levels of noise to native listeners of Danish (N = 18) and Swedish (N = 24), respectively. The results show that Danish is as intelligible to Danish listeners as Swedish is to Swedish listeners. In a separate task, the same listeners recognized the same materials (presented without noise) in the neighboring language. The asymmetry that has traditionally been claimed was indeed found, even when differences in familiarity with the non-native language were controlled for. Possible reasons for the asymmetry are discussed. (C) 2010 Elsevier B.V. All rights reserved. C1 [van Heuven, Vincent J.] Leiden Univ, Ctr Linguist, Phonet Lab, NL-2300 RA Leiden, Netherlands. [Gooskens, Charlotte; van Bezooijen, Renee] Univ Groningen, NL-9700 AB Groningen, Netherlands. RP van Heuven, VJ (reprint author), Leiden Univ, Ctr Linguist, Phonet Lab, POB 9515, NL-2300 RA Leiden, Netherlands. EM v.j.j.p.van.heuven@hum.leidenuniv.nl CR Allen Sture, 1970, NUSVENSK FREKVENSORD ANDERSON AH, 1991, LANG SPEECH, V34, P351 BASBOL H, 2005, PHONOLOGY DANISH Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X Bergenholtz H., 1992, DANSK FREKVENSORDBOG Bezooijen Renee van, 1999, LINGUISTICS NETHERLA, P1 Bleses D, 2008, J CHILD LANG, V35, P619, DOI 10.1017/S0305000908008714 Bleses Dorthe, 2004, 20 DAN 2003 S BRAIN, P165 Bo I., 1978, 4 ROG Borestam Ulla, 1987, DANSK SVENSK SPRAKGE BRAUNMULLER K, 2002, APPL LINGUIST, V12, P1 Brink Lars, 1975, DANSK RIGSMAL LYDUDV Delsing Lars-Olof, 2005, HALLER SPRAKET IHOP Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V19, P1, DOI [DOI 10.1017/S0272263197001010, 10.1017/S0272263197001010] ELBERLING C, 1989, SCAND AUDIOL, V18, P169, DOI 10.3109/01050398909070742 Elert C-C., 1970, LJUD ORD SVENSKAN ENGSTRAND O, 1990, SWEDISH J INT PHONET, V20, P42 Garlen C., 1984, SVENSKANS FONOLOGI K GOOSKENS C, 2007, NEAR LANGUAGES COLLA, P99 GRONNUM N, 1998, DANISH J INT PHONETI, V28, P99 GRONNUM N, 2003, TAKE DANISH FOR INST Gronnum Nina, 1998, FONETIK FONOLOGI ALM Gronnum N, 2009, SPEECH COMMUN, V51, P594, DOI 10.1016/j.specom.2008.11.002 HAGERMAN B, 1984, SCAND AUDIOL, V13, P57, DOI 10.3109/01050398409076258 HANSEN PM, 1990, UDTALEORDBOG Hedelin Per, 1997, NORSTEDTS SVENSKA UT JENSEN JB, 1989, HISPANIA-J DEV INTER, V72, P848, DOI 10.2307/343562 Kennedy S, 2008, CAN MOD LANG REV, V64, P459, DOI 10.3138/cmlr.64.3.459 LIDEN G, 1954, Acta Otolaryngol Suppl, V116, P189 MALMBERG B, 1968, SVENSK FONETIK Maurud Oivind, 1976, NABOSPRAKSFORSTAELSE Perre L, 2008, BRAIN RES, V1188, P132, DOI 10.1016/j.brainres.2007.10.084 Perre L, 2009, PSYCHOPHYSIOLOGY, V46, P739, DOI 10.1111/j.1469-8986.2009.00813.x Tang Chaoju, 2009, LINGUA, V119, P709, DOI DOI 10.1016/J.LINGUA.2008.10.001 TELEMAN U, 1987, NORDISK SPRAKSEKRETA, V8, P70 van Bezooijen Renee, 1997, HDB STANDARDS RESOUR, P481 Van Heuven Vincent J, 2008, International Journal of Humanities and Arts Computing, V2, DOI 10.3366/E1753854809000305 van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4 Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080 Yule G, 1984, TEACHING TALK STRATE NR 40 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 1022 EP 1037 DI 10.1016/j.specom.2010.06.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700014 ER PT J AU Heinrich, A Flory, Y Hawkins, S AF Heinrich, Antje Flory, Yvonne Hawkins, Sarah TI Influence of English r-resonances on intelligibility of speech in noise for native English and German listeners SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; L2 acquisition; r-resonances ID SHORT-TERM-MEMORY; VERTICAL-BAR; NONNATIVE LISTENERS; NATURAL SPEECH; PERCEPTION; COARTICULATION; ORGANIZATION; RECOGNITION; CONTRASTS; PATTERNS AB Non-rhotic British English speakers and Germans living in England were compared in their use of short- and long-domain r-resonances (cues to an upcoming [1]) in read English sentences heard in noise. The sentences comprised 52 pairs differing only in /r/ or /I/ in a minimal-pair target word (mirror, miller). Target words were cross-spliced into a different utterance of the same sentence-base (match) and into a base originally containing the other target word (mismatch), making a four-stimulus set for each sentence-pair. Intelligibility of target and some preceding unspliced words was measured. English listeners were strongly influenced by r-resonances in the sonorant immediately preceding the critical /r/. A median split of the German group showed that those who had lived in southeast England for 3-20 months used the weaker long-domain r-resonances, whereas Germans who had lived in England for 21-105 months ignored all r-resonances, possibly in favour of word frequency. A preliminary study of German speech showed differences in temporal extent and spectral balance (frequency of F3 and higher forniants) between English and German r-resonances. The perception and production studies together suggest sophisticated application of exposure-induced changes in acoustic phonetic and phonological knowledge of L1 to a partially similar sound in L2. (C) 2010 Elsevier B.V. All rights reserved. C1 [Heinrich, Antje; Hawkins, Sarah] Univ Cambridge, Dept Linguist, Cambridge CB3 9DA, England. [Flory, Yvonne] Univ Saarland, Saarbrucken, Germany. RP Heinrich, A (reprint author), Univ Cambridge, Dept Linguist, Sidgwick Ave, Cambridge CB3 9DA, England. EM ah540@cam.ac.uk FU ERA-AGE (FLARE) [RG49525]; Marie Curie Research Training Network [MRTN-CT-2006-035561-S2S] FX Funded by an ERA-AGE (FLARE) Grant (RG49525) to the first author, and a Marie Curie Research Training Network Grant, Sound to Sense (S2S: MRTN-CT-2006-035561-S2S) to the third author. Experiment 1 was conducted by the second author as part of a Master's thesis at the Universitat des Saarlandes. We thank Pia Rubig for helping with stimulus preparation and data collection, and Bill Barry for discussion. Data for the English listeners comprised part of a paper presented at the 2009 CR American National Standards Institute (ANSI), 1996, S361996 ANSI BARRY WJ, 1995, PHONETICA, V52, P228 BELLBERTI F, 1982, J ACOUST SOC AM, V71, P449, DOI 10.1121/1.387466 BENGUERE.AP, 1974, PHONETICA, V30, P41 Best C. T., 1994, DEV SPEECH PERCEPTIO, P167 Best CT, 2001, J ACOUST SOC AM, V109, P775, DOI 10.1121/1.1332378 Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103 Bronkhorst AW, 2000, ACUSTICA, V86, P117 Carter P., 2003, PAPERS LAB PHONOLOGY, P237 CARTER P, 1999, P 14 INT C PHON SCI, V1, P105 CARTER P, 2002, THESIS U YORK Cohen J., 1988, STAT POWER ANAL BEHA, V2nd Coleman J, 2003, J PHONETICS, V31, P351, DOI 10.1016/j.wocn.2003.10.001 Craik FIM, 1996, J EXP PSYCHOL GEN, V125, P159, DOI 10.1037/0096-3445.125.2.159 Cruttenden Alan, 2001, GIMSONS PRONUNCIATIO Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707 DARWIN CJ, 1985, SPEECH COMMUN, V4, P231, DOI 10.1016/0167-6393(85)90049-4 Faul F, 2007, BEHAV RES METHODS, V39, P175, DOI 10.3758/BRM.41.4.1149 GRUNKE ME, 1982, PERCEPT PSYCHOPHYS, V31, P210, DOI 10.3758/BF03202525 Hawkins S, 2003, J PHONETICS, V31, P373, DOI 10.1016/j.wocn.2003.09.006 Hawkins S, 2004, J PHONETICS, V32, P199, DOI 10.1016/S0095-4470(03)00031-7 HAWKINS S, 1996, SOUND PATTERNS CONNE, P173 Hawkins S, 2010, J PHONETICS, V38, P60, DOI 10.1016/j.wocn.2009.02.001 HAWKINS S, 1995, P 13 INT C PHON SCI Hawkins S., 1994, P INT C SPOK LANG PR, P57 Heid S., 2000, P 5 SEM SPEECH PROD, P77 Heinrich A, 2008, Q J EXP PSYCHOL, V61, P735, DOI 10.1080/17470210701402372 HUGGINS AWF, 1972, J ACOUST SOC AM, V51, P1279, DOI 10.1121/1.1912972 HUGGINS AWF, 1972, J ACOUST SOC AM, V51, P1270, DOI 10.1121/1.1912971 Kahneman D., 1973, ATTENTION EFFORT KELLY J, 1986, C PUBLICATION, V258, P304 Kelly J., 1989, DOING PHONOLOGY LADEFOGED PETER, 2001, COURSE PHONETICS Laver John, 1994, PRINCIPLES PHONETICS Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 Leech G., 2001, WORD FREQUENCIES WRI Lindau M., 1985, PHONETIC LINGUISTICS, P157 Local J, 2003, J PHONETICS, V31, P321, DOI 10.1016/S0095-4470(03)00045-7 Lodge K, 2003, LINGUA, V113, P931, DOI 10.1016/S0024-3841(02)00142-0 LOFTUS GR, 1994, PSYCHON B REV, V1, P476, DOI 10.3758/BF03210951 LUCE PA, 1983, HUM FACTORS, V25, P17 Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001 MACHELETT K, 1996, LESEN SONAGRAMMEN V1 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 Moore RK, 2007, SPEECH COMMUN, V49, P418, DOI 10.1016/j.specom.2007.01.011 Nieto-Castanon A, 2005, J ACOUST SOC AM, V117, P3196, DOI 10.1121/1.1893271 Ogden R., 2009, INTRO ENGLISH PHONET Raven J., 1998, RAVEN MANUAL SECTION Raven J. C., 1982, MILL HILL VOCABULARY REMEZ RE, 1994, PSYCHOL REV, V101, P129, DOI 10.1037/0033-295X.101.1.129 Remez RE, 2003, J PHONETICS, V31, P293, DOI 10.1016/S0095-4470(03)00042-1 REMEZ RE, 1994, HDB PSYCHOLINGUISTIC, P145 Secord W., 2007, ELICITING SOUNDS TEC Simpson Adrian P., 1998, ZAS WORKING PAPERS L, V11, P91 Sommers MS, 1996, PSYCHOL AGING, V11, P333, DOI 10.1037/0882-7974.11.2.333 Stevens K.N., 1998, ACOUSTIC PHONETICS Ito Kikuyo, 2009, J Acoust Soc Am, V125, P2348, DOI 10.1121/1.3082103 Tunley A., 1999, THESIS U CAMBRIDGE ULBRICHT H, 1972, INSTRUMENTALPHONETIS Wagener KC, 2005, INT J AUDIOL, V44, P144, DOI 10.1080/14992020500057517 WEST P, 1999, P 14 INT C PHON SCI, V3, P1901 WEST P, 2001, J PHONETICS, V27, P405 Zhou XH, 2008, J ACOUST SOC AM, V123, P4466, DOI 10.1121/1.2902168 NR 63 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2010 VL 52 IS 11-12 SI SI BP 1038 EP 1055 DI 10.1016/j.specom.2010.09.009 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 692BO UT WOS:000285126700015 ER PT J AU Hansen, JHL Gray, SS Kim, W AF Hansen, John H. L. Gray, Sharmistha S. Kim, Wooil TI Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification SO SPEECH COMMUNICATION LA English DT Article DE Voice Onset Time (VOT); Voice Onset Region (VOR); Teager Energy Operator (TEO); Accent classification ID AMERICAN ENGLISH; FOREIGN ACCENT; SPEECH; RECOGNITION; TRANSFORMS; PERCEPTION; INVARIANT; ROTATION; OPERATOR; FRENCH AB Articulation characteristics of particular phonemes can provide cues to distinguish accents in spoken English. For example, as shown in Arslan and Hansen (1996, 1997), Voice Onset Time (VOT) can be used to classify mandarin, Turkish, German and American accented English. Our goal in this study is to develop an automatic system that classifies accents using VOT in unvoiced stops(1). VOT is an important temporal feature which is often overlooked in speech perception, speech recognition, as well as accent detection. Fixed length frame-based speech processing inherently ignores VOT. In this paper, a more effective VOT detection scheme using the non-linear energy tracking algorithm Teager Energy Operator (TEO), across a sub-frequency band partition for unvoiced stops (/p/, /t/ and /k/), is introduced. The proposed VOT detection algorithm also incorporates spectral differences in the Voice Onset Region (VOR) and the succeeding vowel of a given stop-vowel sequence to classify speakers having accents due to different ethnic origin. The spectral cues are enhanced using one of the four types of feature parameter extractions - Discrete Mellin Transform (DMT), Discrete Mellin Fourier Transform (DMFT) and Discrete Wavelet Transform using the lowest and the highest frequency resolutions (DWTlfr and DWThfr). A Hidden Markov Model (HMM) classifier is employed with these extracted parameters and applied to the problem of accent classification. Three different language groups (American English, Chinese, and Indian) are used from the CU-Accent database. The VOT is detected with less than 10% error when compared to the manual detected VOT with a success rate of 79.90%, 87.32% and 47.73% for English, Chinese and Indian speakers (includes atypical cases for Indian case), respectively. It is noted that the DMT and DWTlfr features are good for parameterizing speech samples which exhibit substitution of succeeding vowel after the stop in accented speech. The successful accent classification rates of DMT and DWTlfr features are 66.13% and 71.67%, for /p/ and /t/ respectively, for pairwise accent detection. Alternatively, the DMFT feature works on all accent sensitive words considered, with a success rate of 70.63%. This study shows that effective VOT detection can be achieved using an integrated TEO processing with spectral difference analysis in the VOR that can be employed for accent classification. (C) 2010 Elsevier B.V. All rights reserved. C1 [Hansen, John H. L.; Gray, Sharmistha S.; Kim, Wooil] Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. EM john.hansen@utdallas.edu FU US Air Force Research Laboratory, Rome NY [FA8750-04-1-0058] FX This work was supported by the US Air Force Research Laboratory, Rome NY, under contract number FA8750-04-1-0058. CR Allen JS, 2003, J ACOUST SOC AM, V113, P544, DOI 10.1121/1.1528172 Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6 Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608 Bahoura M, 2001, IEEE SIGNAL PROC LET, V8, P10, DOI 10.1109/97.889636 BERKLING K, 2002, SPEECH COMMUN, V35, P125 CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601 CHAN AYW, 2000, CULTURE CURRICULUM, V13, P67 CHEN QS, 1994, IEEE T PATTERN ANAL, V16, P1156 Comrie B., 1990, WORLDS MAJOR LANGUAG DAS S, 2004, IEEE NORSIG, P344 Deller J., 1999, DISCRETE TIME PROCES Esposito A, 2002, PHONETICA, V59, P197, DOI 10.1159/000068347 FANG M, 1990, APPL OPTICS, V29, P704, DOI 10.1364/AO.29.000704 FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256 FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876 Francis AL, 2003, J ACOUST SOC AM, V113, P1025, DOI 10.1121/1.1536169 Ghesquiere PJ, 2002, INT CONF ACOUST SPEE, P749 GRAY SS, 2005, IEEE AUT SPEECH REC GRAY SS, 2005, THESIS U COLORADO BO GROVER C, 1987, LANG SPEECH, V30, P277 Hansen JHL, 1998, IEEE T BIO-MED ENG, V45, P300, DOI 10.1109/10.661155 HOSHINO A, 2003, IEEE ICASSP 03, P472 Johnson C., 2002, INT J BILINGUAL, V6, P271, DOI 10.1177/13670069020060030401 KAISER JF, 1990, INT CONF ACOUST SPEE, P381, DOI 10.1109/ICASSP.1990.115702 KAZEMZADEH A, 2006, INT 2006 Kumpf K, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1740 KUMPF K, 1997, EUROSPEECH 97 Ladefoged Peter, 1993, COURSE PHONETICS Levkovitz J, 1997, APPL OPTICS, V36, P3035, DOI 10.1364/AO.36.003035 LOPEZBASCUAS LE, 2004, J ACOUST SOC AM, V115, P2465 MAHADEVA PSR, 2009, IEEE T AUDIO SIGNAL, V17, P556 Major R. C, 2001, FOREIGN ACCENT ONTOG MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P3024, DOI 10.1109/78.277799 MCGORY J, 2001, J ACOUST SOC AM, V109, P2474 NIST/SEMATECH, 2005, E HDB STAT METH Peng Long, 2004, WORLD ENGLISH, V23, P535, DOI 10.1111/j.0083-2919.2004.00376.x POSER WJ, 2004, LANGUAGE LOG POST 20 ROSNER BS, 1984, J ACOUST SOC AM, V75, P1231, DOI 10.1121/1.390775 SHENG Y, 1988, OPT ENG SHENG Y, 1986, J OPT SOC AM STEINSCHNEIDER M, 1999, AM PHYSL SOC, P2346 Stouten V, 2009, SPEECH COMMUN, V51, P1194, DOI 10.1016/j.specom.2009.06.003 Sundaram N., 2003, INSTANTANEOUS NONLIN Teager H. M., 1983, SPEECH SCI RECENT AD, P73 TEAGER HM, 1980, IEEE T ACOUST SPEECH, V28, P599, DOI 10.1109/TASSP.1980.1163453 TORRENCE C, 1998, PRACTICAL GUIDE WAVE Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995 ZWICKE PE, 1983, IEEE T PATTERN ANAL, V5, P191 2010, CU ACCENT NR 49 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2010 VL 52 IS 10 BP 777 EP 789 DI 10.1016/j.specom.2010.05.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 638ST UT WOS:000280917000001 ER PT J AU Valente, F AF Valente, Fabio TI Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition (ASR); TANDEM features; Multi Layer Perceptron (MLP); Auditory and modulation frequencies AB This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (M LP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on M LP are considered. The first one operates in parallel fashion and is invariant to the order in which feature streams are introduced. The second one operates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The study is carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings data using the TANDEM approach. Results reveal that (1) the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion; (2) the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from high to low modulations; (3) the improvement obtained from separate processing of two modulation frequency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger than the improvement obtained from separate processing of two auditory frequency ranges (4% relative WER reduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systems and on other languages. Furthermore, the paper extends the discussion to the combination of classifiers trained on separate auditory-modulation frequency channels showing that previous conclusions hold also in this scenario. (C) 2010 Elsevier B.V. All rights reserved. C1 IDIAP Res Inst, CH-1920 Martigny, Switzerland. RP Valente, F (reprint author), IDIAP Res Inst, CH-1920 Martigny, Switzerland. EM fabio.valente@idiap.ch FU Defense Advanced Research Projects Agency (DARPA) [HR0011-06-C-0023]; European Union FX This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023 and by the European Union under the integrated project AMIDA. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). The author thanks colleagues from the AMIDA and GALE projects for their help with the different LVCSR systems and the reviewers for their comments. CR Allen J., 2005, ARTICULATION INTELLI ALLEN JB, 1994, IEEE T SPEECH AUDIO, V2 BOURLARD H, 2004, P DARPA EARS EFF AFF BOURLARD H, 1996, P ICSLP 96 Bourlard Ha, 1994, CONNECTIONIST SPEECH CHEN B, 2003, P EUR 2003 Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344 Fletcher H., 1953, SPEECH HEARING COMMU HAIN T, 2005, NIST RT05 WORKSH ED HERMANSKY H, 1996, P ICSLP 96 HERMANSKY H, 1999, P ICASSP 99 HERMANSKY H, 2000, P ICASSP 2000 Hermansky H., 2005, P INT 2005 HERMANSKY H, 2003, P ASRU 2003 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1997, P EUR 97 Hermansky Hynek, 1994, IEEE T SPEECH AUDIO, V2 HOUTGAST T, 1989, J ACOUST SOC AM, V88 Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416 MILLER, 2002, J NEUROPHYSIOL, V87 MISRA H, 2003, P ICASSP 2003 MOORE D, 2006, P MLMI 2006 MORGAN N, 2004, P ICASSP 2004 PLAHL C, 2009, P 10 ANN C INT SPEEC RUMELHART DE, 1986, NATURE, V323, P533, DOI 10.1038/323533a0 SIVADAS S, 2002, P ICASSP 2002 VALENTE F, 2008, P ICASSP 2008 VALENTE F, 2009, P 10 ANN C INT SPEEC VALENTE F, 2007, INT 2007 VALENTE F, 2007, P ICASSP 2007 ZHAO S, 2009, P INT 2009 NR 31 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2010 VL 52 IS 10 BP 790 EP 800 DI 10.1016/j.specom.2010.05.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 638ST UT WOS:000280917000002 ER PT J AU Riedhammer, K Favre, B Hakkani-Tur, D AF Riedhammer, Korbinian Favre, Benoit Hakkani-Tuer, Dilek TI Long story short - Global unsupervised models for keyphrase based meeting summarization SO SPEECH COMMUNICATION LA English DT Article DE Multi-party meetings speech; Summarization; Keyphrases; Global optimization ID TEXT AB We analyze and compare two different methods for unsupervised extractive spontaneous speech summarization in the meeting domain. Based on utterance comparison, we introduce an optimal formulation for the widely used greedy maximum marginal relevance (MMR) algorithm. Following the idea that information is spread over the utterances in form of concepts, we describe a system which finds an optimal selection of utterances covering as many unique important concepts as possible. Both optimization problems are formulated as an integer linear program (ILP) and solved using public domain software. We analyze and discuss the performance of both approaches using various evaluation setups on two well studied meeting corpora. We conclude on the benefits and drawbacks of the presented models and give an outlook on future aspects to improve extractive meeting summarization. (C) 2010 Elsevier B.V. All rights reserved. C1 [Riedhammer, Korbinian] Univ Erlangen Nurnberg, Lehrstuhl Informat 5, D-91058 Erlangen, Germany. [Riedhammer, Korbinian; Favre, Benoit; Hakkani-Tuer, Dilek] Int Comp Sci Inst, Berkeley, CA 94704 USA. [Favre, Benoit] Univ Maine, Lab Informat, F-72085 Le Mans 9, France. RP Riedhammer, K (reprint author), Univ Erlangen Nurnberg, Lehrstuhl Informat 5, Martensstr 3, D-91058 Erlangen, Germany. EM korbinian.riedhammer@informatik.uni-erlangen.de; benoit.favre@lium.univ-lemans.fr; dilek@icsi.berkeley.edu RI Riedhammer, Korbinian/A-2293-2012 OI Riedhammer, Korbinian/0000-0003-3582-2154 CR Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P2, DOI DOI 10.1023/A:1009715923555 Carbonell J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/290941.291025 Christensen H, 2004, LECT NOTES COMPUT SC, V2997, P223 Filatova E., 2004, P ACL WORKSH SUMM Furui S, 2004, IEEE T SPEECH AUDI P, V12, P401, DOI 10.1109/TSA.2004.828699 GARG N, 2009, P ANN C INT SPEECH C, P1499 GILLICK D, 2009, P IEEE INT C AC SPEE, P4769 Gillick D., 2009, P ACL HLT WORKSH INT, P10, DOI 10.3115/1611638.1611640 GILLICK D, 2008, P TEXT AN C WORKSH, P227 Ha L.Q., 2002, P 19 INT C COMP LING, P1 HORI C, 2000, P INT C SPOK LANG PR, P326 HORI C, 2002, P INT C AC SPEECH SI, P9 HOVY E, 2006, P INT C LANG RES EV Huang Z., 2007, P EMNLP CONLL, P1093 INOUE A, 2004, P ICASSP, P599 JANIN A, 2003, P ICASSP, P364 Lin C., 2004, P WORKSH TEXT SUMM B, P25 LIN H, 2009, P IEEE WORKSH SPEECH, P381 Liu F., 2009, P HLT NAACL, P620, DOI 10.3115/1620754.1620845 Liu F., 2009, P ACL IJCNLP, P261, DOI 10.3115/1667583.1667664 Liu F., 2008, P ACL, P201, DOI 10.3115/1557690.1557747 Liu FH, 2008, LECT NOTES COMPUT SC, V5299, P181 LIU Y, 2008, P ICASSP, P5009 Maskey S., 2005, P EUR C SPEECH COMM, P621 McCowan I., 2005, P MEAS BEH McDonald R, 2007, LECT NOTES COMPUT SC, V4425, P557 Mieskes M, 2007, Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, P627 MROZINSKI J, 2005, P IEEE INT C AC SPEE, P981 MURRAY G, 2007, P ACM WORKSH MACH LE, P156 Murray G., 2005, P ACL 2005 WORKSH IN, P33 Murray G., 2005, P INT 2005 LISB PORT, P593 Murray G, 2008, P INT WORKSH MACH LE, P349 Murray G., 2006, P HUM LANG TECHN C N, P367, DOI 10.3115/1220835.1220882 NENKOVA A, 2004, P JOINT ANN M HLT NA Penn G., 2008, P ACL, P470 RENALS S, 2007, P IEEE WORKSH SPEECH RIEDHAMMER K, 2008, P INT, P2434 RIEDHAMMER K, 2008, P IEEE WORKSH SPOK L, P153 Santorini Beatrice, 1990, MSCIS9047 U PENNS DE Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923 Takamura H., 2009, P 12 C EUR CHAPT ACL, P781, DOI 10.3115/1609067.1609154 Thede S. M., 1999, P 37 ANN M ACL, P175, DOI 10.3115/1034678.1034712 Xie S., 2009, P ANN C INT SPEECH C, P1503 XIE S, 2009, P IEEE WORKSH SPEECH Zechner K, 2002, COMPUT LINGUIST, V28, P447, DOI 10.1162/089120102762671945 Zhang J., 2007, P NAACL HLT COMP VOL, P213 Zhu Q., 2005, P INT, P2141 ZHU X, 2006, IEEE INT C MULT EXP, P793 Zhu XR, 2009, PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, P549, DOI 10.1109/GCIS.2009.201 NR 49 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2010 VL 52 IS 10 BP 801 EP 815 DI 10.1016/j.specom.2010.06.002 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 638ST UT WOS:000280917000003 ER PT J AU Engelbrecht, KP Moller, S AF Engelbrecht, Klaus-Peter Moeller, Sebastian TI Sequential classifiers for the prediction of user judgments about spoken dialog systems SO SPEECH COMMUNICATION LA English DT Article DE Prediction model; Evaluation; PARADISE; Spoken dialog system; Usability ID QUALITY AB So far, predictions of user quality judgments in response to spoken dialog systems have been achieved on the basis of interaction parameters describing the dialog, e.g. in the PARADISE framework. These parameters do not take into account the temporal position of events happening in the dialog. It seems promising to apply sequence classification algorithms to the raw annotations of the data, instead of interaction parameters describing the overall dialog. As dialogs can be of very different length, Hidden Markov Models (HMM) and Markov Chains (MC) are handy, because they describe the likelihood of traversing to a state given only the previous state and the transition probability, thus they can be trained and applied to sequences of different lengths. This paper analyzes the feasibility of predicting user judgments with HMMs and MCs. In order to test the models, we acquire data with different types of users, forcing users to do as similar interactions as possible, and asking for user judgments after each turn. This allows comparing predicted distributions of judgments to the distributions measured empirically. We also apply the models to less rich corpora and compare them with results from Linear Regression models as used in the PARADISE framework. (C) 2010 Elsevier B.V. All rights reserved. C1 [Engelbrecht, Klaus-Peter; Moeller, Sebastian] TU Berlin, Qual & Usabil Lab, Deutsch Telekom Labs, D-10587 Berlin, Germany. RP Engelbrecht, KP (reprint author), TU Berlin, Qual & Usabil Lab, Deutsch Telekom Labs, Ernst Reuter Pl 7, D-10587 Berlin, Germany. EM klaus-peter.engelbrecht@telekom.de; sebastian.moeller@telekom.de FU Deutsche Telekom Laboratories, TU Berlin FX The work presented in this paper was supported by a number of students and colleagues from Deutsche Telekom Laboratories, TU Berlin. Felix Hartard and Florian Godde helped collecting the data described in the first part of the paper. Babette Wiezorek made a large part of the annotations of features. flamed Ketabdar gave advice for the implementation of the HMM-related algorithms, and Benjamin Weiss helped with comments to a first draft of the paper. We are very grateful for this support. In addition, we would like to thank the anonymous reviewers for their kind reviews and valuable comments, which helped to improve this paper. CR Ai H., 2008, P 9 SIGDIAL WORKSH D, P164, DOI 10.3115/1622064.1622097 Bortz J., 2005, STAT HUMAN SOZIALWIS, V6 Cuayahuitl H., 2005, P IEEE WORKSH AUT SP, P290 DORNER D, 2002, NEURONALE THEORIE HA Eckert W., 1997, P IEEE WORKSH AUT SP Engelbrecht KP, 2009, SPEECH COMMUN, V51, P1234, DOI 10.1016/j.specom.2009.06.007 ENGELBRECHT KP, 2009, P 1 INT WORKSH SPOK ENGELBRECHT KP, 2007, P 8 SIGDIAL WORKSH D, P291 Engelbrecht K.-P., 2008, P ESSV, P86 Engelbrecht K.-P., 2009, P SIGDIAL WORKSH DIS, P170, DOI 10.3115/1708376.1708402 EVANINI K, 2008, P SPOK LANG TECHN WO, P129 Fraser N., 1997, HDB STANDARDS RESOUR, P564 Guski R., 1999, NOISE HEALTH, V1, P45 HASTIE HW, 2002, P 3 INT C LANG RES E, V2, P641 HONE KS, 2001, P EUROSPEECH AALB DE, P2083 *ITU T, 2003, 851 ITU T Jekosch U., 2005, ASSESSMENT EVALUATIO Levenshtein V., 1966, SOV PHYS DOKL, V10, P707 Litman D., 2007, P 8 SIGDIAL WORKSH D, P124 McTear M., 2004, SPOKEN DIALOGUE TECH Moller S, 2008, SPEECH COMMUN, V50, P730, DOI 10.1016/j.specom.2008.03.001 MOLLER S, 2005, P 4 EUR C AC FOR AC, P2681 Moller S., 2005, QUALITY TELEPHONE BA MOLLER S, 2006, P 9 INT C SPOK LANG, P1786 Nielsen J., 1993, USABILITY ENG Okun MA, 1990, EDUC PSYCHOL REV, V2, P59, DOI 10.1007/BF01323529 Raake A., 2006, SPEECH QUALITY VOIP RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Rieser V., 2008, P 6 INT C LANG RES E, P2356 RUSSEL S, 2004, MODERNER ANSATZ SCHLEICHER R, 2008, THESIS U COLOGNE Schmitt A, 2008, LECT NOTES ARTIF INT, V5078, P72, DOI 10.1007/978-3-540-69369-7_9 SKOWRONEK J, 2002, THESIS RUHR U BOCHUM WALKER M, 2000, P 2 INT C LANG RES E, V1, P189 Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271 Weiss B, 2009, ACTA ACUST UNITED AC, V95, P1140, DOI 10.3813/AAA.918245 Witten I.H., 2005, DATA MINING PRACTICA NR 38 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2010 VL 52 IS 10 BP 816 EP 833 DI 10.1016/j.specom.2010.06.004 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 638ST UT WOS:000280917000004 ER PT J AU Ling, ZH Richmond, K Yamagishi, J AF Ling, Zhen-Hua Richmond, Korin Yamagishi, Junichi TI An Analysis of HMM-based prediction of articulatory movements SO SPEECH COMMUNICATION LA English DT Article DE Hidden Markov model; Articulatory features; Parameter generation ID SPEECH PRODUCTION; MODEL; ACOUSTICS AB This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945 mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900 mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076 mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used. (C) 2010 Elsevier B.V. All rights reserved. C1 [Ling, Zhen-Hua] Univ Sci & Technol China, iFLYTEK Speech Lab, Hefei 230027, Anhui, Peoples R China. [Richmond, Korin; Yamagishi, Junichi] Univ Edinburgh, CSTR, Edinburgh EH8 9LW, Midlothian, Scotland. RP Ling, ZH (reprint author), Univ Sci & Technol China, iFLYTEK Speech Lab, Hefei 230027, Anhui, Peoples R China. EM zhling@ustc.edu; korin@cstr.ed.ac.uk; jyamagis@inf.ed.ac.uk FU Marie Curie Early Stage Training (EST) Network; National Nature Science Foundation of China [60905010]; Engineering and Physical Sciences Research Council (EPSRC); EC FX The authors thank Phil Hoole of Ludwig-Maximilian University, Munich for his great effort in helping record the EMA database. This work was supported by the Marie Curie Early Stage Training (EST) Network, "Edinburgh Speech Science and Technology (EdSST)". Zhen-Hua Ling is funded by the National Nature Science Foundation of China (Grant No. 60905010). Korin Richmond is funded by the Engineering and Physical Sciences Research Council (EPSRC). Junichi Yamagishi is funded by EPSRC and an EC FP7 collaborative project called the EMIME project. CR BAER T, 1987, MAGNETIC RESONANCE I, V5, P7 Fitt S., 1999, EUROSPEECH, V2, P823 Hiroya S, 2006, SPEECH COMMUN, V48, P1677, DOI 10.1016/j.specom.2006.08.002 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 KIRITANI S, 1986, SPEECH COMMUN, V5, P119, DOI 10.1016/0167-6393(86)90003-8 Ling ZH, 2009, IEEE T AUDIO SPEECH, V17, P1171, DOI 10.1109/TASL.2009.2014796 PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994 RICHMOND K, 2007, NOLISP, P263 RICHMOND K, 2009, P INT BRIGHT UK SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Blackburn CS, 2000, J ACOUST SOC AM, V107, P1659, DOI 10.1121/1.428450 Tamura M, 1999, EUROSPEECH, P959 Taylor P. A., 1998, 3 ESCA WORKSH SPEECH, P147 Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001 Tokuda K, 1999, INT CONF ACOUST SPEE, P229 Tokuda K, 2000, ICASSP, V3, P1315 Tokuda K., 2004, TEXT SPEECH SYNTHESI Yoshimura T, 1998, P ICSLP, P29 Young S., 2002, HTK BOOK HTK VERSION Zen H, 2007, 6 ISCA WORKSH SPEECH, P294 Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004 NR 23 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2010 VL 52 IS 10 BP 834 EP 846 DI 10.1016/j.specom.2010.06.006 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 638ST UT WOS:000280917000005 ER PT J AU Pontes, JDA Furui, S AF Pontes, Josafa de Jesus Aguiar Furui, Sadaoki TI Predicting the phonetic realizations of word-final consonants in context - A challenge for French grapheme-to-phoneme converters SO SPEECH COMMUNICATION LA English DT Article DE Decision trees; Liaison in French; Post-lexical rules; Speech synthesis; Grapheme-to-phoneme conversions AB One of the main problems in developing a text-to-speech (TTS) synthesizer for French lies in grapheme-to-phoneme conversion. Automatic converters produce still too many errors in their phoneme sequences, to be helpful for people learning French as a foreign language. The prediction of the phonetic realizations of word-final consonants (WFCs) in general, and liaison in particular (les haricots vs. les escargots), are some of the main causes of such conversion errors. Rule-based methods have been used to solve these issues. Yet, the number of rules and their complex interaction make maintenance a problem. In order to alleviate such problems, we propose here an approach that, starting from a database (compiled from cases documented in the literature), allows to build C4.5 decision trees and subsequently, automate the generation of the required phonetic rules. We investigated the relative efficiency of this method both for classification of contexts and word-final consonant phoneme prediction. A prototype based on this approach reduced Obligatory context classification errors by 52%. Our method has the advantage to spare us the trouble to code rules manually, since they are contained already in the training database. Our results suggest that predicting the realization of WFCs as well as context classification is still a challenge for the development of a TTS application for teaching French pronunciation. (C) 2010 Elsevier B.V. All rights reserved. C1 [Pontes, Josafa de Jesus Aguiar; Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Pontes, JDA (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1-W8-77 Ookayama, Tokyo 1528552, Japan. EM josafa@furui.cs.titech.ac.jp; furui@furui.cs.titech.ac.jp CR BATTYE A, 2000, FRENCH LANGUAGE TODA, P109 Black A., 1998, P 3 ESCA WORKSH SPEE, P77 Black A. W., 1999, FESTIVAL SPEECH SYNT Boe L.-J., 1992, ZUT DICT PHONETIQUE CORREARD MH, 2003, OXFORD HACHETTE FREN COTE MH, 2005, PHONOLOGIC FRANCAISE, V58 DELATTRE PIERRE, 1951, PRINCIPES PHONETIQUE Durand J, 2008, J FR LANG STUD, V18, P33, DOI 10.1017/S0959269507003158 Encreve P., 1988, LIAISON AVEC SANS EN Fouche P., 1959, TRAITE PRONONCIATION GOLDMAN JP, 1999, ACT TALN99 CARG CORS, P165 GREVISSE M, 1997, USAGE GRAMMAIRE FRAN, P45 MAREUIL PN, 2003, LIAISONS FRENCH CORP NEW B, 2003, LEXIQUE, V2 *OFF QUEB LANG FRA, 2002, BDL BANQ DEP LING Pierret J-M., 1994, PHONETIQUE HIST FRAN, P98 Quinlan J. R., 1993, C4 5 PROGRAMS MACHIN ROBERT P, 2005, GRAND ROBERT LANGUE RONKOHAVI R, 1998, LEARNING KNOWLEDGE D, V30, P1998 SANDERS C, 1993, FRENCH TODAY LANGUAG, P263 SPROAT R, 1998, MULTILINGUAL TEXT TO, P56 STAMMERJOHANN H, 1976, NEUEREN SPRACHEN, V75, P489 *SYN, 2006, BIBL CORD ET LEMM, V11 Tzoukermann E., 1998, P ICSLP98 SYDN AUSTR, V5, P2039 *U SO DENM, 2008, CORP VERLUYTEN SP, 1987, LING ANTVERP NEW SER, V21, P175 Yvon F, 1998, COMPUT SPEECH LANG, V12, P393, DOI 10.1006/csla.1998.0104 NR 27 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2010 VL 52 IS 10 BP 847 EP 862 DI 10.1016/j.specom.2010.06.007 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 638ST UT WOS:000280917000006 ER PT J AU Chakroborty, S Saha, G AF Chakroborty, Sandipan Saha, Goutam TI Feature selection using singular value decomposition and QR factorization with column pivoting for text-independent speaker identification SO SPEECH COMMUNICATION LA English DT Article DE Speaker identification; MFCC; LFCC; GMFCC; GMM; Divergence; Subband; Correlation; SVD; QRcp; F-Ratio ID RECOGNITION SYSTEM; SPEECH RECOGNITION; MUTUAL INFORMATION; VERIFICATION; MODELS AB Selection of features is one of the important tasks in the application like Speaker Identification (SI) and other pattern recognition problems. When multiple features are extracted from the same frame of speech, it is expected that a feature vector would contain redundant features. Redundant features confuse the speaker model in multidimensional space resulting in degraded performance by the system. Careful selection of potential features can remove this redundancy while helping to achieve the higher rate of accuracy at lower computational cost. Although the selection of features is difficult without having exhaustive search, this paper proposes an alternative and straight forward technique for feature selection using Singular Value Decomposition (SVD) followed by QR Decomposition with Column Pivoting (QRcp). The idea is to capture the most salient part of the information from the speakers' data by choosing those features that can explain different dimensions showing minimal similarities (or maximum acoustic variability) among them in orthogonal sense. The performances after selection of features using proposed criterion have been compared with using Mel-frequency Cepstral Coefficients (MFCC), Linear Frequency (LF) Cepstral Coefficients (LFCC) and a new feature proposed in this paper that is based on Gaussian shaped filters on mel-scale. It is shown that proposed SVD-QRcp based feature selection outperforms F-Ratio based method and the proposed feature extraction tool is superior to baseline MFCC & LFCC. (C) 2010 Elsevier B.V. All rights reserved. C1 [Chakroborty, Sandipan; Saha, Goutam] Indian Inst Technol Kharagpur, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India. RP Chakroborty, S (reprint author), Indian Inst Technol Kharagpur, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India. EM mail2sandi@gmail.com; gsaha@ece.iitkgp.ernet.in CR ARI S, 2007, INT J BIOMED SCI, V2 ARI S, 2008, J APPL SOFT COMPUT, DOI DOI 10.1016/J.ASOC.2008.04.010 ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 BATTITI R, 1994, IEEE T NEURAL NETWOR, V5, P537, DOI 10.1109/72.298224 Bellman R. E., 1957, DYNAMIC PROGRAMMING Besacier L, 2000, SIGNAL PROCESS, V80, P1245, DOI 10.1016/S0165-1684(00)00033-5 Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499 Campbell J., 1995, P INT C AC SPEECH SI, P341 CHAKROBORTY S, 2007, INT J SIGNAL PROCESS, V5, P11 CHANG CY, 1973, IEEE T SYST MAN CYB, VSMC3, P166 Charlet D, 1997, PATTERN RECOGN LETT, V18, P873, DOI 10.1016/S0167-8655(97)00064-0 CHEUNG RS, 1978, IEEE T ACOUST SPEECH, V26, P397, DOI 10.1109/TASSP.1978.1163142 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DHARANIPRAGDA S, 2007, IEEE T AUDIO SPEECH, V15 Ellis D. P. W., 2000, P INT C SPOK LANG PR, P79 Eriksson T, 2005, IEEE SIGNAL PROC LET, V12, P500, DOI 10.1109/LSP.2005.849495 Errity A, 2007, Proceedings of the 2007 15th International Conference on Digital Signal Processing, P587 Gajic B, 2006, IEEE T AUDIO SPEECH, V14, P600, DOI 10.1109/TSA.2005.855834 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Ganchev T, 2006, P 5 INT S COMM SYST, P314 GJELSVIK E, 1999, P 5 INT S SIGN PROC, P637 Golub G. H., 1996, MATRIX COMPUTATIONS, P48 Haydar A, 1998, ELECTRON LETT, V34, P1457, DOI 10.1049/el:19981069 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hua YB, 1998, IEEE SIGNAL PROC LET, V5, P141 Kanjilal PP, 1999, IEEE T SYST MAN CY B, V29, P1, DOI 10.1109/3477.740161 KANJILAL PP, 1995, ADAPTIVE PREDICTION, P56 Kumar N, 1998, SPEECH COMMUN, V26, P283, DOI 10.1016/S0167-6393(98)00061-2 Kumar N, 1997, THESIS J HOPKINS U B Kwak N, 2002, IEEE T PATTERN ANAL, V24, P1667, DOI 10.1109/TPAMI.2002.1114861 MELIN H, 1996, P COST 250 WORKSH AP, P59 Milner B, 2007, IEEE T AUDIO SPEECH, V15, P24, DOI 10.1109/TASL.2006.876880 Murty KR, 2006, IEEE SIGNAL PROC LET, V13, P52, DOI 10.1109/LSP.2005.860538 NELSON GD, 1968, IEEE T SYST SCI CYB, VSSC4, P145, DOI 10.1109/TSSC.1968.300141 Nicholson S., 1997, P EUR SPEECH C SPEEC, P413 Paliwal K. K., 1992, Digital Signal Processing, V2, DOI 10.1016/1051-2004(92)90005-J PANDIT M, 1998, P ICASSP, V2, P769, DOI 10.1109/ICASSP.1998.675378 Papoulis A., 2002, PROBABILITY RANDOM V, P72 PEACOCKE RD, 1990, COMPUTER, V23, P26, DOI 10.1109/2.56868 Petrovska D., 1998, POLYCOST TELEPHONE S, P211 PRASAD KS, 2007, P INT C SIGN PROC CO, P20 PRUZANSKY S, 1964, J ACOUST SOC AM, V36, P2041, DOI 10.1121/1.1919320 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 SAHA G, 2005, P IEEE ANN C IND 200, P70 SAMBUR MR, 1975, IEEE T ACOUST SPEECH, VAS23, P176, DOI 10.1109/TASSP.1975.1162664 SILVERMAN BW, 1986, MONOGRAPHS STAT APPL, P75 Wolf JJ, 1971, J ACOUST SOC AM, V51, P2044 ZHENG F, 2000, P INT C SPOK LANG PR, P389 Zilca RD, 2006, IEEE T AUDIO SPEECH, V14, P467, DOI [10.1109/TSA.2005.857809, 10.1109/FSA.2005.857809] NR 49 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2010 VL 52 IS 9 BP 693 EP 709 DI 10.1016/j.specom.2010.04.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 631BD UT WOS:000280320600001 ER PT J AU Bao, CC Xu, H Xia, BY Liu, ZY Qiu, JW AF Bao, Changchun Xu, Hao Xia, Bingyin Liu, Zhangyu Qiu, Jianwei TI An efficient transcoding algorithm between AMR-NB and G.729ab SO SPEECH COMMUNICATION LA English DT Article DE Speech coding; Speech transcoding; CELP; G.729ab; AMR-NB AB In this paper, an efficient transcoding algorithm between AMR-NB and G.729ab is proposed. The proposed algorithm further elaborates on solutions when a DTX function is adopted between the source and destination coding systems. When neither, either or both of the source and destination coding systems adopt the DTX function, the proposed algorithm can carry out the transcoding operation between the two coding systems efficiently. When neither of the two coding systems adopts the DTX function, transcoding methods in different domains are proposed. A scalable distortion measure method based on parameter domain, specifically related to codebook gain conversion, is proposed to keep the amplitude of synthesized speech. The effect on subjective speech quality due to the amplitude of synthesized speech is cancelled out by using the proposed method and the computational complexity is reduced as well. When either or both of the two coding systems adopt the DTX function, depending on the type of the destination frame, transcoding methods between speech frames and non-speech frames are proposed. When the frame is declared as an erased frame, a linear prediction-based pitch recovery and transcoding method is used in this paper. By employing the proposed algorithm in transcoders, complexity is reduced by about 26-82% and quality is also improved compared to the conventional DTE method. (C) 2010 Elsevier B.V. All rights reserved. C1 [Bao, Changchun; Xu, Hao; Xia, Bingyin; Liu, Zhangyu; Qiu, Jianwei] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China. EM baochch@bjut.edu.cn FU Huawei Technologies Co., Ltd.; Funding Project for Academic Human Resources Development in Institutions of Higher Learning Under the Jurisdiction of Beijing Municipality FX This work was supported by Huawei Technologies Co., Ltd. and Funding Project for Academic Human Resources Development in Institutions of Higher Learning Under the Jurisdiction of Beijing Municipality. CR *3GPP, 2007, 26094 3GPP REC TS *3GPP, 2007, 26092 3GPP TS *3GPP, 1999, 26090 3GPP REC TS [Anonymous], 2003, P8621 ITUT Benyassine A, 1997, IEEE COMMUN MAG, V35, P64, DOI 10.1109/35.620527 CHOI JK, 2004, ICASSP 04, P269 CHRISTOPHE B, 2007, IEEE 9 WORKSH MULT S, P155 GHENANIA M, 2004, P EUR, P1681 *ITU T, 2005, G7291 ITUT ITU-T, 1996, G729 ITUT ITU-T (Telecommunication Standardization Sector International Telecommunication Union), 2005, G191 ITUT Kang HG, 2003, IEEE T MULTIMEDIA, V5, P24, DOI 10.1109/TMM.2003.808823 KANG HG, 2000, IEEE WORKSH SPEECH C, P78 LEE ED, 2007, ICHIT 06, P178 NETO AFC, 1999, IEEE P INT C AC SPEE, P177 OTA Y, 2002, IEEE INT C COMM, P114 TSUCHINAGA Y, 2002, Patent No. 020072104 NR 17 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2010 VL 52 IS 9 BP 710 EP 724 DI 10.1016/j.specom.2010.04.003 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 631BD UT WOS:000280320600002 ER PT J AU Jafari, A Almasganj, F AF Jafari, Ayyoob Almasganj, Farshad TI Using Laplacian eigenmaps latent variable model and manifold learning to improve speech recognition accuracy SO SPEECH COMMUNICATION LA English DT Article DE Laplacian eigenmaps; Latent variable model; Speech recognition ID NONLINEAR DIMENSIONALITY REDUCTION AB This paper demonstrates the application of the Laplacian eigenmaps latent variable model (LELVM) to the task of speech recognition. LELVM is a new dimension reduction method that combines the benefits of latent variable models a multimodal probability density for latent and observed variables, and globally differentiable nonlinear mappings for the tasks of reconstruction and dimensionality reduction with spectral manifold learning methods no local optimum, ability to unfold nonlinear manifolds, and excellent practical scaling to latent spaces of high dimensions. LELVM is achieved by defining an out-of-sample mapping for Laplacian eigenmaps using a semi-supervised learning procedure. LELVM is simple, non-parametric and computationally inexpensive. In this research, LELVM is used to project MFCC features to a new subspace which leads to more discrimination among different phonetic categories. To evaluate the performance of the proposed feature modification system, a HMM-based speech recognition system and TIMIT speech database are employed. The experiments represent about 5% of the accuracy improvement in an isolated phoneme recognition task. The experiments imply the superiority of the proposed method to the usual PCA methods. Moreover, the proposed method keeps its benefits in noisy environments and does not degrade in such conditions. Crown Copyright (C) 2010 Published by Elsevier B.V. All rights reserved. C1 [Jafari, Ayyoob; Almasganj, Farshad] Amirkabir Univ Technol, Dept Biomed Engn, Tehran, Iran. RP Jafari, A (reprint author), Amirkabir Univ Technol, Dept Biomed Engn, Tehran, Iran. EM ajafari20@aut.ac.ir; almas@aut.ac.ir CR Belkin M, 2003, NEURAL COMPUT, V15, P1373, DOI 10.1162/089976603321780317 Belkin M, 2002, ADV NEUR IN, V14, P585 CARREIRAPERPINA.MA, 2001, THESIS U SHEELD UK CARREIRAPERPINA.MA, 2007, PEOPLE TRACKING USIN Fant G., 1970, ACOUSTIC THEORY SPEE Ham J., 2005, AISTATS, P120 Jansen A., 2005, GEOMETRIC PERSPECTIV JIAYAN J, 2006, INT JOINT C NEUR NET Jolliffe I, 1986, SPRINGER SERIES STAT LANG C, P ICSP 04 Lawrence N, 2005, J MACH LEARN RES, V6, P1783 MORSE PM, 1953, METHODS THEORETICA 1, P43 ROSENBERG S, 1997, LAPLACIAN REIMMANNIA Roweis ST, 2000, SCIENCE, V290, P2323, DOI 10.1126/science.290.5500.2323 Tenenbaum JB, 2000, SCIENCE, V290, P2319, DOI 10.1126/science.290.5500.2319 TOGNERI R, 1992, IEE PROC-I, V139, P123 Wand M. P., 1995, KERNEL SMOOTHING Yang X., 2006, ICML 06, P1065 Zhu X., 2003, ICML, P912 NR 19 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2010 VL 52 IS 9 BP 725 EP 735 DI 10.1016/j.specom.2010.04.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 631BD UT WOS:000280320600003 ER PT J AU Hines, A Harte, N AF Hines, Andrew Harte, Naomi TI Speech intelligibility from image processing SO SPEECH COMMUNICATION LA English DT Article DE Auditory periphery model; Hearing aids; Sensorineural hearing loss; Structural similarity; MSSIM; Speech intelligibility ID AUDITORY-NERVE RESPONSES; FINE-STRUCTURE; VOWEL EPSILON; MODEL; PERCEPTION; PERIPHERY; ENVELOPE; QUALITY; SOUNDS; FIBERS AB Hearing loss research has traditionally been based on perceptual criteria, speech intelligibility and threshold levels. The development of computational models of the auditory periphery has allowed experimentation via simulation to provide quantitative, repeatable results at a more granular level than would be practical with clinical research on human subjects. The responses of the model used in this study have been previously shown to be consistent with a wide range of physiological data from both normal and impaired ears for stimuli presentation levels spanning the dynamic range of hearing. The model output can be assessed by examination of the spectro-temporal output visualised as neurograms. The effect of sensorineural hearing loss (SNHL) on phonemic structure was evaluated in this study using two types of neurograms: temporal fine structure (TFS) and average discharge rate or temporal envelope. A new systematic way of assessing phonemic degradation is proposed using the outputs of an auditory nerve model for a range of SNHLs. The mean structured similarity index (MSSIM) is an objective measure originally developed to assess perceptual image quality. The measure is adapted here for use in measuring the phonemic degradation in neurograms derived from impaired auditory nerve outputs. A full evaluation of the choice of parameters for the metric is presented using a large amount of natural human speech. The metric's boundedness and the results for TFS neurograms indicate it is a superior metric to standard point to point metrics of relative mean absolute error and relative mean squared error. MSSIM as an indicative score of intelligibility is also promising, with results similar to those of the standard speech intelligibility index metric. (C) 2010 Elsevier B.V. All rights reserved. C1 [Hines, Andrew; Harte, Naomi] Trinity Coll Dublin, Dept Elect & Elect Engn, Sigmedia Grp, Dublin, Ireland. RP Hines, A (reprint author), Trinity Coll Dublin, Dept Elect & Elect Engn, Sigmedia Grp, Dublin, Ireland. EM hinesa@tcd.ie CR American National Standards Institute (ANSI), 1997, S351997R2007 ANSI BONDY J, 2004, NIPS 2003 ADV NEURAL, V16, P1409 Bruce IC, 2003, J ACOUST SOC AM, V113, P369, DOI 10.1121/1.1519544 Bruce I.C., 2007, AUD SIGN PROC HEAR I, P73 DARPA UDC, 1990, 111 NIST DENG L, 1987, J ACOUST SOC AM, V82, P2001, DOI 10.1121/1.395644 Dillon H., 2001, HEARING AIDS DINATH F, 2008, P 30 INT IEEE ENG ME, P1793 Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 Heinz MG, 2009, JARO-J ASSOC RES OTO, V10, P407, DOI 10.1007/s10162-009-0169-8 Hines A., 2009, INT 09 BRIGHT ENGL, P1119 Houtgast T., 2001, J ACOUST SOC AM, V110, P529 JERGER J, 1971, ARCHIV OTOLARYNGOL, V93, P573 Kandadai S., 2008, P IEEE INT C AC SPEE, P221 LIBERMAN MC, 1978, J ACOUST SOC AM, V63, P442, DOI 10.1121/1.381736 Lopez-Poveda EA, 2005, INT REV NEUROBIOL, V70, P7, DOI 10.1016/S0074-7742(05)70001-5 Lorenzi C, 2006, P NATL ACAD SCI USA, V103, P18866, DOI 10.1073/pnas.0607364103 ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070 Sachs MB, 2002, ANN BIOMED ENG, V30, P157, DOI 10.1114/1.1458592 Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Steeneken HJM, 2002, SPEECH COMMUN, V38, P399, DOI 10.1016/S0167-6393(02)00011-0 Studebaker GA, 1999, J ACOUST SOC AM, V105, P2431, DOI 10.1121/1.426848 Wang Z, 2004, IEEE T IMAGE PROCESS, V13, P600, DOI 10.1109/TIP.2003.819861 Wang Z., 2005, P IEEE INT C AC SPEE, V2, P573 WIENER FM, 1946, J ACOUST SOC AM, V18, P401, DOI 10.1121/1.1916378 Wong JC, 1998, HEARING RES, V123, P61, DOI 10.1016/S0378-5955(98)00098-7 Xu L, 2003, J ACOUST SOC AM, V114, P3024, DOI 10.1121/1.1623786 Zhang XD, 2001, J ACOUST SOC AM, V109, P648, DOI 10.1121/1.1336503 Zilany M. S. A., 2007, THESIS MCMASTER U HA Zilany MSA, 2007, J ACOUST SOC AM, V122, P402, DOI 10.1121/1.2735117 Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512 NR 33 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2010 VL 52 IS 9 BP 736 EP 752 DI 10.1016/j.specom.2010.04.006 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 631BD UT WOS:000280320600004 ER PT J AU Nosratighods, M Ambikairajah, E Epps, J Carey, MJ AF Nosratighods, Mohaddeseh Ambikairajah, Eliathamby Epps, Julien Carey, Michael John TI A segment selection technique for speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Speaker verification; Segment selection; Null hypothesis ID RECOGNITION; NORMALIZATION; GMM AB The performance of speaker verification systems degrades considerably when the test segments are utterances of very short duration. This might be either due to variations in score-matching arising from the unobserved speech sounds of short speech utterances or the fact that the shorter the utterance, the greater the effect of individual speech sounds on the average likelihood score. In other words, the effects of individual speech sounds have not been cancelled out by a large number of speech sounds in very short utterances. This paper presents a score-based segment selection technique for discarding portions of speech that result in poor discrimination ability in a speaker verification task. Theory is developed to detect the most significant and reliable speech segments based on the probability that the test segment conies from a fixed set of cohort models. This approach, suitable for any duration of test utterance, reduces the effect of acoustic regions of the speech that are not accurately modelled due to sparse training data, and makes a decision based only on the segments that provide the best-matched scores from the segment selection algorithm. The proposed segment selection technique provides reductions in relative error rate of 22% and 7% in terms of minimum Detection Cost Function (DCF) and Equal Error Rate (EER) compared with a baseline used the segment-based normalization, when evaluated on the short utterances of NIST 2002 Speaker Recognition Evaluation dataset. (C) 2010 Elsevier B.V. All rights reserved. C1 [Nosratighods, Mohaddeseh; Ambikairajah, Eliathamby; Epps, Julien] Univ New S Wales, Sch Elect Engn & Telecommun, Sydney, NSW 2052, Australia. [Carey, Michael John] Univ Birmingham, Dept Elect Elect & Comp Engn, Birmingham B15 2TT, W Midlands, England. [Ambikairajah, Eliathamby] NICTA, Eveleigh 1430, Australia. RP Nosratighods, M (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Sydney, NSW 2052, Australia. EM hadis@unsw.edu.au; am-bi@ee.unsw.edu; j.epps@unsw.edu.au; m.carey@bham.ac.uk CR ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 AUCKENTHALER R, 1999, IEEE INT C ACOUSTICS, V1, P313 BARRAS C, 2003, IEEE INT C AC SPEECH, V2, P49 CAREY MJ, 1997, IEEE INT C AC SPEECH, V2, P1083 Devore J.L., 1995, PROBABILITY STAT ENG Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9 Heck LP, 1997, INT CONF ACOUST SPEE, P1071, DOI 10.1109/ICASSP.1997.596126 Koolwaaij J, 2000, DIGIT SIGNAL PROCESS, V10, P113, DOI 10.1006/dspr.1999.0357 Li K.-P., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196655 Malayath N, 2000, DIGIT SIGNAL PROCESS, V10, P55, DOI 10.1006/dspr.1999.0363 MARTIN AF, 2002, NIST SPEAKER EVALUAT Nosratighods M, 2007, INT CONF ACOUST SPEE, P269 NOSRATIGHODS M, 2006, INT C SPEECH SCI TEC, P136 Pelecanos J, 2006, INT CONF ACOUST SPEE, P109 PELECANOS J, 2001, SPEAKER ODYSSEY SPEA, P175 PRZYBOCKI M, 2004, SPEAKER ODYSSEY SPEA, P15 REYNOLDS D, 1994, EUR C SPEECH COMM TE Reynolds DA, 1992, GAUSSIAN MIXTURE MOD ROSE RC, 1991, INT CONF ACOUST SPEE, P401, DOI 10.1109/ICASSP.1991.150361 Wackerly D. D., 1996, MATH STAT APPL NR 23 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2010 VL 52 IS 9 BP 753 EP 761 DI 10.1016/j.specom.2010.04.007 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 631BD UT WOS:000280320600005 ER PT J AU Laska, B Bolic, M Goubran, R AF Laska, Brady Bolic, Miodrag Goubran, Rafik TI Discrete cosine transform particle filter speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Noise reduction; Particle filtering; Discrete cosine transform (DCT) ID SPECTRAL AMPLITUDE ESTIMATOR; SUBSPACE APPROACH; KALMAN FILTER; NOISY SPEECH; ALGORITHMS AB A discrete cosine transform (DCT) domain speech enhancement algorithm is proposed that models the evolution of speech DCT coefficients as a time-varying autoregressive process. Rao-Blackwellized particle filter (RBPF) techniques are used to estimate the model parameters and recover the clean signal coefficients. Using very low-order models for each coefficient and operating at a decimated frame rate, the proposed approach provides a significant complexity reduction compared to the standard full-band RBPF speech enhancement algorithm. In addition to the complexity gains, performance is also improved. Modeling the speech signal in the DCT-domain is shown to provide a better fit in spectral troughs, leading to more noise reduction and less speech distortion. To illustrate possible frequency-dependent processing strategies, a hybrid structure is proposed that offers a complexity/performance trade-off by substituting a simple DCT Wiener filter for the DCT-RBPF in some bands. In comparisons with high performing speech enhancement algorithms using wide-band speech and noise, the proposed DCT-RBPF algorithm achieves higher scores on objective quality and intelligibility measures. (C) 2010 Elsevier B.V. All rights reserved. C1 [Laska, Brady] Res Mot Ltd, Ottawa, ON K2K 3K1, Canada. [Goubran, Rafik] Carleton Univ, Dept Syst & Comp Engn, Ottawa, ON K1S 5B6, Canada. [Bolic, Miodrag] Univ Ottawa, Sch Informat Technol & Engn, Ottawa, ON K1N 6N5, Canada. RP Laska, B (reprint author), Res Mot Ltd, 4000 Innovat Dr, Ottawa, ON K2K 3K1, Canada. EM blaska@rim.com; mbolic@site.uottawa.ca; goubran@sce.carleton.ca FU Siemens AG; NSERC FX This work was funded in part by Siemens AG, and NSERC. Thanks to Rajbabu Velmurugan and the anonymous reviewers for their valuable comments and to Frederic Mustiere for sharing his RBPF expertise. CR Arulampalam MS, 2002, IEEE T SIGNAL PROCES, V50, P174, DOI 10.1109/78.978374 BOLIC M, 2003, P IEEE INT C AC SPEE, V2, P589 CHEVALIER M, 1985, P IEEE ICASSP, V10, P501 Daum F., 2003, P IEEE C AER, V4, P1979 DENDRINOS M, 1991, SPEECH COMMUN, V10, P45, DOI 10.1016/0167-6393(91)90027-Q DENG Y, 2006, P EUR EUSIPCO FLOR I Doucet A., 2000, P 16 C UNC ART INT, P176 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Fong W, 2002, IEEE T SIGNAL PROCES, V50, P438, DOI 10.1109/78.978397 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 GORDON NJ, 1993, IEE PROC-F, V140, P107 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 JENSEN SH, 1995, IEEE T SPEECH AUDI P, V3, P439, DOI 10.1109/89.482211 Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575 Labarre D, 2007, IEEE T SIGNAL PROCES, V55, P5195, DOI 10.1109/TSP.2007.899587 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lotter T., 2005, EURASIP J APPL SIG P, V7, P1110 Ma N, 2006, IEEE T AUDIO SPEECH, V14, P19, DOI 10.1109/TSA.2005.858515 Martin R, 2005, SIG COM TEC, P43, DOI 10.1007/3-540-27489-8_3 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 MUSTIERE F, 2006, P IEEE ICASSP TOUL F, V3, P21 MUSTIERE F, 2007, P IEEE ICASSP HON HI, P1197 Paliwal K. K., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) PUDER H, 2006, TOPICS ACOUSTIC ECHO Rao S, 1996, IEEE T INFORM THEORY, V42, P1160, DOI 10.1109/18.508839 Soon IY, 1998, SPEECH COMMUN, V24, P249, DOI 10.1016/S0167-6393(98)00019-3 TRIKI M, 2009, P IEEE ICASSP, P29 Vermaak J, 2002, IEEE T SPEECH AUDI P, V10, P173, DOI 10.1109/TSA.2002.1001982 Wu WR, 1998, IEEE T CIRCUITS-II, V45, P1072 NR 32 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2010 VL 52 IS 9 BP 762 EP 775 DI 10.1016/j.specom.2010.05.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 631BD UT WOS:000280320600006 ER PT J AU Bitouk, D Verma, R Nenkova, A AF Bitouk, Dmitri Verma, Ragini Nenkova, Ani TI Class-level spectral features for emotion recognition SO SPEECH COMMUNICATION LA English DT Article DE Emotions; Emotional speech classification; Spectral features ID SPEECH AB The most common approaches to automatic emotion recognition rely on utterance-level prosodic features. Recent studies have shown that utterance-level statistics of segmental spectral features also contain rich information about expressivity and emotion. In our work we introduce a more fine-grained yet robust set of spectral features: statistics of Mel-Frequency Cepstral Coefficients computed over three phoneme type classes of interest stressed vowels, unstressed vowels and consonants in the utterance. We investigate performance of our features in the task of speaker-independent emotion recognition using two publicly available datasets. Our experimental results clearly indicate that indeed both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared to prosodic or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to even further improvement. Given the large number of class-level spectral features, we expected feature selection will improve results even further, but none of several selection methods led to clear gains. Further analyses reveal that spectral features computed from consonant regions of the utterance contain more information about emotion than either stressed or unstressed vowel features. We also explore how emotion recognition accuracy depends on utterance length. We show that, while there is no significant dependence for utterance-level prosodic features, accuracy of emotion recognition using class-level spectral features increases with the utterance length. (C) 2010 Elsevier B.V. All rights reserved. C1 [Bitouk, Dmitri; Verma, Ragini] Univ Penn, Dept Radiol, Sect Biomed Image Anal, Philadelphia, PA 19104 USA. [Nenkova, Ani] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA. RP Bitouk, D (reprint author), Univ Penn, Dept Radiol, Sect Biomed Image Anal, 3600 Market St,Suite 380, Philadelphia, PA 19104 USA. EM Dmitri.Bitouk@uphs.upenn.edu FU NIH [R01 MHO73174] FX This work is supported by NIH Grant R01 MHO73174. The authors would like to thank Dr. Jiahong Yuan for providing us with the code and English acoustic models for forced alignment. CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BITOUK D, 2009, P INT 2009 Boersma P., 2001, GLOT INT, V5, P341 Burkhardt F., 2005, P INT 2005, P1 Chang C.-C., 2001, LIBSVM LIB SUPPORT V Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022 Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8 GRIMM M, 2006, P 14 EUR SIGN PROC C Hall M. A., 1997, P 4 INT C NEUR INF P, P855 Hall MA, 1998, AUST COMP S, V20, P181 HASEGAWAJOHNSON M, 2004, P INT 2004 HU H, 2007, P IEEE INT C AC SPEE, V4, P413 HUANG R, 2006, P INT C PATT REC ICP, P1204 KIM S, 2007, P IEEE 9 WORKSH MULT KWON O.W., 2003, P 8 EUR C SPEECH COM, P125 Lee CM, 2004, P INT 2004, P205 LEGETTER C, 1996, COMPUT SPEECH LANG, V10, P249 *LING DAT CONS, 2002, LDC2002528 U PENNS Luengo I, 2005, P INTERSPEECH, P493 McGilloway S., 2000, P ISCA WORKSH SPEECH, P200 MENG H, 2007, P IEEE INT C SIGN PR, P1179 Neiberg D, 2006, P INT C SPOK LANG PR, P809 Nicholson J, 2000, NEURAL COMPUT APPL, V9, P290, DOI 10.1007/s005210070006 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 ODELL J, 2002, HTK BOOK PAO T, 2005, COMPUT LINGUISTICS C, V10 Sato N., 2007, INFORM MEDIA TECHNOL, V2, P835 Scherer S., 2007, P INT ENV 2007, P152 SCHULLER B., 2005, P INT LISB PORT, P805 SCHULLER B, 2006, P INTERSPEECH 2006 I, P1818 SETHU V, 2008, P INT 2008, P617 Shafran I., 2003, P IEEE AUT SPEECH RE, P31 Shamia M., 2005, P IEEE INT C MULT EX Song ML, 2004, LECT NOTES COMPUT SC, V3046, P406 TABATABAEI T, 2007, P IEEE INT S CIRC SY, P345 Vlasenko B, 2008, LECT NOTES ARTIF INT, V5078, P217, DOI 10.1007/978-3-540-69369-7_24 Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139 Vondra M, 2009, LECT NOTES ARTIF INT, V5398, P256 WANG Y, 2005, P IEEE INT C AC SPEE, P1125 Yacoub S., 2003, P EUROSPEECH, P729 Ye CX, 2008, LECT NOTES COMPUT SC, V5353, P61 NR 41 TC 28 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2010 VL 52 IS 7-8 BP 613 EP 625 DI 10.1016/j.specom.2010.02.010 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 608HH UT WOS:000278573700001 ER PT J AU Munro, MJ Derwing, TM Burgess, CS AF Munro, Murray J. Derwing, Tracey M. Burgess, Clifford S. TI Detection of nonnative speaker status from content-masked speech SO SPEECH COMMUNICATION LA English DT Article DE Foreign accent; Voice quality; Backwards speech ID PERCEIVED FOREIGN ACCENT; LANGUAGE DISCRIMINATION; MAGNITUDE ESTIMATION; ESL LEARNERS; L2 SPEECH; ENGLISH; VOICE; 2ND-LANGUAGE; INTELLIGIBILITY; PRONUNCIATION AB Listeners are highly sensitive to divergences from native-speaker patterns of speech production, such that they can recognize second-language speakers even from very short stretches of speech. However, the processes by which nonnative speaker detection is accomplished are not fully understood. In this investigation, we used content-masked (backwards) speech to assess listeners' sensitivity to nonnative speaker status when potential segmental, grammatical, and lexical cues were removed. The listeners performed at above-chance levels across utterances of varying lengths and across three accents (Mandarin, Cantonese, and Czech). Reduced sensitivity was observed when variability in speaking rates and F0 was removed, and when temporal integrity was severely disrupted. The results indicate that temporal properties, pitch, and voice quality probably played a role in the listeners' judgments. (C) 2010 Elsevier B.V. All rights reserved. C1 [Munro, Murray J.; Burgess, Clifford S.] Simon Fraser Univ, Dept Linguist, Burnaby, BC V5A 1S6, Canada. [Derwing, Tracey M.] Univ Alberta, Dept Educ Psychol, Edmonton, AB T6G 2G5, Canada. RP Munro, MJ (reprint author), Simon Fraser Univ, Dept Linguist, 8888 Univ Dr, Burnaby, BC V5A 1S6, Canada. EM mjmunro@sfu.ca FU Social Sciences and Humanities Research Council of Canada FX The authors thank H. Li, N. Penner, R. Thomson, and K. Jamieson for their assistance with stimulus preparation and data collection. We also thank Bruce Derwing for his comments. This research was supported by the Social Sciences and Humanities Research Council of Canada. CR ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x Andrianopoulos MV, 2001, J VOICE, V15, P61, DOI 10.1016/S0892-1997(01)00007-8 BLACK JW, 1973, J SPEECH HEAR RES, V16, P165 Boersma P., 2008, PRAAT DOING PHONETIC BRENNAN EM, 1981, LANG SPEECH, V24, P207 BRENNAN EM, 1975, J PSYCHOLINGUIST RES, V4, P27, DOI 10.1007/BF01066988 Chan AY, 2007, CAN J LING/REV CAN L, V52, P231 Collins SA, 2000, ANIM BEHAV, V60, P773, DOI 10.1006/anbe.2000.1523 DAVILA A, 1993, SOC SCI QUART, V74, P902 Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V19, P1, DOI [DOI 10.1017/S0272263197001010, 10.1017/S0272263197001010] Derwing T. M., 2002, J MULTILING MULTICUL, V23, P245, DOI DOI 10.1080/01434630208666468 Derwing TM, 2008, APPL LINGUIST, V29, P359, DOI 10.1093/applin/amm041 DONALDSON W, 1993, B PSYCHONOMIC SOC, V31, P271 Esling J. H., 2000, VOICE QUALITY MEASUR, P25 ESLING JH, 1983, TESOL QUART, V17, P89, DOI 10.2307/3586426 ESLING JH, 1994, PRONUNCIATION PEDAGO, P49 Fitch WT, 2000, TRENDS COGN SCI, V4, P258, DOI 10.1016/S1364-6613(00)01494-7 Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 Flege J. E., 1988, HUMAN COMMUNICATION, P224 FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256 FLEGE JE, 1991, J ACOUST SOC AM, V89, P395, DOI 10.1121/1.400473 FLEGE JE, 1992, J ACOUST SOC AM, V91, P370, DOI 10.1121/1.402780 FRIEND M, 1994, J ACOUST SOC AM, V96, P1283, DOI 10.1121/1.410276 Gick B., 2008, PHONOLOGY 2 LANGUAGE, P309 HANLEY TD, 1966, PHONETICA, V14, P97 Hartmann W. M., 1998, SIGNALS SOUND SENSAT HONIKMAN B, 1964, PAPERS CONTRIBUTED O, P73 Jones R. H., 1995, ELT J, V49, P244, DOI 10.1093/elt/49.3.244 KIMURA D, 1968, SCIENCE, V161, P395, DOI 10.1126/science.161.3839.395 Ladefoged Peter, 2005, VOWELS CONSONANTS IN Laver J, 1980, PHONETIC DESCRIPTION Levi SV, 2007, J ACOUST SOC AM, V121, P2327, DOI 10.1121/1.2537345 Lippi-Green Rosina, 1997, ENGLISH ACCENT LANGU Mackay IRA, 2006, APPL PSYCHOLINGUIST, V27, P157, DOI 10.1017/S0142716406060231 Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081 Major RC, 2007, STUD SECOND LANG ACQ, V29, P539, DOI 10.1017/S0272263107070428 Mennen I, 2004, J PHONETICS, V32, P543, DOI 10.1016/j.wocn.2004.02.002 Munro M., 2003, P 15 INT C PHON SCI, P535 Munro M., 2003, TESL CANADA J, V20, P38 Munro M. J., 1995, STUDIES 2 LANGUAGE A, V17, P17, DOI 10.1017/S0272263100013735 Munro M. J., 2001, STUDIES 2 LANGUAGE A, V23, P451 MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049 Munro Murray J., 2008, PHONOLOGY 2 LANGUAGE, P193 Nolan F., 1998, FORENSIC LINGUIST, V3, P39 Ramus F, 2000, SCIENCE, V288, P349, DOI 10.1126/science.288.5464.349 RAUPACH M, 1980, STUDIES HONOUR F GOL, P263 RIGGENBACH H, 1991, DISCOURSE PROCESS, V14, P423 Scovel T., 1988, TIME SPEAK PSYCHOLIN Sherman D, 1954, J SPEECH HEAR DISORD, V19, P312 Southwood MH, 1999, CLIN LINGUIST PHONET, V13, P335 Tajima K, 1997, J PHONETICS, V25, P1, DOI 10.1006/jpho.1996.0031 THOMPSON I, 1991, LANG LEARN, V41, P177, DOI 10.1111/j.1467-1770.1991.tb00683.x Toro JM, 2003, ANIM COGN, V6, P131, DOI 10.1007/s10071-003-0172-0 Trofimovich P, 2006, STUD SECOND LANG ACQ, V28, P1, DOI 10.1017/S0272263106060013 Tsukada K, 2004, PHONETICA, V61, P67, DOI 10.1159/000082557 vanDommelen WA, 1995, LANG SPEECH, V38, P267 VANELS T, 1987, MOD LANG J, V71, P147, DOI 10.2307/327199 VANLANCKER D, 1985, J PHONETICS, V13, P19 Varonis E. M., 1982, STUDIES 2ND LANGUAGE, V4, P114, DOI 10.1017/S027226310000437X White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003 Wilson I. L., 2006, THESIS U BRIT COLUMB NR 62 TC 12 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2010 VL 52 IS 7-8 BP 626 EP 637 DI 10.1016/j.specom.2010.02.013 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 608HH UT WOS:000278573700002 ER PT J AU Reubold, U Harrington, J Kleber, F AF Reubold, Ulrich Harrington, Jonathan Kleber, Felicitas TI Vocal aging effects on F-0 and the first formant: A longitudinal analysis in adult speakers SO SPEECH COMMUNICATION LA English DT Article DE Vocal aging; F-0; Formants; Longitudinal analysis; Perception; Source-tract-interaction ID SPEAKING FUNDAMENTAL-FREQUENCY; LARYNGEAL AIRWAY-RESISTANCE; AGE-RELATED-CHANGES; ACOUSTIC CHARACTERISTICS; VOWEL PRODUCTION; VERTICAL-BAR; SPEECH; VOICE; TRACT; ENGLISH AB This paper presents a longitudinal analysis of the extent to which age affects F-0 and formant frequencies. Five speakers at two time intervals showed a clear effect for F-0 and F-1 but no systematic effects for F-2 or F-3. In two speakers for which recordings were available in successive years over a 50 year period, results showed with increasing age a decrease in both F-0 and F-1 for a female speaker and a V-shaped pattern, i.e. a decrease followed by an increase in both F-0 and F-1 for a male speaker. This analysis also provided strong evidence that F-1 approximately tracked F-0 across the years: i.e., the rate of change of (the logarithm of) F-0 and F-1 were generally the same. We then also tested that the changes in F-1 were not an acoustic artifact of changing F-0. Perception experiments with the main aim of assessing whether changes in F-1 contributed to age judgments beyond those from F-0 showed that the contribution of F-1 was inconsistent and negligible. The general conclusion is that age-related changes in F-1 may be compensatory to offset a physiologically induced decline in F-0 and thereby maintain a relatively constant auditory distance between F-0 and F-1. (C) 2010 Elsevier B.V. All rights reserved. C1 [Reubold, Ulrich; Harrington, Jonathan; Kleber, Felicitas] Univ Munich, Inst Phonet & Speech Proc IPS, D-80799 Munich, Germany. Ocean Univ Qingdao, Inst Comp Sci & Engn, Shandong, Peoples R China. RP Reubold, U (reprint author), Univ Munich, Inst Phonet & Speech Proc IPS, Schellingstr 3-2, D-80799 Munich, Germany. EM reubold@phonetik.uni-muenchen.de CR Abitbol J, 1999, J VOICE, V13, P424, DOI 10.1016/S0892-1997(99)80048-4 Baayen R. Harald, 2008, ANAL LINGUISTIC DATA BADIN P, 1984, 23 STL QPSR, P53 Bailey Guy, 1991, LANG VAR CHANGE, V3.3, P241, DOI [10.1017/S0954394500000569, DOI 10.1017/S0954394500000569] Baken RJ, 2005, J VOICE, V19, P317, DOI 10.1016/j.jvoice.2004.07.005 Barney A, 2007, ACTA ACUST UNITED AC, V93, P1046 Beckman Mary E., 2010, HDB PHONETIC SCI, P603, DOI 10.1002/9781444317251.ch16 BENJAMIN BJ, 1982, J PSYCHOLINGUIST RES, V11, P159 Beyerlein P., 2008, VOCAL AGING EXPLAINE BROWN WS, 1991, J VOICE, V5, P310, DOI 10.1016/S0892-1997(05)80061-X Campbell J, 2004, DIGIT SIGNAL PROCESS, V14, P295, DOI 10.1016/j.dsp.2004.06.001 Chambers J. K., 2002, HDB LANGUAGE VARIATI CHILDERS DG, 1994, IEEE T BIO-MED ENG, V41, P663, DOI 10.1109/10.301733 Cox F., 2006, AUSTR J LINGUISTICS, V26, P147, DOI 10.1080/07268600600885494 DEJONG KJ, 1995, J ACOUST SOC AM, V97, P491, DOI 10.1121/1.412275 Decoster W, 2000, J VOICE, V14, P184, DOI 10.1016/S0892-1997(00)80026-0 DEPINTO O, 1982, J PHONETICS, V10, P367 DOCHERTY G, OXFORD HDB IN PRESS ENDRES W, 1971, J ACOUST SOC AM, V49, P1842, DOI 10.1121/1.1912589 Flanagan J. L., 1965, SPEECH ANAL SYNTHESI FLUGEL C, 1991, MECH AGEING DEV, V61, P65, DOI 10.1016/0047-6374(91)90007-M Foulkes P, 2006, J PHONETICS, V34, P409, DOI 10.1016/j.wocn.2005.08.002 GUGATSCHKA M, J VOICE IN PRESS Harnsberger James D, 2008, J Voice, V22, P58, DOI 10.1016/j.jvoice.2006.07.004 Harrington J., 2000, PAPERS LAB PHONOLOGY, VV, P40 Harrington J, 2008, J ACOUST SOC AM, V123, P2825, DOI 10.1121/1.2897042 Harrington J, 2000, NATURE, V408, P927, DOI 10.1038/35050160 HARRINGTON J, 2005, GIFT SPEECH, P227 Harrington J, 2006, J PHONETICS, V34, P439, DOI 10.1016/j.wocn.2005.08.001 HARRINGTON J, 2007, LAB PHONOLOGY, V9 Harrington Jonathan, 2000, J INT PHON ASSOC, V30, P63, DOI [10.1017/S0025100300006666, DOI 10.1017/S0025100300006666] HOIT JD, 1987, J SPEECH HEAR RES, V30, P351 HOIT JD, 1992, J SPEECH HEAR RES, V35, P309 HOLLIEN H, 1972, J SPEECH HEAR RES, V15, P155 Holmes J., 2001, SPEECH SYNTHESIS REC Honda K, 1999, LANG SPEECH, V42, P401 Huber JE, 2008, J SPEECH LANG HEAR R, V51, P651, DOI 10.1044/1092-4388(2008/047) KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 LABOV W, 1994, SOCIOLINGUISTIC PATT, V1 Labov William, 1972, SOCIOLINGUISTIC PATT Labov William, 2001, PRINCIPLES LINGUISTI, V2 LINDBLOM BE, 1971, J ACOUST SOC AM, V50, P1166, DOI 10.1121/1.1912750 Linville SE, 2001, J VOICE, V15, P323, DOI 10.1016/S0892-1997(01)00034-0 Linville SE, 2001, VOCAL AGING Linville SE, 1996, J VOICE, V10, P190, DOI 10.1016/S0892-1997(96)80046-4 Linville SE, 1987, J VOICE, V1, P44, DOI 10.1016/S0892-1997(87)80023-1 LINVILLE SE, 1985, J GERONTOL, V40, P324 LINVILLE SE, 1985, J ACOUST SOC AM, V78, P40, DOI 10.1121/1.392452 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Maeda S., 1979, ACT 10 JOURN ET PAR, P152 MELCON MC, 1989, J SPEECH HEAR DISORD, V54, P282 MENDOZADENTON N, INTER INTRA IN PRESS Morris RJ, 1987, J VOICE, V1, P38, DOI 10.1016/S0892-1997(87)80022-X MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z MWANGI S, 2009, P INT C AC NAG DAGA, P1761 Nishio M, 2008, FOLIA PHONIATR LOGO, V60, P120, DOI 10.1159/000118510 Palethorpe Sallyanne, 2007, INTERSPEECH 2007, P2753 Pedhazur Elazar J., 1997, MULTIPLE REGRESSION Rastatter MP, 1997, FOLIA PHONIATR LOGO, V49, P1 RUSSELL A, 1995, J SPEECH HEAR RES, V38, P101 Sankoff G, 2007, LANGUAGE, V83, P560, DOI 10.1353/lan.2007.0106 Sato K, 1997, ANN OTO RHINOL LARYN, V106, P44 SCUKANEC GP, 1991, PERCEPT MOTOR SKILL, V73, P203, DOI 10.2466/PMS.73.4.203-208 SEGRE R, 1971, EYE EAR NOSE THROAT, V50, P62 SHIPP T, 1975, J SPEECH HEAR RES, V18, P707 SLAWSON AW, 1968, J ACOUST SOC AM, V43, P87, DOI 10.1121/1.1910769 SMITH BL, 1987, J SPEECH HEAR RES, V30, P522 SYRDAL AK, 1986, J ACOUST SOC AM, V79, P1086, DOI 10.1121/1.393381 TRAUNMULLER H, 1981, J ACOUST SOC AM, V69, P1465 TRAUNMULLER H, 1984, SPEECH COMMUN, V3, P49, DOI 10.1016/0167-6393(84)90008-6 TRAUNMULLER H, 1991, P 12 INT C PHON SCI, V5, P62 Trudgill Peter, 1979, SOCIAL MARKERS SPEEC Verdonck-de Leeuw Irma M, 2004, J Voice, V18, P193, DOI 10.1016/j.jvoice.2003.10.002 Weinreich U., 1968, DIRECTIONS HIST LING Wind J., 1970, PHYLOGENY ONTOGENY H WINKLER R, 2007, P 16 INT C PHON SCI WINKLER R, 2007, P INT 2007 ANTW Xue SA, 2003, J SPEECH LANG HEAR R, V46, P689, DOI 10.1044/1092-4388(2003/054) Zemlin WR, 1998, SPEECH HEARING SCI A NR 79 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2010 VL 52 IS 7-8 BP 638 EP 651 DI 10.1016/j.specom.2010.02.012 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 608HH UT WOS:000278573700003 ER PT J AU Yu, K Gales, M Wang, L Woodland, PC AF Yu, Kai Gales, Mark Wang, Lan Woodland, Philip C. TI Unsupervised training and directed manual transcription for LVCSR SO SPEECH COMMUNICATION LA English DT Article DE Unsupervised training; Discriminative training; Automatic transcription; Data selection AB A significant cost in obtaining acoustic training data is the generation of accurate transcriptions. When no transcription is available, unsupervised training techniques must be used. Furthermore, the use of discriminative training has become a standard feature of state-of-the-art large vocabulary continuous speech recognition (LVCSR) system. In unsupervised training, unlabelled data are recognised using a seed model and the hypotheses from the recognition system are used as transcriptions for training. In contrast to maximum likelihood training, the performance of discriminative training is more sensitive to the quality of the transcriptions. One approach to deal with this issue is data selection, where only well recognised data are selected for training. More effectively, as the key contribution of this work, an active learning technique, directed manual transcription, can be used. Here a relatively small amount of poorly recognised data is manually transcribed to supplement the automatic transcriptions. Experiments show that using the data selection approach for discriminative training yields disappointing performance improvement on the data which is mismatched to the training data type of the seed model. However, using the directed manual transcription approach can yield significant improvements in recognition accuracy on all types of data. (C) 2010 Elsevier B.V. All rights reserved. C1 [Yu, Kai; Gales, Mark; Wang, Lan; Woodland, Philip C.] Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England. RP Yu, K (reprint author), Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England. EM ky219@cam.ac.uk RI Yu, Kai/B-1772-2012 OI Yu, Kai/0000-0002-7102-9826 FU Defense Advanced Research Projects Agency [HR0011-06-C-0022] FX This work was supported in part under the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-C-0022. Many thanks go to X.A. Liu for training some of the language models used in the experiments. CR CHAN H, 2004, ICASSP MONTR COHN D, 1994, MACH LEARN, V15, P201, DOI 10.1023/A:1022673506211 Cox S. J., 1989, P ICASSP GLASG DOUMPIOTIS V, 2003, P EUROSPEECH EVERMANN G, 2003, P ASRU ST THOM EVERMANN G, 2005, P ICASSP PHIL GALES MJF, 2005, P ICASSP PHIL KAMM TM, 2002, P HUM LANG TECHN SAN KEMP T, 1999, P EUROSPEECH BUD Kumar N, 1997, THESIS J HOPKINS U LAMEL L, 2001, P ICASSP SALT LAK CI Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186 MA J, 2006, P ICASSP TOUL MANGU L, 2000, THESIS J HOPKINS U NAKAMURA M, 2007, COMPUTER SPEECH LANG, V22, P171 PALLETT DS, 1990, P ICASSP POVEY D, 2002, P ICASSP ORL Povey D., 2003, THESIS CAMBRIDGE U RICCARDI G, 2003, P EUROSPEECH GEN SINHA R, 2006, P ICASSP TOUL WANG L, 2007, P ICASSP HON Wessel F, 2005, IEEE T SPEECH AUDI P, V13, P23, DOI 10.1109/TSA.2004.838537 WOODLAND PC, 1995, ARPA WORKSH SPOK LAN, P104 WOODLAND PC, 1997, DARPA SPEECH REC WOR, P73 YU K, 2007, P INTERSPEECH ANTW NR 25 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2010 VL 52 IS 7-8 BP 652 EP 663 DI 10.1016/j.specom.2010.02.014 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 608HH UT WOS:000278573700004 ER PT J AU Gorriz, JM Ramirez, J Lang, EW Puntonet, CG Turias, I AF Gorriz, J. M. Ramirez, J. Lang, E. W. Puntonet, C. G. Turias, I. TI Improved likelihood ratio test based voice activity detector applied to speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Voice activity detection; Generalized complex Gaussian probability distribution function; Robust speech recognition ID NOISE; LRT; INFORMATION; MODEL; VAD AB Nowadays, the accuracy of speech processing systems is strongly affected by acoustic noise. This is a serious obstacle regarding the demands of modern applications. Therefore, these systems often need a noise reduction algorithm working in combination with a precise voice activity detector (VAD). The computation needed to achieve denoising and speech detection must not exceed the limitations imposed by real time speech processing systems. This paper presents a novel VAD for improving speech detection robustness in noisy environments and the performance of speech recognition systems in real time applications. The algorithm is based on a Multivariate Complex Gaussian (MCG) observation model and defines an optimal likelihood ratio test (LRT) involving multiple and correlated observations (MCO) based on a jointly Gaussian probability distribution (jGpdf) and a symmetric covariance matrix. The complete derivation of the jGpdf-LRT for the general case of a symmetric covariance matrix is shown in terms of the Cholesky decomposition which allows to efficiently compute the VAD decision rule. An extensive analysis of the proposed methodology for a low dimensional observation model demonstrates: (i) the improved robustness of the proposed approach by means of a clear reduction of the classification error as the number of observations is increased, and (ii) the trade-off between the number of observations and the detection performance. The proposed strategy is also compared to different VAD methods including the G.729, AMR and AFE standards, as well as other recently reported algorithms showing a sustained advantage in speech/non-speech detection accuracy and speech recognition performance using the AURORA databases. (C) 2010 Elsevier B.V. All rights reserved. C1 [Gorriz, J. M.; Ramirez, J.] Univ Granada, Dpt Signal Theory Networking & Commun, E-18071 Granada, Spain. [Lang, E. W.] Univ Regensburg, Inst Biophys, D-93040 Regensburg, Germany. [Puntonet, C. G.] Univ Granada, Dpt Comp Architecture & Technol, E-18071 Granada, Spain. [Turias, I.] Univ Cadiz, Dpt Lenguajes & Sistemas Informat, Algeciras 11202, Spain. RP Gorriz, JM (reprint author), Univ Granada, Dpt Signal Theory Networking & Commun, E-18071 Granada, Spain. EM gorriz@ugr.es RI Puntonet, Carlos/B-1837-2012; Prieto, Ignacio/B-5361-2013; Gorriz, Juan/C-2385-2012; Turias, Ignacio/L-7211-2014; Ramirez, Javier/B-1836-2012 OI Turias, Ignacio/0000-0003-4627-0252; Ramirez, Javier/0000-0002-6229-2921 CR [Anonymous], 1999, 301708 ETSI EN [Anonymous], 2000, 201108 ETSI ES Benyassine A, 1997, IEEE COMMUN MAG, V35, P64, DOI 10.1109/35.620527 Berouti M., 1979, P IEEE INT C AC SPEE, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bouquin-Jeannes R. L., 1995, SPEECH COMMUN, V16, P245 BOUQUINJEANNES RL, 1994, ELECTRON LETT, V30, P930 Chang JH, 2004, ELECTRON LETT, V40, P1561, DOI 10.1049/el:20047090 Chengalvarayan R, 1999, P EUROSPEECH 1999 BU, P61 Cho D, 2005, SIGNAL PROCESS-IMAGE, V20, P77, DOI 10.1016/j.image.2004.10.003 Cho YD, 2001, P INT C AC SPEECH SI, V2, P737 ETSI, 2002, 202050 ETSI ES Golub G.H., 1996, MATRIX COMPUTATIONS Gorriz J. M., 2006, Speech Communication, V48, DOI 10.1016/j.specom.2006.07.006 Gorriz J, 2009, IEEE T AUDIO SPEECH, V16, P1565 Gorriz JM, 2006, J ACOUST SOC AM, V120, P470, DOI 10.1121/1.2208450 Gorriz JM, 2005, ELECTRON LETT, V41, P877, DOI 10.1049/el:20051761 Gorriz JM, 2006, IEEE SIGNAL PROC LET, V13, P636, DOI 10.1109/LSP.2006.876340 Hirsch H. G., 2000, ISCA ITRW ASR2000 AU International Telecommunication Union (ITU), 1996, G729 ITUT KARRAY L, 2003, SPEECH COMMUN, V3, P261 Li Q, 2002, IEEE T SPEECH AUDI P, V10, P146 Manly B.F.J., 1986, MULTIVARIATE STAT ME Marzinzik M., 2002, IEEE T SPEECH AUDIO, V10, P341 MORENO A, 2000, P 2 LREC C Niehsen W, 1999, IEEE T SIGNAL PROCES, V47, P217, DOI 10.1109/78.738256 Ramirez J, 2005, IEEE T SPEECH AUDI P, V13, P1119, DOI 10.1109/TSA.2005.853212 Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002 Ramirez J, 2006, IEEE SIGNAL PROC LET, V13, P497, DOI 10.1109/LSP.2006.873147 RAMIREZ J, 2001, IEEE SIGNAL PROCESS, V12, P837 SOHN J, 1999, IEEE SIGNAL PROCESSI, V16, P1 Tanyer SG, 2000, IEEE T SPEECH AUDI P, V8, P478, DOI 10.1109/89.848229 TUCKER R, 1992, IEE PROC-I, V139, P377 Woo KH, 2000, ELECTRON LETT, V36, P180, DOI 10.1049/el:20000192 Yamani HA, 1997, J PHYS A-MATH GEN, V30, P2889, DOI 10.1088/0305-4470/30/8/029 YOUNG S, 1997, HTK BOOK NR 36 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2010 VL 52 IS 7-8 BP 664 EP 677 DI 10.1016/j.specom.2010.03.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 608HH UT WOS:000278573700005 ER PT J AU Christiansen, C Pedersen, MS Dau, T AF Christiansen, Claus Pedersen, Michael Syskind Dau, Torsten TI Prediction of speech intelligibility based on an auditory preprocessing model SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Auditory processing model; Ideal binary mask; Speech intelligibility index; Speech transmission index ID SHORT-TERM ADAPTATION; RECEPTION THRESHOLD; AMPLITUDE-MODULATION; AUDIO QUALITY; QUANTITATIVE MODEL; TRANSMISSION INDEX; FLUCTUATING NOISE; NERVE RESPONSES; NORMAL-HEARING; ITU STANDARD AB Classical speech intelligibility models, such as the speech transmission index (STI) and the speech intelligibility index (SII) are based on calculations on the physical acoustic signals. The present study predicts speech intelligibility by combining a psychoacoustically validated model of auditory preprocessing [Dau et al., 1997. J. Acoust. Soc. Am. 102,2892-2905] with a simple central stage that describes the similarity of the test signal with the corresponding reference signal at a level of the internal representation of the signals. The model was compared with previous approaches, whereby a speech in noise experiment was used for training and an ideal binary mask experiment was used for evaluation. All three models were able to capture the trends in the speech in noise training data well, but the proposed model provides a better prediction of the binary mask test data, particularly when the binary masks degenerate to a noise vocoder. (C) 2010 Elsevier B.V. All rights reserved. C1 [Christiansen, Claus; Dau, Torsten] Tech Univ Denmark, Dept Elect Engn, Ctr Appl Hearing Res, DK-2800 Lyngby, Denmark. [Pedersen, Michael Syskind] Oticon AS, DK-2765 Smorum, Denmark. RP Christiansen, C (reprint author), Tech Univ Denmark, Dept Elect Engn, Ctr Appl Hearing Res, DK-2800 Lyngby, Denmark. EM cfc@elektro.dtu.dk; msp@oticon.dk; tda@elektro.dtu.dk FU Danish research council; Oticon; Widex; GN Resound FX We are grateful to Ulrik Kjems for providing all the speech material as well as all the measured data from the psychoacoustic measurements. We also thank two anonymous reviewers for their very helpful comments on an earlier version of this paper. This work has been partly supported by the Danish research council and partly by Oticon, Widex and GN Resound through a research consortium. CR ANSI, 1997, S351997 ANSI BEERENDS JG, 1992, J AUDIO ENG SOC, V40, P963 Beerends JG, 2002, J AUDIO ENG SOC, V50, P765 Bekesy G., 1960, EXPT HEARING Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929 CARTER GC, 1973, IEEE T ACOUST SPEECH, VAU21, P337, DOI 10.1109/TAU.1973.1162496 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dau T, 1997, J ACOUST SOC AM, V102, P2906, DOI 10.1121/1.420345 Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960 Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344 Derleth RP, 2000, J ACOUST SOC AM, V108, P285, DOI 10.1121/1.429464 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 HAGERMAN B, 1984, SCAND AUDIOL, V13, P57, DOI 10.3109/01050398409076258 HAGERMAN B, 1982, SCAND AUDIOL, V11, P79, DOI 10.3109/01050398209076203 HAGERMAN B, 1984, Scandinavian Audiology Supplementum, P1 HAGERMAN B, 1982, SCAND AUDIOL, V11, P191, DOI 10.3109/01050398209076217 HAGERMAN B, 1995, SCAND AUDIOL, V24, P71, DOI 10.3109/01050399509042213 Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354 Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Huber R, 2006, IEEE T AUDIO SPEECH, V14, P1902, DOI 10.1109/TASL.2006.883259 Jepsen ML, 2008, J ACOUST SOC AM, V124, P422, DOI 10.1121/1.2924135 Karjalainen M., 1985, P IEEE INT C AC SPEE, V10, P608 Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575 Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924 Kjems U, 2009, J ACOUST SOC AM, V126, P1415, DOI 10.1121/1.3179673 KOCH R, 1992, THESIS G AUGUST U GO LUDVIGSEN C, 1990, ACTA OTO-LARYNGOL, P190 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 MILLER GA, 1950, J ACOUST SOC AM, V22, P167, DOI 10.1121/1.1906584 MOORE BCJ, 1977, INTRO PSYCHOL HEARIN NIELSEN LB, 1993, THESIS TU DENMARK OT PALMER AR, 1986, HEARING RES, V24, P1, DOI 10.1016/0378-5955(86)90002-X Patterson RD, 1987, M IOC SPEECH GROUP A Payton K. L., 2002, PAST PRESENT FUTURE, P125 Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216 Peters RW, 1998, J ACOUST SOC AM, V103, P577, DOI 10.1121/1.421128 Pickles JO, 1988, INTRO PHYSL HEARING Plack C. J., 2005, THE SENSE OF HEARING Rhebergen KS, 2005, J ACOUST SOC AM, V117, P2181, DOI 10.1121/1.1861713 Ruggero MA, 1997, J ACOUST SOC AM, V101, P2151, DOI 10.1121/1.418265 SMITH RL, 1977, J NEUROPHYSIOL, V40, P1098 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Thiede T, 2000, J AUDIO ENG SOC, V48, P3 Verhey JL, 1999, J ACOUST SOC AM, V106, P2733, DOI 10.1121/1.428101 Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080 Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12 WESTERMAN LA, 1984, HEARING RES, V15, P249, DOI 10.1016/0378-5955(84)90032-7 Zilany MSA, 2007, 3 INT IEEE EMBS C NE, P481 Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512 NR 53 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2010 VL 52 IS 7-8 BP 678 EP 692 DI 10.1016/j.specom.2010.03.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 608HH UT WOS:000278573700006 ER PT J AU Dohen, M Schwartz, JL Bailly, G AF Dohen, Marion Schwartz, Jean-Luc Bailly, Gerard TI Speech and face-to-face communication - An introduction SO SPEECH COMMUNICATION LA English DT Editorial Material DE Multimodality; Interaction; Nonverbal communication AB This issue focuses on face-to-face speech communication. Research works have demonstrated that this communicative situation is essential to language acquisition and development (e.g. naming). Face-to-face communication is in fact much more than speaking and speech is greatly influenced both in substance and content by this essential form of communication. Face-to-face communication is multimodal: interacting involves multimodality and nonverbal communication to a large extent. Speakers not only hear but also see each other producing sounds as well as facial and more generally body gestures. Gaze together with speech contribute to maintain mutual attention and to regulate turn-taking for example. Moreover, speech communication involves not only linguistic but also psychological, affective and social aspects of interaction. Face-to-face communication is situated: the true challenge of spoken communication is to take into account and integrate information not only from the speakers but also from the entire physical environment in which the interaction takes place. The communicative setting, the "task" in which the interlocutors are involved, their respective roles and the environmental conditions of the conversation indeed greatly influence how the spoken interaction unfolds. The present issue aims at synthesizing the most recent developments in this topic considering its various aspects from complementary perspectives: cognitive and neurocognitive (multisensory and perceptuo-motor interactions), linguistic (dialogic face to face interactions), paralinguistic (emotions and affects, turn-taking, mutual attention), computational (animated conversational agents, multimodal interacting communication systems). (C) 2010 Elsevier B.V. All rights reserved. C1 [Dohen, Marion; Schwartz, Jean-Luc; Bailly, Gerard] Grenoble Univ, GIPSA Lab, Speech & Cognit Dept, CNRS,UMR 5216, Grenoble, France. RP Dohen, M (reprint author), INPG, 961 Rue Houille Blanche,Domaine Univ,BP 46, F-38402 St Martin Dheres, France. EM Marion.Dohen@gipsa-lab.grenoble-inp.fr NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 477 EP 480 DI 10.1016/j.specom.2010.02.016 PG 4 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000001 ER PT J AU Weiss, B Kuehnel, C Wechsung, I Fagel, S Moller, S AF Weiss, Benjamin Kuehnel, Christine Wechsung, Ina Fagel, Sascha Moeller, Sebastian TI Quality of talking heads in different interaction and media contexts SO SPEECH COMMUNICATION LA English DT Article DE Embodied conversational agent; Smart home; Talking head; Usability; WOZ ID AGENT; INTERFACE; FACE AB We investigate the impact of three different factors on the quality of talking heads as metaphors of a spoken dialogue system in the smart home domain. The main focus lies on the effect of voice and head characteristics on audio and video quality, as well as overall quality. Furthermore, the influence of interactivity and of media context on user perception is analysed. For this purpose two subsequent experiments were conducted: the first was designed as a non-interactive rating test of videos of talking heads, while the second experiment was interactive. Here, the participants had to solve a number of tasks in dialogue with a talking head. To assess the impact of the media context, redundant information was provided via an additional visual output channel to half of the participants. As a secondary effect, the importance of participants' gender is examined. It is shown that perceived quality differences observed in the non-interactive setting are blurred when the interactivity and media contexts provide distraction from the talking head. Furthermore, a simple additional feedback screen improves the perceived quality of the talking heads. Gender effects are negligible concerning the ratings in interaction, but female and male participants exhibit different behaviour in the experiment. This advocates for more realistic evaluation settings in order to increase the external validity of the obtained quality judgements. (C) 2010 Elsevier B.V. All rights reserved. C1 [Weiss, Benjamin; Kuehnel, Christine; Wechsung, Ina; Moeller, Sebastian] TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, D-10587 Berlin, Germany. [Fagel, Sascha] TU Berlin, Inst Sprache & Kommunikat, D-10587 Berlin, Germany. RP Weiss, B (reprint author), TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, Ernst Reuter Pl 7, D-10587 Berlin, Germany. EM BWeiss@telekom.de; Christine.Kuehnel@telekom.de; Ina.Wechsung@telekom.de; Sascha.Fagel@tu-berlin.de; Sebastian.Moeller@telekom.de FU Deutsche Forschungsgemeinschaft DFG (German Research Community) [MO 1038/6-1] FX The project was financially supported by the Deutsche Forschungsgemeinschaft DFG (German Research Community), Grant MO 1038/6-1. CR Adcock A. B., 2005, J INTERACTIVE LEARNI, V16, P195 Andre E, 1998, KNOWL-BASED SYST, V11, P25, DOI 10.1016/S0950-7051(98)00057-4 Berry DC, 2005, INT J HUM-COMPUT ST, V63, P304, DOI 10.1016/j.ijhcs.2005.03.006 BREITFUSS W, 2008, P AISB 2008 S MULT O, P18 Buisine S, 2004, HUM COM INT, V7, P217 Burnham D., 2008, P INT C AUD VIS SPEE CANADA K, 1991, ETR&D-EDUC TECH RES, V39, P43, DOI 10.1007/BF02298153 Cassell J., 2000, EMBODIED CONVERSATIO Costello A.B., 2005, PRACTICAL ASSESSMENT, V10 Cowell AJ, 2005, INT J HUM-COMPUT ST, V62, P281, DOI 10.1016/j.ijhcs.2004.11.008 Dehn DM, 2000, INT J HUM-COMPUT ST, V52, P1, DOI 10.1006/ijhc.1999.0325 Dutoit T., 1996, P ICSLP 96 PHIL, V3, P1393, DOI 10.1109/ICSLP.1996.607874 ERICKSON T, 1997, DESIGNING AGENTS PEO Fagel S., 2007, P INT C AUD VIS SPEE FAGEL S, 2008, P INT C AUD VIS SPEE Fagel S, 2004, SPEECH COMMUN, V44, P141, DOI 10.1016/j.specom.2004.10.006 Frokjaer E, 2000, P ACM C HUM FACT COM, P345 Gong L, 2008, COMPUT HUM BEHAV, V24, P2074, DOI 10.1016/j.chb.2007.09.008 Gong L, 2007, HUM COMMUN RES, V33, P163, DOI 10.1111/j.1468-2958.2007.00295.x HORN JL, 1965, PSYCHOMETRIKA, V30, P179, DOI 10.1007/BF02289447 Hutcheson G. D., 1999, MULTIVARIATE SOCIAL KING WJ, 1996, P C HUM FACT COMP SY Kipp M, 2004, THESIS BOCA RATON Koda T., 1996, Proceedings. 5th IEEE International Workshop on Robot and Human Communication RO-MAN'96 Tsukuba (Cat. No.96TH8179), DOI 10.1109/ROMAN.1996.568812 Kramer N.C., 2002, VIRTUELLE REALITATEN, P203 Kramer N.C., 2008, SOZIALE WIRKUNGEN VI KUHNEL C, 2008, P INT C MULT INT ICM Lester J. C., 1997, P 8 WORLD C ART INT, P23 MASSARO DW, 2000, EMBODIED CONVERSATIO, P286 MCBREEN HM, 2000, P AAAI FALL S SOC IN, P122 MOLLER S, 2003, P 8 EUR C SPEECH COM, V3, P1953 NASS C, 1999, P INT C AUD VIS SPEE Nass C, 1997, J APPL SOC PSYCHOL, V27, P864, DOI 10.1111/j.1559-1816.1997.tb00275.x Nowak K. L., 2004, J COMPUTER MEDIATED, V9 Nowak K. L., 2005, J COMPUTER MEDIATED, V11 Pandzic IS, 1999, VISUAL COMPUT, V15, P330, DOI 10.1007/s003710050182 Pelachaud C., 2004, BROWS TRUST EVALUATI Pollick F. E., 2010, P SIGCHI C HUM FACT, V40, P69, DOI 10.1145/1240624.1240626 PRENDINGER H, 2004, KI ZEITSCHRIFT GERMA, V1, P4 Schroder M., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025708916924 Sproull L, 1996, HUM-COMPUT INTERACT, V11, P97, DOI 10.1207/s15327051hci1102_1 Takeuchi A., 1995, Human Factors in Computing Systems. CHI'95 Conference Proceedings THEOBALD B, 2008, P INT 2008 BRISB AUS, P2310 VANMULKEN S, 1998, P HCI PEOPL COMP Walker MA, 1998, COMPUT SPEECH LANG, V12, P317, DOI 10.1006/csla.1998.0110 WEISS B, 2009, P HUM COMP INT INT H, P349 XIAO J, 2002, EMBODIED CONVERSATIO Xiao J., 2006, THESIS GEORGIA I TEC Zimmerman J, 2005, P C DES PLEAS PROD I, P233 NR 49 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 481 EP 492 DI 10.1016/j.specom.2010.02.011 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000002 ER PT J AU Badin, P Tarabalka, Y Elisei, F Bailly, G AF Badin, Pierre Tarabalka, Yuliya Elisei, Frederic Bailly, Gerard TI Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding SO SPEECH COMMUNICATION LA English DT Article DE Lip reading; Tongue reading; Audiovisual speech perception; Virtual audiovisual talking head; Augmented speech; ElectroMagnetic Articulography (EMA) ID PERCEPTION; FRENCH; MODELS AB Lip reading relies on visible articulators to ease speech understanding. However, lips and face alone provide very incomplete phonetic information: the tongue, that is generally not entirely seen, carries an important part of the articulatory information not accessible through lip reading. The question is thus whether the direct and full vision of the tongue allows tongue reading. We have therefore generated a set of audiovisual VCV stimuli with an audiovisual talking head that can display all speech articulators, including tongue, in an augmented speech mode. The talking head is a virtual clone of a human speaker and the articulatory movements have also been captured on this speaker using ElectroMagnetic Articulography (EMA). These stimuli have been played to subjects in audiovisual perception tests in various presentation conditions (audio signal alone, audiovisual signal with profile cutaway display with or without tongue, complete face), at various Signal-to-Noise Ratios. The results indicate: (1) the possibility of implicit learning of tongue reading, (2) better consonant identification with the cutaway presentation with the tongue than without the tongue, (3) no significant difference between the cutaway presentation with the tongue and the more ecological rendering of the complete face, (4) a predominance of lip reading over tongue reading, but (5) a certain natural human capability for tongue reading when the audio signal is strongly degraded or absent. We conclude that these tongue reading capabilities could be used for applications in the domains of speech therapy for speech retarded children, of perception and production rehabilitation of hearing impaired children, and of pronunciation training for second language learners. (C) 2010 Elsevier B.V. All rights reserved. C1 [Badin, Pierre] Grenoble Univ, GIPSA Lab, DPC, ICP,CNRS,ENSE3,UMR 5216, F-38402 St Martin Dheres, France. RP Badin, P (reprint author), Grenoble Univ, GIPSA Lab, DPC, ICP,CNRS,ENSE3,UMR 5216, 961 Rue Houille Blanche,BP 46, F-38402 St Martin Dheres, France. EM Pierre.Badin@gipsa-lab.grenoble-inp.fr; Yuliya.Tarabalka@gipsa-lab.grenoble-inp.fr; Frederic.E-lisei@gipsa-lab.grenoble.inp.fr; Gerard.Bailly@gipsa-lab.gre-noble-inp.fr CR Badin P., 2006, 7 INT SEM SPEECH PRO, P395 Badin P, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2635 BALTER O, 2005, 7 INT ACM SIGACCESS, P36 Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4 Benoit C., 1996, SPEECHREADING HUMANS, P315 CATHIARD MA, 1996, SPEECHREADING HUMANS, P211 CORNETT RO, 1967, AM ANN DEAF, V112, P3 Engwall O, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2631 ERBER NP, 1975, J SPEECH HEAR DISORD, V40, P481 Fagel S, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2643 GRAUWINKEL K, 2007, INTERSPEECH, P706 Hoole P., 1997, FORSCHUNGSBERICHTE I, V35, P177 IJSSELDIJK FJ, 1992, J SPEECH HEAR RES, V35, P466 BENOI C, 1994, J SPEECH HEAR RES, V37, P1195 Kroger BJ, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2639 Massaro D., 2003, EUR 2003 GEN SWITZ, P2249 Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025) Massaro DW, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2623 Mills A. E., 1987, HEARING EYE PSYCHOL, P145 MONTGOMERY D, 1981, PSYCHOL RES, V43 Mulford R., 1988, EMERGENT LEXICON CHI, P293 Narayanan S, 2004, J ACOUST SOC AM, V115, P1771, DOI 10.1121/1.1652588 Odisio M, 2004, SPEECH COMMUN, V44, P63, DOI 10.1016/j.specom.2004.10.008 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 Serrurier A, 2008, J ACOUST SOC AM, V123, P2335, DOI 10.1121/1.2875111 STOELGAMMON C, 1988, J SPEECH HEAR DISORD, V53, P302 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 TYEMURRAY N, 1993, NCVS STATUS PROG REP, V4, P41 VIHMAN MM, 1985, LANGUAGE, V61, P397, DOI 10.2307/414151 Wik P, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2627 NR 30 TC 8 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 493 EP 503 DI 10.1016/j.specom.2010.03.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000003 ER PT J AU Heracleous, P Beautemps, D Aboutabit, N AF Heracleous, Panikos Beautemps, Denis Aboutabit, Noureddine TI Cued Speech automatic recognition in normal-hearing and deaf subjects SO SPEECH COMMUNICATION LA English DT Article DE French Cued Speech; Hidden Markov models; Automatic recognition; Feature fusion; Multi-stream HMM decision fusion ID LANGUAGE AB This article discusses the automatic recognition of Cued Speech in French based on hidden Markov models (HMMs). Cued Speech is a visual mode which, by using hand shapes in different positions and in combination with lip patterns of speech, makes all the sounds of a spoken language clearly understandable to deaf people. The aim of Cued Speech is to overcome the problems of lipreading and thus enable deaf children and adults to understand spoken language completely. In the current study, the authors demonstrate that visible gestures are as discriminant as audible orofacial gestures. Phoneme recognition and isolated word recognition experiments have been conducted using data from a normal-hearing cuer. The results obtained were very promising, and the study has been extended by applying the proposed methods to a deaf cuer. The achieved results have not shown any significant differences compared to automatic Cued Speech recognition in a normal-hearing subject. In automatic recognition of Cued Speech, lip shape and gesture recognition are required. Moreover, the integration of the two modalities is of great importance. In this study, lip shape component is fused with hand component to realize Cued Speech recognition. Using concatenative feature fusion and multi-stream HMM decision fusion, vowel recognition, consonant recognition, and isolated word recognition experiments have been conducted. For vowel recognition, an 87.6% vowel accuracy was obtained showing a 61.3% relative improvement compared to the sole use of lip shape parameters. In the case of consonant recognition, a 78.9% accuracy was obtained showing a 56% relative improvement compared to the use of lip shape only. In addition to vowel and consonant recognition, a complete phoneme recognition experiment using concatenated feature vectors and Gaussian mixture model (GMM) discrimination was conducted, obtaining a 74.4% phoneme accuracy. Isolated word recognition experiments in both normal-hearing and deaf subjects were also conducted providing a word accuracy of 94.9% and 89%, respectively. The obtained results were compared with those obtained using audio signal, and comparable accuracies were observed. (C) 2010 Elsevier B.V. All rights reserved. C1 [Heracleous, Panikos] ATR, Intelligent Robot & Commun Labs, Kyoto 6190288, Japan. [Heracleous, Panikos; Beautemps, Denis; Aboutabit, Noureddine] Univ Grenoble 3, Speech & Cognit Dept, GIPSA Lab, CNRS,UJF,INPG,UMR 5216, F-38402 St Martin Dheres, France. RP Heracleous, P (reprint author), ATR, Intelligent Robot & Commun Labs, 2-2-2 Hikaridai Seika Cho, Kyoto 6190288, Japan. EM panikos@atr.jp FU ANR FX The authors would like to thank the volunteer cuers Sabine Chevalier, Myriam Diboui, and Clementine Huriez for their time spending on Cued Speech data recording, and also for accepting the recording constraints. Also the authors would like to thank Christophe Savariaux and Coriandre Vilain for their help in the Cued Speech material recording. This work was mainly performed at GIPSA-lab, Speech and Cognition Department and was supported by the TELMA project (ANR, 2005 edition). CR ABOUTABIT N, 2006, P ICASSP2006, P633 ABOUTABIT N, 2007, P INT C AUD VIS SPEE Aboutabit N., 2007, THESIS I NATL POLYTE Adjoudani A., 1996, SPEECHREADING HUMANS, P461 Auer ET, 2007, J SPEECH LANG HEAR R, V50, P1157, DOI 10.1044/1092-4388(2007/080) BERNSTEIN L, 2007, CUED SPEECH CUED LAN Bourlard H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607145 CORNETT RO, 1967, AM ANN DEAF, V112, P3 Dreuw P, 2007, P INT, P2513 FLEETWOOD E, 1999, CUED LANGUAGE STRUCT Gibert G, 2005, J ACOUST SOC AM, V118, P1144, DOI 10.1121/1.1944587 Gillick L., 1989, P ICASSP, P532 Hennecke M. E., 1996, SPEECHREADING HUMANS, P331 Leybaert J, 2000, J EXP CHILD PSYCHOL, V75, P291, DOI 10.1006/jecp.1999.2539 MERKX P, 2005, FINAL PROJECT MATH S, V196, P1 MONTGOMERY AA, 1983, J ACOUST SOC AM, V73, P2134, DOI 10.1121/1.389537 Nakamura S., 2002, Proceedings Fourth IEEE International Conference on Multimodal Interfaces, DOI 10.1109/ICMI.2002.1167011 Nefian A. V., 2002, P ICASSP NICHOLLS GH, 1982, J SPEECH HEAR RES, V25, P262 Ong SCW, 2005, IEEE T PATTERN ANAL, V27, P873, DOI 10.1109/TPAMI.2005.112 Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150 UCHANSKI RM, 1994, J REHABIL RES DEV, V31, P20 Young S., 2001, HTK BOOK NR 23 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 504 EP 512 DI 10.1016/j.specom.2010.03.001 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000004 ER PT J AU Troille, E Cathiard, MA Abry, C AF Troille, Emilie Cathiard, Marie-Agnes Abry, Christian TI Speech face perception is locked to anticipation in speech production SO SPEECH COMMUNICATION LA English DT Article DE Auditory-visual speech perception; Speech production; Anticipation AB At the beginning of the 90's, it was definitively demonstrated that as early as the visual speech information is perceivable, speech identification can be processed. Cathiard et al. [Cathiard, M.-A., Tiberghien, G., Tseva, A., Lallouache, M.-T., Escudier, P., 1991. Visual perception of anticipatory rounding during acoustic pauses: a cross-language study. In: Proceedings of the XIIth International Congress of Phonetic Sciences, Aix-en-Provence, France, 4, pp. 50-53] used different V-to-V anticipatory spans, with articulatory measurements, along silent pauses, in a perceptual gating paradigm, and established that up to 200 ms "speech can be seen before it is heard". These results were later framed into the framework of a general anticipatory control model, the Movement Expansion Model [Abry, C., Lallouache, M.-T., Cathiard, M.-A., 1996]. How can coarticulation models account for speech sensitivity to audio visual desynchronization? In: Stork, D., Hennecke, M. (Eds.), Speechreading by Humans and Machines, NATO ASI Series F: Computer, Vol. 150. Springer-Verlag, Berlin, Tokyo, pp. 247-255]. Surprisingly the timing of the vowel and consonant auditory and visual streams remained until now poorly understood within the typical CVCV span. A first preliminary test was published by Escudier, Benoit and Lallouache [Escudier, P., Benoit, C., Lallouache, M.-T., 1990. Identification visuelle de stimuli associes a l'opposition /i/-/y/: etude statique. Colloque de physique, supplement au n(o) 2, tome 51, ler Congres Francais d'Acoustique, C2-541-544]: this is the issue we took up again more than 10 years later. And for the first time we found that "speech can be heard before it is seen". The main purpose of the present contribution will be to bring new data in order to clear up apparent contradictions, essentially due to misconceptions of variability and lawfulness in speakers' behavior. (C) 2009 Elsevier B.V. All rights reserved. C1 [Troille, Emilie; Cathiard, Marie-Agnes; Abry, Christian] Univ Grenoble 3, CRI, EA 610, F-38040 Grenoble 9, France. [Troille, Emilie] Univ Grenoble 3, GIPSA Lab, CNRS, ICP,INPG,UMR 5216, F-38040 Grenoble 9, France. RP Cathiard, MA (reprint author), 10 Chemin Ruy, F-38690 Chabons, France. EM emilie.troille@gmail.com; marieagnes.cathiard@u-grenoble3.fr; chris.abry@orange.fr FU Region Rhone-Alpes, France FX We thank: Deborah Kowalski and Jean-Luc Schwartz, our speakers; Alain Arnal, Helene Loevenbruck, Solange Rossato and Christophe Savariaux for their technical assistance at the Institut de la Communication Parlee, Grenoble, France. This work was supported by a grant from Region Rhone-Alpes, France. CR ABRY C, 1995, P ICPHS STOCKH SUED, V4, P152 Abry C., 1995, B COMM PARLEE, V3, P85 Abry C, 1996, NATO ASI SERIES F, V150, P247 ABRY C, 1989, J PHONETICS, V17, P47 BENOIT C, 1986, J ACOUST SOC AM, V80, P1846, DOI 10.1121/1.394302 BENOIT C, 1986, P 12 INT C AC TOR Byrd D, 2003, J PHONETICS, V31, P149, DOI 10.1016/S0095-4470(02)00085-2 CATHIARD MA, 1996, NATO ASI SERIES F, V150, P211 CATHIARD MA, 2007, SPEC SESS AUD SPEECH, P291 CATHIARD MA, 1991, P 12 INT C PHON SCI, V4, P50 CATHIARD MA, 1994, THESIS GRENOBLE ESCUDIER P, 1990, C PHYS C FRANC AC S2, V51 Evans N, 2009, BEHAV BRAIN SCI, V32, P429, DOI 10.1017/S0140525X0999094X Farnetani E., 1999, COARTICULATION THEOR, P144 Finney D. J., 1971, PROBIT ANAL, V3rd GAITENBY J, 1965, SR2 HASK LAB, P1 Ghazanfar AA, 2005, J NEUROSCI, V25, P5004, DOI 10.1523/JNEUROSCI.0799-05.2005 GROSJEAN F, 1980, PERCEPT PSYCHOPHYS, V28, P267, DOI 10.3758/BF03204386 Grosjean Francois, 1997, GUIDE SPOKEN WORD RE, P597 Lallouache M. T., 1991, THESIS ENSERG GRENOB Munson B, 2004, J SPEECH LANG HEAR R, V47, P58, DOI [10.1044/1092-4388(2004/006), 10.1044/1092-4388(2204/006)] NITTROUER S, 1988, J ACOUST SOC AM, V84, P1653, DOI 10.1121/1.397180 Noiray A, 2008, EMERGENCE LANGUAGE A, P100 NOIRAY A, 2006, P 7 INT SEM SPEECH P, P319 SCHWARTZ JL, 1993, J PHONETICS, V21, P411 SMEELE PMT, 1994, P INT C SPOK LANG PR, V3, P1431 SMEELE PMT, 1994, THESIS TU DELFT DELF Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 Troille E., 2007, P INT C AUD VIS SPEE, P281 TULLER B, 1984, J ACOUST SOC AM, V76, P1030, DOI 10.1121/1.391421 WHALEN DH, 1984, PERCEPT PSYCHOPHYS, V35, P49, DOI 10.3758/BF03205924 NR 31 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 513 EP 524 DI 10.1016/j.specom.2009.12.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000005 ER PT J AU Fort, M Spinelli, E Savariaux, C Kandel, S AF Fort, Mathilde Spinelli, Elsa Savariaux, Christophe Kandel, Sonia TI The word superiority effect in audiovisual speech perception SO SPEECH COMMUNICATION LA English DT Article DE Audiovisual speech; Lexical access; Speech perception in noise; Word recognition ID RECOGNITION; FRENCH; IDENTIFICATION; RESTORATION; CONTEXT AB Seeing the facial gestures of a speaker enhances phonemic identification in noise. The goal of this study was to assess whether the visual information regarding consonant articulation activates lexical representations. We conducted a phoneme monitoring task with word and pseudo-words in audio only (A) and audiovisual (AV) contexts with two levels of white noise masking the acoustic signal. The results confirmed that visual information enhances consonant detection in noisy conditions and also revealed that it accelerates the phoneme detection process. The consonants were detected faster in AV than in A only condition. Furthermore, when the acoustic signal was deteriorated, the consonant phonemes were better recognized when they were embedded in words rather than in pseudo-words in the AV condition. This provides evidence indicating that visual information on phoneme identity can contribute to lexical activation processes during word recognition. (C) 2010 Elsevier B.V. All rights reserved. C1 [Fort, Mathilde; Spinelli, Elsa; Kandel, Sonia] Univ Pierre Mendes France, Lab Psychol & NeuroCognit, CNRS, UMR 5105, F-38040 Grenoble 9, France. [Spinelli, Elsa; Kandel, Sonia] Inst Univ France, F-75005 Paris, France. [Savariaux, Christophe] Univ Grenoble 3, GIPSA Lab, Dpt Parole & Cognit, CNRS,UMR 5216, F-38040 Grenoble 9, France. RP Fort, M (reprint author), Univ Pierre Mendes France, Lab Psychol & NeuroCognit, CNRS, UMR 5105, BP 47, F-38040 Grenoble 9, France. EM mathilde.fort@upmf-grenoble.fr; elsa.spinelli@upmf-grenoble.fr; christophe.savariaux@gipsa-lab.inpg.fr; sonia.kandel@upmf-grenoble.fr CR AMANO J, 1998, P AUD VIS SPEECH PRO, P43 Barutchu A, 2008, EUR J COGN PSYCHOL, V20, P1, DOI 10.1080/09541440601125623 Brancazio L, 2004, J EXP PSYCHOL HUMAN, V30, P445, DOI 10.1037/0096-1523.30.3.445 Buchwald AB, 2009, LANG COGNITIVE PROC, V24, P580, DOI 10.1080/01690960802536357 COLIN C, 2003, ANN PSYCHOL, V104, P497 Connine CM, 1996, LANG COGNITIVE PROC, V11, P635 CUTLER A, 1987, COGNITIVE PSYCHOL, V19, P141, DOI 10.1016/0010-0285(87)90010-7 ERBER NP, 1969, J SPEECH HEAR RES, V12, P423 FRAUENFELDER UH, 1990, J EXP PSYCHOL HUMAN, V16, P77, DOI 10.1037/0096-1523.16.1.77 GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110 Gow DW, 2003, PERCEPT PSYCHOPHYS, V65, P575, DOI 10.3758/BF03194584 Green K. P., 1998, ADV PSYCHOL SPEECHRE, P3 BENOI C, 1994, J SPEECH HEAR RES, V37, P1195 Kim J, 2004, COGNITION, V93, pB39, DOI 10.1016/j.cognition.2003.11.003 LOCASTO PC, 2007, LANG SPEECH, V50, P54 MARSLENWILSON W, 1990, ACL MIT NAT, P148 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 New B, 2001, ANN PSYCHOL, V101, P447 Norris D, 2000, BEHAV BRAIN SCI, V23, P299, DOI 10.1017/S0140525X00003241 Robert-Ribes J, 1998, J ACOUST SOC AM, V103, P3677, DOI 10.1121/1.423069 Sams M, 1998, SPEECH COMMUN, V26, P75, DOI 10.1016/S0167-6393(98)00051-X SAMUEL AG, 1981, J EXP PSYCHOL GEN, V110, P474, DOI 10.1037/0096-3445.110.4.474 Spinelli E., 2005, PSYCHOL LANGAGE ECRI SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 Tiippana K, 2004, EUR J COGN PSYCHOL, V16, P457, DOI 10.1080/09541440340000268 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 NR 29 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 525 EP 532 DI 10.1016/j.specom.2010.02.005 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000006 ER PT J AU Sato, M Buccino, G Gentilucci, M Cattaneo, L AF Sato, Marc Buccino, Giovanni Gentilucci, Maurizio Cattaneo, Luigi TI On the tip of the tongue: Modulation of the primary motor cortex during audiovisual speech perception SO SPEECH COMMUNICATION LA English DT Article DE Audiovisual speech perception; Transcranial magnetic stimulation; Motor system; Mirror-neuron system; Motor theory of speech perception; McGurk effect ID TRANSCRANIAL MAGNETIC STIMULATION; CORTICO-HYPOGLOSSAL PROJECTIONS; HUMAN AUDITORY-CORTEX; HUMAN BRAIN-STEM; VISUAL SPEECH; LINGUAL MUSCLES; SEEING VOICES; HEARING LIPS; BROCAS AREA; EXCITABILITY AB Recent neurophysiological studies show that cortical brain regions involved in the planning and execution of speech gestures are also activated in processing speech sounds. These findings suggest that speech perception is in part mediated by reference to the motor actions afforded in the speech signal. Since interactions between auditory and visual modalities are beneficial in speech perception and face-to-face communication, we used single-pulse transcranial magnetic stimulation (TMS) to investigate whether audiovisual speech perception might induce excitability changes in the left tongue-related primary motor cortex and whether acoustic and visual speech inputs might differentially modulate motor excitability. To this aim, motor-evoked potentials obtained with focal TMS applied over the left tongue primary motor cortex were recorded from participants' tongue muscles during the perception of matching and conflicting audiovisual syllables incorporating tongue- and/or lip-related phonemes (i.e. visual and acoustic /ba/, /ga/ and /da/, visual /ba/ and acoustic /ga/, visual /ga/ and acoustic /ba/). Compared to the presentation of congruent /ba/ syllable, which primarily involves lip movements when pronounced, exposure to syllables incorporating visual and/or acoustic tongue-related phonemes induced a greater excitability of the left tongue primary motor cortex as early as 100-200 ms after the consonantal onset of the acoustically presented syllable. These results provide evidence that both visual and auditory modalities specifically modulate activity in the tongue primary motor cortex at an early stage during audiovisual speech perception. Because no interaction between the two modalities was observed, these results suggest that information from each sensory channel is recoded separately in the primary motor cortex at that point of time. These findings are discussed in relation to theories assuming a link between perception and action in the human speech processing system and theoretical models of audiovisual interaction. (C) 2009 Elsevier B.V. All rights reserved. C1 [Sato, Marc] CNRS, GIPSA Lab, UMR 5216, Dept Parole & Cognit, F-38040 Grenoble 9, France. [Sato, Marc] Grenoble Univ, F-38040 Grenoble 9, France. [Buccino, Giovanni] Magna Graecia Univ Catanzaro, Dept Med Sci, Catanzaro, Italy. [Gentilucci, Maurizio] Univ Parma, Dept Neurosci, Sez Fisiol, I-43100 Parma, Italy. [Cattaneo, Luigi] Univ Trent, Ctr Mind Brain Sci, CIMeC, I-38100 Trento, Italy. RP Sato, M (reprint author), CNRS, GIPSA Lab, UMR 5216, Dept Parole & Cognit, 1180 Ave Cent,BP 25, F-38040 Grenoble 9, France. EM marc.sato@gipsa-lab.inpg.fr FU MIUR (Ministero Italiano dell'Istruzione, dell'Universita e della Ricerca); CNRS (Centre National de la Recherche Scientifique) FX We wish to thank Elena Borra and Helene Loevenbruck for their help with this study. This research was supported by MIUR (Ministero Italiano dell'Istruzione, dell'Universita e della Ricerca) and CNRS (Centre National de la Recherche Scientifique). CR Arbib MA, 2005, BEHAV BRAIN SCI, V28, P105, DOI 10.1017/S0140525X05000038 Besle J, 2004, EUR J NEUROSCI, V20, P2225, DOI 10.1111/j.1460-9568.2004.03670.x Brancazio L, 2005, PERCEPT PSYCHOPHYS, V67, P759, DOI 10.3758/BF03193531 Callan DE, 2004, J COGNITIVE NEUROSCI, V16, P805, DOI 10.1162/089892904970771 Callan DE, 2003, NEUROREPORT, V14, P2213, DOI 10.1097/00001756-200312020-00016 Calvert GA, 2000, CURR BIOL, V10, P649, DOI 10.1016/S0960-9822(00)00513-3 Calvert GA, 2003, J COGNITIVE NEUROSCI, V15, P57, DOI 10.1162/089892903321107828 Chen CH, 1999, NEUROLOGY, V52, P411 D'Ausillo A, 2009, CURR BIOL, V19, P381, DOI 10.1016/j.cub.2009.01.017 Fadiga L, 2005, CURR OPIN NEUROBIOL, V15, P213, DOI 10.1016/j.conb.2005.03.013 Fadiga L, 2002, EUR J NEUROSCI, V15, P399, DOI 10.1046/j.0953-816x.2001.01874.x Galantucci B, 2006, PSYCHON B REV, V13, P361, DOI 10.3758/BF03193857 Gentilucci M, 2006, NEUROSCI BIOBEHAV R, V30, P949, DOI 10.1016/j.neubiorev.2006.02.004 Gentilucci M, 2005, EXP BRAIN RES, V167, P66, DOI 10.1007/s00221-005-0008-z Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788 Green K P, 1998, HEARING EYE, P3 Hertrich I, 2007, NEUROPSYCHOLOGIA, V45, P1342, DOI 10.1016/j.neuropsychologia.2006.09.019 Jones JA, 2003, NEUROREPORT, V14, P1129, DOI 10.1097/01.wnr.0000074343.81633.2a BENOI C, 1994, J SPEECH HEAR RES, V37, P1195 Klucharev Vasily, 2003, Brain Res Cogn Brain Res, V18, P65 Liberman A. M., 2000, TRENDS COGN SCI, V3, P254 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 MACLEOD A, 1987, British Journal of Audiology, V21, P131, DOI 10.3109/03005368709077786 Massaro D. W., 1998, PERCEIVING TALKING F MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Meister IG, 2007, CURR BIOL, V17, P1692, DOI 10.1016/j.cub.2007.08.064 Mottonen R, 2004, NEUROSCI LETT, V363, P112, DOI 10.1016/j.neulet.2004.03.076 Möttönen Riikka, 2002, Brain Res Cogn Brain Res, V13, P417 MOTTONEN R, 2004, THESIS HELSINKI U TE Muellbacher W, 1997, BRAIN, V120, P1909, DOI 10.1093/brain/120.10.1909 MUELLBACHER W, 1994, J NEUROL NEUROSUR PS, V57, P309, DOI 10.1136/jnnp.57.3.309 Nishitani N, 2002, NEURON, V36, P1211, DOI 10.1016/S0896-6273(02)01089-9 Ojanen V, 2005, NEUROIMAGE, V25, P333, DOI 10.1016/j.neuroimage.2004.12.001 OJANEN V, 2005, THESIS HELSINKI U TE OLDFIELD RC, 1971, NEUROPSYCHOLOGIA, V9, P97, DOI 10.1016/0028-3932(71)90067-4 Paulesu E, 2003, J NEUROPHYSIOL, V90, P2005, DOI 10.1152/jn.00926.2002 Pekkola J, 2006, NEUROIMAGE, V29, P797, DOI 10.1016/j.neuroimage.2005.09.069 Pulvermuller F, 2006, P NATL ACAD SCI USA, V103, P7865, DOI 10.1073/pnas.0509989103 Reisberg D., 1987, HEARING EYE PSYCHOL, P97 Rizzolatti G, 2004, ANNU REV NEUROSCI, V27, P169, DOI 10.1146/annurev.neuro.27.070203.144230 Rizzolatti G, 1998, TRENDS NEUROSCI, V21, P188, DOI 10.1016/S0166-2236(98)01260-0 Rodel RMW, 2003, ANN OTO RHINOL LARYN, V112, P71 ROSSINI PM, 1994, ELECTROEN CLIN NEURO, V91, P79, DOI 10.1016/0013-4694(94)90029-9 Roy AC, 2008, J PHYSIOLOGY-PARIS, V102, P101, DOI 10.1016/j.jphysparis.2008.03.006 SAMS M, 1991, NEUROSCI LETT, V127, P141, DOI 10.1016/0304-3940(91)90914-F Sato M, 2009, BRAIN LANG, V111, P1, DOI 10.1016/j.bandl.2009.03.002 SCHWARTZ JL, 1998, ADV PSYCHOL SPEECHRE, P85 Schwartz JL, 2008, REV FR LING APPL, V13, P9 Sekiyama K, 2003, NEUROSCI RES, V47, P277, DOI 10.1016/S0168-0102(03)00214-1 Skipper JI, 2007, CEREB CORTEX, V17, P2387, DOI 10.1093/cercor/bhl147 Skipper JI, 2005, NEUROIMAGE, V25, P76, DOI 10.1016/j.neuroimage.2004.11.006 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Sundara M, 2001, NEUROREPORT, V12, P1341, DOI 10.1097/00001756-200105250-00010 Urban PP, 1996, BRAIN, V119, P1031, DOI 10.1093/brain/119.3.1031 van Wassenhove V, 2005, P NATL ACAD SCI USA, V102, P1181, DOI 10.1073/pnas.0408949102 Wassermann EM, 1998, EVOKED POTENTIAL, V108, P1, DOI 10.1016/S0168-5597(97)00096-8 Watkins K, 2004, J COGNITIVE NEUROSCI, V16, P978, DOI 10.1162/0898929041502616 Watkins KE, 2003, NEUROPSYCHOLOGIA, V41, P989, DOI 10.1016/S0028-3932(02)00316-0 Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263 Wilson SM, 2006, NEUROIMAGE, V33, P316, DOI 10.1016/j.neuroimage.2006.05.032 NR 61 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 533 EP 541 DI 10.1016/j.specom.2009.12.004 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000007 ER PT J AU Flecha-Garcia, ML AF Flecha-Garcia, Maria L. TI Eyebrow raises in dialogue and their relation to discourse structure, utterance function and pitch accents in English SO SPEECH COMMUNICATION LA English DT Article DE Dialogue structure; Pitch accent; Facial movement; Non-verbal communication; Multimodality ID PROSODIC PROMINENCE; SPEECH AB Face-to-face interaction involves both verbal and non-verbal communication. Studies have suggested a relationship between eyebrow raises and the verbal message, but our knowledge is still limited. If we could characterise a relation between eyebrow raises and the linguistic signal we could better understand and reproduce human multimodal communication behaviour. Based on previous observations on body movement, this research investigated eyebrow raising in face-to-face dialogue in English in connection with (1) discourse structure and utterance function and (2) pitch accents. Small but significant results partially supported the predictions, suggesting a link between eyebrow raising and spoken language. Eyebrow raises occurred more frequently at the start of high-level discourse segments than anywhere else in the dialogue, and more frequently in instructions than in requests for or acknowledgements of information. Interestingly, contrary to the hypothesis queries did not have more raises than any other type of utterance. Additionally, as predicted, eyebrow raises seemed to be aligned with pitch accents, preceding them by an average of 0.06 s. Possible linguistic functions are proposed, namely the structuring and emphasising of information in the verbal message. Finally, methodological issues and practical applications are briefly discussed. (C) 2009 Elsevier B.V. All rights reserved. C1 Univ Edinburgh, Sch Philosophy Psychol & Language Sci, Edinburgh EH8 9AD, Midlothian, Scotland. RP Flecha-Garcia, ML (reprint author), Univ Edinburgh, Sch Philosophy Psychol & Language Sci, Dugald Stewart Bldg,3 Charles St, Edinburgh EH8 9AD, Midlothian, Scotland. EM marisaflecha@gmail.com FU EPSRC FX I am very grateful to Dr. Ellen G. Bard and Prof. D. Robert Ladd for their valuable help in the supervision of this project. I also thank Dr. Holly Brannigan and Dr. Marc Swerts for their constructive comments. This project was partially funded by EPSRC. CR ANDERSON AH, 1991, LANG SPEECH, V34, P351 Bolinger D., 1986, INTONATION ITS PARTS Carletta J, 1997, COMPUT LINGUIST, V23, P13 Cassell J., 2001, P 41 ANN M ASS COMP, P106 Cave C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607235 Cave C, 2002, P 7 INT C SPOK LANG, P2353 Chovil N., 1991, RES LANG SOC INTERAC, V25, P163 De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018 Ekman P., 1979, HUMAN ETHOLOGY, P169 FLECHAGARCIA ML, 2006, THESIS U EDINBURGH E Granstrom B, 2005, SPEECH COMMUN, V46, P473, DOI 10.1016/j.specom.2005.02.017 Keating P., 2003, P 16 INT C PHON SCI, P2071 Kendon A., 1980, RELATIONSHIP VERBAL, P207 Kendon Adam, 1972, STUDIES DYADIC COMMU, P177 Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005 KRAHMER E, 2004, BROWS TRUST EVALUATI Ladd D. R., 1996, INTONATIONAL PHONOLO Ladd DR, 2003, J PHONETICS, V31, P81, DOI 10.1016/S0095-4470(02)00073-6 McClave E, 1998, J PSYCHOLINGUIST RES, V27, P69, DOI 10.1023/A:1023274823974 McClave EZ, 2000, J PRAGMATICS, V32, P855, DOI 10.1016/S0378-2166(99)00079-X McNeill D., 1992, HAND MIND WHAT GESTU McNeill D., 2001, GESTURE, V1, P9, DOI DOI 10.1075/GEST.1.1.03MCN) Neter J, 1996, APPL LINEAR STAT MOD Pitrelli J., 1994, P 3 INT C SPOK LANG, P123 Silverman K., 1992, P INT C SPOK LANG PR, P867 Srinivasan RJ, 2003, LANG SPEECH, V46, P1 Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 NR 27 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 542 EP 554 DI 10.1016/j.specom.2009.12.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000008 ER PT J AU Cvejic, E Kim, J Davis, C AF Cvejic, Erin Kim, Jeesun Davis, Chris TI Prosody off the top of the head: Prosodic contrasts can be discriminated by head motion SO SPEECH COMMUNICATION LA English DT Article DE Prosody; Suprasegmental; Visual speech; Rigid head motion; Upper head ID AUDIOVISUAL SPEECH-PERCEPTION; VISUAL-PERCEPTION; INTONATION; STATEMENTS; QUESTIONS; ENGLISH; STRESS; FOCUS AB The current study investigated people's ability to discriminate prosody related head and face motion from videos showing only the upper face of the speaker saying the same sentence with different prosody. The first two experiments used a visual-visual matching task. These videos were either fully textured (Experiment 1) or showed only the outline of the speaker's head (Experiment 2). Participants were presented with two stimulus pairs of silent videos, with their task to select the pair that had the same prosody. The overall results of the visual-visual matching experiments showed that people could discriminate same- from different-prosody sentences with a high degree of accuracy. Similar levels of discrimination performance were obtained for the fully textured (containing rigid and non-rigid motions) and the outline only (rigid motion only) videos. Good visual-visual matching performance shows that people are sensitive to the underlying factor that determined whether the movements were the same or not, i.e., the production of prosody. However, testing auditory-visual matching provides a more direct test concerning people's sensitivity to how head motion/face motion relates to spoken prosody. Experiments 3 (with fully textured videos) and 4 (with outline only videos) employed a cross-modal matching task that required participants to match auditory with visual tokens that had the same prosody. As with the previous experiments, participants performed this discrimination very well. Similarly, no decline in performance was observed for the outline only videos. This result supports the proposal that rigid head motion provides an important visual cue to prosody. (C) 2010 Elsevier B.V. All rights reserved. C1 [Cvejic, Erin; Kim, Jeesun; Davis, Chris] Univ Western Sydney, MARCS Auditory Labs, Penrith, NSW, Australia. RP Cvejic, E (reprint author), Univ Western Sydney, MARCS Auditory Labs, Bldg 5,Bankstown Campus,Locked Bag 1797, Penrith, NSW, Australia. EM e.cvejic@uws.edu.au; j.kim@uws.edu.au; chris.davis@uws.edu.au FU School of Psychology, University of Western Sydney and MARCS Auditory Laboratories; Australian Research Council [DP0666857, TS0669874] FX The authors wish to thank Bronson Harry for his patient assistance with the recording of audio-visual stimuli, and two anonymous reviewers and the guest editors for their helpful suggestions to improve the manuscript. The first author also wishes to acknowledge the support of the School of Psychology, University of Western Sydney and MARCS Auditory Laboratories for providing generous financial support. The second and third authors acknowledge support from Australian Research Council (DP0666857 & TS0669874). CR Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4 BERNSTEIN LE, 1989, J ACOUST SOC AM, V85, P397, DOI 10.1121/1.397690 Boersma P., 2008, PRAAT DOING PHONETIC BOLINGER D, 1972, LANGUAGE, V48, P633, DOI 10.2307/412039 Bolinger D., 1989, INTONATION ITS USES BURNHAM D, 2007, INTERSPEECH, P698 BURNHAM D, 2007, INTERSPEECH, P701 Cave C, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2175 Cutler A, 1997, LANG SPEECH, V40, P141 Davis C, 2006, COGNITION, V100, pB21, DOI 10.1016/j.cognition.2005.09.002 Davis C, 2004, Q J EXP PSYCHOL-A, V57, P1103, DOI 10.1080/02724980343000701 DOHEN M, 2005, INTERSPEECH 2005, P2413 DOHEN M, 2005, INTERSPEECH, P2416 DOHEN M, 2006, SPEECH PROSODY, P224 DOHEN M, 2006, SPEECH PROSODY 2006, P221 Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009 EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091 Erickson D, 1998, LANG SPEECH, V41, P399 FLECHAGARCIA ML, 2006, THESIS U EDINBURGH Forster KI, 2003, BEHAV RES METH INS C, V35, P116, DOI 10.3758/BF03195503 GUA I, 2009, LANG SPEECH, V52, P207 IEEE, 1969, IEEE T AUDIO ELECTRO, VAE-17, P227 Ishi C. T., 2007, IEEE RSJ INT C INT R, P548 Krahmer E, 2001, SPEECH COMMUN, V34, P391, DOI 10.1016/S0167-6393(00)00058-3 Krahmer E, 2004, HUM COM INT, V7, P191 Lansing CR, 1999, J SPEECH LANG HEAR R, V42, P526 Lappin JS, 2009, J VISION, V9, DOI 10.1167/9.1.30 Lee HM, 2008, J HIGH ENERGY PHYS MCKEE SP, 1984, VISION RES, V24, P25, DOI 10.1016/0042-6989(84)90140-8 Nooteboom S., 1997, HDB PHONETIC SCI, P640 Pare M, 2003, PERCEPT PSYCHOPHYS, V65, P553, DOI 10.3758/BF03194582 Scarborough R, 2009, LANG SPEECH, V52, P135, DOI 10.1177/0023830909103165 Srinivasan RJ, 2003, LANG SPEECH, V46, P1 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009 Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001 Vatikiotis-Bateson E, 1998, PERCEPT PSYCHOPHYS, V60, P926, DOI 10.3758/BF03211929 Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165 NR 38 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 555 EP 564 DI 10.1016/j.specom.2010.02.006 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000009 ER PT J AU Colletta, JM Pellenq, C Guidetti, M AF Colletta, Jean-Marc Pellenq, Catherine Guidetti, Michele TI Age-related changes in co-speech gesture and narrative: Evidence from French children and adults SO SPEECH COMMUNICATION LA English DT Article DE Co-speech gestures; Narratives; Multimodality; French children and adults ID ICONIC GESTURES AB As children's language abilities develop, so may their use of co-speech gesture. We tested this hypothesis by studying oral narratives produced by French children and adults. One hundred and twenty-two participants, divided into three age groups (6 years old, 10 years old and adults), were asked to watch a Toni and Jerry cartoon and then tell the story to the experimenter. All narratives were videotaped, and subsequently transcribed and annotated for language and gesture using the ELAN software. The results showed a strong effect of age on language complexity, discourse construction and gesture. The age effect was only partly related to the length of the narratives, as adults produced shorter narratives than 10-year-olds. The study thus confirms that co-speech gestures develop with age in the context of narrative activity and plays a crucial role in discourse cohesion and the framing of verbal utterances. This developmental shift towards more complex narratives through both words and gestures is discussed in terms of its theoretical implications in the study of gesture and discourse development. (C) 2010 Elsevier B.V. All rights reserved. C1 [Colletta, Jean-Marc] Univ Grenoble 3, Lab Lidilem, EA 609, F-38040 Grenoble 9, France. [Pellenq, Catherine] IUFM, F-38100 Grenoble, France. [Pellenq, Catherine] Univ Grenoble 1, Lab Sci Educ, EA 602, F-38100 Grenoble, France. [Guidetti, Michele] Univ Toulouse, UTM, Unite Rech Interdisciplinare Octogone, Lab Cognit Commun & Dev ECCD,EA 4156, F-31058 Toulouse 9, France. RP Colletta, JM (reprint author), Univ Grenoble 3, Lab Lidilem, EA 609, 1180 Ave Cent,BP 25, F-38040 Grenoble 9, France. EM jean-marc.colletta@u-grenoble3.fr; catherine.pellenq@ujf-grenoble.fr; guidetti@univ-tlse2.fr FU ANR (French National Research Agency) [0178-01] FX This research was supported by grant no. 0178-01 from the ANR (French National Research Agency) project entitled "L'acquisition et les troubles du langage au regard de la multimodalite de la communication parlee". We are grateful to Isabelle Rousset from Lidilem, and to all the children and adult students who took part in this study. CR Bamberg M., 1987, ACQUISITION NARRATIV Beattie G, 1999, SEMIOTICA, V123, P1, DOI 10.1515/semi.1999.123.1-2.1 Berman R. A., 1994, RELATING EVENTS NARR Bouvet Danielle, 2001, DIMENSION CORPORELLE Butcher C., 2000, LANGUAGE GESTURE, P235, DOI DOI 10.1017/CBO9780511620850.015 Calbris G., 2003, EXPRESSION GESTUELLE CAPIRCI O, 2002, ESSAYS HONOR WC STOK, P213 Capirci O, 2008, GESTURE, V8, P22, DOI 10.1075/gest.8.1.04cap Capirci O, 1996, J CHILD LANG, V23, P645 CAPIRCI O, 2007, P 3 INT SOC GEST STU Coirier P., 1996, PSYCHOLINGUISTIQUE T COLLETTA JM, 2009, MULTIMODAL CORPORA Colletta J.-M., 2004, DEV PAROLE CHEZ ENFA Colletta J.-M., 2009, EXPOSITORY DISCOURSE, P63 Colletta JM, 2009, GESTURE, V9, P61, DOI 10.1075/gest.9.1.03col De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018 Diessel H., 2004, CAMBRIDGE STUDIES LI, V105 DUCEYKAUFMANN V, 2007, THESIS U STENDHAL GR Duncan S., 2000, LANGUAGE GESTURE, P141, DOI 10.1017/CBO9780511620850.010 Fayol M., 1997, IDEES TEXTE PSYCHOL FAYOL M, 1985, RECIT CONSTRUCTION A FEYEREISEN P, 1994, CERVEAU COMMUNICATIO Goldin-Meadow S, 2003, POINTING: WHERE LANGAUAGE, CULTURE, AND COGNITON MEET, P85 Gombert JE, 1990, DEV METALINGUISTIQUE GRAZIANO M, 2009, THESIS U STUDI SUOR Guidetti M., 2003, PRAGMATIQUE PSYCHOL Guidetti M., 2002, 1 LANGUAGE, V22, P265 Gullberg M, 2008, GESTURE, V8, P149, DOI 10.1075/gest.8.2.03gul Gullberg M., 2008, FIRST LANG, V28, P200, DOI DOI 10.1177/0142723707088074 Hadar U, 1997, SEMIOTICA, V115, P147, DOI 10.1515/semi.1997.115.1-2.147 Halliday Michael, 1976, COHESION ENGLISH Hickmann Maya, 2003, CHILDRENS DISCOURSE Iverson JM, 1998, NEW DIRECTIONS CHILD Jisa H, 2000, LINGUISTICS, V38, P591, DOI 10.1515/ling.38.3.591 Jisa H., 2004, LANGUAGE DEV CHILDHO, P135 Kendon A., 1980, RELATIONSHIP VERBAL, P207 Kendon A., 2004, GESTURE VISIBLE ACTI Kita S, 2007, LANG COGNITIVE PROC, V22, P1212, DOI 10.1080/01690960701461426 Kita S., 2000, LANGUAGE GESTURE, P162, DOI DOI 10.1017/CBO9780511620850.011 Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3 Kita S, 2007, GESTURE STUD, V1, P67 Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017 KUNENE R, 2007, P 10 INT PRAGM C GOT Labov W., 1978, PARLER ORDINAIRE LAFOREST M, 1996, AUTOUR NARRATION LEONARD JL, 1993, LANGAGE SOC, V65, P39 LUNDQUIST L., 1980, COHERENCE TEXTUELLE Marcos H, 1998, COMMUNICATION PRELIN Mayberry R., 2000, LANGUAGE GESTURE, P199, DOI 10.1017/CBO9780511620850.013 McNeill D., 1992, HAND MIND WHAT GESTU NICOLADIS E, 2008, P 11 INT C STUD CHIL OZCALISKAN S, 2006, CONSTRUCTIONS ACQUIS, P31 Ozyürek Asli, 2008, Dev Psychol, V44, P1040, DOI 10.1037/0012-1649.44.4.1040 PIZZUTO E, 2007, GESTURAL COMMUNICATI, P164 Rime B., 1991, FUNDAMENTALS NONVERB, P239 SPERBER D, 1989, COMMUNICATION COGNIT Streeck Jurgen, 1992, ADV NONVERBAL COMMUN, P3 THOMPSON LA, 1994, J EXP CHILD PSYCHOL, V57, P327, DOI 10.1006/jecp.1994.1016 Tolchinsky L, 2004, LANGUAGE DEV CHILDHO, P233, DOI 10.1075/tilar.3.15tol VANDERSTRATTEN A, 1991, PREMIERS GESTES PREM NR 60 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 565 EP 576 DI 10.1016/j.specom.2010.02.009 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000010 ER PT J AU Aubanel, V Nguyen, N AF Aubanel, Vincent Nguyen, Noel TI Automatic recognition of regional phonological variation in conversational interaction SO SPEECH COMMUNICATION LA English DT Article DE Conversational interaction; Regional phonological and phonetic variation; Automatic speech processing; French; Sociophonetics ID SOCIAL DESIRABILITY; SPEECH RECOGNITION; PERCEPTION; LANGUAGE; DIALECT; FRENCH; CONVERGENCE; ACCENT; CORPUS; SCALE AB One key aspect of face-to-face communication concerns the differences that may exist between speakers' native regional accents. This paper focuses on the characterization of regional phonological variation in a conversational setting. A new, interactive task was designed in which 12 pairs of participants engaged in a collaborative game leading them to produce a number of purpose-built names. In each game, the participants were native speakers of Southern French and Northern French, respectively. How the names were produced by each of the two participants was automatically determined from the recordings using ASR techniques and a pre-established set of possible regional variants along five phonological dimensions. A naive Bayes classifier was then applied to these phonetic forms, with a view to differentiating the speakers' native regional accents. The results showed that native regional accent was correctly recognized for 79% of the speakers. These results also revealed or confirmed the existence of accent-dependent differences in how segments are phonetically realized, such as the affrication of /d/ in /di/ sequences. Our data allow us to better characterize the phonological and phonetic patterns associated with regional varieties of French on a large scale and in a natural, interactional situation. (C) 2010 Elsevier B.V. All rights reserved. C1 [Aubanel, Vincent] CNRS, Lab Parole & Langage, F-13100 Aix En Provence, France. Aix Marseille Univ, F-13100 Aix En Provence, France. RP Aubanel, V (reprint author), CNRS, Lab Parole & Langage, 5 Ave Pasteur, F-13100 Aix En Provence, France. EM vincent.aubanel@lpl-aix.fr; noel.nguyen@lpl-aix.fr FU Region Provence-Alpes-Cote d'Azur; [ANR-08-BLAN-0276-01] FX This work was supported by a Ph.D. Scholarship awarded to the first author by the Region Provence-Alpes-Cote d'Azur, and by the project ANR-08-BLAN-0276-01. We are grateful to Stephane Rauzy for statistical advice and to the staff and students of the lycee Thiers, Marseille, for their kind participation. We also thank two anonymous reviewers for helpful comments. CR Adda-Decker M., 2008, TRAITEMENT AUTOMATIQ, V49-3, P13 Adda-Decker M, 2007, REV FR LING APPL, V12, P71 Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1 ANDERSON AH, 1991, LANG SPEECH, V34, P351 Bertrand R., 2008, TRAITEMENT AUTOMATIQ, V49, P105 Binisti N., 2003, CAHIERS FRANCAIS CON, V8, P107 Bradlow A.R., 2007, J ACOUSTICAL SOC A 2, V121, P3072 Brunelliere A, 2009, COGNITION, V111, P390, DOI 10.1016/j.cognition.2009.02.013 Carton F., 1983, ACCENTS FRANCAIS Clopper CG, 2008, LANG SPEECH, V51, P175, DOI 10.1177/0023830908098539 Conrey B, 2005, BRAIN LANG, V95, P435, DOI 10.1016/j.bandl.2005.06.008 Coveney A., 2001, SOUNDS CONT FRENCH A CROWNE DP, 1960, J CONSULT PSYCHOL, V24, P349, DOI 10.1037/h0047358 Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001 Delvaux V, 2007, PHONETICA, V64, P145, DOI 10.1159/000107914 Dufour S., 2007, J ACOUST SOC AM, V121, P131 Durand J., 2003, CORPUS VARIATION PHO, P11 Durand J., 1988, RECHERCHES LINGUISTI, V17, P29 Durand J., 2004, VARIATION FRANCOPHON, P217 Durand J., 1990, GENERATIVE NONLINEAR Durand J., 2003, TRIBUNE INT LANGUES, V33, P3 Evans BG, 2004, J ACOUST SOC AM, V115, P352, DOI 10.1121/1.1635413 Eychenne J., 2006, THESIS U TOULOUSE LE FAGYAL Z, 2002, P 24 JOURN ET PAR NA, P165 Fagyal-Le Mentec Z, 2006, FRENCH: A LINGUISTIC INTRODUCTION, P17, DOI 10.1017/CBO9780511791185.003 Floccia C, 2006, J EXP PSYCHOL HUMAN, V32, P1276, DOI 10.1037/0096-1523.32.5.1276 FONAGY I, 1989, REV ROMANE, V24, P225 Hansen Anita Berit, 2001, LINGUISTIQUE, V37, P33 Hay J, 2006, LINGUIST REV, V23, P351, DOI 10.1515/TLR.2006.014 Kraljic T, 2008, COGNITION, V107, P54, DOI 10.1016/j.cognition.2007.07.013 LENNOX RD, 1984, J PERS SOC PSYCHOL, V46, P1349, DOI 10.1037/0022-3514.46.6.1349 MALECOT A, 1976, PHONETICA, V33, P45 MARLOWE D, 1961, J CONSULT PSYCHOL, V25, P100 MARTINET A, 1958, ROMANCE PHILOL, V11, P345 Martinet A., 1945, PRONONCIATION FRANCA NATALE M, 1975, J PERS SOC PSYCHOL, V32, P790, DOI 10.1037/0022-3514.32.5.790 New B, 2004, BEHAV RES METH INS C, V36, P516, DOI 10.3758/BF03195598 Pardo JS, 2006, J ACOUST SOC AM, V119, P2382, DOI 10.1121/1.2178720 Racine I., 2008, THESIS U GENEVE Scharenborg O, 2007, SPEECH COMMUN, V49, P336, DOI 10.1016/j.specom.2007.01.009 SNYDER M, 1974, J PERS SOC PSYCHOL, V30, P526, DOI 10.1037/h0037039 Sumner M, 2009, J MEM LANG, V60, P487, DOI 10.1016/j.jml.2009.01.001 TRIMAILLE C, 2008, AFLS C OXF 3 5 SEPT van Rijsbergen CJ., 1979, INFORM RETRIEVAL VANRULLEN T, 2005, P TRAIT AUT LANG NAT Vieru-Dimulescu B., 2008, TRAITEMENT AUTOMATIQ, V49, P135 WOEHRLING C, 2009, THESIS U PARIS SUD NR 47 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 577 EP 586 DI 10.1016/j.specom.2010.02.008 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000011 ER PT J AU Kopp, S AF Kopp, Stefan TI Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors SO SPEECH COMMUNICATION LA English DT Article DE Social Resonance; Coordination; Embodied conversational agents; Gesture ID NONCONSCIOUS MIMICRY; IMITATION; RAPPORT; SOUND AB Human natural face-to-face communication is characterized by inter-personal coordination. In this paper, phenomena are analyzed that yield coordination of behaviors, beliefs, and attitudes between interaction partners, which can be tied to a concept of establishing social resonance. It is discussed whether these mechanisms can and should be transferred to conversation with artificial interlocutors like ECAs or humanoid robots. It is argued that one major step in this direction is embodied coordination, mutual adaptations that are mediated by flexible modules for the top-down production and bottom-up perception of expressive conversational behavior that ground in and, crucially, coalesce in the same sensorimotor structures. Work on modeling this for ECAs with a focus on coverbal gestures is presented. (C) 2010 Elsevier B.V. All rights reserved. C1 Univ Bielefeld, Sociable Agents Grp, D-33501 Bielefeld, Germany. RP Kopp, S (reprint author), Univ Bielefeld, Sociable Agents Grp, POB 100131, D-33501 Bielefeld, Germany. EM skopp@techfak.uni-bielefeld.de RI Kopp, Stefan/K-3456-2013 OI Kopp, Stefan/0000-0002-4047-9277 FU Deutsche Forschungsgemeinschaft (DFG) [SFB 673]; Center of Excellence "Cognitive Interaction Technology" (CITEC) FX This research is supported by the Deutsche Forschungsgemeinschaft (DFG) in SFB 673 "Alignment in Communication" and the Center of Excellence "Cognitive Interaction Technology" (CITEC). CR Allwood J., 1992, Journal of Semantics, V9, DOI 10.1093/jos/9.1.1 Amit R., 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002, DOI 10.1109/DEVLRN.2002.1011867 Bailenson JN, 2008, COMPUT HUM BEHAV, V24, P66, DOI 10.1016/j.chb.2007.01.015 Bergmann K, 2009, LECT NOTES ARTIF INT, V5773, P76 Bergmann Kirsten, 2009, P 8 INT C AUT AG MUL, P361 Bernieri F. J., 1991, FUNDAMENTALS NONVERB, P401 BERNIERI FJ, 1994, PERS SOC PSYCHOL B, V20, P303, DOI 10.1177/0146167294203008 Bertenthal BI, 2006, J EXP PSYCHOL HUMAN, V32, P210, DOI 10.1037/0096-1523.32.2.210 BICKMORE T, 2006, P CHI, P550 Bickmore T., 2003, THESIS MIT Billard A, 2002, FROM ANIM ANIMAT, P281 BRANIGAN H, 2010, J PRAGMATIC IN PRESS Breazeal C, 2005, ARTIF LIFE, V11, P31, DOI 10.1162/1064546053278955 Brennan SE, 1996, J EXP PSYCHOL LEARN, V22, P1482, DOI 10.1037/0278-7393.22.6.1482 BUSCHMEIER H, 2009, 12 EUR WORKSH NAT LA, P82 Cassell J., 2000, P INT NAT LANG GEN C, P171 Cassell J., 2000, EMBODIED CONVERSATIO CASSELL J, 2007, ACL WORKSH EMB LANG, P40 CASSELL J, 2001, P ACM CHI 2001 C SEA, P396 Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893 Clark H. H., 1996, USING LANGUAGE DUNCAN S, 2007, P GEST 2007 C INT SO Finkel EJ, 2006, J PERS SOC PSYCHOL, V91, P456, DOI 10.1037/0022-3514.91.3.456 Gallese V, 2004, TRENDS COGN SCI, V8, P396, DOI 10.1016/j.tics.2004.07.002 GARDENFORS P, 1996, CONTRIBUTION SCI TEC Giles Howard, 1991, LANGUAGE CONTEXTS CO Gratch J, 2006, LECT NOTES ARTIF INT, V4133, P14 Hall J. A., 2001, INTERPERSONAL SENSIT HAMILTON A, 2008, ATTENTION PERFORMANC, V22 HSEE CK, 1990, COGNITION EMOTION, V4, P327, DOI 10.1080/02699939008408081 Kang S.-H, 2008, P INT C AUT AG MULT, P120 Kendon Adam, 1973, SOCIAL COMMUNICATION, P29 KIMBARA I, 2005, GESTURE, V6, P39 Kopp S., 2004, P INT C MULT INT ICM, P97, DOI 10.1145/1027933.1027952 Kopp S, 2005, LECT NOTES ARTIF INT, V3661, P329 Kopp S, 2004, COMPUT ANIMAT VIRT W, V15, P39, DOI 10.1002/cav.6 Kramer NC, 2008, LECT NOTES COMPUT SC, V5208, P507 Lakin JL, 2008, PSYCHOL SCI, V19, P816, DOI 10.1111/j.1467-9280.2008.02162.x Lakin JL, 2003, J NONVERBAL BEHAV, V27, P145, DOI 10.1023/A:1025389814290 McNeill D., 1992, HAND MIND WHAT GESTU MILES L, 2009, J EXPT SOCIAL PSYCHO, P585 Montgomery KJ, 2007, SOC COGN AFFECT NEUR, V2, P114, DOI 10.1093/scan/nsm004 CONDON WS, 1966, J NERV MENT DIS, V143, P338, DOI 10.1097/00005053-196610000-00005 Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169 Press C, 2006, EUR J NEUROSCI, V24, P2415, DOI 10.1111/j.1460-9568.2006.05115.x Reddy M., 1979, METAPHOR THOUGHT, P284 Reeves B., 1996, MEDIA EQUATION PEOPL Rizzolatti G, 2001, NAT REV NEUROSCI, V2, P661, DOI 10.1038/35090060 SADEGHIPOUR A, 2009, LECT NOTES ARTIF INT, V5773, P80 SCHEFLEN AE, 1964, PSYCHIATR, V27, P316 Scheflen A. E., 1982, INTERACTION RHYTHMS, P13 Shockley K, 2003, J EXP PSYCHOL HUMAN, V29, P326, DOI 10.1037/0096-1523.29.2.326 SHON A, 2007, 2007 IEEE INT C ROB, P2847 Sowa Timo, 2005, P KOGWIS 2005, P183 STRONKS B, 2002, P CHI 02 WORKSH PHIL, P25 Sugathapala De Silva M. W., 1980, ASPECTS LINGUISTIC B, P105 Tickle-Degnen L, 1990, PSYCHOL INQ, V1, P285, DOI DOI 10.1207/S15327965PLI0104_1 TRAUM D, 1992, 2 INT C SPOK LANG PR, P137 ULDALL B, OPTIMAL DISTIN UNPUB Wallbott Harald G, 1995, MUTUALITIES DIALOGUE, P82 Wilson M, 2005, PSYCHOL BULL, V131, P460, DOI 10.1037/0033-2909.131.3.460 Yngve Victor, 1970, CHICAGO LINGUISTIC S, V6, P567 NR 62 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 587 EP 597 DI 10.1016/j.specom.2010.02.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000012 ER PT J AU Bailly, G Raidt, S Elisei, F AF Bailly, Gerard Raidt, Stephan Elisei, Frederic TI Gaze, conversational agents and face-to-face communication SO SPEECH COMMUNICATION LA English DT Article DE Conversational agents; Face-to-face communication; Gaze ID AUDIOVISUAL SPEECH-PERCEPTION; SOCIAL ATTENTION; DIRECTION; SIGNALS; EYES; MIND AB In this paper, we describe two series of experiments that examine audiovisual face-to-face interaction between naive human viewers and either a human interlocutor or a virtual conversational agent. The main objective is to analyze the interplay between speech activity and mutual gaze patterns during mediated face-to-face interactions. We first quantify the impact of deictic gaze patterns of our agent. We further aim at refining our experimental knowledge on mutual gaze patterns during human face-to-face interaction by using new technological devices such as non-invasive eye trackers and pinhole cameras, and at quantifying the impact of a selection of cognitive states and communicative functions on recorded gaze patterns. (C) 2010 Elsevier B.V. All rights reserved. C1 [Bailly, Gerard; Raidt, Stephan; Elisei, Frederic] Univ Grenoble, Speech & Cognit Dept, GIPSA Lab, CNRS,UMR 5216, Grenoble, France. RP Bailly, G (reprint author), Univ Grenoble, Speech & Cognit Dept, GIPSA Lab, CNRS,UMR 5216, Grenoble, France. EM gerard.bailly@gipsa-lab.grenoble-inp.fr FU Rhone-Alpes region FX We thank our colleague and target speaker Helene Loevenbruck for her incredible patience and complicity. We also thank all of our subjects - the ones whose data have been used here and the others whose data have been corrupted by deficiencies of recording devices. Edouard Gentaz has helped us in statistical processing. This paper benefited from the pertinent suggestions of the two anonymous reviewers. We thank Peter F. Dominey and Marion Dohen for the proofreading. This project has been financed by the project Presence of the cluster ISLE of the Rhone-Alpes region. CR Argyle Michael, 1976, GAZE AND MUTUAL GAZE BAILLY G, 2005, HUMAN COMPUTER INTER Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107 BARONCOHEN S, 1985, COGNITION, V21, P37, DOI 10.1016/0010-0277(85)90022-8 Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X Blais C, 2008, PLOS ONE, V3, DOI 10.1371/journal.pone.0003022 BREAZEAL C, 2000, THESIS MIT BOSTON Buchan JN, 2008, BRAIN RES, V1242, P162, DOI 10.1016/j.brainres.2008.06.083 Buchan JN, 2007, SOC NEUROSCI, V2, P1, DOI 10.1080/17470910601043644 Carpenter M., 2000, COMMUNICATIVE LANGUA, V9, P30 Cassell J., 2000, EMBODIED CONVERSATIO CASTIELLO U, 1991, BRAIN, V114, P2639, DOI 10.1093/brain/114.6.2639 CHEN M, 2002, SIGCHI, P49 Clair R. N. S., 1979, LANGUAGE SOCIAL PSYC Driver J, 1999, VIS COGN, V6, P509 DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031 ELISEI F, 2007, EYEGAZE AWARE ANAL S, P120 EVINGER C, 1994, EXP BRAIN RES, V100, P337 FUJIE S, 2005, BACK CHANNEL FEEDBAC, P889 GEIGER G, 2003, PERCEPTUAL EVALUATIO, P224 GOODWIN C, 1980, SOCIOL INQ, V50, P272, DOI 10.1111/j.1475-682X.1980.tb00023.x HADDINGTON P, 2002, STUDIA LINGUSITICA L, P107 Itti L, 2003, SPIE 48 ANN INT S OP, P64 Kaur M., 2003, INT C MULT INT VANC, P151 KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4 Langton SRH, 2000, TRENDS COGN SCI, V4, P50, DOI 10.1016/S1364-6613(99)01436-9 Langton SRH, 1999, VIS COGN, V6, P541 Lee SP, 2002, ACM T GRAPHIC, V21, P637 Leslie A.M., 1994, MAPPING MIND DOMAIN, P119, DOI DOI 10.1017/CBO9780511752902.006 Lewkowicz DJ, 1996, J EXP PSYCHOL HUMAN, V22, P1094 Matsusaka Y, 2003, IEICE T INF SYST, VE86D, P26 Miller LM, 2005, J NEUROSCI, V25, P5884, DOI 10.1523/JNEUROSCI.0896-05.2005 Morgan J. L., 1996, SIGNAL SYNTAX OVERVI NOVICK DG, 1996, COORDINATING TURN TA OS ED, 2005, SPEECH COMMUN, V47, P194 Peters C, 2003, ATTENTION DRIVEN EYE Peters C, 2005, LECT NOTES ARTIF INT, V3661, P229 Picot A, 2007, LECT NOTES ARTIF INT, V4722, P272 POSNER MI, 1990, ANNU REV NEUROSCI, V13, P25, DOI 10.1146/annurev.neuro.13.1.25 POSNER MI, 1980, Q J EXP PSYCHOL, V32, P3, DOI 10.1080/00335558008248231 Pourtois G, 2004, EUR J NEUROSCI, V20, P3507, DOI 10.1111/j.1460-9568.2004.03794.x Povinelli RJ, 2003, IEEE T KNOWL DATA EN, V15, P339, DOI 10.1109/TKDE.2003.1185838 PREMACK D, 1978, BEHAV BRAIN SCI, V1, P515 RAIDT S, 2006, LANG RESS EV C LREC, P2544 RAIDT S, 2008, THESIS I NATL POLYTE, P175 Reveret L., 2000, INT C SPEECH LANG PR, P755 Riva G, 2003, BEING THERE CONCEPTS Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173) RUTTER DR, 1987, DEV PSYCHOL, V23, P54, DOI 10.1037//0012-1649.23.1.54 Salvucci D.D., 2000, ETRA 00, P71 Scassellati B. M., 2001, FDN THEORY MIND HUMA Thorisson KR, 2002, TEXT SPEECH LANG TEC, V19, P173 Vatikiotis-Bateson E, 1998, PERCEPT PSYCHOPHYS, V60, P926, DOI 10.3758/BF03211929 WALLBOTT HG, 1991, J PERS SOC PSYCHOL, V61, P147, DOI 10.1037/0022-3514.61.1.147 Yarbus A. L, 1967, EYE MOVEMENTS VISION, DOI [10.1007/978-1-4899-5379-7, DOI 10.1007/978-1-4899-5379-7] NR 55 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2010 VL 52 IS 6 SI SI BP 598 EP 612 DI 10.1016/j.specom.2010.02.015 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 604MB UT WOS:000278282000013 ER PT J AU Gunawan, TS Ambikairajah, E Epps, J AF Gunawan, Teddy Surya Ambikairajah, Eliathamby Epps, Julien TI Perceptual speech enhancement exploiting temporal masking properties of human auditory system SO SPEECH COMMUNICATION LA English DT Article DE Human auditory system; Speech enhancement; Temporal masking; Subjective test; Objective test ID NOISE SUPPRESSION; MODEL; RECOGNITION; ESTIMATOR; ALGORITHM; FREQUENCY; PHASE AB The use of simultaneous masking in speech enhancement has shown promise for a range of noise types. In this paper, a new speech enhancement algorithm based on a short-term temporal masking threshold to noise ratio (MNR) is presented. A novel functional model for forward masking based on three parameters is incorporated into a speech enhancement framework based on speech boosting. The performance of the speech enhancement algorithm using the proposed forward masking model was compared with seven other speech enhancement methods over 12 different noise types and four SNRs. Objective evaluation using PESQ revealed that using the proposed forward masking model, the speech enhancement algorithm outperforms the other algorithms by 6-20% depending on the SNR. Moreover, subjective evaluation using 16 listeners confirmed the objective test results. (C) 2009 Elsevier B.V. All rights reserved. C1 [Gunawan, Teddy Surya] Int Islamic Univ Malaysia, Dept Elect & Comp Engn, Kuala Lumpur 53100, Malaysia. [Ambikairajah, Eliathamby; Epps, Julien] Univ New S Wales, Sch Elect Engn & Telecommun, Sydney, NSW 2052, Australia. RP Gunawan, TS (reprint author), Int Islamic Univ Malaysia, Dept Elect & Comp Engn, Kuala Lumpur 53100, Malaysia. EM tsgunawan@gmail.com CR AMBIKAIRAJAH E, 1998, INT C SPOK LANG PROC [Anonymous], 2003, P835 ITUT BEROUTI M, 1979, INT C AC SPEECH SIGN BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bouquin RL., 1996, SPEECH COMMUN, V18, P3 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 CARNERO B, 1999, IEEE T SIGNAL PROCES, V47 Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278] EBU, 1988, SOUND QUAL ASS MAT R EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FEDER M, 1989, IEEE T ACOUST SPEECH, V37, P204, DOI 10.1109/29.21683 FLORENTINE M, 1988, J ACOUST SOC AM, V84, P195, DOI 10.1121/1.396964 GAGNON L, 1991, P INT C AC SPEECH SI GUNAWAN TS, 2006, 10 INT C COMM SYST S GUNAWAN TS, 2004, 10 INT C SPEECH SCI GUNAWAN TS, 2006, IEEE INT C AC SPEECH GUSTAFSSON S, 1998, INT C AC SPEECH SIGN Hansen J. H. L., 1999, ENCY ELECT ELECT ENG HIRSCH HG, 2000, AURORA EXPT FRAMEWOR Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P457, DOI 10.1109/TSA.2003.815936 IRINO T, 1999, P INT C AC SPEECH SI *ITU, 1996, P830 ITUT ITU, 1998, BS1387 ITUR ITU, 2001, P862 ITUT JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 KUBIN G, 1999, INT C AC SPEECH SIGN LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 LIM JS, 1983, SPEECH ENHANCEMENT Lin L, 2003, ELECTRON LETT, V39, P754, DOI 10.1049/el:20030480 Lin L, 2002, ELECTRON LETT, V38, P1486, DOI 10.1049/el:20020965 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Ma N, 2006, IEEE T AUDIO SPEECH, V14, P19, DOI 10.1109/TSA.2005.858515 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Moore B. C. J. E., 1995, HEARING Moore BC., 2003, INTRO PSYCHOL HEARIN MOORER JA, 1986, J AUDIO ENG SOC, V34, P143 OSHAUGHNESSY D, 1989, IEEE COMMUN MAG, V27, P46, DOI 10.1109/35.17653 SCALART P, 1996, INT C AC SPEECH SIGN SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662 Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569 Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296 VARGA AP, 1992, NOISEX 92 STUDY EFFE VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 Vaseghi S. V., 2000, ADV DIGITAL SIGNAL P Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 WESTERLUND N, 2003, APPL SPEECH ENHANCEM ZWICKER E, 1999, PSYCHOACOUTICS FACTS NR 55 TC 9 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2010 VL 52 IS 5 BP 381 EP 393 DI 10.1016/j.specom.2009.12.006 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 594NK UT WOS:000277544200001 ER PT J AU Barra-Chicote, R Yamagishi, J King, S Montero, JM Macias-Guarasa, J AF Barra-Chicote, Roberto Yamagishi, Junichi King, Simon Manuel Montero, Juan Macias-Guarasa, Javier TI Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech SO SPEECH COMMUNICATION LA English DT Article DE Emotional speech synthesis; HMM-based synthesis; Unit selection AB We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded - happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method. (C) 2009 Elsevier B.V. All rights reserved. C1 [Barra-Chicote, Roberto; Manuel Montero, Juan] Univ Politecn Madrid, Grp Tecnol Habla, ETSI Telecomunicac, E-28040 Madrid, Spain. [Yamagishi, Junichi; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Informat Forum, Edinburgh EH8 9AB, Midlothian, Scotland. [Macias-Guarasa, Javier] Univ Alcala De Henares, Dept Elect, Madrid 28805, Spain. RP Barra-Chicote, R (reprint author), Univ Politecn Madrid, Grp Tecnol Habla, ETSI Telecomunicac, Ciudad Univ S-N, E-28040 Madrid, Spain. EM barra@die.upm.es RI Macias-Guarasa, Javier/J-4625-2012; Montero, Juan M/K-2381-2014; Barra-Chicote, Roberto/L-4963-2014 OI Montero, Juan M/0000-0002-7908-5400; Barra-Chicote, Roberto/0000-0003-0844-7037 FU Spanish Ministry of Education; ROBONAUTA [DPI2007-66846-c02-02]; EPSRC; EC; SD-TEAM-UPM [TIN2008-06856-C05-03]; SD-TEAM-UAH [TIN2008-06856-C05-05]; eDIKT FX RB was visiting CSTR at the time of this work. RB was supported by the Spanish Ministry of Education and by project ROBONAUTA (DPI2007-66846-c02-02). JY is supported by EPSRC and the EC FP7 EMIME project. SK holds an EPSRC Advanced Research Fellowship. JMM and JMG are supported by projects SD-TEAM-UPM (TIN2008-06856-C05-03) and SD-TEAM-UAH (TIN2008-06856-C05-05), respectively. This work has made use of the resources provided by the Edinburgh Compute and Data Facility which is partially supported by the eDIKT initiative (http://www.edikt.org.uk). We also thank the two anonymous reviewers for their constructive feedback and helpful suggestions. The associate editor coordinating the review of this manuscript for publication was Dr. Marc Swerts. CR Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123 BARRA R, 2007, P INT 2007, P2233 BARRACHICOTE R, 2008, P 6 INT C LANG RES E BARRACHICOTE R, 2008, V J TECNOL, P115 Barra R, 2006, INT CONF ACOUST SPEE, P1085 BENNETT C, 2006, P BLIZZ CHALL 2006 Black A., 2003, P EUROSPEECH GEN SWI, P1649 Black A. B., 2005, P INT 2005 LISB PORT, P77 Black A. W., 1995, P EUROSPEECH MADR SP, P581 Bulut M., 2002, P INT C SPOK LANG PR, P1265 Burkhardt F., 2005, P INT, P1517 CHARONNAT L, 2008, P LANG RES EV C, P2376 CLARK R, 2006, P BLIZZ CHALL WORKSH Clark R., 2007, P BLZ3 2007 P SSW6 Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014 Donovan RE, 1999, COMPUT SPEECH LANG, V13, P223, DOI 10.1006/csla.1999.0123 Eide E, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P127, DOI 10.1109/WSS.2002.1224388 FRASER M, 2007, P BLZ3 2007 P SSW6 A GALLARDOANTOLIN A, 2007, P INTERSPEECH 2007 GOMES C, 2004, CPAIOR04, P387 HAMZA W, 2004, P ICSLP 2004 Hofer G., 2005, P INTERSPEECH LISB P, P501 Hunt A. J., 1996, P ICASSP 96, P373 Karaiskos V., 2008, P BLIZZ CHALL WORKSH Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Montero J. M., 1998, P 5 INT C SPOK LANG, P923 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Nose T, 2007, IEICE T INF SYST, VE90D, P1406, DOI 10.1093/ietisy/e90-d.9.1406 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHRODER M, 2004, THESIS SAARLAND U SA Schroder M., 2001, P EUROSPEECH 2001 SE, P561 STROM V, 2008, P INT 2008, P1873 SYRDAL AK, 2000, P ICSLP 2000 OCT, P411 Tachibana M, 2006, IEICE T INF SYST, VE89D, P1092, DOI 10.1093/ietisy/e89-d.3.1092 Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484 Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tokuda K., 2008, HMM BASED SPEECH SYN Yamagishi J., 2008, P BLIZZ CHALL 2008 Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394 Yoshimura T., 1999, P EUROSPEECH 99 SEPT, P2374 YOSHIMURA T, 2000, IEICE T D 2, V83, P2099 Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 44 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2010 VL 52 IS 5 BP 394 EP 404 DI 10.1016/j.specom.2009.12.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 594NK UT WOS:000277544200002 ER PT J AU Lee, YH Kim, HK AF Lee, Young Han Kim, Hong Kook TI Entropy coding of compressed feature parameters for distributed speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Entropy coding; Distributed speech recognition; Huffman coding; Mel-frequency cepstral coefficient; Voicing class ID QUANTIZATION AB In this paper, we propose several entropy coding methods to further compress quantized mel-frequency cepstral coefficients (MFCCs) used for distributed speech recognition (DSR). As a standard DSR front-end, the European Telecommunications Standards Institute (ETSI) published an extended front-end that includes the split-vector quantization of MFCCs and voicing class information. By exploring entropy variances of compressed MFCCs according to the voicing class of the analysis frame and the amount of the entropy due to MFCC subvector indices, voicing class-dependent and subvector-wise Huffman coding methods are proposed. In addition, differential Huffman coding is then applied to further enhance the coding gain against class-dependent and subvector-wise Huffman codings. Subsequent experiments show that the average bit-rate of the subvector-wise differential Huffman coding is measured at 33.93 bits/frame, which is the smallest among the proposed Huffman coding methods, whereas that of a traditional Huffman coding that does not consider voicing class and encodes with a single Huffman coding tree for all the subvectors is measured at 42.22 bits/frame for the TIMIT database. In addition, we evaluate the performance of the proposed Huffman coding methods applied to speech in noise by using the Aurora 4 database, a standard speech database for DSR. As a result, it is shown that the subvector-wise differential Huffman coding method provides the smallest average bit-rate. (C) 2010 Elsevier B.V. All rights reserved. C1 [Lee, Young Han; Kim, Hong Kook] Gwangju Inst Sci & Technol, Dept Informat & Commun, Kwangju 500712, South Korea. RP Kim, HK (reprint author), Gwangju Inst Sci & Technol, Dept Informat & Commun, 1 Oryong Dong, Kwangju 500712, South Korea. EM cpumaker@gist.ac.kr; hongkook@gist.ac.kr FU Korea government (MEST) [2009-0057194]; Ministry of Knowledge and Economy, Korea [NIPA-2009-C1090-0902-0010]; Gwangiu Institute of Science and Technology FX This work was supported in part by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (No. 2009-0057194), by the Ministry of Knowledge and Economy, Korea, under the ITRC support program supervised by the National IT Industry Promotion Agency (NIPA)(NIPA-2009-C1090-0902-0010), and by the basic research project grant provided by the Gwangiu Institute of Science and Technology in 2009. CR [Anonymous], 2003, 202211 ETSI ES Borgstrom B. J., 2007, P INT AUG, P578 Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698 Gallardo-Antolin A, 2005, IEEE T SPEECH AUDI P, V13, P1186, DOI 10.1109/TSA.2005.853210 Garofolo J., 1988, GETTING STARTED DARP Hirsch Guenter, 2002, EXPT FRAMEWORK PERFO HIRSCH HG, 1998, P ICSLP DENV CO, P1877 Huerta JM, 1998, P 5 INT C SPOK LANG, V4, P1463 HUFFMAN DA, 1952, P IRE, V40, P1098, DOI 10.1109/JRPROC.1952.273898 Kim HK, 2001, IEEE T SPEECH AUDI P, V9, P558 KISS I, 1999, P EUROSPEECH, P2183 KISS I, 2000, P ICSLP BEIJ CHIN, V4, P250 So S, 2006, SPEECH COMMUN, V48, P746, DOI 10.1016/j.specom.2005.10.002 RAJ B, 2001, P IEEE WORKSH ASRU T, P127 RAMASWAMY G, 1998, P ICASSP, V2, P977, DOI 10.1109/ICASSP.1998.675430 SORIN A, 2004, P ICASSP MAY, P129 Srinivasamurthy N, 2006, SPEECH COMMUN, V48, P888, DOI 10.1016/j.specom.2005.11.003 Tan ZH, 2005, SPEECH COMMUN, V47, P220, DOI 10.1016/j.specom.2005.05.007 ZHU Q, 2001, P IEEE INT C AC SPEE, V1, P113 NR 19 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2010 VL 52 IS 5 BP 405 EP 412 DI 10.1016/j.specom.2010.01.002 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 594NK UT WOS:000277544200003 ER PT J AU Vicsi, K Szaszak, G AF Vicsi, Klara Szaszak, Gyoergy TI Using prosody to improve automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Prosody; Syntactic unit; Sentence modality; Hidden Markov models AB In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. The proposed modelling approach is very similar to the one used in standard speech recognition, where basic HMM units (the most often acoustic phoneme models) are trained and are then connected according to the dictionary and some grammar (language model) to obtain a recognition network, along which recognition can be interpreted also as an alignment process. In this paper the HMM framework is used to model speech prosody, and to perform initial syntactic and/or semantic level processing of the input speech in parallel to standard speech recognition. As acoustic prosodic features, fundamental frequency and energy are used. A method was implemented for syntactic level information extraction from the speech. The method was designed to work for fixed-stress languages, and it yields a segmentation of the input speech for syntactically linked word groups, or even single words corresponding to a syntactic unit (these word groups are sometimes referred to as phonological phrases in psycholinguistics, which can consist of one or more words). These so-called word-stress units are marked by prosody, and have an associated fundamental frequency and/or energy contour which allows their discovery. For this, HMMs for the different types of word-stress unit contours were trained and then used for recognition and alignment of such units from the input speech. This prosodic segmentation of the input speech also allows word-boundary recovery and can be used for N-best lattice rescoring based on prosodic information. The syntactic level input speech segmentation algorithm was evaluated for the Hungarian and for the Finnish languages that have fixed stress on the first syllable. (This means if a word is stressed, stress is realized on the first syllable of the word.) The N-best rescoring based on syntactic level word-stress unit alignment was shown to augment the number of correctly recognized words. For further syntactic and semantic level processing of the input speech in ASR, clause and sentence boundary detection and modality (sentence type) recognition was implemented. Again, the classification was carried out by HMMs, which model the prosodic contour for each clause and/or sentence modality type. Clause (and hence also sentence) boundary detection was based on H M M's excellent capacity in aligning dynamically the reference prosodic structure to the utterance coming from the ASR input. This method also allows punctuation to be automatically marked. This semantic level processing of speech was investigated for the Hungarian and for the German languages. The correctness of recognized types of modalities was 69% for Hungarian, and 78% for German. (C) 2010 Elsevier B.V. All rights reserved. C1 [Vicsi, Klara; Szaszak, Gyoergy] Budapest Univ Technol & Econ, Lab Speech Acoust, TMIT, H-1111 Budapest, Hungary. RP Vicsi, K (reprint author), Budapest Univ Technol & Econ, Lab Speech Acoust, TMIT, Stoczek U 2, H-1111 Budapest, Hungary. EM vicsi@tmit.bme.hu; szaszak@tmit.bme.hu FU Hungarian Research Foundations OTKA T [046487]; IKTA [00056] FX The authors would like to thank Toomas Altosaar (Helsinki University of Technology) for his kind help and his contribution to the use of the Finnish Speech Database.The work has been supported by the Hungarian Research Foundations OTKA T 046487 ELE and IKTA 00056. CR AINSWORTH WA, 1976, MECH SPEECH RECOGNIT BATLINER A, 1994, NATO ASI SERIES F Becchetti C., 1999, SPEECH RECOGNITION T *C ALBR U KIEL I P, 1994, KIEL CORP READ SPEEC, V1 Gallwitz F, 2002, SPEECH COMMUN, V36, P81, DOI 10.1016/S0167-6393(01)00027-9 Hirota K., 2001, Physica C KOMPE R, 1995, P 4 EUR C SPEECH COM, P1333 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Silverman K., 1992, P INT C SPOK LANG PR, P867 SZASZAK G, 2007, COST ACT 2102 INT WO, P138 VAINIO M, 1999, P ICPHS 1999 SAN FRA, P2347 Veilleux N. M., 1993, P ARPA WORKSH HUM LA, P335, DOI 10.3115/1075671.1075749 VICSI K, 2008, USING PROSODY IMPROV Vicsi K., 2004, 2 MAG SZAM NYELV K S, P315 VICSI K, 2005, SPEECH TECHNOL, P363 VICSI K, 1998, 1 HUNGARIAN SPEECH D, P163 Wahlster W., 2000, VERBMOBIL FDN SPEECH Young S., 2005, HTK BOOK HTK VERSION NR 18 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2010 VL 52 IS 5 BP 413 EP 426 DI 10.1016/j.specom.2010.01.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 594NK UT WOS:000277544200004 ER PT J AU Benesty, J Chen, JD Huang, YY AF Benesty, Jacob Chen, Jingdong Huang, Yiteng (Arden) TI On widely linear Wiener and tradeoff filters for noise reduction SO SPEECH COMMUNICATION LA English DT Article DE Noise reduction; Wiener filter; Widely linear Wiener filter; Circularity; Noncircularity ID COMPLEX-VARIABLES; STATISTICS; CIRCULARITY; SIGNALS AB Noise reduction is often formulated as a linear filtering problem in the frequency domain. With this formulation, the core issue of noise reduction becomes how to design an optimal frequency-domain filter that can significantly suppress noise without introducing perceptually noticeable speech distortion. While higher-order information can be used, most existing approaches use only second-order statistics to design the noise-reduction filter because they are relatively easier to estimate and are more reliable. When we transform non-stationary speech signals into the frequency domain and work with the short-time discrete Fourier transform coefficients, there are two types of second-order statistics, i.e., the variance and the so-called pseudo-variance due to the noncircularity of the signal. So far, only the variance information has been exploited in designing different noise-reduction filters while the pseudo-variance has been neglected. In this paper, we attempt to shed some light on how to use noncircularity in the context of noise reduction. We will discuss the design of optimal and suboptimal noise reduction filters using both the variance and pseudo-variance and answer the basic question whether noncircularity can be used to improve the noise-reduction performance. (C) 2010 Elsevier B.V. All rights reserved. C1 [Benesty, Jacob] Univ Quebec, INRS EMT, Montreal, PQ H5A 1K6, Canada. [Chen, Jingdong; Huang, Yiteng (Arden)] WeVoice Inc, Bridgewater, NJ 08807 USA. RP Chen, JD (reprint author), 9 Iroquois Trail, Branchburg, NJ 08876 USA. EM benesty@emt.inrs.ca; jingdongchen@ieee.org; ardenhuang@gmail.com CR Amblard PO, 1996, SIGNAL PROCESS, V53, P15, DOI 10.1016/0165-1684(96)00072-2 Amblard PO, 1996, SIGNAL PROCESS, V53, P1, DOI 10.1016/0165-1684(96)00071-0 [Anonymous], 1990, DARPA TIMIT ACOUSTIC Benesty J, 2009, SPRINGER TOP SIGN PR, V2, P1, DOI 10.1007/978-3-642-00296-0_1 Benesty J., 2005, SPEECH ENHANCEMENT CHEN J, 2003, ADAPTIVE SIGNAL PROC, P129 Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851 CHEVALIER P, 2009, P IEEE ICASSP, P3573 Diethorn EJ, 2004, AUDIO SIGNAL PROCESSING: FOR NEXT-GENERATION MULTIMEDIA COMMUNICATION SYSTEMS, P91, DOI 10.1007/1-4020-7769-6_4 ERIKSSON J, 2009, P IEEE ICASSP, P3565 Hirsch H. G., 1995, P IEEE INT C AC SPEE, V1, P153 Huang Y., 2006, ACOUSTIC MIMO SIGNAL LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Mandic D., 2009, COMPLEX VALUED NONLI Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 NEESER FD, 1993, IEEE T INFORM THEORY, V39, P1293, DOI 10.1109/18.243446 Ollila E, 2008, IEEE SIGNAL PROC LET, V15, P841, DOI 10.1109/LSP.2008.2005050 PICINBONO B, 1994, IEEE T SIGNAL PROCES, V42, P3473, DOI 10.1109/78.340781 PICINBONO B, 1995, IEEE T SIGNAL PROCES, V43, P2030, DOI 10.1109/78.403373 Schreier PJ, 2003, IEEE T SIGNAL PROCES, V51, P714, DOI [10.1109/TSP.2002.808085, 10.1109/TCP.2002.808085] STAHL V, 2000, P ICASSP, V3, P1875 Vary P, 2006, DIGITAL SPEECH TRANS NR 23 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2010 VL 52 IS 5 BP 427 EP 439 DI 10.1016/j.specom.2010.02.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 594NK UT WOS:000277544200005 ER PT J AU Vicente-Pena, J Diaz-de-Maria, F AF Vicente-Pena, Jesus Diaz-de-Maria, Fernando TI Uncertainty decoding on Frequency Filtered parameters for robust ASR SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; Spectral subtraction; Uncertainty decoding; Frequency Filtered; Bounded distance HMM; SSBD-HMM ID SPEECH RECOGNITION; SPECTRAL SUBTRACTION; NOISE; MODEL; TIME AB The use of feature enhancement techniques to obtain estimates of the clean parameters is a common approach for robust automatic speech recognition (ASR). However, the decoding algorithm typically ignores how accurate these estimates are. Uncertainty decoding methods incorporate this type of information. In this paper, we develop a formulation of the uncertainty decoding paradigm for Frequency Filtered (FF) parameters using spectral subtraction as a feature enhancement method. Additionally, we show that the uncertainty decoding method for FF parameters admits a simple interpretation as a spectral weighting method that assigns more importance to the most reliable spectral components. Furthermore, we suggest combining this method with SSBD-HMM (Spectral Subtraction and Bounded Distance HMM), one recently proposed technique that is able to compensate for the effects of features that are highly contaminated (outliers). This combination pursues two objectives: to improve the results achieved by uncertainty decoding methods and to determine which part of the improvements is due to compensating for the effects of outliers and which part is due to compensating for other less deteriorated features. (C) 2010 Elsevier B.V. All rights reserved. C1 [Vicente-Pena, Jesus; Diaz-de-Maria, Fernando] Univ Carlos III Madrid, Dept Signal Proc & Commun, EPS, Madrid 28911, Spain. RP Vicente-Pena, J (reprint author), Univ Carlos III Madrid, Dept Signal Proc & Commun, EPS, Avda Univ 30, Madrid 28911, Spain. EM jvicente@tsc.uc3m.es; fdiaz@tsc.uc3m.es RI Diaz de Maria, Fernando/E-8048-2011 CR Arrowood J. A., 2002, P ICSLP, P1561 BENITEZ C, 2004, P ICSLP JEJ ISL KOR, P137 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 *CMU, 1998, CMU V 0 6 PRON DICT Deng L., 2002, P ICSLP, P2449 de Veth J, 2001, SPEECH COMMUN, V34, P247, DOI 10.1016/S0167-6393(00)00037-6 de Veth J, 2001, SPEECH COMMUN, V34, P57, DOI 10.1016/S0167-6393(00)00046-7 Droppo J., 2002, P ICASSP 2002, V1, P57 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J HIRSCH G, 2002, AU41702 ETSI STQ AUR KRISTJANSSON T, 2002, P ICASSP, V1, P61 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733 MORRIS A, 2001, WISP WORKSH INN METH NADEU C, 1995, P EUR 95, P1381 Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0 *NIST, 1992, NIST RES MAN CORP RM Paliwal KK, 1999, P EUR C SPEECH COMM, P85 Papoulis A., 2002, PROBABILITY RANDOM V PARIHAR N, 2001, AURORA WORKING GROUP PAUL DB, 1992, HLT 91, P357 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Stouten V, 2006, SPEECH COMMUN, V48, P1502, DOI 10.1016/j.specom.2005.12.006 Varga AP, 1992, TECH REP DRA SPEECH VICENTEPENA J, 2006, P INT C SPOK LANG PR, P1491 Vicente-Pena J, 2010, SPEECH COMMUN, V52, P123, DOI 10.1016/j.specom.2009.09.002 Vicente-Pena J, 2006, SPEECH COMMUN, V48, P1379, DOI 10.1016/j.specom.2006.07.007 Weiss NA, 1993, INTRO STAT, P407 WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125 Yoma NB, 1998, IEEE T SPEECH AUDI P, V6, P579, DOI 10.1109/89.725325 Yoma NB, 2002, IEEE T SPEECH AUDI P, V10, P158, DOI 10.1109/TSA.2002.1001980 Young S., 2002, HTK BOOK HTK VERSION NR 32 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2010 VL 52 IS 5 BP 440 EP 449 DI 10.1016/j.specom.2010.02.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 594NK UT WOS:000277544200006 ER PT J AU Paliwal, K Wojcicki, K Schwerin, B AF Paliwal, Kuldip Wojcicki, Kamil Schwerin, Belinda TI Single-channel speech enhancement using spectral subtraction in the short-time modulation domain SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Modulation spectral subtraction; Speech enhancement fusion; Analysis-modification-synthesis (AMS); Musical noise ID PRIMARY AUDITORY-CORTEX; FOURIER-ANALYSIS; AMPLITUDE-MODULATION; TRANSMISSION INDEX; QUALITY ESTIMATION; MASKING PROPERTIES; RECOGNITION; FREQUENCY; NOISE; INTELLIGIBILITY AB In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using an objective speech quality measure as well as formal subjective listening tests, we show that the proposed method results in improved speech quality. Furthermore, the proposed method achieves better noise suppression than the MMSE method. In this study, the effect of modulation frame duration on speech quality of the proposed enhancement method is also investigated. The results indicate that modulation frame durations of 180-280 ms, provide a good compromise between different types of spectral distortions, namely musical noise and temporal slurring. Thus given a proper selection of modulation frame duration, the proposed modulation spectral subtraction does not suffer from musical noise artifacts typically associated with acoustic spectral subtraction. In order to achieve further improvements in speech quality, we also propose and investigate fusion of modulation spectral subtraction with the MMSE method. The fusion is performed in the short-time spectral domain by combining the magnitude spectra of the above speech enhancement algorithms. Subjective and objective evaluation of the speech enhancement fusion shows consistent speech quality improvements across input SNRs. (C) 2010 Elsevier B.V. All rights reserved. C1 [Paliwal, Kuldip; Wojcicki, Kamil; Schwerin, Belinda] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia. RP Wojcicki, K (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia. EM kamil.wojcicki@ieee.org CR ALLEN JB, 1977, IEEE T ACOUST SPEECH, V25, P235, DOI 10.1109/TASSP.1977.1162950 ALLEN JB, 1977, P IEEE, V65, P1558, DOI 10.1109/PROC.1977.10770 Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 Atlas L., 2004, P IEEE INT C AC SPEE, V2, P761 ATLAS L, 2003, 155 IEICE U Atlas LE, 2001, P SOC PHOTO-OPT INS, V4474, P1, DOI 10.1117/12.448636 BACON SP, 1989, J ACOUST SOC AM, V85, P2575, DOI 10.1121/1.397751 Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Falk T. H., 2007, P ISCA C INT SPEECH, P970 Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628 Greenberg S., 2001, P 7 EUR C SPEECH COM, P473 GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 Hasan MK, 2004, IEEE SIGNAL PROC LET, V11, P450, DOI 10.1109/LSP.2004.824017 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H., 1995, P IEEE INT C AC SPEE, V1, P405 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Kamath S., 2002, P IEEE INT C AC SPEE Kanedera N, 1999, SPEECH COMMUN, V28, P43, DOI 10.1016/S0167-6393(99)00002-3 Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466 Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Kinnunen T, 2008, P ISCA SPEAK LANG RE KINNUNEN T, 2006, P IEEE INT C AC SPEE, V1, P665 Kowalski N, 1996, J NEUROPHYSIOL, V76, P3503 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001 Lu X, 2010, SPEECH COMMUN, V52, P1, DOI 10.1016/j.specom.2009.08.006 Lyons J., 2008, P ISCA C INT SPEECH, P387 Malayath N, 2000, DIGIT SIGNAL PROCESS, V10, P55, DOI 10.1006/dspr.1999.0363 Mesgarani N., 2005, P ICASSP, V1, P1105, DOI 10.1109/ICASSP.2005.1415311 Nadeu C, 1997, SPEECH COMMUN, V22, P315, DOI 10.1016/S0167-6393(97)00030-7 Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755 Payton K. L., 2002, PAST PRESENT FUTURE, P125 Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216 PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P364, DOI 10.1109/TASSP.1981.1163580 Quatieri T. F., 2002, DISCRETE TIME SPEECH Rix AW, 2001, PERCEPTUAL EVALUATIO, P862 SCHREINER CE, 1986, HEARING RES, V21, P227, DOI 10.1016/0378-5955(86)90221-2 Shamma SA, 1996, NETWORK-COMP NEURAL, V7, P439, DOI 10.1088/0954-898X/7/3/001 Shannon B., 2006, P INT C SPOK LANG PR, P1423 SHEFT S, 1990, J ACOUST SOC AM, V88, P796, DOI 10.1121/1.399729 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397 Tyagi V., 2003, P ISCA EUR C SPEECH, P981 VASEGHI SV, 1992, J AUDIO ENG SOC, V40, P791 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Vuuren S. V., 1998, P INT C SPOK LANG PR, P3205 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 XIAO X, 2007, P ICASSP 2007, V4, P1021 ZADEH LA, 1950, P IRE, V38, P291, DOI 10.1109/JRPROC.1950.231083 NR 62 TC 25 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2010 VL 52 IS 5 BP 450 EP 475 DI 10.1016/j.specom.2010.02.004 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 594NK UT WOS:000277544200007 ER PT J AU Denby, B Schultz, T Honda, K AF Denby, Bruce Schultz, Tanja Honda, Kiyoshi TI Special Issue Silent Speech Interfaces SO SPEECH COMMUNICATION LA English DT Editorial Material EM denby@ieee.org NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 269 EP 269 DI 10.1016/j.specom.2010.02.001 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400001 ER PT J AU Denby, B Schultz, T Honda, K Hueber, T Gilbert, JM Brumberg, JS AF Denby, B. Schultz, T. Honda, K. Hueber, T. Gilbert, J. M. Brumberg, J. S. TI Silent speech interfaces SO SPEECH COMMUNICATION LA English DT Article DE Silent speech; Speech pathologies; Cellular telephones; Speech recognition; Speech synthesis ID BRAIN-COMPUTER INTERFACES; ULTRASOUND IMAGES; MOTOR CORTEX; RECOGNITION; SYSTEM; MOVEMENTS; ELECTRODE; SURFACE; COMMUNICATION; TETRAPLEGIA AB The possibility of speech processing in the absence of an intelligible acoustic signal has given rise to the idea of a 'silent speech' interface, to be used as an aid for the speech-handicapped, or as part of a communications system operating in silence-required or high-background-noise environments. The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telecommunications privacy issues, and then follows with a presentation of demonstrator systems based on seven different types of technologies. A concluding section underlining some of the common challenges faced by silent speech interface researchers, and ideas for possible future directions, is also provided. (C) 2009 Elsevier B.V. All rights reserved. C1 [Denby, B.] Univ Paris 06, F-75005 Paris, France. [Schultz, T.] Univ Karlsruhe, Cognit Syst Lab, D-76131 Karlsruhe, Germany. [Honda, K.] ATR Cognit Informat Sci Labs, Seika, Kyoto 6190288, Japan. [Denby, B.; Hueber, T.] ESPCI ParisTech, Elect Lab, F-75005 Paris, France. [Gilbert, J. M.] Univ Hull, Dept Engn, Kingston Upon Hull HU6 7RX, N Humberside, England. [Brumberg, J. S.] Boston Univ, Dept Cognit & Neural Syst, Boston, MA 02215 USA. RP Denby, B (reprint author), Univ Paris 06, 4 Pl Jussieu, F-75005 Paris, France. EM denby@ieee.org; tanja@ira.uka.de; honda@atr.jp; hueber@ieee.org; J.M.Gilbert@hull.ac.uk; brum-berg@cns.bu.edu FU French Department of Defense (DGA); Centre de Microelectronique de Paris Ile-de-France (CEMIP); French National Research Agency (ANR) [ANR-06-BLAN-0166]; ENT Consultants' Fund; Hull and East Yorkshire Hospitals NHS Trust; National Institute on Deafness and other Communication Disorders [R01 DC07683, R44 DC007050-02]; National Science Foundation [SBE-0354378] FX The authors acknowledge support from the French Department of Defense (DGA); the "Centre de Microelectronique de Paris Ile-de-France" (CEMIP); the French National Research Agency (ANR) under the contract number ANR-06-BLAN-0166; the ENT Consultants' Fund, Hull and East Yorkshire Hospitals NHS Trust; the National Institute on Deafness and other Communication Disorders (R01 DC07683; R44 DC007050-02); and the National Science Foundation (SBE-0354378). They also wish to thank Gerard Chollet; Maureen Stone; Laurent Benaroya; Gerard Dreyfus; Pierre Roussel; and Szu-Chen (Stan) Jou for their help in preparing this article. CR ARNAL A, 2000, 23 JOURN ET PAR, P425 BAKEN RJ, 1984, J SPEECH HEAR DISORD, V49, P202 Bartels J, 2008, J NEUROSCI METH, V174, P168, DOI 10.1016/j.jneumeth.2008.06.030 Betts BJ, 2006, INTERACT COMPUT, V18, P1242, DOI 10.1016/j.intcom.2006.08.012 Birbaumer N, 2000, IEEE T REHABIL ENG, V8, P190, DOI 10.1109/86.847812 Blankertz B, 2006, IEEE T NEUR SYS REH, V14, P147, DOI 10.1109/TNSRE.2006.875557 BLOM ED, 1979, LARYNGECTOMEE REHABI, P251 BOS JC, 2005, 2005064 DRCD TOR CR Brown DR, 2005, MEAS SCI TECHNOL, V16, P2381, DOI 10.1088/0957-0233/16/11/033 Brown DR, 2004, MEAS SCI TECHNOL, V15, P1291, DOI 10.1088/0957-0233/15/7/010 BRUMBERG JS, 2007, NEUR M PLANN 2007 SA Brumberg JS, 2010, SPEECH COMMUN, V52, P367, DOI 10.1016/j.specom.2010.01.001 BRUMBERG JS, 2008, NEUR M PLANN 2007 WA BURNETT GC, 1997, J ACOUST SOC AM, V102, pA3168, DOI 10.1121/1.420785 Chan A. D. C., 2003, THESIS U NEW BRUNSWI Chan ADC, 2001, MED BIOL ENG COMPUT, V39, P500, DOI 10.1007/BF02345373 Crevier-Buchman Lise, 2002, Rev Laryngol Otol Rhinol (Bord), V123, P137 DASALLA CS, 2009, P 3 INT CONV REH ENG DAVIDSON L, 2005, J ACOUST SOC AM, V120, P407 DEKENS T, 2008, P 6 INT LANG RES EV DENBY B, 2004, P IEEE INT C AC SPEE, V1, P1685 DENBY B, 2006, PROSPECTS SILENT SPE, P1365 Dornhege G., 2007, BRAIN COMPUTER INTER DRUMMOND S, 1996, MOTOR SKILLS, V83, P801 Dupont S., 2004, P ROB 2004 WORKSH IT EPSTEIN CM, 1983, INTRO EEG EVOKED POT EPSTEIN MA, 2005, CLIN LINGUIST PHONET, V16, P567 Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003 FITZPATRICK M, 2002, NEW SCI 0403 Furui S., 2001, DIGITAL SPEECH PROCE GEORGOPOULOS AP, 1982, J NEUROSCI, V2, P1527 GIBBON F, 2005, BIBLIO ELECTROPALATO Gracco VL, 2005, NEUROIMAGE, V26, P294, DOI 10.1016/j.neuroimage.2005.01.033 GUENTHER FH, 2008, NEUR M PLANN 2008 WA HASEGAWA T, 1992, P IEEE ICCS ISITA 19, V20, P617 HASEGAWAJOHNSON M, 2008, COMMUNICATION HERACLEOUS P, 2007, EURASIP J ADV SIG PR, P1 Hirahara T, 2010, SPEECH COMMUN, V52, P301, DOI 10.1016/j.specom.2009.12.001 HOCHBERG LR, 2008, NEUR M PLANN 2008 WA Hochberg LR, 2006, NATURE, V442, P164, DOI 10.1038/nature04970 Holmes J., 2001, SPEECH SYNTHESIS REC Hoole P., 1999, COARTICULATION THEOR, P260 HOUSE D, 2002, LECT NOTES COMPUTER, V2443, P65 Hueber T., 2008, INT SEM SPEECH PROD, P365 HUEBER T, 2008, PHONE RECOGNITION UL, P2032 Hueber T, 2010, SPEECH COMMUN, V52, P288, DOI 10.1016/j.specom.2009.11.004 HUEBER T, 2007, INT C PHON SCI SAARB, P2193 HUEBER T, 2007, IEEE INT C AC SPEECH, V7, P1245 HUEBER T, 2007, CONTINUOUS SPEECH PH, P658 Hummel J, 2006, PHYS MED BIOL, V51, pN205, DOI 10.1088/0031-9155/51/10/N01 *IEEE, 2008, IEEE COMPUTER, V41 JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072 Jorgensen C, 2010, SPEECH COMMUN, V52, P354, DOI 10.1016/j.specom.2009.11.003 JORGENSEN C, 2005, P 38 ANN HAW INT C S JOU S, 2007, INT C AC SPEECH SIGN JOU S, 2006, INTERSPEECH 2006 9 I, V2, P573 KENNEDY PR, 1989, J NEUROSCI METH, V29, P181, DOI 10.1016/0165-0270(89)90142-8 KENNEDY PR, 2006, ELECT ENG HDB SERIES, V1 Kennedy PR, 2000, IEEE T REHABIL ENG, V8, P198, DOI 10.1109/86.847815 Kennedy PR, 1998, NEUROREPORT, V9, P1707, DOI 10.1097/00001756-199806010-00007 Kim S., 2007, NEURAL ENG, P486 Levinson SE, 2005, MATHEMATICAL MODELS FOR SPEECH TECHNOLOGY, P1, DOI 10.1002/0470020911 Lotte F, 2007, J NEURAL ENG, V4, pR1, DOI 10.1088/1741-2560/4/R01 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331 MANABE H, 2004, P 26 ANN INT ENG MED, V2, P4389 Manabe H., 2003, P EXT ABSTR CHI2003, P794 MARCHAL A, 1993, LANG SPEECH, V36, P3 Maynard EM, 1997, ELECTROEN CLIN NEURO, V102, P228, DOI 10.1016/S0013-4694(96)95176-0 Millet L, 2007, NY TIMES BK REV, P20 Morse M. S., 1991, P 13 ANN INT C IEEE, V13, P1877 MORSE MS, 1989, IMAGES 21 CENTURY 2, V11, P1793 MORSE MS, 1986, COMPUT BIOL MED, V16, P399, DOI 10.1016/0010-4825(86)90064-8 MUNHALL KG, 1995, J ACOUST SOC AM, V98, P1222, DOI 10.1121/1.413621 Nakajima Y., 2003, P EUROSPEECH, P2601 Nakajima Y., 2003, P ICASSP, P708 Nakajima Y, 2006, IEICE T INF SYST, VE89D, P1, DOI 10.1093/ietisy/e89-d.1.1 NAKAMURA H, 1988, Patent No. 4769845 Nakada Y., 2005, Scientific Reports of the Kyoto Prefectural University, Human Environment and Agriculture, P7 NessAiver MS, 2006, J MAGN RESON IMAGING, V23, P92, DOI 10.1002/jmri.20463 Neuper C, 2003, CLIN NEUROPHYSIOL, V114, P399, DOI 10.1016/S1388-2457(02)00387-5 Ng L., 2000, P INT C AC SPEECH SI, V1, P229 NGUYEN N, 1996, J PHONETICS, P77 Nijholt A, 2008, IEEE INTELL SYST, V23, P72, DOI 10.1109/MIS.2008.41 Otani Makoto, 2008, Acoustical Science and Technology, V29, DOI 10.1250/ast.29.195 *OUISP, 2006, OR ULTR SYNTH SPEECH Patil SA, 2010, SPEECH COMMUN, V52, P327, DOI 10.1016/j.specom.2009.11.006 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 PETAJAN ED, 1984, IEEE COMM SOC GLOB T Porbadnigk A, 2009, BIOSIGNALS 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BIO-INSPIRED SYSTEMS AND SIGNAL PROCESSING, P376 PREUSS RD, 2006, 8 INT C SIGN PROC IC, V1, P16 Quatieri TF, 2006, IEEE T AUDIO SPEECH, V14, P533, DOI 10.1109/TSA.2005.855838 ROTHENBERG M, 1992, J VOICE, V6, P36, DOI 10.1016/S0892-1997(05)80007-4 RUBIN P, 1998, P AUDIO VISUAL SPEEC, P233 SAJDA P, 2008, IEEE SIGNAL PROCESS SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7 SCHROETER J, 2000, IEEE INT C MULT EXP, P571 Schultz T, 2010, SPEECH COMMUN, V52, P341, DOI 10.1016/j.specom.2009.12.002 Stone M, 2005, CLIN LINGUIST PHONET, V19, P455, DOI 10.1080/02699200500113558 Stone M, 1995, J ACOUST SOC AM, V98, P3107, DOI 10.1121/1.413799 Stone M, 1986, Dysphagia, V1, P78, DOI 10.1007/BF02407118 STONE M, 1983, J PHONETICS, V11, P207 SUGIE N, 1985, IEEE T BIO-MED ENG, V32, P485, DOI 10.1109/TBME.1985.325564 Suppes P, 1997, P NATL ACAD SCI USA, V94, P14965, DOI 10.1073/pnas.94.26.14965 TARDELLI JD, 2003, ESCTR2004084 MIT LIN TATHAM, 1971, BEHAV TECHNOL, V6 TERC, 2009, TERC Titze IR, 2000, J ACOUST SOC AM, V107, P581, DOI 10.1121/1.428324 TRAN VA, 2008, P SPEECH PROS CAMP B Tran VA, 2010, SPEECH COMMUN, V52, P314, DOI 10.1016/j.specom.2009.11.005 Tran VA, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1465 Truccolo W, 2008, J NEUROSCI, V28, P1163, DOI 10.1523/JNEUROSCI.4415-07.2008 WALLICZEK M, 2006, P INT PITTSB US, P1487 WAND M, 2009, COMMUNICATI IN PRESS WESTER M, 2006, THESIS U KARLSRUHE Wolpaw JR, 2002, CLIN NEUROPHYSIOL, V113, P767, DOI 10.1016/S1388-2457(02)00057-3 WRENCH A, 2007, ULTRAFEST, V4 WRENCH AA, 2003, 6 INT SEM SPEECH PRO, P314 WRIGHT EJ, 2007, NEUR M PLANN 2007 SA 2008, CARSTENS MEDIZINELEK NR 120 TC 39 Z9 42 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 270 EP 287 DI 10.1016/j.specom.2009.08.002 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400002 ER PT J AU Hueber, T Benaroya, EL Chollet, G Denby, B Dreyfus, G Stone, M AF Hueber, Thomas Benaroya, Elie-Laurent Chollet, Gerard Denby, Bruce Dreyfus, Gerard Stone, Maureen TI Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips SO SPEECH COMMUNICATION LA English DT Article DE Silent speech; Ultrasound; Corpus-based speech synthesis; Visual phone recognition ID SYSTEM AB This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a "silent speech interface" application, usable either by a laryngectomized patient or for silent communication. The system is built around an audio visual dictionary which associates visual to acoustic observations for each phonetic class. Visual features are extracted from ultrasound images of the tongue and from video images of the lips using a PCA-based image coding technique. Visual observations of each phonetic class are modeled by continuous HMMs. The system then combines a phone recognition stage with corpus-based synthesis. In the recognition stage, the visual HMMs are used to identify phonetic targets in a sequence of visual features. In the synthesis stage, these phonetic targets constrain the dictionary search for the sequence of diphones that maximizes similarity to the input test data in the visual space, subject to a concatenation cost in the acoustic domain. A prosody-template is extracted from the training corpus, and the final speech waveform is generated using "Harmonic plus Noise Model" concatenative synthesis techniques. Experimental results are based on an audiovisual database containing I h of continuous speech from each of two speakers. (C) 2009 Elsevier B.V. All rights reserved. C1 [Hueber, Thomas; Benaroya, Elie-Laurent; Denby, Bruce; Dreyfus, Gerard] Ecole Super Phys & Chim Ind Ville Paris ESPCI Par, Elect Lab, F-75231 Paris 05, France. [Hueber, Thomas; Chollet, Gerard] Telecom ParisTech, CNRS, LTCI, F-75634 Paris, France. [Denby, Bruce] Univ Paris 06, F-75252 Paris, France. [Stone, Maureen] Univ Maryland, Sch Dent, Vocal Tract Visualizat Lab, Baltimore, MD 21201 USA. RP Hueber, T (reprint author), ESPCI ParisTech, Elect Lab, 10 Rue Vauquelin, F-75005 Paris, France. EM hueber@ieee.org FU French Department of Defense (DGA); Centre de Microelectronique de Paris Ile-de-France (CEMIP); French National Research Agency (ANR) [ANR-06-BLAN-0166] FX This work was supported by the French Department of Defense (DGA), the "Centre de Microelectronique de Paris Ile-de-France" (CEMIP) and the French National Research Agency (ANR), under the contract number ANR-06-BLAN-0166. The authors would like to thank the anonymous reviewers for numerous valuable suggestions and corrections. They also acknowledge the seven synthesis transcribers for their excellent work, as well as the contributions of the collaboration members and numerous visitors who have attended Ouisper Brainstormings over the past 3 years. CR Akgul YS, 2000, IEEE WORKSHOP ON MATHEMATICAL METHODS IN BIOMEDICAL IMAGE ANALYSIS, PROCEEDINGS, P135 BIRKHOLZ P, 2003, P 15 INT C PHON SCI, P2597 EFRON B, 1981, BIOMETRIKA, V68, P589, DOI 10.1093/biomet/68.3.589 EPSTEIN M, 2001, J ACOUST SOC AM, V115, P2631 Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003 FORNEY GD, 1973, P IEEE, V61, P268, DOI 10.1109/PROC.1973.9030 GRAVIER G, 2002, P 2 INT C HUM LANG T HERACLEOUS P, 2005, TISSUE CONDUCTIVE AC, P93 Hogg R., 1996, PROBABILITY STAT INF Hueber T., 2008, INT SEM SPEECH PROD, P365 HUEBER T, 2007, EIGENTONGUE FEATURE, V1, P1245 HUEBER T, 2008, PHONE RECOGNITION UL, P2032 HUEBER T, 2009, VISUOPHONETIC DECODI, P640 HUEBER T, 2007, CONTINUOUS SPEECH PH, P658 HUNT A, 1996, UNIT SELECTION CONCA, P373 JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072 Kominek J., 2004, P 5 ISCA SPEECH SYNT, P223 Li M, 2005, CLIN LINGUIST PHONET, V19, P545, DOI 10.1080/02699200500113616 Lucey P., 2006, P 8 IEEE WORKSH MULT, P24 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331 PERONA P, 1990, IEEE T PATTERN ANAL, V12, P629, DOI 10.1109/34.56205 SINDER D, 1997, P ASVA97 TOK, P439 Stone M, 1995, J ACOUST SOC AM, V98, P3107, DOI 10.1121/1.413799 STYLIANOU Y, 1997, DIPHONE CONCATENATIO, P613 TOKUDA K, 2000, SPEECH PARAMETER GEN, P1315 Tran VA, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1465 TURK M, 1991, FACE RECOGNITION USI, P586 Young S., 2005, HTK BOOK Young S., 1989, FINFENGTR38 CUED CAM Yu YJ, 2002, IEEE T IMAGE PROCESS, V11, P1260, DOI [10.1109/TIP.2002.804276, 10.1109/TIP.2002.804279] NR 31 TC 11 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 288 EP 300 DI 10.1016/j.specom.2009.11.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400003 ER PT J AU Hirahara, T Otani, M Shimizu, S Toda, T Nakamura, K Nakajima, Y Shikano, K AF Hirahara, Tatsuya Otani, Makoto Shimizu, Shota Toda, Tomoki Nakamura, Keigo Nakajima, Yoshitaka Shikano, Kiyohiro TI Silent-speech enhancement using body-conducted vocal-tract resonance signals SO SPEECH COMMUNICATION LA English DT Article DE Non-audible murmur; Body-conducted sound; Voice conversion; Talking aids ID VOICE CONVERSION AB The physical characteristics of weak body-conducted vocal-tract resonance signals called non-audible murmur (NAM) and the acoustic characteristics of three sensors developed for detecting these signals have been investigated. NAM signals attenuate 50 dB at 1 kHz; this attenuation consists of 30-dB full-range attenuation due to air-to-body transmission loss and 10 dB/octave spectral decay due to a sound propagation loss within the body. These characteristics agree with the spectral characteristics of measured NAM signals. The sensors have a sensitivity of between 41 and 58 dB [V/Pa] at I kHz, and the mean signal-to-noise ratio of the detected signals was 15 dB. On the basis of these investigations, three types of silent-speech enhancement systems were developed: (1) simple, direct amplification of weak vocal-tract resonance signals using a wired urethane-elastomer NAM microphone, (2) simple, direct amplification using a wireless urethane-elastomer-duplex NAM microphone, and (3) transformation of the weak vocal-tract resonance signals sensed by a soft-silicone NAM microphone into whispered speech using statistical conversion. Field testing of the systems showed that they enable voice impaired people to communicate verbally using body-conducted vocal-tract resonance signals. Listening tests demonstrated that weak body-conducted vocal-tract resonance sounds can be transformed into intelligible whispered speech sounds. Using these systems, people with voice impairments can re-acquire speech communication with less effort. (C) 2009 Elsevier B.V. All rights reserved. C1 [Hirahara, Tatsuya; Otani, Makoto; Shimizu, Shota] Toyama Prefectural Univ, Dept Intelligent Syst Design Engn, Toyama 9390398, Japan. [Toda, Tomoki; Nakamura, Keigo; Nakajima, Yoshitaka; Shikano, Kiyohiro] Nara Inst Sci & Technol, Grad Sch Informat Sci, Nara 6300192, Japan. RP Hirahara, T (reprint author), Toyama Prefectural Univ, Dept Intelligent Syst Design Engn, 5180 Kurokawa, Toyama 9390398, Japan. EM hirahara@pu-toyama.ac.jp FU SCOPE of the Ministry of Internal Affairs and Communications of Japan FX This work was supported by SCOPE of the Ministry of Internal Affairs and Communications of Japan. CR Abe M., 1990, TRI0166 ATR INT TEL Espy-Wilson CY, 1998, J SPEECH LANG HEAR R, V41, P1253 Fant G., 1970, ACOUSTIC THEORY SPEE Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 FUJISAKA Y, 2004, TECHNICAL REPORT I E, V103, P13 Heracleous P., 2003, P ASRU, P73 Higashikawa M, 1999, J SPEECH LANG HEAR R, V42, P583 KIKUCHI Y, 2004, P INT C SPEECH PROS, P761 NAKAGIRI M, 2006, P INTERSPEECH PITTSB, P2270 Nakajima Y., 2005, P INTERSPEECH LISB P, P389 Nakajima Y., 2003, P EUROSPEECH, P2601 NAKAJIMA Y, 2005, TECHNICAL REPORT I E, V105, P7 Nakajima Y., 2003, P ICASSP, P708 Nakajima Y, 2006, IEICE T INF SYST, VE89D, P1, DOI 10.1093/ietisy/e89-d.1.1 NAKAMURA K, 2007, P INTERSPEECH ANTW B, P2517 Nakamura K., 2006, P INTERSPEECH PITTSB, P1395 Nota Y., 2007, Acoustical Science and Technology, V28, DOI 10.1250/ast.28.33 OESTREICHER HL, 1951, J ACOUST SOC AM, V23, P707, DOI 10.1121/1.1906828 Otani M, 2009, APPL ACOUST, V70, P469, DOI 10.1016/j.apacoust.2008.05.003 SAGISAKA Y, 1992, J ACOUST SOC JPN, V48, P878 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 TITZE IR, 1989, J ACOUST SOC AM, V85, P1699, DOI 10.1121/1.397959 TODA T, 2005, P ICASSP PHIL US MAR, V1, P9, DOI 10.1109/ICASSP.2005.1415037 Toda T., 2005, P INTERSPEECH LISB P, P1957 Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344 Uemi N., 1994, Proceedings. 3rd IEEE International Workshop on Robot and Human Communication. RO-MAN '94 Nagoya (Cat. No.94TH0679-1), DOI 10.1109/ROMAN.1994.365931 NR 26 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 301 EP 313 DI 10.1016/j.specom.2009.12.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400004 ER PT J AU Tran, VA Bailly, G Loevenbruck, H Toda, T AF Tran, Viet-Anh Bailly, Gerard Loevenbruck, Helene Toda, Tomoki TI Improvement to a NAM-captured whisper-to-speech system SO SPEECH COMMUNICATION LA English DT Article DE Non-audible murmur; Whispered speech; Audiovisual voice conversion; Silent speech interface ID EXTRACTION; PITCH AB Exploiting a tissue-conductive sensor a stethoscopic microphone the system developed at NAIST which converts non-audible murmur (NAM) to audible speech by GM M-based statistical mapping is a very promising technique. The quality of the converted speech is however still insufficient for computer-mediated communication, notably because of the poor estimation of F-0 from unvoiced speech and because of impoverished phonetic contrasts. This paper presents our investigations to improve the intelligibility and naturalness of the synthesized speech and first objective and subjective evaluations of the resulting system. The first improvement concerns voicing and F-0 estimation. Instead of using a single GMM for both, we estimate a continuous F-0 using a GMM, trained on target voiced segments only. The continuous F-0 estimation is filtered by a voicing decision computed by a neural network. The objective and subjective improvement is significant. The second improvement concerns the input time window and its dimensionality reduction: we show that the precision of F-0 estimation is also significantly improved by extending the input time window from 90 to 450 ms and by using a Linear Discriminant Analysis (LDA) instead of the original Principal Component Analysis (PCA). Estimation of spectral envelope is also slightly improved with LDA but is degraded with larger time windows. A third improvement consists in adding visual parameters both as input and output parameters. The positive contribution of this information is confirmed by a subjective test. Finally, H M M-based conversion is compared with GMM-based conversion. (C) 2009 Elsevier B.V. All rights reserved. C1 [Tran, Viet-Anh; Bailly, Gerard; Loevenbruck, Helene] Grenoble Univ, CNRS, UMR 5216, GIPSA Lab, Grenoble, France. [Toda, Tomoki] Nara Inst Sci & Technol, Grad Sch Informat Sci, Nara, Japan. RP Tran, VA (reprint author), Grenoble Univ, CNRS, UMR 5216, GIPSA Lab, Grenoble, France. EM viet-anh.tran@gipsa-lab.inpg.fr; gerard.bailly@gipsa-lab.inpg.fr; helene.loevenbruck@gipsa-lab.inpg.fr; tomoki@is.naist.jp CR Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166 BAILLY G, 2008, SPEAKING SMILE DISGU, P111 Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107 BETT BJ, 2005, SMALL VOCABULARY REC, P16 COLEMAN J, 2002, LARYNX MOVEMENTS INT Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467 CREVIERBUCHMAN L, 2008, ACT 64 C SOC FRAC PH HERACLEOUS P, 2005, INT C SMART OBJ AMB, P93 Higashikawa M, 1996, J VOICE, V10, P155, DOI 10.1016/S0892-1997(96)80042-7 Hueber T, 2007, INT CONF ACOUST SPEE, P1245 HUEBER T, 2008, PHONE RECOGNITION UL, P2032 HUEBER T, 2008, SEGMENTAL VOCODER DR, P2028 HUEBER T, 2007, INT C PHON SCI SAARB, P2193 HUEBER T, 2007, CONTINUOUS SPEECH PH, P658 INOUYE T, 1970, J NERV MENT DIS, V151, P415, DOI 10.1097/00005053-197012000-00007 Jorgensen C., 2005, P 38 ANN HAW INT C S, p294c, DOI 10.1109/HICSS.2005.683 JOU SC, 2006, CONTINUOUS SPEECH RE, P573 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Nakagiri M, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2270 Nakajima Y., 2003, INT C AC SPEECH SIGN, P708 POTAMIANOS G, 2003, JOINT AUDIOVISUAL SP, P95 Reveret L., 2000, INT C SPEECH LANG PR, P755 Sodoyer D, 2004, SPEECH COMMUN, V44, P113, DOI 10.1016/j.specom.2004.10.002 SUMMERFIELD AQ, 1989, HDB RES FACE PROCESS, P223 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 TODA T, 2005, NAM TO SPEECH CONVER, P1957 Toda T., 2009, P ICASSP TAIP TAIW, P3601 TODA T, 2005, SPEECH PARAMETER GEN, P2801 Tokuda K, 2000, INT CONF ACOUST SPEE, P1315, DOI 10.1109/ICASSP.2000.861820 TRAN VA, 2008, P SPEECH PROS CAMP B WALLICZEK M, 2006, SUB WORD UNIT BASED, P1487 Young S., 1999, HTK BOOK Zen H., 2007, SPEECH SYNTH WORKSH, P294 Zeroual C., 2005, P INTERSPEECH LISB, P1069 NR 34 TC 7 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 314 EP 326 DI 10.1016/j.specom.2009.11.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400005 ER PT J AU Patil, SA Hansen, JHL AF Patil, Sanjay A. Hansen, John H. L. TI The physiological microphone (PMIC): A competitive alternative for speaker assessment in stress detection and speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Physiological sensor; Stress detection; Speaker verification; Non-acoustic sensor; PMIC ID AUTOMATIC SPEECH RECOGNITION; HEART-RATE; CLASSIFICATION; NOISE; COMPENSATION; FEATURES; SENSOR AB Interactive speech system scenarios exist which require the user to perform tasks which exert limitations on speech production, thereby causing speaker variability and reduced speech performance. In noisy stressful scenarios, even if noise could be completely eliminated, the production variability brought on by stress, including Lombard effect, has a more pronounced impact on speech system performance. Thus, in this study we focus on the use of a silent speech interface (PMIC), with a corresponding experimental assessment to illustrate its utility in the tasks of stress detection and speaker verification. This study focuses on the suitability of PMIC versus close-talk microphone (CTM), and reports that the PMIC achieves as good performance as CTM or better for a number of test conditions. PMIC reflects both stress-related information and speaker-dependent information to a far greater extent than the CTM. For stress detection performance (which is reported in % accuracy), PMIC performs at least on par or about 2% better than the CTM-based system. For a speaker verification application, the PMIC outperforms CTM for all matched stress conditions. The performance reported in terms of %EER is 0.91% (as compared to 1.69%), 0.45% (as compared to 1.49%), and 1.42% (as compared to 1.80%) for PMIC. This indicates that PMIC reflects speaker-dependent information. Also, another advantage of the PMIC is its ability to record the user physiology traits/state. Our experiments illustrate that PMIC can be an attractive alternative for stress detection as well as speaker verification tasks along with an advantage of its ability to record physiological information, in situations where the use of CTM may hinder operations (deep sea divers, fire-fighters in rescue operations, etc.). (C) 2009 Elsevier B.V. All rights reserved. C1 [Patil, Sanjay A.; Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, CRSS, Erik Jonsson Sch Engn & Comp Sci, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, CRSS, Erik Jonsson Sch Engn & Comp Sci, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. EM john.hansen@utdallas.edu CR AKARGUN UC, 2007, IEEE SIGNAL PROCESS, P1 Baber C, 1996, SPEECH COMMUN, V20, P37, DOI 10.1016/S0167-6393(96)00043-X Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006 Bimbot F, 2004, EURASIP J APPL SIG P, V2004, P430, DOI 10.1155/S1110865704310024 BOUGHAZALE S, 1996, THESIS DUKE U BOUGHAZALE SE, 1995, INT CONF ACOUST SPEE, P664, DOI 10.1109/ICASSP.1995.479685 Bou-Ghazale SE, 2000, IEEE T SPEECH AUDI P, V8, P429, DOI 10.1109/89.848224 BRADY K, 2004, IEEE INT C AC SPEECH BROUHA L, 1961, J APPL PHYSIOL, V16, P133 BROUHA L, 1963, J APPL PHYSIOL, V18, P1095 Brown DR, 2005, MEAS SCI TECHNOL, V16, P2381, DOI 10.1088/0957-0233/16/11/033 BURNETT G, 1999, THESIS U CALIFORNIA CHAN C, 2003, THESIS U NEW BRUNSWI CORRIGAN G, 1996, THESIS NW U WA COSMIDES L, 1983, J EXP PSYCHOL HUMAN, V9, P864, DOI 10.1037/0096-1523.9.6.864 Courteville A, 1998, IEEE T BIO-MED ENG, V45, P145, DOI 10.1109/10.661262 Denes PB, 1993, SPEECH CHAIN PHYS BI DEPAULA MH, 1992, REV SCI INSTRUM, V63, P3487, DOI 10.1063/1.1143753 Freund Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14 GABLE TJ, 2000, THESIS U CALIFORNIA Goodie JL, 2000, J PSYCHOPHYSIOL, V14, P159, DOI 10.1027//0269-8803.14.3.159 Graciarena M, 2003, IEEE SIGNAL PROC LET, V10, P72, DOI [10.1109/LSP.2003.808549, 10.1109/LSP.2002.808549] Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 Hansen J. H. L., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319239 HEUBER T, 2007, CONTINUOUS SPEECH PH, P658 HOLZRICHTER JF, 2002, IEEE 10 DIG SIGN PRO, P35 Huang RQ, 2007, IEEE T AUDIO SPEECH, V15, P453, DOI 10.1109/TASL.2006.881695 Ikeno A., 2007, IEEE AER C 2007 BIG, P1 INGALLS R, 1987, J ACOUST SOC AM, V81, P809, DOI 10.1121/1.394659 JOU S, 2007, IEEE INT C AC SPEECH Junqua JC, 1996, SPEECH COMMUN, V20, P13, DOI 10.1016/S0167-6393(96)00041-6 Klabunde R. E., 2005, CARDIOVASCULAR PHYSL KNUDSEN S, 1994, 10 SPIE INT C OPT FI, V2360, P396 MAINARDI E, 2007, ANN INT C IEEE ENG M, P3035 MAXFIELD ME, 1963, J APPL PHYSIOL, V18, P1099 MOHAMAD N, 2007, P SOC PHOTO-OPT INS, V6800, P40 MULDER G, 1981, PSYCHOPHYSIOLOGY, V18, P392, DOI 10.1111/j.1469-8986.1981.tb02470.x Murray IR, 1996, SPEECH COMMUN, V20, P3, DOI 10.1016/S0167-6393(96)00040-4 Noma H, 2005, Ninth IEEE International Symposium on Wearable Computers, Proceedings, P210, DOI 10.1109/ISWC.2005.56 OTANI K, 1995, IEEE J SEL AREA COMM, V13, P42, DOI 10.1109/49.363147 PETERS RD, 1995, COMP MED SY, P204 Quatieri TF, 2006, IEEE T AUDIO SPEECH, V14, P533, DOI 10.1109/TSA.2005.855838 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Roucos S., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4) SCANLON M, 2002, MULT SPEECH REC WORK, P1 SHAHINA A, 2005, INT C INT SENS INF P, P400 Ten Bosch Louis, 2003, SPEECH COMMUN, V40.1, P213 Titze IR, 2000, J ACOUST SOC AM, V107, P581, DOI 10.1121/1.428324 Tran VA, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1465 VISWANATHAN V, 1984, IEEE INT C AC SPEECH, P57 WAND M, 2007, INTERSPEECH 2007 WANG H, 2003, P IEEE SENSORS, V2, P1096 Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0 WOMACK BD, 1966, IEEE INT C AC SPEECH, V1, P53 Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995 NR 56 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 327 EP 340 DI 10.1016/j.specom.2009.11.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400006 ER PT J AU Schultz, T Wand, M AF Schultz, Tanja Wand, Michael TI Modeling coarticulation in EMG-based continuous speech recognition SO SPEECH COMMUNICATION LA English DT Article DE EMG-based speech recognition; Silent Speech Interfaces; Phonetic features ID SIGNALS AB This paper discusses the use of surface electromyography for automatic speech recognition. Electromyographic signals captured at the facial muscles record the activity of the human articulatory apparatus and thus allow to trace back a speech signal even if it is spoken silently. Since speech is captured before it gets airborne, the resulting signal is not masked by ambient noise. The resulting Silent Speech Interface has the potential to overcome major limitations of conventional speech-driven interfaces: it is not prone to any environmental noise, allows to silently transmit confidential information, and does not disturb bystanders. We describe our new approach of phonetic feature bundling for modeling coarticulation in EMG-based speech recognition and report results on the EMG-PIT corpus, a multiple speaker large vocabulary database of silent and audible EMG speech recordings, which we recently collected. Our results on speaker-dependent and speaker-independent setups show that modeling the interdependence of phonetic features reduces the word error rate of the baseline system by over 33% relative. Our final system achieves 10% word error rate for the best-recognized speaker on a 101-word vocabulary task, bringing EMG-based speech recognition within a useful range for the application of Silent Speech Interfaces. (C) 2009 Elsevier B.V. All rights reserved. C1 [Schultz, Tanja; Wand, Michael] Karlsruhe Inst Technol, Cognit Syst Lab, D-76131 Karlsruhe, Germany. RP Schultz, T (reprint author), Karlsruhe Inst Technol, Cognit Syst Lab, Adenauerring 4, D-76131 Karlsruhe, Germany. EM tanja.schultz@kit.edu FU School of Health and Rehabilitation Sciences, University of Pittsburgh. FX The authors would like to thank Szu-Chen (Stan) Jou for his in-depth support with the initial recognition system, his help with the EMG-PIT collection and the data scripts. We also thank Maria Dietrich for recruiting all subjects and carrying out major parts of the database collection. Her study was supported in part through funding received from the SHRS Research Development Fund, School of Health and Rehabilitation Sciences, University of Pittsburgh. CR Bahl L, 1991, P INT C AC SPEECH SI, P185, DOI 10.1109/ICASSP.1991.150308 BEYERLEIN P, 2000, THESIS RWTH AACHEN Chan ADC, 2001, MED BIOL ENG COMPUT, V39, P500, DOI 10.1007/BF02345373 Denby B, 2010, SPEECH COMMUN, V52, P270, DOI 10.1016/j.specom.2009.08.002 Dietrich M., 2008, THESIS U PITTSBURGH FRANKEL J, 2004, P INT C SPOK LANG PR, P1202, DOI DOI 10.1016/J.SPECOM.2008.05.004 International Phonetic Association, 1999, HDB INT PHON ASS JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072 JOU SC, 2006, P INT PITTSB PA, P573 JOU SCS, 2007, P IEEE INT C AC SPEE, P401 Kirchhoff K., 1999, THESIS U BIELEFELD Leveau B, 1992, SELECTED TOPICS SURF Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331 METZE F, 2005, THESIS U KARLSRUHE Metze F., 2002, P INT C SPOK LANG PR, P2133 MORSE MS, 1989, P ANN INT C IEEE ENG, P1793 MORSE MS, 1991, PROCEEDINGS OF THE ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOL 13, PTS 1-5, P1877, DOI 10.1109/IEMBS.1991.684800 MORSE MS, 1986, COMPUT BIOL MED, V16, P399, DOI 10.1016/0010-4825(86)90064-8 SCHUNKE M, 2006, KOPF NEUROANATOMIE, V3 SUGIE N, 1985, IEEE T BIO-MED ENG, V32, P485, DOI 10.1109/TBME.1985.325564 Ueda N, 2000, J VLSI SIG PROC SYST, V26, P133, DOI 10.1023/A:1008155703044 WALLICZEK M, 2006, P INT PITTSB US, P1487 WAND M, 2009, P BIOS PROT PROT, P155 WAND M, COMMUNICATI IN PRESS WAND M, 2009, P INT BRIGHT UK YU H, 2000, P INT C SPOK LANG PR, P353 YU H, 2003, P EUR GEN SWITZ, P1869 NR 27 TC 30 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 341 EP 353 DI 10.1016/j.specom.2009.12.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400007 ER PT J AU Jorgensen, C Dusan, S AF Jorgensen, Charles Dusan, Sorin TI Speech interfaces based upon surface electromyography SO SPEECH COMMUNICATION LA English DT Article DE Electromyography; Speech recognition; Speech synthesis; Bioelectric control; Articulatory synthesis; EMG ID TRACT AREA FUNCTION; ARTICULATORY MODEL; RECOGNITION AB This paper discusses the use of surface electromyography (EMG) to recognize and synthesize speech. The acoustic speech signal can be significantly corrupted by high noise in the environment or impeded by garments or masks. Such situations occur, for example, when firefighters wear pressurized suits with self-contained breathing apparatus (SCBA) or when astronauts perform operations in pressurized gear. In these conditions it is important to capture and transmit clear speech commands in spite of a corrupted or distorted acoustic speech signal. One way to mitigate this problem is to use surface electromyography to capture activity of speech articulators and then, either recognize spoken commands from EMG signals or use these signals to synthesize acoustic speech commands. We describe a set of experiments for both speech recognition and speech synthesis based on surface electromyography and discuss the lessons learned about the characteristics of the EMG signal for these domains. The experiments include speech recognition in high noise based on 15 commands for firefighters wearing self-contained breathing apparatus, a sub-vocal speech robotic platform control experiment based on five words, a speech recognition experiment testing recognition of vowels and consonants, and a speech synthesis experiment based on an articulatory speech synthesizer. Published by Elsevier B.V. C1 [Jorgensen, Charles] NASA, Ames Res Ctr, Computat Sci Div, Moffett Field, CA 94035 USA. [Dusan, Sorin] NASA, Ames Res Ctr, MCT Inc, Moffett Field, CA 94035 USA. RP Jorgensen, C (reprint author), NASA, Ames Res Ctr, Computat Sci Div, M-S 269-1, Moffett Field, CA 94035 USA. EM Charles.Jorgensen@nasa.gov FU NASA FX The authors would like to particularly thank the NASA Aeronautics Basic Research program's Extension of the Human Senses Initiative for support over the years during the development of these technologies and the collegial contributions of Drs. Bradley Betts, Kevin Wheeler, Kim Binstead, Shinji Maeda, Arturo Galvan, Jianwu Dang, and Mrs. Rebekah Kochavi and Mrs. Diana Lee. CR Betts BJ, 2006, INTERACT COMPUT, V18, P1242, DOI 10.1016/j.intcom.2006.08.012 BIRKHOLZ P, 2007, 6 ISCA WORKSH SPEECH BRADY K, 2004, P 2004 IEEE INT C AC, V1, P477 *CARN MELL U ROB, PERS EXPL ROV Chan ADC, 2005, IEEE T BIO-MED ENG, V52, P121, DOI 10.1109/TBME.2004.836492 COKER CH, 1966, J ACOUST SOC AM, V40, P1271, DOI 10.1121/1.2143456 DELUCA CJ, 2005, IMAGING BEHAV MOTOR DUSAN S, 2000, P 5 SEM SPECH PROD M Faaborg-Anderson K., 1957, ACTA PHYSL SCAN S140, V41, P1 Gerdle B., 1999, MODERN TECHNIQUES NE, P705 Graciarena M, 2003, IEEE SIGNAL PROC LET, V10, P72, DOI [10.1109/LSP.2003.808549, 10.1109/LSP.2002.808549] JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072 JORGENSEN C, 2005, P 38 ANN HAW INT C S JOU SC, 2005, P IEEE INT C AC SPEE, P1009 JUNQUA JC, 1993, J AM SOC AM, V93, P512 Junqua J-C, 1999, P INT C AC SPEECH SI, P2083 Kingsbury N, 2001, APPL COMPUT HARMON A, V10, P234, DOI 10.1006/acha.2000.0343 LABOISSIERE R, 1995, P 13 INT C PHON SCI, V1, P358 MAEDA S, 1979, J ACOUST SOC AM, V65, pS22, DOI 10.1121/1.2017158 MAEDA S, 1982, SPEECH COMM MAEDA S, 1988, J ACOUST SOC AM, V84, pS146, DOI 10.1121/1.2025845 MCCOWAN I, 2005, 04 IDIAP RES, P73 McGowan RS, 1996, J ACOUST SOC AM, V99, P595, DOI 10.1121/1.415220 MENDES JAG, 2008, IEEE COMP SOC C IM S MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 MEYER P, 1986, SIGNAL PROCESS, V3, P377 Ng L., 2000, P INT C AC SPEECH SI, V1, P229 PERRIER P, 1992, J SPEECH HEAR RES, V35, P53 RUBIN P, 1981, J ACOUST SOC AM, V70, P321, DOI 10.1121/1.386780 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 Shah SA, 2005, AM J TRANSPLANT, V5, P400 STEINER I, 2008, 8 INT SEM SPEECH PRO Trejo LJ, 2003, IEEE T NEUR SYS REH, V11, P199, DOI 10.1109/TNSRE.2003.814426 Wheeler KR, 2003, IEEE PERVAS COMPUT, V2, P56, DOI 10.1109/MPRV.2003.1203754 NR 34 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 354 EP 366 DI 10.1016/j.specom.2009.11.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400008 ER PT J AU Brumberg, JS Nieto-Castanon, A Kennedy, PR Guenther, FH AF Brumberg, Jonathan S. Nieto-Castanon, Alfonso Kennedy, Philip R. Guenther, Frank H. TI Brain-computer interfaces for speech communication SO SPEECH COMMUNICATION LA English DT Article DE Brain-computer interface; Neural prosthesis; Speech restoration ID THOUGHT-TRANSLATION DEVICE; INTRACORTICAL ELECTRODE ARRAY; VISUAL-EVOKED POTENTIALS; NEURAL-NETWORK MODEL; CEREBRAL-CORTEX; CONE ELECTRODE; ELECTROCORTICOGRAPHIC SIGNALS; MICROELECTRODE ARRAYS; MACHINE INTERFACE; CORTICAL-NEURONS AB This paper briefly reviews current silent speech methodologies for normal and disabled individuals. Current techniques utilizing electromyographic (EMG) recordings of vocal tract movements are useful for physically healthy individuals but fail for tetraplegic individuals who do not have accurate voluntary control over the speech articulators. Alternative methods utilizing EMG from other body parts (e.g., hand, arm, or facial muscles) or electroencephalography (EEG) can provide capable silent communication to severely paralyzed users, though current interfaces are extremely slow relative to normal conversation rates and require constant attention to a computer screen that provides visual feedback and/or cueing. We present a novel approach to the problem of silent speech via an intracortical microelectrode brain-computer interface (BCI) to predict intended speech information directly from the activity of neurons involved in speech production. The predicted speech is synthesized and acoustically fed back to the user with a delay under 50 ms. We demonstrate that the Neurotrophic Electrode used in the BCI is capable of providing useful neural recordings for over 4 years, a necessary property for BCIs that need to remain viable over the lifespan of the user. Other design considerations include neural decoding techniques based on previous research involving BCIs for computer cursor or robotic arm control via prediction of intended movement kinematics from motor cortical signals in monkeys and humans. Initial results from a study of continuous speech production with instantaneous acoustic feedback show the BCI user was able to improve his control over an artificial speech synthesizer both within and across recording sessions. The success of this initial trial validates the potential of the intracortical microelectrode-based approach for providing a speech prosthesis that can allow much more rapid communication rates. (C) 2010 Elsevier By. All rights reserved. C1 [Brumberg, Jonathan S.; Guenther, Frank H.] Boston Univ, Dept Cognit & Neural Syst, Boston, MA 02215 USA. [Guenther, Frank H.] Boston Univ, Sargent Coll Hlth & Rehabil Sci, Dept Speech Language & Hearing Sci, Boston, MA 02215 USA. [Guenther, Frank H.] Harvard Univ, MIT, Div Hlth Sci & Technol, Cambridge, MA 02139 USA. [Guenther, Frank H.] Massachusetts Gen Hosp, Athinoula A Martinos Ctr Biomed Imaging, Charlestown, MA 02129 USA. [Brumberg, Jonathan S.; Kennedy, Philip R.] Neural Signals Inc, Duluth, GA 30096 USA. [Nieto-Castanon, Alfonso] StatsANC LLC, Buenos Aires, DF, Argentina. RP Brumberg, JS (reprint author), Boston Univ, Dept Cognit & Neural Syst, 677 Beacon St, Boston, MA 02215 USA. EM brumberg@cns.bu.edu FU National Institute on Deafness and other Communication Disorders [R01 DC007683, R01 DC002852, R44 DC007050-02]; CELEST, an NSF Science of Learning Center [NSF SBE-0354378] FX This research was supported by the National Institute on Deafness and other Communication Disorders (R01 DC007683; R01 DC002852; R44 DC007050-02) and by CELEST, an NSF Science of Learning Center (NSF SBE-0354378). The authors thank the participant and his family for their dedication to this research project, Rob Law and Misha Panko for their assistance with the preparation of this manuscript and Tanja Schultz for her helpful comments. CR Allison BZ, 2008, CLIN NEUROPHYSIOL, V119, P399, DOI 10.1016/j.clinph.2007.09.121 Bartels J, 2008, J NEUROSCI METH, V174, P168, DOI 10.1016/j.jneumeth.2008.06.030 Betts BJ, 2006, INTERACT COMPUT, V18, P1242, DOI 10.1016/j.intcom.2006.08.012 Birbaumer N, 2000, IEEE T REHABIL ENG, V8, P190, DOI 10.1109/86.847812 Birbaumer N, 2003, IEEE T NEUR SYS REH, V11, P120, DOI 10.1109/TNSRE.2003.814439 Birbaumer N, 1999, NATURE, V398, P297, DOI 10.1038/18581 BROWN EN, 2004, COMPUTATIONAL NEUROS, V7, P253 Brumberg J. S., 2009, P 10 ANN C INT SPEEC Carmena JM, 2003, PLOS BIOL, V1, P193, DOI 10.1371/journal.pbio.0000042 Cheng M, 2002, IEEE T BIO-MED ENG, V49, P1181, DOI 10.1109/TBME.2002.803536 DaSalla CS, 2009, NEURAL NETWORKS, V22, P1334, DOI 10.1016/j.neunet.2009.05.008 Donchin E, 2000, IEEE T REHABIL ENG, V8, P174, DOI 10.1109/86.847808 Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003 Gelb A., 1974, APPL OPTIMAL ESTIMAT GEORGOPOULOS AP, 1982, J NEUROSCI, V2, P1527 GUENTHER FH, 1994, BIOL CYBERN, V72, P43, DOI 10.1007/BF00206237 Guenther FH, 2009, PLOS ONE, V4, DOI 10.1371/journal.pone.0008218 Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 HAMALAINEN M, 1993, REV MOD PHYS, V65, P413, DOI 10.1103/RevModPhys.65.413 Hinterberger T, 2003, CLIN NEUROPHYSIOL, V114, P416, DOI 10.1016/S1388-2457(02)00411-X HOCHBERG LR, 2008, NEUR M PLANN 2008 SO Hochberg LR, 2006, NATURE, V442, P164, DOI 10.1038/nature04970 HOOGERWERF AC, 1994, IEEE T BIO-MED ENG, V41, P1136, DOI 10.1109/10.335862 JONES KE, 1992, ANN BIOMED ENG, V20, P423, DOI 10.1007/BF02368134 JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072 JOU SC, 2006, INTERSPEECH 2006 JOU SS, 2009, COMMUN COMPUT PHYS, V25, P305 Kalman R.E, 1960, J BASIC ENG, V82, P35, DOI DOI 10.1115/1.3662552 KENNEDY PR, 1989, J NEUROSCI METH, V29, P181, DOI 10.1016/0165-0270(89)90142-8 KENNEDY PR, 1992, NEUROREPORT, V3, P605, DOI 10.1097/00001756-199207000-00015 KENNEDY PR, 2006, ELECT ENG HDB SERIES, V1 KENNEDY PR, 1992, NEUROSCI LETT, V142, P89, DOI 10.1016/0304-3940(92)90627-J Kennedy PR, 2004, IEEE T NEUR SYS REH, V12, P339, DOI 10.1109/TNSRE.2004.834629 Kennedy PR, 2000, IEEE T REHABIL ENG, V8, P198, DOI 10.1109/86.847815 Kennedy PR, 1998, NEUROREPORT, V9, P1707, DOI 10.1097/00001756-199806010-00007 Kim S, 2007, 3 IEEE EMBS C NEUR E, P486 Kipke DR, 2003, IEEE T NEUR SYS REH, V11, P151, DOI 10.1109/TNSRE.2003.814443 Krusienski DJ, 2008, J NEUROSCI METH, V167, P15, DOI 10.1016/j.jneumeth.2007.07.017 Krusienski DJ, 2006, J NEURAL ENG, V3, P299, DOI 10.1088/1741-2560/3/4/007 Kubler A, 1999, EXP BRAIN RES, V124, P223, DOI 10.1007/s002210050617 Leuthardt Eric C, 2004, J Neural Eng, V1, P63, DOI 10.1088/1741-2560/1/2/001 MACKAY DG, 1968, J ACOUST SOC AM, V43, P811, DOI 10.1121/1.1910900 Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331 MATTHEWS BA, 2008, NEUR M PLANN 2008 SO Maynard EM, 1997, ELECTROEN CLIN NEURO, V102, P228, DOI 10.1016/S0013-4694(96)95176-0 MENDES J, 2008, C IM SIGN PROC CISP, V1, P221 MILLER LE, 2007, NEUR M PLANN 2007 SA MITZDORF U, 1985, PHYSIOL REV, V65, P37 MOUNTCASTLE VB, 1975, J NEUROPHYSIOL, V38, P871 Nicolelis MAL, 2003, P NATL ACAD SCI USA, V100, P11041, DOI 10.1073/pnas.1934665100 Penfield W, 1959, SPEECH BRAIN MECH PORBADNIGK A, 2009, INT C BIOINSP SYST S, P376 Rousche PJ, 1998, J NEUROSCI METH, V82, P1, DOI 10.1016/S0165-0270(98)00031-4 Schalk G, 2007, J NEURAL ENG, V4, P264, DOI 10.1088/1741-2560/4/3/012 Schalk G, 2008, J NEURAL ENG, V5, P75, DOI 10.1088/1741-2560/5/1/008 SCHMIDT EM, 1976, EXP NEUROL, V52, P496, DOI 10.1016/0014-4886(76)90220-X Sellers EW, 2006, BIOL PSYCHOL, V73, P242, DOI 10.1016/j.biopsycho.2006.04.007 SIEBERT SA, 2008, NEUR M PLANN 2008 SO Suppes P, 1997, P NATL ACAD SCI USA, V94, P14965, DOI 10.1073/pnas.94.26.14965 Taylor DM, 2002, SCIENCE, V296, P1829, DOI 10.1126/science.1070291 Trejo LJ, 2006, IEEE T NEUR SYS REH, V14, P225, DOI 10.1109/TNSRE.2006.875578 Truccolo W, 2008, J NEUROSCI, V28, P1163, DOI 10.1523/JNEUROSCI.4415-07.2008 Truccolo W, 2005, J NEUROPHYSIOL, V93, P1074, DOI 10.1152/jn.00697.2004 Vaughan TM, 2006, IEEE T NEUR SYS REH, V14, P229, DOI 10.1109/TNSRE.2006.875577 Velliste M, 2008, NATURE, V453, P1098, DOI 10.1038/nature06996 WALLICZEK M, 2006, INTERSPEECH 2006, P1596 Wand M., 2009, INT C BIOINSP SYST S Wessberg J, 2000, NATURE, V408, P361 Williams JC, 1999, BRAIN RES PROTOC, V4, P303, DOI 10.1016/S1385-299X(99)00034-3 Wise KD, 2004, P IEEE, V92, P76, DOI 10.1109/JPROC.2003.820544 WISE KD, 1970, IEEE T BIO-MED ENG, VBM17, P238, DOI 10.1109/TBME.1970.4502738 Wolpaw JR, 2000, IEEE T REHABIL ENG, V8, P222, DOI 10.1109/86.847823 Wolpaw JR, 2004, P NATL ACAD SCI USA, V101, P17849, DOI 10.1073/pnas.0403504101 WRIGHT EJ, 2007, NEUR M PLANN 2007 SA WRIGHT EJ, 2008, NEUR M PLANN 2008 SO NR 76 TC 27 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2010 VL 52 IS 4 SI SI BP 367 EP 379 DI 10.1016/j.specom.2010.01.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 574XB UT WOS:000276026400009 ER PT J AU Goldwater, S Jurafsky, D Manning, CD AF Goldwater, Sharon Jurafsky, Dan Manning, Christopher D. TI Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Conversational; Error analysis; Individual differences; Mixed-effects model ID NEIGHBORHOOD ACTIVATION; FREQUENCY; TRANSCRIPTION AB Despite years of speech recognition research, little is known about which words tend to be misrecognized and why. Previous work has shown that errors increase for infrequent words, short words, and very loud or fast speech, but many other presumed causes of error (e.g., nearby disfluencies, turn-initial words, phonetic neighborhood density) have never been carefully tested. The reasons for the huge differences found in error rates between speakers also remain largely mysterious. Using a mixed-effects regression model, we investigate these and other factors by analyzing the errors of two state-of-the-art recognizers on conversational speech. Words with higher error rates include those with extreme prosodic characteristics, those occurring turn-initially or as discourse markers, and doubly confusable pairs: acoustically similar words that also have similar language model probabilities. Words preceding disfluent interruption points (first repetition tokens and words before fragments) also have higher error rates. Finally, even after accounting for other factors, speaker differences cause enormous variance in error rates, suggesting that speaker error rate variance is not fully explained by differences in word choice, fluency, or prosodic characteristics. We also propose that doubly confusable pairs, rather than high neighborhood density, may better explain phonetic neighborhood errors in human speech processing. (C) 2009 Elsevier B.V. All rights reserved. C1 [Goldwater, Sharon] Univ Edinburgh, Sch Informat, Edinburgh EH8 9AB, Midlothian, Scotland. [Jurafsky, Dan] Stanford Univ, Dept Linguist, Stanford, CA 94305 USA. [Manning, Christopher D.] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA. RP Goldwater, S (reprint author), Univ Edinburgh, Sch Informat, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland. EM sgwater@inf.ed.ac.uk; jurafsky@stanford.edu; manning@stanford.edu FU Edinburgh-Stanford LINK a; ONR MURI [N000140510388] FX This work was supported by the Edinburgh-Stanford LINK and ONR MURI award N000140510388. We thank Andreas Stolcke for providing the SRI recognizer output, language model, and forced alignments; Phil Woodland for providing the Cambridge recognizer output and other evaluation data; and Katrin Kirchhoff and Raghunandan Kumaran for datasets used in preliminary work, useful scripts, and additional help. CR Adda-Decker M., 2005, P INTERSPEECH C INTE, P2205 Baayen RH, 2008, PRACTICAL INTRO STAT Bates D, 2007, IME4 LINEAR MIXED EF Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836 Boersma P., 2007, PRAAT DOING PHONETIC Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 BULYKO I, 2003, P C HUM LANG TECHN Dahan D, 2001, COGNITIVE PSYCHOL, V42, P317, DOI 10.1006/cogp.2001.0750 Diehl RL, 1996, J PHONETICS, V24, P187, DOI 10.1006/jpho.1996.0011 DODDINGTON GR, 1981, IEEE SPECTRUM, V18, P26 EVERMANN G, 2004, P FALL 2004 RICH TRA EVERMANN G, 2005, P ICASSP, P209 Everrnann G., 2004, P ICASSP FISCUS J, 2004, RT 04F WORKSH Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7 Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 GOLDWATER S, 2008, P ACL Good PI, 2004, PERMUTATION PARAMETR Hain T, 2005, IEEE T SPEECH AUDI P, V13, P1173, DOI 10.1109/TSA.2005.852999 Harrell FE, 2007, DESIGN PACKAGE R PAC HEIKE A, 1981, LANG SPEECH, V24, P147 Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006 HOWES D, 1954, J EXP PSYCHOL, V48, P106, DOI 10.1037/h0059478 Ingle J., 2005, J ACOUST SOC AM, V117, P2459 Keating P., 2003, PAPERS LAB PHONOLOGY, P143 Luce PA, 2000, PERCEPT PSYCHOPHYS, V62, P615, DOI 10.3758/BF03212113 Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001 Marcus M.P., 1999, LDC99T42 MARSLENWILSON WD, 1987, COGNITION, V25, P71, DOI 10.1016/0010-0277(87)90005-9 Nakamura M, 2008, COMPUT SPEECH LANG, V22, P171, DOI 10.1016/j.csl.2007.07.003 NUSBAUM H, 1995, 2002 APPL SPEECH TEC, pCH4 Nusbaum H. C., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90002-7 PENNOCKSPECK B, 2005, ACT 28 C INT AEDEAN, P407 POVEY D, 2002, P IEEE ICASSP R Development Core Team, 2007, R LANG ENV STAT COMP Ratnaparkhi A, 1996, P C EMP METH NAT LAN, P133 SHINOZAKI T, 2001, P ASRU 2001 SHRIBERG E, 1995, P INT C PHON SCI ICP, V4, P384 SIEGLER M, 1995, P ICASSP Stolcke A, 2006, IEEE T AUDIO SPEECH, V14, P1729, DOI 10.1109/TASL.2006.879807 Vergyri D., 2003, P ICASSP HONG KONG A, pI Vitevitch MS, 1999, J MEM LANG, V40, P374, DOI 10.1006/jmla.1998.2618 Wang W, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P238 NR 43 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2010 VL 52 IS 3 BP 181 EP 200 DI 10.1016/j.specom.2009.10.001 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560GA UT WOS:000274888900001 ER PT J AU Torreira, F Adda-Decker, M Ernestus, M AF Torreira, Francisco Adda-Decker, Martine Ernestus, Mirjam TI The Nijmegen Corpus of Casual French SO SPEECH COMMUNICATION LA English DT Article DE Corpus; Casual speech; French ID SPONTANEOUS SPEECH; WORDS AB This article describes the preparation, recording and orthographic transcription of a new speech corpus, the Nijmegen Corpus of Casual French (NCCFr). The corpus contains a total of over 36 h of recordings of 46 French speakers engaged in conversations with friends. Casual speech was elicited during three different parts, which together provided around 90 min of speech from every pair of speakers. While Parts 1 and 2 did not require participants to perform any specific task, in Part 3 participants negotiated a common answer to general questions about society. Comparisons with the ESTER corpus of journalistic speech show that the two corpora contain speech of considerably different registers. A number of indicators of casualness, including swear words, casual words, verlan, disfluencies and word repetitions, are more frequent in the NCCFr than in the ESTER corpus, while the use of double negation, an indicator of formal speech, is less frequent. In general, these estimates of casualness are constant through the three parts of the recording sessions and across speakers. Based on these facts, we conclude that our corpus is a rich resource of highly casual speech, and that it call be effectively exploited by researchers in language science and technology. (C) 2009 Elsevier B.V. All rights reserved. C1 [Torreira, Francisco; Ernestus, Mirjam] Radboud Univ Nijmegen, CLS, NL-6525 XD Nijmegen, Netherlands. [Torreira, Francisco; Ernestus, Mirjam] Max Planck Inst Psycholinguist, NL-6525 XD Nijmegen, Netherlands. [Adda-Decker, Martine] LIMSI CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. [Adda-Decker, Martine] LIMSI CNRS, Situated Percept Grp, F-91403 Orsay, France. RP Torreira, F (reprint author), Radboud Univ Nijmegen, CLS, Wundtlaan 1, NL-6525 XD Nijmegen, Netherlands. EM Francisco.Torreira@mpi.nl RI Ernestus, Mirjam /E-4344-2010 FU European Young Investigator Award FX Our thanks to Cecile Fougeron, Coralie Vincent, Christine Meunier, Ton Wempe, the staff at ILPGA and the participants for their help during the recording of the corpus in France. We also want to thank Lou Boves and Christopher Stewart for helpful comments and discussion. This work was funded by a European Young Investigator Award to the third author. It was presented at the 6th Journees d'Etudes Linguistiques of Nantes University in June 2009. CR Barras C, 2001, SPEECH COMMUN, V33, P5, DOI 10.1016/S0167-6393(00)00067-4 Blanche-Benveniste Claire, 1990, FRANCAIS PARLE ETUDE Boersma P., 2009, PRAAT DOING PHONETIC Clark H. H., 1996, USING LANGUAGE Clark HH, 1998, COGNITIVE PSYCHOL, V37, P201, DOI 10.1006/cogp.1998.0693 Coveney Aidan, 1996, VARIABILITY SPOKEN F Development Core Team, 2008, R LANG ENV STAT COMP Eggins Suzanne, 1997, ANAL CASUAL CONVERSA Ernestus M., 2000, VOICE ASSIMILATION S FAGYAL Z, 1998, VIVACITE DIVERSITE V, V3, P151 Tree JEF, 1995, J MEM LANG, V34, P709, DOI 10.1006/jmla.1995.1032 GALLIANO S, 2005, P INT 2005, P2453 GAUVAIN JL, 2005, P INT 2005 JOHNSON K., 2004, SPONTANEOUS SPEECH D, P29 JOUSSE V, 2008, CARACTERISATION DETE Laks B., 2005, LINGUISTIQUE CORPUS, P205 Lamel L, 2000, SPEECH COMMUN, V31, P339, DOI 10.1016/S0167-6393(99)00067-9 Local J., 2007, P 16 INT C PHON SCI, P6 Local J, 2003, J PHONETICS, V31, P321, DOI 10.1016/S0095-4470(03)00045-7 Moore R. K., 2003, P EUROSPEECH, P2581 MOORE RK, 2005, P 10 INT C SPEECH CO Plug L, 2005, PHONETICA, V62, P131, DOI 10.1159/000090094 Sarkar D, 2008, USE R, P1 Schegloff EA, 2000, LANG SOC, V29, P1 SERPOLLET N, 2007, P CORPUS LINGUISTICS Shriberg E., 2001, J INT PHON ASSOC, V31, P153 Smith Alan, 2002, J FRENCH LANGUAGE ST, V12, P23 Valdman A, 2000, FR REV, V73, P1179 2007, NP ROBERT 2008 GRAND NR 29 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2010 VL 52 IS 3 BP 201 EP 212 DI 10.1016/j.specom.2009.10.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560GA UT WOS:000274888900002 ER PT J AU Valente, F AF Valente, Fabio TI Multi-stream speech recognition based on Dempster-Shafer combination rule SO SPEECH COMMUNICATION LA English DT Article DE TANDEM features; Multi Layer Perceptron; Multi-stream speech recognition; Inverse-entropy combination AB This paper aims at investigating the use of Dempster-Shafer (DS) combination rule for multi-stream automatic speech recognition. The DS combination is based on a generalization of the conventional Bayesian framework. The main motivation for this work is the similarity between the DS combination and findings of Fletcher on human speech recognition. Experiments are based on the combination of several Multi Layer Perceptron (MLP) classifiers trained oil different representations of the speech signal. The TANDEM framework is adopted in order to use the MLP outputs into conventional speech recognition systems. We exhaustively investigate several methods for applying the DS combination into multi-stream ASR. Experiments are run oil small and large vocabulary speech recognition tasks and aim at comparing the proposed technique with other frame-based combination rules (e.g. inverse entropy). Results reveal that the proposed method outperforms conventional combination rules in both tasks. Furthermore we verify that the performance of the combined feature stream is never inferior to the performance of the best individual feature stream. We conclude the paper discussing other applications of the DS combination and possible extensions. (C) 2009 Elsevier B.V. All rights reserved. C1 IDIAP Res Inst, CH-1920 Martigny, Switzerland. RP Valente, F (reprint author), IDIAP Res Inst, CH-1920 Martigny, Switzerland. EM fabio.valente@idiap.ch FU Defense Advanced Research Projects Agency (DARPA) [HR0011-06-C-0023] FX This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). The author thank Jithendra Vepa, Thomas Hain and AMI ASR team for their help with the meeting system. The author also thanks anonymous reviewers for their comments. CR Allen J., 2005, ARTICULATION INTELLI BOURLARD H, 1996, P ICSLP 96 Bourlard Ha, 1994, CONNECTIONIST SPEECH Fletcher H., 1953, SPEECH HEARING COMMU Galina L.R., 1994, NEURAL NETWORKS, V7, P777 HAIN T, 2005, NIST RT05 WORKSH ED Hermansky H., 1996, P ICSLP Hermansky H., 2005, P INT 2005 HERMANSKY H, 1998, P ICSLP 98 SYDN AUST Hermansky H., 2000, P ICASSP HWANG MY, 2007, P IEEE WORKSH AUT SP Kittler J., 1998, IEEE T PAMI, V20 Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416 MANDLER EJ, 1988, PATTERN RECOGN, V10, P381 Misra H., 2003, P ICASSP MORGAN N, 2004, P ICASSP PLAHL C, 2009, P INT BRISB AUSTR Shafer G., 1976, MATH THEORY EVIDENCE Stolcke Andreas, 2006, P ICASSP THOMAS S, 2008, P INT Valente F., 2007, P ICASSP VALENTE F, 2007, P INT XU L, 1992, IEEE T SYST MAN CYB, V22, P418, DOI 10.1109/21.155943 Zhu Q., 2004, P ICSLP RICH TRANSCRIPTION E NR 25 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2010 VL 52 IS 3 BP 213 EP 222 DI 10.1016/j.specom.2009.10.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560GA UT WOS:000274888900003 ER PT J AU Chien, JT Chueh, CH AF Chien, Jen-Tzung Chueh, Chuang-Hua TI Joint acoustic and language modeling for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Hidden Markov model; n-Gram; Conditional random field; Maximum entropy; Discriminative training; Speech recognition ID MAXIMUM-ENTROPY APPROACH; RANDOM-FIELDS AB In a traditional model of speech recognition, acoustic and linguistic information sources are assumed independent of each other. Parameters of hidden Markov model (HMM) and n-gram are separately estimated for maximum a posteriori classification. However, the speech features and lexical words are inherently correlated in natural language. Lacking combination of these models leads to some inefficiencies. This paper reports on the joint acoustic and linguistic modeling for speech recognition by using the acoustic evidence in estimation of the linguistic model parameters, and vice versa, according to the maximum entropy (ME) principle. The discriminative ME (DME) models are exploited by using features from competing sentences. Moreover, a mutual ME (MME) model is built for sentence posterior probability, which is maximized to estimate the model parameters by characterizing the dependence between acoustic and linguistic features. The N-best Viterbi approximation is presented in implementing DME and MME models. Additionally, the new models are incorporated with the high-order feature statistics and word regularities. In the experiments, the proposed methods increase the sentence posterior probability or model separation. Recognition errors are significantly reduced in comparison with separate HMM and n-gram model estimations from 32.2% to 27.4% using the MATBN corpus and from 5.4% to 4.8% using the WSJ corpus (5K condition). (C) 2009 Elsevier B.V. All rights reserved. C1 [Chien, Jen-Tzung; Chueh, Chuang-Hua] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RP Chien, JT (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. EM jtchien@mail.ncku.edu.tw FU National Science Council, Taiwan, ROC [NSC97-2221-E-006-230-MY3] FX The authors acknowledge Dr. Jean-Luc Gauvain and the anonymous reviewers for their valuable comments which improved the presentation of this paper. This word has been partially supported by the National Science Council, Taiwan, ROC, under contract NSC97-2221-E-006-230-MY3. CR Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179> Berger AL, 1996, COMPUT LINGUIST, V22, P39 BEYERLEIN P, 1998, P IEEE INT C AC SPEE, V1, P481, DOI 10.1109/ICASSP.1998.674472 Chien JT, 2006, IEEE T AUDIO SPEECH, V14, P1719, DOI 10.1109/TSA.2005.858551 Chien JT, 2006, IEEE T AUDIO SPEECH, V14, P797, DOI 10.1109/TSA.2005.860847 CHIEN JT, 2008, P IEEE WORKSH SPOK L, P201 Chueh C. H., 2005, P INTERSPEECH, P721 CHUEH CH, 2006, P IEEE INT C AC SPEE, V1, P1061 DARROCH JN, 1972, ANN MATH STAT, V43, P1470, DOI 10.1214/aoms/1177692379 DellaPietra S, 1997, IEEE T PATTERN ANAL, V19, P380, DOI 10.1109/34.588021 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 FOSLERLUSSIER E, 2008, P IEEE INT C AC SPEE, P4049 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gillick L., 1989, P ICASSP, P532 Gunawardana A., 2005, P INTERSPEECH, P1117 HEIGOLD G, 2007, P EUR C SPEECH COMM, P1721 JAYNES ET, 1957, PHYS REV, V106, P620, DOI 10.1103/PhysRev.106.620 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 Khudanpur S, 2000, COMPUT SPEECH LANG, V14, P355, DOI 10.1006/csla.2000.0149 Kuo H. K. J., 2002, P ICASSP, V1, P325 KUO HKJ, 2004, P ICSLP JEJ ISL S KO, P681 Lafferty John D., 2001, ICML, P282 Lee LS, 1997, IEEE SIGNAL PROC MAG, V14, P63 Liu Y, 2006, IEEE T AUDIO SPEECH, V14, P1526, DOI 10.1109/TASL.2006.878255 MACHEREY W, 2003, P EUR C SPEECH COMM, V1, P493 MAHAJAN M, 2006, P ICASSP 2006, V1, P273 Malouf R., 2002, P 6 C NAT LANG LEARN, P49 McCallum A., 2000, P 17 INT C MACH LEAR, P591 MORRIS J, 2006, P INT C SPOK LANG PR, P579 Normandin Y, 1994, IEEE T SPEECH AUDI P, V2, P299, DOI 10.1109/89.279279 Quattoni A, 2007, IEEE T PATTERN ANAL, V29, P1848, DOI 10.1109/TPAMI.2007.1124 Riedmiller M., 1993, P IEEE INT C NEUR NE, V1, P586, DOI DOI 10.1109/ICNN.1993.298623 Rosenfeld R, 1996, COMPUT SPEECH LANG, V10, P187, DOI 10.1006/csla.1996.0011 Sha F., 2003, P C N AM CHAPT ASS C, P134 Stolcke A., 2002, P INT C SPOK LANG PR, V2, P901 Wang SJ, 2004, IEEE T NEURAL NETWOR, V15, P903, DOI 10.1109/TNN.2004.828755 NR 36 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2010 VL 52 IS 3 BP 223 EP 235 DI 10.1016/j.specom.2009.10.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560GA UT WOS:000274888900004 ER PT J AU Kolar, J Liu, Y Shriberg, E AF Kolar, Jachym Liu, Yang Shriberg, Elizabeth TI Speaker adaptation of language and prosodic models for automatic dialog act segmentation of speech SO SPEECH COMMUNICATION LA English DT Article DE Spoken language understanding; Dialog act segmentation; Speaker adaptation; Prosody modeling; Language modeling ID TO-TEXT; RECOGNITION AB Speaker-dependent modeling has a long history in speech recognition, but has received less attention in speech understanding. This study explores speaker-specific modeling for the task of automatic segmentation of speech into dialog acts (DAs), using a linear combination of speaker-dependent and speaker-independent language and prosodic models. Data come from 20 frequent speakers in the ICSI meeting corpus; adaptation data per speaker ranges from 5 k to 115 k words. We compare performance for both reference transcripts and automatic speech recognition output. We find that: (1) speaker adaptation in this domain results both in a significant overall improvement and in improvements for many individual speakers, (2) the magnitude of improvement for individual speakers does not depend on the amount of adaptation data, and (3) language and prosodic models differ both in degree of improvement, and in relative benefit for specific DA classes. These results suggest important future directions for speaker-specific modeling in spoken language understanding tasks. (C) 2009 Elsevier B.V. All rights reserved. C1 [Kolar, Jachym] Univ W Bohemia, Dept Cybernet, Fac Sci Appl, Plzen 30614, Czech Republic. [Liu, Yang] Univ Texas Dallas, Dept Comp Sci, Richardson, TX 75083 USA. [Shriberg, Elizabeth] SRI Int, Speech Technol & Res Lab, Menlo Pk, CA 94025 USA. [Shriberg, Elizabeth] Int Comp Sci Inst, Berkeley, CA 94704 USA. RP Kolar, J (reprint author), Univ W Bohemia, Dept Cybernet, Fac Sci Appl, Univerzitni 8, Plzen 30614, Czech Republic. EM jachym@kky.zcu.cz; yangl@hlt.utdallas.edu; ees@speech.sri.com FU Ministry of Education of the Czech Republic [1M0567, 2C06020]; NSF [IIS-0544682, IIS-0845484] FX This work was supported by the Ministry of Education of the Czech Republic under projects 1M0567 and 2C06020 at UWB Pilsen, and the NSF grants IIS-0544682 at SRI International and IIS-0845484 at UT Dallas. The views are those of the authors and do not reflect the views of the funding agencies. CR AKITA Y, 2004, P INTERSPEECH 2004 I AKITA Y, 2006, P INTERSPEECH 2006 I BESLING S, 1995, P EUROSPEECH MADR SP CUENDET S, 2006, P IEEE WORKSH SPOK L Dhillon R, 2004, TR04002 ICSI FAVRE B, 2008, P IEEE WORKSH SPOK L Furui S, 2004, IEEE T SPEECH AUDI P, V12, P401, DOI 10.1109/TSA.2004.828699 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HIRST A, 1998, INTONATION SYST Huang J., 2002, P ICSLP 2002 DENV CO JANIN A, 2003, P ICASSP HONG KONG Jones D., 2003, P EUROSPEECH GEN SWI KAHN JG, 2004, P HLT NAACL BOST MA Kim JH, 2003, SPEECH COMMUN, V41, P563, DOI 10.1016/S00167-6393(03)00049-9 KNESER R, 1995, P ICASSP DETR MI US KOLAR J, 2007, P INTERSPEECH 2007 A Kolar J, 2006, LECT NOTES ARTIF INT, V4188, P629 KOLAR J, 2006, P INTERSPEECH 2006 I Liu Y, 2006, COMPUT SPEECH LANG, V20, P468, DOI 10.1016/j.csl.2005.06.002 LIU Y, 2005, P ACL ANN ARB MI US LIU Y, 2004, P EMNLP BARC SPAIN MAGIMAIDOSS M, 2007, P ICASSP HON HI MATUSOV E, 2007, P INTERSPEECH 2007 A Olshen R., 1984, CLASSIFICATION REGRE, V1st Ostendorf M., 1994, Computational Linguistics, V20 ROARK B, 2006, P ICASSP TOUL FRANC Shriberg E, 2000, SPEECH COMMUN, V32, P127, DOI 10.1016/S0167-6393(00)00028-5 SONMEZ K, 1998, P ICSLP SYDN AUSTR SRIVASTAVA A, 2003, P EUROSPEECH GEN SWI STOLCKE A, 1998, P ICSLP SYDN AUSTR Stolcke A, 2006, IEEE T AUDIO SPEECH, V14, P1729, DOI 10.1109/TASL.2006.879807 STOLCKE A, 2002, P ICSLP DENV CO US TUR G, 2007, P ICASSP HON HI US WARNKE V, 1997, P EUROSPEECH RHOD GR ZIMMERMANN M, 2006, P INTERSPEECH 2006 I ZIMMERMANN M, 2009, P INTERSPEECH 2009 B NR 37 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2010 VL 52 IS 3 BP 236 EP 245 DI 10.1016/j.specom.2009.10.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560GA UT WOS:000274888900005 ER PT J AU Boulenger, V Hoen, M Ferragne, E Pellegrino, F Meunier, F AF Boulenger, Veronique Hoen, Michel Ferragne, Emmanuel Pellegrino, Francois Meunier, Fanny TI Real-time lexical competitions during speech-in-speech comprehension SO SPEECH COMMUNICATION LA English DT Article DE Speech-in-noise; Informational masking; Lexical competition ID AUDITORY SCENE ANALYSIS; TOP-DOWN INFLUENCES; INFORMATIONAL MASKING; SIMULTANEOUS TALKERS; INTERFERING-SPEECH; WORD RECOGNITION; COCKTAIL PARTY; PERCEPTION; HEARING; IDENTIFICATION AB This study aimed at characterizing the cognitive processes that come into play during speech-in-speech comprehension by examining lexical competitions between target speech and concurrent multi-talker babble. We investigated the effects of number of simultaneous talkers (2, 4, 6 or 8) and of the token frequency of the words that compose the babble (high or low) on lexical decision to target words. Results revealed a decrease in performance as measured by reaction times to targets with increasing number of concurrent talkers. Crucially, the frequency of words in the babble significantly affected performance: high-frequency babble interfered more strongly (by lengthening reaction times) with word recognition than low-frequency babble. This informational masking was particularly salient when only two talkers were present in the babble due to the availability of identifiable lexical items from the background. Our findings suggest that speech comprehension in multi-talker babble can trigger competitions at the lexical level between target and background. They further highlight the importance of investigating speech-in-speech comprehension situations as they may provide crucial information on interactive and competitive mechanisms that occur in real-time during word recognition. (C) 2009 Elsevier B.V. All rights reserved. C1 [Boulenger, Veronique; Ferragne, Emmanuel; Pellegrino, Francois; Meunier, Fanny] Univ Lyon, Lab Dynam Langage, UMR CNRS 5596, Inst Sci Homme, F-69363 Lyon 07, France. [Hoen, Michel] Univ Lyon 1, Stem Cell & Brain Res Inst, INSERM U846, F-69675 Bron, France. RP Boulenger, V (reprint author), Univ Lyon, Lab Dynam Langage, UMR CNRS 5596, Inst Sci Homme, 14 Ave Berthelot, F-69363 Lyon 07, France. EM Veronique.Boulenger@ish-lyon.cnrs.fr RI Hoen, Michel/C-7721-2012 OI Hoen, Michel/0000-0003-2099-8130 FU European Research Council FX This project was carried out with financial support from the European Research Council (SpiN project to Fanny Meunier). We would like to thank Claire Grataloup for allowing us to use the materials from her PhD. We would also like to thank the anonymous Reviewer and the Editor for their very helpful comments. CR Alain C, 2005, J COGNITIVE NEUROSCI, V17, P811, DOI 10.1162/0898929053747621 Alain C, 2001, J EXP PSYCHOL HUMAN, V27, P1072, DOI 10.1037//0096-1523.27.5.1072 Alain C, 2003, J COGNITIVE NEUROSCI, V15, P1063, DOI 10.1162/089892903770007443 ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486 Boersma P., 2009, PRAAT DOING PHONETIC Bregman AS., 1990, AUDITORY SCENE ANAL BRONKHORST AW, 1992, J ACOUST SOC AM, V92, P3132, DOI 10.1121/1.404209 Bronkhorst AW, 2000, ACUSTICA, V86, P117 Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 CONNINE CM, 1990, J EXP PSYCHOL LEARN, V16, P1084, DOI 10.1037/0278-7393.16.6.1084 Davis MH, 2007, HEARING RES, V229, P132, DOI 10.1016/j.heares.2007.01.014 Drullman R, 2000, J ACOUST SOC AM, V107, P2224, DOI 10.1121/1.428503 Dupoux E, 2003, J EXP PSYCHOL HUMAN, V29, P172, DOI 10.1037/0096-1523.29.1.172 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 FORSTER KI, 1984, J EXP PSYCHOL LEARN, V10, P680, DOI 10.1037/0278-7393.10.4.680 Gaskell MG, 1997, PROCEEDINGS OF THE NINETEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P247 Hawley ML, 2004, J ACOUST SOC AM, V115, P833, DOI 10.1121/1.1639908 Hoen M, 2007, SPEECH COMMUN, V49, P905, DOI 10.1016/j.specom.2007.05.008 Kouider S, 2005, PSYCHOL SCI, V16, P617, DOI 10.1111/j.1467-9280.2005.01584.x Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210 MacKay D. G., 1987, ORG PERCEPTION ACTIO MACKAY DG, 1982, PSYCHOL REV, V89, P483, DOI 10.1037/0033-295X.89.5.483 MarslenWilson W, 1996, J EXP PSYCHOL HUMAN, V22, P1376 MARSLENWILSON W, 1990, ACL MIT NAT, P148 McClelland J., 1986, COGNITIVE PSYCHOL, V8, P1, DOI [10.1016/0010-0285(86)90015-0, DOI 10.1016/0010-0285(86)90015-0] MCCLELLAND JL, 1981, PSYCHOL REV, V88, P375, DOI 10.1037/0033-295X.88.5.375 Mirman D., 2005, J MEM LANG, V52, P424 Monsell S., 1991, BASIC PROCESSES READ, P148 MOORE TE, 1995, CAN J BEHAV SCI, V27, P9, DOI 10.1037/008-400X .27.1.9 New B, 2004, BEHAV RES METH INS C, V36, P516, DOI 10.3758/BF03195598 PELLEGRINO F, 2004, INT C SPEECH PROS 20, P517 Pellegrino F, 2000, SIGNAL PROCESS, V80, P1231, DOI 10.1016/S0165-1684(00)00032-3 Plant D. C., 1996, PSYCHOL REV, V103, P56, DOI DOI 10.1037/0033-295 Rhebergen KS, 2005, J ACOUST SOC AM, V118, P1274, DOI 10.1121/1.2000751 RUBIN P, 1976, PERCEPT PSYCHOPHYS, V19, P394, DOI 10.3758/BF03199398 Samuel AG, 1997, COGNITIVE PSYCHOL, V32, P97, DOI 10.1006/cogp.1997.0646 Samuel AG, 2001, PSYCHOL SCI, V12, P348, DOI 10.1111/1467-9280.00364 Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650 TAFT M, 1986, COGNITION, V22, P259, DOI 10.1016/0010-0277(86)90017-X Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666 NR 43 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2010 VL 52 IS 3 BP 246 EP 253 DI 10.1016/j.specom.2009.11.002 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560GA UT WOS:000274888900006 ER PT J AU Arias, JP Yoma, NB Vivanco, H AF Pablo Arias, Juan Becerra Yoma, Nestor Vivanco, Hiram TI Automatic intonation assessment for computer aided language learning SO SPEECH COMMUNICATION LA English DT Article DE Intonation assessment; Computer aided language learning; Word stress assessment ID RECOGNITION AB In this paper the nature and relevance of the information provided by intonation is discussed in the framework of second language learning. As a consequence, an automatic intonation assessment system for second language learning is proposed based on a top-down scheme. A stress assessment system is also presented by combining intonation and energy contour estimation. The utterance pronounced by the student is directly compared with a reference one. The trend similarity of intonation and energy contours are compared frame-by-frame by using DTW alignment. Moreover the robustness of the alignment provided by the DTW algorithm to microphone, speaker and quality pronunciation mismatch is addressed. The intonation assessment system gives an averaged subjective-objective score correlation as high as 0.88. The stress assessment evaluation system gives an EER equal to 21.5%, which in turn is similar to the error observed in phonetic quality evaluation schemes. These results suggest that the proposed systems could be employed in real applications. Finally, the schemes presented here are text- and language-independent due to the fact that the reference utterance text-transcription and language are not required. (C) 2009 Elsevier B.V. All rights reserved. C1 [Pablo Arias, Juan; Becerra Yoma, Nestor] Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile. [Vivanco, Hiram] Univ Chile, Dept Linguist, Santiago, Chile. RP Yoma, NB (reprint author), Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Av Tupper 2007,POB 412-3, Santiago, Chile. EM nbecerra@ing.uchile.cl FU Conicyt-Chile [D051-10243, 1070382] FX This work was funded by Conicyt-Chile under Grants Fondef No. D051-10243 and Fondecyt No. 1070382. CR BAETENS H, 1982, BILINGUALISM BASIC P Bell ND, 2009, J PRAGMATICS, V41, P1825, DOI 10.1016/j.pragma.2008.10.010 BERNAT E, 2006, ASIAN EFL J, V8 Bernstein J., 1990, P INT C SPOK LANG PR, P1185 Boersma P., 2008, PRAAT DOING PHONETIC Bolinger D., 1986, INTONATION ITS PARTS Bolinger D., 1989, INTONATION ITS USES Botinis A, 2001, SPEECH COMMUN, V33, P263, DOI 10.1016/S0167-6393(00)00060-1 Carter R., 2001, CAMBRIDGE GUIDE TEAC Celce-Murcia M., 2000, DISCOURSE CONTEXT LA Chun Dorothy, 2002, DISCOURSE INTONATION Cruttenden A., 2008, GIMSONS PRONUNCIATIO Dalton C., 1994, PRONUNCIATION DELMONTE R, 1997, P ESCA EUR 97 RHOD, V2, P669 DONG B, 2008, 6 INT S CHIN SPOK LA DONG B, 2004, INT S CHIN SPOK LANG, P137 ELIMAM YA, 2005, IEICE T INFORM SYSTE ESKENAZI M, 1998, P STILL WORKSH SPEEC Eskenazi M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607892 FACE T, 2006, J LANG LINGUIST, V15, P295 FLETCHER J, 2005, INTONATIONAL VARIATI Fonagy Ivan, 2001, LANGUAGES LANGUAGE E FRANCO H, 1997, ICASSP 97, V2, P1471 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 GARNNUNN P, 1992, CALVERTS DESCRIPTIVE Gillick L., 1989, P ICASSP, P532 GRABE E, 2002, INTONATIONAL VARIATI GU L, 2003, INT S CIRC SYST ISCA, V2, P580 Guy G. R., 1984, AUSTR J LINGUISTICS, V4, P1, DOI 10.1080/07268608408599317 Hiller S., 1994, Computer Assisted Language Learning, V7, DOI 10.1080/0958822940070105 Holmes J., 2001, SPEECH SYNTHESIS REC Jenkins J., 2000, PHONOLOGY ENGLISH IN JIA H, 2008, P SPEECH PROS, P547 Jones R. H., 1997, SYSTEM, V25, P103, DOI 10.1016/S0346-251X(96)00064-4 Jurafsky Daniel, 2009, SPEECH LANGUAGE PROC, V2nd Kachru Yamuna, 1985, RELC J, V16, P1, DOI 10.1177/003368828501600201 KIM H, 2002, ICSLP 2002, P1225 LIANG W, 2005, P INT C COMM CIRC SY, V2, P857 Molina C, 2009, SPEECH COMMUN, V51, P485, DOI 10.1016/j.specom.2009.01.002 MORLEY J, 1991, TESOL QUART, V25, P481, DOI 10.2307/3586981 Moyer Alene, 2004, AGE ACCENT EXPERIENC NEUMEYER L, 1996, P ICSLP 96 OPPELSTRUP L, 2005, P FONETIK GOT PEABODY M, 2006, P 5 INT S CHIN SPOK Pennington M., 1989, RELC J, V20, P20, DOI 10.1177/003368828902000103 PETERS AM, 1977, LANGUAGE, V53, P560, DOI 10.2307/413177 PIERREHUMBERT J, 1990, INTENTIONS COMMUNICA RABINER LR, 1979, IEEE T ACOUST SPEECH, V27, P583, DOI 10.1109/TASSP.1979.1163323 RABINER LR, 1980, IEEE T ACOUST SPEECH, V28, P377, DOI 10.1109/TASSP.1980.1163422 RABINER LR, 1978, IEEE T ACOUST SPEECH, V26, P34, DOI 10.1109/TASSP.1978.1163037 RAMAN M, 2004, ENGLISH LANGUAGE TEA Ramirez Verdugo D., 2005, INTERCULT PRAGMAT, V2, P151 Roach P., 2008, ENGLISH PHONETICS PH Rypa M. E., 1999, CALICO Journal, V16 SAKOE H, 1978, IEEE T ACOUST SPEECH, V26 Saussure F. D., 2006, WRITINGS GEN LINGUIS SHIMIZU M, 2005, PHON TEACH LEARN C 2 STOUTEN F, 2006, P ICASSP SU P, 2006, P 5 INT C MACH LEARN Tao Hongyin, 1996, UNITS MANDARIN CONVE Teixeira C., 2000, P ICSLP TEPPERMAN J, 2007, IEEE T AUDIO SPEECH, V16, P8 Tepperman J., 2008, P INTERSPEECH ICSLP TRAYNOR PL, 2003, INSTRUCT PSYCHOL J, P137 van Santen JPH, 2009, SPEECH COMMUN, V51, P1082, DOI 10.1016/j.specom.2009.04.007 Wells John, 2006, ENGLISH INTONATION YOU K, 2004, INTERSPEECH 2004, P1857 ZHAO X, 2007, ISSSE 07, P59 NR 68 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2010 VL 52 IS 3 BP 254 EP 267 DI 10.1016/j.specom.2009.11.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560GA UT WOS:000274888900007 ER PT J AU Wu, TY Duchateau, J Martens, JP Van Compernolle, D AF Wu, Tingyao Duchateau, Jacques Martens, Jean-Pierre Van Compernolle, Dirk TI Feature subset selection for improved native accent identification SO SPEECH COMMUNICATION LA English DT Article DE Accent identification; Language identification; Feature selection; Gaussian mixture model; Linear discriminant analysis; Support vector machine ID AUTOMATIC LANGUAGE IDENTIFICATION; SUPPORT VECTOR MACHINES; CANCER CLASSIFICATION; GENE SELECTION; SPEECH; RECOGNITION AB In this paper, we develop methods to identify accents of native speakers. Accent identification differs from other speaker classification tasks because accents may differ in a limited number of phonemes only and moreover the differences can be quite subtle. In this paper, it is shown that in such cases it is essential to select a small subset of discriminative features that can be reliably estimated and at the same time discard non-discriminative and noisy features. For identification purposes a speaker is modeled by a supervector containing the mean values for the features for all phonemes. Initial accent models are obtained as class means from the speaker supervectors. Then feature subset selection is performed by applying either ANOVA (analysis of variance), LDA (linear discriminant analysis), SVM-RFE (support vector machine-recursive feature elimination), or their hybrids, resulting in a reduced dimensionality of the speaker vector and more importantly a significantly enhanced recognition performance. We also compare the performance of GMM, LDA and SVM as classifiers on a full or a reduced feature subset. The methods are tested on a Flemish read speech database with speakers classified in five regions. The difficulty of the task is confirmed by a human listening experiment. We show that a relative improvement of more than 20% in accent recognition rate can be achieved with feature subset selection irrespective of the choice of classifier. We finally show that the construction of speaker-based supervectors significantly enhances results over a reference GMM system that uses the raw feature vectors directly as input, both in text dependent and independent conditions. (C) 2009 Elsevier B.V. All rights reserved. C1 [Wu, Tingyao; Duchateau, Jacques; Van Compernolle, Dirk] Katholieke Univ Leuven, PSI, ESAT, B-3001 Heverlee, Belgium. [Martens, Jean-Pierre] Univ Ghent, ELIS, Ghent, Belgium. RP Van Compernolle, D (reprint author), Katholieke Univ Leuven, PSI, ESAT, Kasteelpk Arenberg 10,B2441, B-3001 Heverlee, Belgium. EM Tingyao.Wu@esat.kuleuven.be; Jacques.Duchateau@esat.kuleuven.be; Jean-Pierre.Martens@elis.ugent.be; Dirk.VanCompernolle@esat.kuleuven.be FU K.U. Leuven; Scientific Research Flanders [G.008.01, G.0260.07] FX This research was supported by the Research Fund of the K.U. Leuven, the Fund for Scientific Research Flanders (Projects G.008.01 and G.0260.07). CR Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1 Cremelie N, 1999, SPEECH COMMUN, V29, P115, DOI 10.1016/S0167-6393(99)00034-5 BERKLING K, 1998, P ICSLP 98, V2, P89 Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 Burget L, 2006, P ICASSP 2006, V1, P209 CAMPBELL WM, 2008, P INT C AC SPEECH SI, P4141 Castaldo F., 2007, P INT, P346 CHEN T, 2001, P ASRU 2001 TREN IT, P343 DEMUYNCK K, 2008, P ICSLP, P495 Duan KB, 2005, IEEE T NANOBIOSCI, V4, P228, DOI 10.1109/TNB.2005.853657 GHESQUIERE P, 2002, P INT C AC SPEECH SI, V1, P749 Guyon I, 2002, MACH LEARN, V46, P389, DOI 10.1023/A:1012487302797 Guyon I., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753616 HANSEN JHL, 2004, P INT C SPOK LANG PR, P1569 HUANG RQ, 2006, P INT, P445 KNOPS U, 1984, TAAL TONGVAL, V25, P117 Kohavi R, 1997, ARTIF INTELL, V1, P273 LABOV W, 1996, P INT C SPOK LANG PR LAMEL LF, 1995, COMPUT SPEECH LANG, V9, P87, DOI 10.1006/csla.1995.0005 LINCOLN M, 1998, P INT C SPOK LANG PR, V2, P109 LIU MK, 2000, P IEEE INT C AC SPEE, V2, P1025 Matejka P., 2006, P OD 2006 SPEAK LANG, P57 Mertens P., 1998, FONILEX MANUAL Purnell T, 1999, J LANG SOC PSYCHOL, V18, P10, DOI 10.1177/0261927X99018001002 Rakotomamonjy A., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753706 SHEN W, 2008, P INT, P763 SHRIBERG E, 2008, P OD SPEAK LANG REC TENBOSCH L, 2000, P INT C SPOK LANG PR, V3, P1009 Thomas ER, 2000, J PHONETICS, V28, P1, DOI 10.1006/jpho.2000.0103 TORRESCARRASQUI.PA, 2002, P ICASSP, V1, P757 TORRESCARRASQUI.PA, 2004, P OD SPEAK LANG REC, P297 TORRESCARRASQUI.PA, 2008, P INT, P79 Tukey J.W., 1977, EXPLORATORY DATA ANA van Hout Roeland, 1999, ARTIKELEN DERDE SOCI, P183 WU T, 2009, THESIS KATHOLIEKE U Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450 Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6 NR 37 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2010 VL 52 IS 2 BP 83 EP 98 DI 10.1016/j.specom.2009.08.010 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 532QI UT WOS:000272764100001 ER PT J AU Hanson, EK Beukelman, DR Heidemann, JK Shutts-Johnson, E AF Hanson, Elizabeth K. Beukelman, David R. Heidemann, Jana Kahl Shutts-Johnson, Erin TI The impact of alphabet supplementation and word prediction on sentence intelligiblity of electronically distorted speech SO SPEECH COMMUNICATION LA English DT Article DE Alphabet supplementation; Word prediction; Language modeling; Speech intelligibility; Speech-generating device; SGD; Prototype; Dysarthria; AAC; Augmentative and alternative communication ID SEVERELY DYSARTHRIC SPEECH; STIMULUS COHESION; COMPRESSED SPEECH; LINGUISTIC CUES; ADAPTATION AB Alphabet supplementation is a low-tech augmentative and alternative communication (AAC) strategy that involves pointing to the first letter of each word spoken. Sentence intelligibility scores increased an average of 25% (Hanson et al., 2004) when speakers with moderate and severe dysarthria (a neurologic speech impairment) used alphabet supplementation strategies. This project investigated the impact of both alphabet supplementation and ail electronic word prediction strategy, commonly used in augmentative and alternative communication technology, oil the sentence intelligibility of normal natural speech that was electronically distorted to reduce intelligibility to the profound range of <30%. Results demonstrated large sentence intelligibility increases (average 80% increase) when distorted speech was supplemented with alphabet supplementation and word prediction. (C) 2009 Elsevier B.V. All rights reserved. C1 [Hanson, Elizabeth K.] Univ S Dakota, Dept Commun Disorders, Vermillion, SD 57069 USA. [Beukelman, David R.] Univ Nebraska, Barkley Mem Ctr 202, Lincoln, NE 68583 USA. RP Hanson, EK (reprint author), Univ S Dakota, Dept Commun Disorders, 414 E Clark St, Vermillion, SD 57069 USA. EM ekhanson@usd.edu; dbeukelma-n1@unl.edu FU National Institute of Disability and Rehabilitation Research [H113 980026]; US Department of Education FX This publication was produced in part under Grant #H113#980026 from the National Institute of Disability and Rehabilitation Research, US Department of Education. The opinions expressed in this publication are those of the grantee and do not necessarily reflect those of NID-RR or the Department of Education. CR Beliveau C., 1995, AUGMENTATIVE ALTERNA, V11, P176, DOI 10.1080/07434619512331277299 Berger K., 1967, J COMMUN DISORD, V1, P201, DOI 10.1016/0021-9924(68)90032-4 BEUKELMAN DR, 1977, J SPEECH HEAR DISORD, V42, P265 Beukelman DR, 2002, J MED SPEECH-LANG PA, V10, P237 Clarke CM, 2004, J ACOUST SOC AM, V116, P3647, DOI 10.1121/1.1815131 CROW E, 1989, RECENT ADV CLIN DYSA DARLEY FL, 1969, J SPEECH HEAR RES, V12, P462 DePaul R, 2000, AM J SPEECH-LANG PAT, V9, P230 Dowden P. A, 1997, AUGMENTATIVE ALTERNA, V13, P48, DOI DOI 10.1080/07434619712331277838 Dupoux E, 1997, J EXP PSYCHOL HUMAN, V23, P914, DOI 10.1037/0096-1523.23.3.914 Hanson EK, 2004, J MED SPEECH-LANG PA, V12, pIX HANSON EK, 2008, C MOT SPEECH MONT CA Hustad K. C, 2001, AUGMENTATIVE ALTERNA, V17, P213, DOI 10.1080/714043385 Hustad KC, 2001, J SPEECH LANG HEAR R, V44, P497, DOI 10.1044/1092-4388(2001/039) Hustad KC, 2008, J SPEECH LANG HEAR R, V51, P1438, DOI 10.1044/1092-4388(2008/07-0185) Hustad KC, 2003, J SPEECH LANG HEAR R, V46, P462, DOI 10.1044/1092-4388(2003/038) Hustad KC, 2002, J SPEECH LANG HEAR R, V45, P545, DOI 10.1044/1092-4388(2002/043) KING D, 2004, SPEAKING DYNAMICALLY LINDBLOM B, 1990, AAC (Augmentative and Alternative Communication), V6, P220, DOI 10.1080/07434619012331275504 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 Peelle JE, 2005, J EXP PSYCHOL HUMAN, V31, P1315, DOI 10.1037/0096-1523.31.6.1315 PISONI DB, 1985, SPEECH COMMUN, V4, P75, DOI 10.1016/0167-6393(85)90037-8 STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455 NR 23 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2010 VL 52 IS 2 BP 99 EP 105 DI 10.1016/j.specom.2009.08.004 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 532QI UT WOS:000272764100002 ER PT J AU Shue, YL Shattuck-Hufnagel, S Iseli, M Jun, SA Veilleux, N Alwan, A AF Shue, Yen-Liang Shattuck-Hufnagel, Stefanie Iseli, Markus Jun, Sun-Ah Veilleux, Nanette Alwan, Abeer TI On the acoustic correlates of high and low nuclear pitch accents in American English SO SPEECH COMMUNICATION LA English DT Article DE Pitch-accent correlates; Prosody; Voice quality; Tonal crowding ID MAXIMUM SPEED; REALIZATION; INTONATION AB Earlier findings in Shue et al. (2007, 2008) raised questions about the alignment of nuclear pitch accents in American English, which are addressed here by eliciting both high and low pitch accents in two different target words in several different positions in a single-phrase utterance (early, late but not final, and filial) from 20 speakers (10 male, 10 female). Results show that the F-0 peak associated with a high nuclear pitch accent is systematically displaced to an earlier point in the target word if that word is filial ill the phrase and thus bears the boundary-related tones as well. This effect of tonal crowding holds across speakers, genders and target words, but was not observed for low accents, adding to the growing evidence that low targets behave differently from highs. Analysis of energy shows that, across target words and genders, the average energy level of a target word is greatest at the start of an utterance and decreases with increasing proximity to the utterance boundary. Duration measures confirms the findings of existing literature on main-stress-syllable lengthening, final syllable lengthening, and lengthening associated with pitch accents, and reveals that final syllable lengthening is further enhanced if the final word also carries a pitch accent. Individual speaker analyses found that while most speakers conformed to the general trends for pitch movements there were 2/10 male and 1/10 female speakers who did not. These results show the importance of taking into account prosodic contexts and speaker variability when interpreting correlates to prosodic events such as pitch accents. (C) 2009 Elsevier B.V. All rights reserved. C1 [Shue, Yen-Liang; Iseli, Markus; Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA. [Shattuck-Hufnagel, Stefanie] MIT, Elect Res Lab, Cambridge, MA 02139 USA. [Jun, Sun-Ah] Univ Calif Los Angeles, Dept Linguist, Los Angeles, CA 90095 USA. [Veilleux, Nanette] Simmons Coll, Dept Comp Sci, Boston, MA 02115 USA. RP Shue, YL (reprint author), Univ Calif Los Angeles, Dept Elect Engn, 56-125B Engn 4 Bldg,Box 951594, Los Angeles, CA 90095 USA. EM yshue@ee.ucla.edu; stef@speech.mit.edu; iseli@ee.ucla.edu; jun@humnet.ucla.edu; nanette.veilleux@simmons.edu; alwan@ee.ucla.edu FU NSF FX We thank Dr. Patricia Keating for her helpful suggestions and advice during the preparation of this study. We also thank the speakers who participated in this experiment. This work was supported in part by the NSF. CR Arvaniti A, 2006, SPEECH COMMUN, V48, P667, DOI 10.1016/j.specom.2005.09.012 Arvaniti A., 2007, PAPERS LAB PHONOLOGY, V9, P547 Beckman M. E., 1994, PHONOLOGICAL STRUCTU, V3, P7 Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X CHAFE WL, 1993, TALKING DATA TRANSCR, P3 Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023 Fougeron C, 1998, J PHONETICS, V26, P45, DOI 10.1006/jpho.1997.0062 Grabe E, 2000, J PHONETICS, V28, P161, DOI 10.1006/jpho.2000.0111 Hirschberg J., 1986, P 24 ANN M ASS COMP, P136, DOI 10.3115/981131.981152 Iseli M, 2007, J ACOUST SOC AM, V121, P2283, DOI 10.1121/1.2697522 Jilka M., 2007, P INT 2007, P2621 KAWAHARA H, 1998, P ICSLP SYD AUSTR KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 KLATT DH, 1976, IEEE T ACOUST SPEECH, V35, P445 Kochanski G, 2005, J ACOUST SOC AM, V118, P1038, DOI 10.1121/1.1923349 Ladd D. R., 1996, INTONATIONAL PHONOLO LEVI SV, INTONATION TUR UNPUB Mucke D, 2009, J PHONETICS, V37, P321, DOI 10.1016/j.wocn.2009.03.005 Ode C, 2005, SPEECH COMMUN, V47, P71, DOI 10.1016/j.specom.2005.06.004 OHALA JJ, 1973, J ACOUST SOC AM, V53, P345, DOI 10.1121/1.1982441 Pierrehumbert J, 1980, THESIS MIT PIERREHUMBERT JB, 1991, PAPERS LAB PHONOLOGY, V2, P90 ROSENBERG A, 2006, P INT PITTSB, P301 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 SHUE YL, 2008, P INT BRISB AUSTR, P873 SHUE YL, 2007, P INT ANTW BELG, P2625 Silverman Kim E. A., 1990, PAPERS LABORATORY PH, P72 SLIFKA J, 2007, J VOICE, V20, P171 Sluijter A. M. C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607440 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 SUNDBERG J, 1979, J PHONETICS, V7, P71 TITZE IR, 1989, J ACOUST SOC AM, V85, P1699, DOI 10.1121/1.397959 Turk AE, 2007, J PHONETICS, V35, P445, DOI 10.1016/j.wocn.2006.12.001 Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093 Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789 NR 35 TC 0 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2010 VL 52 IS 2 BP 106 EP 122 DI 10.1016/j.specom.2009.08.005 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 532QI UT WOS:000272764100003 ER PT J AU Vicente-Pena, J Diaz-de-Maria, F Kleijn, WB AF Vicente-Pena, Jesus Diaz-de-Maria, Fernando Kleijn, W. Bastiaan TI The synergy between bounded-distance HMM and spectral subtraction for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; Spectral subtraction; Acoustic backing-off; Bounded-distance HMM; Missing features; Outliers ID NOISE; FEATURES AB Additive noise generates important losses in automatic speech recognition systems. In this paper, we show that one of the causes contributing to these losses is the fact that conventional recognisers take into consideration feature values that are outliers. The method that we call bounded-distance HMM is a suitable method to avoid that outliers contribute to the recogniser decision. However, this method just deals with outliers, leaving the remaining features unaltered. In contrast, spectral subtraction is able to correct all the features at the expense of introducing some artifacts that, as shown in the paper, cause a larger number of outliers. As a result, we find that bounded-distance HMM and spectral subtraction complement each other well. A comprehensive experimental evaluation was conducted, considering several well-known ASR tasks (of different complexities) and numerous noise types and SNRs. The achieved results show that the suggested combination generally outperforms both the bounded-distance HMM and spectral subtraction individually. Furthermore, the obtained improvements, especially for low and medium SNRs, are larger than the sum of the improvements individually obtained by bounded-distance HMM and spectral subtraction. (C) 2009 Elsevier B.V. All rights reserved. C1 [Vicente-Pena, Jesus; Diaz-de-Maria, Fernando] Univ Carlos III Madrid, EPS, Dept Signal Proc & Commun, Madrid 28911, Spain. [Kleijn, W. Bastiaan] KTH Royal Inst Technol, Sound & Image Proc Lab, Stockholm, Sweden. RP Vicente-Pena, J (reprint author), Univ Carlos III Madrid, EPS, Dept Signal Proc & Commun, Avda Univ 30, Madrid 28911, Spain. EM jvicente@tsc.uc3m.es; fdiaz@tsc.uc3m.es; bastiaan.kleijn@ee.kth.se RI Diaz de Maria, Fernando/E-8048-2011 FU Spanish Regional [CCG06-UC3M/TIC-0812] FX This work has been partially supported by Spanish Regional Grant No. CCG06-UC3M/TIC-0812. CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 *CMU, 1998, CMU V 0 6 PRON DICT Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 de la Torre A, 2005, IEEE T SPEECH AUDI P, V13, P355, DOI 10.1109/TSA.2005.845805 DENG L, 2000, P ICSLP, P806 de Veth J, 2001, SPEECH COMMUN, V34, P247, DOI 10.1016/S0167-6393(00)00037-6 de Veth J, 2001, SPEECH COMMUN, V34, P57, DOI 10.1016/S0167-6393(00)00046-7 DEVETH J, 1998, P INT C SPOK LANG PR, P1427 FLORES JAN, 1994, P ICASSP, V1, P409 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J HAIN T, 1999, P ICASSP 99, V1, P57 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HIRSCH G, 2002, AU41702 ETSI STQ DSR MACHO D, 2000, SPANISH SDC AURORA D Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 Matsui T, 1991, P IEEE INT C AC SPEE, V1, P377 Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733 NADEU C, 1995, P EUR 95, P1381 Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0 *NIST, 1992, NIST RES MAN CORP RM Paliwal KK, 1999, P EUR C SPEECH COMM, P85 PAUL DB, 1992, HLT 91, P357 PUJOL P, 2004, P INT C SPOK LANG PR RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Raj B., 2000, THESIS CARNEGIE MELL Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006 SHOZAKAI M, 1997, IEEE WORKSH AUT SPEE, P450 VARGA AP, 1992, NOISEX 92 STUDY EFFE Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8 Young S., 2002, HTK BOOK HTK VERSION NR 36 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2010 VL 52 IS 2 BP 123 EP 133 DI 10.1016/j.specom.2009.09.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 532QI UT WOS:000272764100004 ER PT J AU Hansen, JHL Zhang, XX AF Hansen, John H. L. Zhang, Xianxian TI Analysis of CFA-BF: Novel combined fixed/adaptive beamforming for robust speech recognition in real car environments SO SPEECH COMMUNICATION LA English DT Article DE Array processing; Robust speech recognition; In-vehicle speech systems; Beamforming ID ADAPTIVE BEAMFORMER; MICROPHONE ARRAY; TIME-DELAY; ENHANCEMENT; NOISE; ALGORITHM; SPECTRUM; FILTERS AB Among a number of studies which have investigated various speech enhancement and processing schemes for in-vehicle speech systems, the delay-and-sum beamforming (DASB) and adaptive beamforming are two typical methods that both have their advantages and disadvantages. In this paper, we propose a novel combined fixed/adaptive beamforming solution (CFA-BF) based on previous work for speech enhancement and recognition in real moving car environments, which seeks to take advantage of both methods. The working scheme of CFA-BF consists of two steps: source location calibration and target signal enhancement. The first step is to pre-record the transfer functions between the speaker and microphone array from different potential source positions using adaptive beamforming under quiet environments; and the second step is to use this pre-recorded information to enhance the desired speech when the car is running on the road. An evaluation using extensive actual car speech data from the CU-Move Corpus shows that the method can decrease WER for speech recognition by up to 30% over a single channel scenario and improve speech quality via the SEGSNR measure by up to 1 dB on the average. (C) 2009 Elsevier B.V. All rights reserved. C1 [Hansen, John H. L.; Zhang, Xianxian] Univ Texas Dallas, CRSS, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, Richardson, TX 75083 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, EC33,POB 830688, Richardson, TX 75083 USA. EM John.Hansen@utdallas.edu FU DARPA through SPAWAR [N66001-8906]; University of Texas at Dallas [EM-MITT] FX This project was supported by Grants from DARPA through SPAWAR under Grant No. N66001-8906, and by the University of Texas at Dallas under Project EM-MITT. Any opinions, findings and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. CR ABUT H, 2002, IEEE DL LECT JAP HON Brandstein M., 2001, MICROPHONE ARRAYS CAPON J, 1969, P IEEE, V57, P1408, DOI 10.1109/PROC.1969.7278 Compernolle D. V., 1990, SPEECH COMMUN, V9, P433 COMPERNOLLE DV, 1990, P IEEE ICASSP 90, V2, P833 DELLER JR, 2000, DISCRETE TIME PROCES, pCH8 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 GALANENKO V, 2001, P IEEE ICASSP, V5, P3017 GAZOR S, 1995, IEEE T SPEECH AUDIO, V3, P94 GAZOR S, 1994, P IEEE ICASSP, V4, P557 GIULIANI D, 1996, P IEEE ICSLP 96, V3, P1329, DOI 10.1109/ICSLP.1996.607858 GOULDING MM, 1990, IEEE T VEH TECHNOL, V39, P316, DOI 10.1109/25.61353 Grenier Y., 1992, P IEEE ICASSP, V1, P305 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 HAAN JMD, 2003, IEEE T SPEECH AUDIO, V11, P14 Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 HANSEN JHL, 1995, J ACOUST SOC AM, V97, P609, DOI 10.1121/1.412283 Hansen JHL, 2000, P IEEE ICSLP 2000, V1, P524 HANSEN JHL, 2001, INTERSPEECH 01 EUROS, V3, P2023 Haykin S., 1985, ARRAY SIGNAL PROCESS Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650 Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop Jensen J, 2001, IEEE T SPEECH AUDI P, V9, P731, DOI 10.1109/89.952491 Johnson D, 1993, ARRAY SIGNAL PROCESS Kaiser J.F., 1993, P INT C AC SPEECH SI, V3, P149 KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830 KOMAROW S, 2000, USA TODAY 0912 KOROMPIS D, 1995, P IEEE ICASSP 95, V4, P2739 LANG SW, 1980, IEEE T ACOUST SPEECH, V28, P716, DOI 10.1109/TASSP.1980.1163467 Li WF, 2005, IEEE SIGNAL PROC LET, V12, P340, DOI 10.1109/LSP.2005.843761 MEYER J, 1997, P ICASSP 97, V2, P1167 NANDKUMAR S, 1995, IEEE T SPEECH AUDI P, V3, P22, DOI 10.1109/89.365384 Nordholm S, 1999, IEEE T SPEECH AUDI P, V7, P241, DOI 10.1109/89.759030 Oh S., 1992, P IEEE ICASSP 92, V1, P281 OMOLOGO M, 1996, P IEEE INT C AC SPEE, V2, P921 OMOLOGO M, 1994, P IEEE ICASSP 94, V2, P860 Pellom B, 2001, TRCSLR200101 U COL Pellom BL, 1998, IEEE T SPEECH AUDI P, V6, P573, DOI 10.1109/89.725324 Pillai S., 1989, ARRAY SIGNAL PROCESS PLUCIENKOWSKI J, 2001, P EUROSPEECH 01, V3, P1573 RABINER L, 1993, FUNDAMENTALS SPEECH, P447 REED FA, 1981, IEEE T ACOUST SPEECH, V29, P561, DOI 10.1109/TASSP.1981.1163614 SENADJI B, 1993, P IEEE ICASSP 93, V1, P321 SHINDE T, 2002, P INT 02 ICSLP 02 DE SVAIZER P, 1997, P IEEE ICASSP, V1, P231 VISSER E, 2002, P INT 02 ICSLP 02 DE WAHAB A, 1998, P 5 INT C CONTR AUT WAHAB A, 1997, P 1 INT C INF COMM S WALLACE RB, 1992, IEEE T CIRCUITS-II, V39, P239, DOI 10.1109/82.136574 Widrow B, 1985, ADAPTIVE SIGNAL PROC Yamada T, 2002, IEEE T SPEECH AUDI P, V10, P48, DOI 10.1109/89.985542 YAPANEL U, 2002, P ICSLP 2002, V2, P793 ZHANG XX, 2000, INT C SIGN PROC TECH ZHANG XX, 2003, P INT 03 EUR 03 GEN Zhang XX, 2003, IEEE T SPEECH AUDI P, V11, P733, DOI 10.1109/TSA.2003.818034 Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995 NR 57 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2010 VL 52 IS 2 BP 134 EP 149 DI 10.1016/j.specom.2009.09.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 532QI UT WOS:000272764100005 ER PT J AU Teoh, ABJ Chong, LY AF Teoh, Andrew Beng Jin Chong, Lee-Ying TI Secure speech template protection in speaker verification system SO SPEECH COMMUNICATION LA English DT Article DE Cancellable biometrics; Random projection; Speaker verification; 2D subspace projection methods; Gaussian mixture model ID DATA PERTURBATION; PRIVACY; MODELS; IDENTIFICATION AB Due to biometric template characteristics that are susceptible to non-revocable and privacy invasion, cancellable biometrics has been introduced to tackle these issues. In this paper, we present a two-factor cancellable formulation for speech biometrics, which we refer as probabilistic random projection (PRP). PRP offers strong protection on speech template by hiding the actual speech feature through the random subspace projection process. Besides, the speech template is replaceable and can be reissued when it is compromised. Our proposed method enables the generation of different speech templates from the same speech feature, which means linkability is not exited between the speech templates. The formulation of the cancellable biometrics retains its performance as for the conventional biometric. Besides that, we also propose 2D subspace projection techniques for speech feature extraction, namely 2D Principle Component Analysis (2DPCA) and 2D CLAss-Featuring Information Compression (2DCLAFIC) to accommodate the requirements of PRP formulation. (C) 2009 Elsevier B.V. All rights reserved. C1 [Teoh, Andrew Beng Jin] Yonsei Univ, Coll Engn, Seoul 120749, South Korea. [Chong, Lee-Ying] Multimedia Univ, Fac Informat Sci & Technol, Jalan Ayer Keroh Lama 75450, Melaka, Malaysia. RP Teoh, ABJ (reprint author), Yonsei Univ, Coll Engn, Seoul 120749, South Korea. EM bjteoh@yonsei.ac.kr; lychong@mmu.edu.my RI Chong, Lee-Ying/B-3506-2010; Teoh, Andrew Beng Jin/F-4422-2010 FU Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University [112002105080020] FX This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University (Grant No. R 112002105080020 (2009)). CR Ang R., 2005, P 10 AUSTR C INF SEC, P242 Ariki Y., 1996, Proceedings of the 13th International Conference on Pattern Recognition, DOI 10.1109/ICPR.1996.546989 BOULT T, 2006, 7 INT C AUT FAC GEST, P560, DOI 10.1109/FGR.2006.94 CHONG LY, 2007, 5 IEEE WORKSH AUT ID, P445 CHONG LY, 2007, P INT C ROB VIS INF, P525 Dasgupta S., 2000, P 16 C UNC ART INT, P143 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Gersho A., 1992, VECTOR QUANTIZATION HARRAG A, 2005, 2005 ANN IEEE INDICO, P237, DOI 10.1109/INDCON.2005.1590163 Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 Kargupta H, 2005, KNOWL INF SYST, V7, P387, DOI 10.1007/s10115-004-0173-6 Laaksonen J., 1996, P INT C ART NEUR NET, P227 Liu K, 2006, IEEE T KNOWL DATA EN, V18, P92 MATSUMOTO T, 2002, OPT SECUR COUNTERFEI, V4, P4677 Ratha NK, 2007, IEEE T PATTERN ANAL, V29, P561, DOI 10.1109/TPAMI.2007.1004 Ratha NK, 2001, IBM SYST J, V40, P614 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 ROSCA J, 2003, P ICA 2003 NAR JAP A, P999 SAVVIDES M, 2004, INT C PATTERN RECOGN, V3, P922 Teoh ABJ, 2007, IEEE T SYST MAN CY B, V37, P1096, DOI 10.1109/TSMCB.2007.903538 Teoh B. J .A., 2004, PATTERN RECOGN, V37, P2245 TEOH BJA, 2006, IEEE T PATTERN ANAL, V28, P1892 TULYAKOV S, 2005, ICAPR, P30 Yang J, 2004, IEEE T PATTERN ANAL, V26, P131 ZHANG W, 2003, IEEE INT C SYST MAN, V5, P4147 NR 27 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2010 VL 52 IS 2 BP 150 EP 163 DI 10.1016/j.specom.2009.09.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 532QI UT WOS:000272764100006 ER PT J AU Pucher, M Schabus, D Yamagishi, J Neubarth, F Strom, V AF Pucher, Michael Schabus, Dietmar Yamagishi, Junichi Neubarth, Friedrich Strom, Volker TI Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Speech synthesis; Hidden Markov model; Dialect; Sociolect; Austrian German ID SYNTHESIS SYSTEM; ALGORITHM AB An HMM-based speech synthesis framework is applied to both standard Austrian German and a Viennese dialectal variety and several training strategies for multi-dialect modeling such as dialect clustering and dialect-adaptive training are investigated. For bridging the gap between processing on the level of HMMs and on the linguistic level, we add phonological transformations to the HMM interpolation and apply them to dialect interpolation. The crucial steps are to employ several formalized phonological rules between Austrian German and Viennese dialect as constraints for the HMM interpolation. We verify the effectiveness of this strategy in a number of perceptual evaluations. Since the HMM space used is not articulatory but acoustic space, there are some variations in evaluation results between the phonological rules. However, in general we obtained good evaluation results which show that listeners call perceive both continuous and categorical changes of dialect varieties by using phonological transformations employed as switching rules in the HMM interpolation. (C) 2009 Elsevier B.V. All rights reserved. C1 [Pucher, Michael; Schabus, Dietmar] Telecommun Res Ctr Vienna Ftw, A-1220 Vienna, Austria. [Yamagishi, Junichi; Strom, Volker] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland. [Neubarth, Friedrich] Austrian Res Inst Artificial Intelligence OFAI, A-1010 Vienna, Austria. RP Pucher, M (reprint author), Telecommun Res Ctr Vienna Ftw, Donau City Str 1,3rd Floor, A-1220 Vienna, Austria. EM pucher@ftw.at FU Austrian Government; City of Vienna within the competence center; Austrian Federal Ministry for Transport, Innovation, and Technology; European Community's Seventh Framework [FP7/2007-2013, 213845] FX The project "Viennese Sociolect and Dialect Synthesis" is funded by the Vienna Science and Technology Fund (WWTF). The Telecommunications Research Center Vienna (FTW) is supported by the Austrian Government and the City of Vienna within the competence center program COMET. OFAI is supported by the Austrian Federal Ministry for Transport, Innovation, and Technology and by the Austrian Federal Ministry for Science and Research. Junichi Yamagishi is funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 213845 (the EMIME project). We thank Dr. Simon King and Mr. Oliver Watts of the University of Edinburgh for their valuable comments and proofreading. We also thank the reviewers for their valuable suggestions. CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 Black A., 2007, P ICASSP 2007, P1229 Cox T., 2001, MULTIDIMENSIONAL SCA CREER S, 2009, ADV CLIN NEUROSCI RE, V9, P16 FITT S, 1999, P EUR, V2, P823 FRASER M, 2007, P BLIZZ 2007 Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Garman M., 1990, PSYCHOLINGUISTICS Karaiskos V., 2008, P BLIZZ CHALL WORKSH Kawahara H., 2001, 2 MAVEBA Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 LIBERMAN AM, 1970, PERCEPT DISORD, V48 Ling Z.H., 2008, P INT, P573 Ling ZH, 2009, IEEE T AUDIO SPEECH, V17, P1171, DOI 10.1109/TASL.2009.2014796 Moosmuller Sylvia, 1987, SOZIOPHONOLOGISCHE V MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Muhr Rudolf, 2007, OSTERREICHISCHES AUS NEUBARTH F, 2008, P 9 ANN C INT SPEECH, P1877 Saussure F, 1916, COURSE GEN LINGUISTI SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Stevens K. N., 1997, HDB PHONETIC SCI, P462 Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484 Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 TOKUDA K, 1991, IEICE T FUND ELECTR, V74, P1240 Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647 Yamagishi J., 2008, P BLIZZ CHALL 2008 Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394 Yoshimura T., 1999, P EUROSPEECH 99 SEPT, P2374 Yoshimura T., 2001, P EUROSPEECH, P2263 Yoshimura T., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.199 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 NR 36 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2010 VL 52 IS 2 BP 164 EP 179 DI 10.1016/j.specom.2009.09.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 532QI UT WOS:000272764100007 ER PT J AU Lu, X Matsuda, S Unoki, M Nakamura, S AF Lu, X. Matsuda, S. Unoki, M. Nakamura, S. TI Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; Temporal modulation; Mean and variance normalization; Edge-preserved smoothing; Modulation object ID BILATERAL FILTER; SPECTRUM; FEATURES AB Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on the temporal modulation of speech. Then on the basis of this analysis, we proposed a two-stage processing algorithm that adaptively normalizes the temporal modulation of speech to extract robust speech features for automatic speech recognition. In the first stage of the proposed algorithm, the temporal modulation contrast of the cepstral time series for both clean and noisy speech is normalized. In the second stage, the contrast normalized temporal modulation spectrum is smoothed in order to reduce the artifacts due to noise while preserving the information in the speech modulation events (edges). We tested our algorithm in speech recognition experiments for additive noise condition, reverberant condition, and noisy condition (both additive noise and reverberation) using the AURORA-2J data corpus. Our results showed that as part of a uniform processing framework, the algorithm helped achieve the following: (1) for the additive noise condition, a 55.85% relative word error reduction (RWER) rate when clean conditional training was performed, and a 41.64% RWER rate when multi-conditional training was performed, (2) for the reverberant condition, a 51.28% RWER rate, and (3) for the noisy condition (both additive noise and reverberation), a 95.03% RWER rate. In addition, we evaluated the performance of each stage of the proposed algorithm in AURORA-2J and AURORA4 experiments, and compared the performance of our algorithm with the performances of two similar processing algorithms in the second stage. The evaluation results further confirmed the effectiveness of our proposed algorithm. (C) 2009 Elsevier B.V. All rights reserved. C1 [Lu, X.; Matsuda, S.; Nakamura, S.] Natl Inst Informat & Commun Technol, Tokyo, Japan. [Unoki, M.] Japan Adv Inst Sci & Technol, Kanazawa, Ishikawa, Japan. RP Lu, X (reprint author), Natl Inst Informat & Commun Technol, Tokyo, Japan. EM xuganglu@gmail.com FU Knowledge Creating Communication Research Center of NICT FX This study is supported by the MASTAR Project of the Knowledge Creating Communication Research Center of NICT. The authors would like to thank Dr. Yeung of HKUST for sharing the HTK based evaluation system on AURORA4. CR [Anonymous], 2002, HTK BOOK VERS 3 2 Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 Chen CP, 2007, IEEE T AUDIO SPEECH, V15, P257, DOI 10.1109/TASL.2006.876717 Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851 Chen JD, 2003, SPEECH COMMUN, V41, P469, DOI 10.1016/S0167-6393(03)00016-5 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020 Elad M, 2002, IEEE T IMAGE PROCESS, V11, P1141, DOI 10.1109/TIP.2002.801126 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 ETSI, 2007, 202050V115 ETSI ES Greenberg S, 1997, INT CONF ACOUST SPEE, P1647, DOI 10.1109/ICASSP.1997.598826 Hermansky H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319236 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 HUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 Hung JW, 2006, IEEE T AUDIO SPEECH, V14, P808, DOI 10.1109/TSA.2005.857801 Joris PX, 2004, PHYSIOL REV, V84, P541, DOI 10.1152/physrev.00029.2003 Kanedera N, 1999, SPEECH COMMUN, V28, P43, DOI 10.1016/S0167-6393(99)00002-3 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst NEUMANN J, 2007, PERSONAL UBIQUITOUS RABINER L, 1993, FUNFAMENTALS SPEECH SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Shen JL, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P881 Togneri R., 2006, P SST 2006, P94 Tomasi C., 1998, ICCV, P839, DOI DOI 10.1109/ICCV.1998.710815 Torre A., 2005, IEEE T SPEECH AUDIO, V13, P355 Turner RE, 2007, LECT NOTES COMPUT SC, V4666, P544 Vaseghi S. V., 2000, ADV DIGITAL SIGNAL P Xiao X, 2007, IEEE SIGNAL PROC LET, V14, P500, DOI 10.1109/LSP.2006.891341 Xiao X, 2008, IEEE T AUDIO SPEECH, V16, P1662, DOI 10.1109/TASL.2008.2002082 Zhang M, 2008, INT CONF ACOUST SPEE, P929 Zhu WZ, 2005, INT CONF ACOUST SPEE, P245 NR 31 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2010 VL 52 IS 1 BP 1 EP 11 DI 10.1016/j.specom.2009.08.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 522QX UT WOS:000272014500001 ER PT J AU Kinnunen, T Li, HZ AF Kinnunen, Tomi Li, Haizhou TI An overview of text-independent speaker recognition: From features to supervectors SO SPEECH COMMUNICATION LA English DT Review DE Speaker recognition; Text-independence; Feature extraction; Statistical models; Discriminative models; Supervectors; Intersession variability compensation ID GAUSSIAN MIXTURE-MODELS; COMBINING MULTIPLE CLASSIFIERS; SUPPORT VECTOR MACHINES; LANGUAGE RECOGNITION; SESSION VARIABILITY; LINEAR PREDICTION; SCORE NORMALIZATION; PATTERN-RECOGNITION; FEATURE-EXTRACTION; PROSODIC FEATURES AB This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions. (C) 2009 Elsevier B.V. All rights reserved. C1 [Kinnunen, Tomi] Univ Joensuu, Dept Comp Sci & Stat, Speech & Image Proc Unit, FIN-80101 Joensuu, Finland. [Li, Haizhou] Inst Infocomm Res, Dept Human Language Technol, Singapore 138632, Singapore. RP Kinnunen, T (reprint author), Univ Joensuu, Dept Comp Sci & Stat, Speech & Image Proc Unit, POB 111, FIN-80101 Joensuu, Finland. EM tkinnu@cs.joensuu.fi; hli@i2r.a-star.edu.sg CR ADAMI A, 2003, P ICASSP, V4, P788 Adami AG, 2007, SPEECH COMMUN, V49, P277, DOI 10.1016/j.specom.2007.02.005 ALEXANDER A, 2004, FORENSIC SCI INT, V146, P95 Alku P, 1999, CLIN NEUROPHYSIOL, V110, P1329, DOI 10.1016/S1388-2457(99)00088-7 Altincay H, 2003, SPEECH COMMUN, V41, P531, DOI 10.1016/S0167-6393(03)00032-3 Ambikairajah E., 2007, P 6 INT IEEE C INF C, P1 ANDREWS W, 2002, P ICASSP, V1, P149 ANDREWS W, 2001, P EUROSPEECH, P2517 [Anonymous], 2002, FORENSIC SPEAKER IDE ARCIENEGA M, 2001, P 7 EUR C SPEECH COM, P2821 ASHOUR G, 1999, P 6 EUR C SPEECH COM, P1187 ATAL BS, 1972, J ACOUST SOC AM, V52, P1687, DOI 10.1121/1.1913303 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013 Auckenthaler R., 2001, P SPEAK OD SPEAK REC, P83 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Bartkova K, 2002, P INT C SPOK LANG PR, P1197 Benyassine A, 1997, IEEE COMMUN MAG, V35, P64, DOI 10.1109/35.620527 BENZEGHIBA M, 2003, P 8 EUR C SPEECH COM, P1361 BenZeghiba MF, 2006, SPEECH COMMUN, V48, P1200, DOI 10.1016/j.specom.2005.08.008 Besacier L, 2000, SIGNAL PROCESS, V80, P1245, DOI 10.1016/S0165-1684(00)00033-5 Besacier L, 2000, SPEECH COMMUN, V31, P89, DOI 10.1016/S0167-6393(99)00070-9 BIMBOT F, 1995, SPEECH COMMUN, V17, P177, DOI 10.1016/0167-6393(95)00013-E Bimbot Frederic, 2004, EURASIP J APPL SIG P, V4, P430, DOI [DOI 10.1155/S1110865704310024, 10.1155/S1110865704310024] Bishop C. M., 2006, PATTERN RECOGNITION BOCKLET T, 2009, P INT C AC SPEECH SI, P4525 Boersma P., 2009, PRAAT DOING PHONETIC Bonastre J.-F., 2007, P INT 2007 ICSLP ANT, P2053 Brummer N, 2007, IEEE T AUDIO SPEECH, V15, P2072, DOI 10.1109/TASL.2007.902870 Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001 BURGET L, 2009, ROBUST SPEAKER RECOG Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499 BURTON DK, 1987, IEEE T ACOUST SPEECH, V35, P133, DOI 10.1109/TASSP.1987.1165110 CAMPBELL J, 2004, ADV NEURAL INFORM PR, V16 CAMPBELL J, 2005, P INT C AC SPEECH SI, P637 CAMPBELL J, 2006, IEEE SIGNAL PROCESS, V13, P308 Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 Campbell WM, 2002, IEEE T SPEECH AUDI P, V10, P205, DOI 10.1109/TSA.2002.1011533 Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003 Carey M. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607979 Castaldo F, 2007, IEEE T AUDIO SPEECH, V15, P1969, DOI 10.1109/TASL.2007.901823 Chan WN, 2007, IEEE T AUDIO SPEECH, V15, P1884, DOI 10.1109/TASL.2007.900103 CHARBUILLET C, 2006, P C INT IEEE C AC SP, V1, P673 Chaudhari UV, 2003, IEEE T SPEECH AUDI P, V11, P61, DOI 10.1109/TSA.2003.809121 Chen K, 1997, INT J PATTERN RECOGN, V11, P417, DOI 10.1142/S0218001497000196 CHEN ZH, 2004, P INT C SPOK LANG PR, P1421 Chetouani M, 2009, PATTERN RECOGN, V42, P487, DOI 10.1016/j.patcog.2008.08.008 CHEVEIGNE A, 2001, P EUROSP, P2451 Damper RI, 2003, PATTERN RECOGN LETT, V24, P2167, DOI 10.1016/S0167-8655(03)00082-5 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 Dehak N, 2006, P IEEE OD SPEAK LANG Dehak N, 2007, IEEE T AUDIO SPEECH, V15, P2095, DOI 10.1109/TASL.2007.902758 Dehak N., 2009, P INT C AC SPEECH SI, P4237 DEHAK N, 2008, SPEAK LANG REC WORKS Deller J., 2000, DISCRETE TIME PROCES Doddington G., 2001, P EUR, P2521 DUNN RB, 2001, 35 AS C SIGN SYST CO, V2, P1562 ESPYWILSON CY, 2006, P ICSLP 2006, P1475 Ezzaidi H., 2001, P EUROSPEECH, P2825 Faltlhauser R., 2001, P 7 EUR C SPEECH COM, P751 Farrell KR, 1994, IEEE T SPEECH AUDI P, V2, P194, DOI 10.1109/89.260362 FARRELL K, 1998, P 1998 IEEE INT C AC, V2, P1129, DOI 10.1109/ICASSP.1998.675468 FAUVE B, 2008, SPEAK LANG REC WORKS Fauve BGB, 2007, IEEE T AUDIO SPEECH, V15, P1960, DOI 10.1109/TASL.2007.902877 Ferrer L., 2007, P ICASSP HON APR, V4, P233 Ferrer L., 2008, P INT C AC SPEECH SI, P4853 FERRER L, 2008, SPEAK LANG REC WORKS Fredouille C, 2000, DIGIT SIGNAL PROCESS, V10, P172, DOI 10.1006/dspr.1999.0367 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1 GARCIAROMERO D, 2004, P SPEAK OD SPEAK REC, V4, P105 Gersho A., 1991, VECTOR QUANTIZATION GLEMBEK O, 2009, P INT C AC SPEECH SI, P4057 GONG WG, 2008, P IM SIGN PROC CISP, V5, P295 GONZALEZRODRIGU.J, 2003, P EUROSPEECH, P693 Gopalan K, 1999, IEEE T SPEECH AUDI P, V7, P289, DOI 10.1109/89.759036 GUDNASON J, 2008, P IEEE INT C AC SPEE, P4821 Gupta S. K., 1992, Digital Signal Processing, V2, DOI 10.1016/1051-2004(92)90027-V HANNANI A, 2004, P SPEAK OD SPEAK REC, P111 HANSEN EG, 2004, P OD 04 SPEAK LANG R, P179 Harrington J., 1999, TECHNIQUES SPEECH AC HARRIS FJ, 1978, P IEEE, V66, P51, DOI 10.1109/PROC.1978.10837 Hatch AO, 2005, 2005 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), P75 HATCH AO, 2006, P ICASSP, P585 HATCH AO, 2006, P INT, P1471 Hautamaki V, 2008, PATTERN RECOGN LETT, V29, P1427, DOI 10.1016/j.patrec.2008.02.021 Hautamaki V., 2007, P 12 INT C SPEECH CO, P645 Hautamaki V, 2008, IEEE SIGNAL PROC LET, V15, P162, DOI 10.1109/LSP.2007.914792 He JL, 1999, IEEE T SPEECH AUDI P, V7, P353 Hebert M., 2003, P 8 EUR C SPEECH COM, P1665 Hebert M., 2008, SPRINGER HDB SPEECH, P743, DOI 10.1007/978-3-540-49127-9_37 HECK L, 2002, P INT C SPOK LANG PR, P1369 HECK LP, 1997, P ICASSP, P1071 Heck LP, 2000, SPEECH COMMUN, V31, P181, DOI 10.1016/S0167-6393(99)00077-1 HEDGE RM, 2004, P IEEE INT C AC SPEE, V1, P517 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hess W., 1983, PITCH DETERMINATION Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 Huang X., 2001, SPOKEN LANGUAGE PROC Imperl B, 1997, SPEECH COMMUN, V22, P385, DOI 10.1016/S0167-6393(97)00053-8 Jain AK, 2000, IEEE T PATTERN ANAL, V22, P4, DOI 10.1109/34.824819 Jang GJ, 2002, NEUROCOMPUTING, V49, P329, DOI 10.1016/S0925-2312(02)00527-1 JIN Q, 2002, P ICASSP ORL MAY, V1, P145 KAJAREKAR S, 2001, P SPEAK OD SPEAK REC, P201 KARAM ZN, 2007, P INT ANTW AUG, P290 KARPOV E, 2004, P 9 INT C SPEECH COM, P366 Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1448, DOI 10.1109/TASL.2007.894527 Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147 Kenny P, 2006, CRIM060814 KINNUNEN T, 2009, P INT C AC SPEECH SI, P4545 Kinnunen T., 2002, P ICSLP 02, P2325 KINNUNEN T, 2007, P INT C BIOM ICB 200, P58 KINNUNEN T, 2006, P IEEE INT C AC SPEE, V1, P665 Kinnunen T, 2009, PATTERN RECOGN LETT, V30, P341, DOI 10.1016/j.patrec.2008.11.007 Kinnunen T., 2000, Proceedings of the IASTED International Conference. Signal Processing and Communications KINNUNEN T, 2006, P 5 INT S CHIN SPOK, P547 Kinnunen T., 2005, P 10 INT C SPEECH CO, P567 Kinnunen T., 2004, P 9 INT C SPEECH COM, P361 KINNUNEN T, 2004, THESIS U JOENSUU JOE KINNUNEN T, 2006, 5 INT S CHIN SPOK LA, P559 Kinnunen T., 2008, SPEAK LANG REC WORKS Kinnunen T, 2006, IEEE T AUDIO SPEECH, V14, P277, DOI 10.1109/TSA.2005.853206 KITAMURA T, 2008, P INT, P813 Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 Kolano G., 1999, P EUR C SPEECH COMM, P1203 KRYSZCZUK K, 2007, EURASIP J ADV SIG PR, V1, P86572 Lapidot I, 2002, IEEE T NEURAL NETWOR, V13, P877, DOI 10.1109/TNN.2002.1021888 LASKOWSKI K, 2009, P INT C AC SPEECH SI, P4541 Lee Hae-Lim, 2007, Plant Biology (Rockville), V2007, P294 LEE K, 2008, P 9 INT INT 2008 BRI, P1397 LEEUWEN D, 2006, COMPUT SPEECH LANG, V20, P128 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Lei H., 2007, P INT 2007 ICSLP ANT, P746 Leung KY, 2006, SPEECH COMMUN, V48, P71, DOI 10.1016/j.specom.2005.05.013 Li H., 2009, P INT C AC SPEECH SI, P4201 Li K. P., 1998, P IEEE INT C AC SPEE, V1, P595 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 LONGWORTH C, 2007, IEEE T AUDIO SPEECH, V6, P1 Louradour J., 2005, P 13 EUR C SIGN PROC LOURADOUR J, 2005, P ICASSP PHIL US MAR, P613 LU X, 2007, SPEECH COMMUN, V50, P312 MA B, 2006, P INT C AC SPEECH SI, V1, P1029 Ma B., 2006, P INT 2006 ICSLP PIT, P505 Ma B, 2007, IEEE T AUDIO SPEECH, V15, P2053, DOI 10.1109/TASL.2007.902861 Magrin-Chagnolleau I, 2002, IEEE T SPEECH AUDI P, V10, P371, DOI 10.1109/TSA.2002.800557 Mak M.W., 2006, P INT C AC SPEECH SI, V1, P929 Mak M.W., 2004, EURASIP J APPL SIG P, V4, P452 MAK MW, 2003, P INT C AC SPEECH SI, V2, P745 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Malayath N, 2000, DIGIT SIGNAL PROCESS, V10, P55, DOI 10.1006/dspr.1999.0363 Mami Y, 2006, SPEECH COMMUN, V48, P127, DOI 10.1016/j.specom.2005.06.014 Mammone RJ, 1996, IEEE SIGNAL PROC MAG, V13, P58, DOI 10.1109/79.536825 Mariethoz J., 2002, P ICSLP, P581 MARKEL JD, 1977, IEEE T ACOUST SPEECH, V25, P330, DOI 10.1109/TASSP.1977.1162961 Martin A. F., 1997, P EUROSPEECH, P1895 MARY L, 2006, P INT 2006 ICSLP PIT, P917 Mary L, 2008, SPEECH COMMUN, V50, P782, DOI 10.1016/j.specom.2008.04.010 MASON M, 2005, P EUR LISB PORT SEP, P3109 McLaughlin J., 1999, P EUR, P1215 Misra H, 2003, SPEECH COMMUN, V39, P301, DOI 10.1016/S0167-6393(02)00046-8 Miyajima C, 2001, SPEECH COMMUN, V35, P203, DOI 10.1016/S0167-6393(00)00079-0 MOONASAR V, 2001, P INT JOINT C NEUR N, P2936 MULLER C, 2007, LECT NOTES COMPUTER, V4441 Muller Christian, 2007, LECT NOTES COMPUTER, V4343 Muller KR, 2001, IEEE T NEURAL NETWOR, V12, P181, DOI 10.1109/72.914517 Murty KR, 2006, IEEE SIGNAL PROC LET, V13, P52, DOI 10.1109/LSP.2005.860538 Naik J. M., 1989, P IEEE INT C AC SPEE, P524 NAKASONE H, 2004, P SPEAK OD SPEAK REC, P251 Ney H, 1997, TEXT SPEECH LANG TEC, V2, P174 NIEMILAITINEN T, 2005, P 2 BALT C HUM LANG, P317 *NIST, 2008, SRE RES PAG Nolan F, 1983, PHONETIC BASES SPEAK Oppenheim A. V., 1999, DISCRETE TIME SIGNAL ORMAN D, 2001, P SPEAK OD SPEAK REC, P219 Paliwal K. K., 2003, P EUR 2003, P2117 Park A., 2002, P INT C SPOK LANG PR, P1337 Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213 PELECANOS J, 2000, P INT C PATT REC ICP, P3298 PELLOM BL, 1999, P ICASSP 99, P837 Pellom BL, 1998, IEEE SIGNAL PROC LET, V5, P281, DOI 10.1109/97.728467 PFISTER B, 2003, P EUROSPEECH 2003, P701 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 POH N, 2004, P IEEE INT C AC SPEE, V5, P893 Prasanna SRM, 2006, SPEECH COMMUN, V48, P1243, DOI 10.1016/j.specom.2006.06.002 Przybocki MA, 2007, IEEE T AUDIO SPEECH, V15, P1951, DOI 10.1109/TASL.2007.902489 Rabiner L, 1993, FUNDAMENTALS SPEECH Ramachandran RP, 2002, PATTERN RECOGN, V35, P2801, DOI 10.1016/S0031-3203(01)00235-7 Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002 Ramos-Castro D, 2007, PATTERN RECOGN LETT, V28, P90, DOI 10.1016/j.patrec.2006.06.008 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Reynolds D.A., 2005, P ICASSP, V1, P177, DOI 10.1109/ICASSP.2005.1415079 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 REYNOLDS DA, 2003, P ICASSP, P784 Reynolds D.A., 2003, P IEEE ICASSP, P53 Roch M, 2006, SPEECH COMMUN, V48, P85, DOI 10.1016/j.specom.2005.06.003 Rodriguez-Linares Leandro, 2003, Pattern Recognition, V36, P347 SAASTAMOINEN J, 2005, EURASIP J APPL SIG P, V17, P2816 Saeidi R, 2009, IEEE T AUDIO SPEECH, V17, P344, DOI 10.1109/TASL.2008.2010278 Shriberg E, 2005, SPEECH COMMUN, V46, P455, DOI 10.1016/j.specom.2005.02.018 Sivakumaran P., 2003, P EUROSPEECH INTERSP, P2669 Sivakumaran P, 2003, SPEECH COMMUN, V41, P485, DOI 10.1016/S0167-6393(03)00017-7 SLOMKA S, 1998, P INT C SPOK LANG PR, P225 SLYH RE, 2004, P ISCA TUT RES WORKS, P315 Solewicz YA, 2007, IEEE T AUDIO SPEECH, V15, P2063, DOI 10.1109/TASL.2007.903054 SOLOMONOFF A, 2005, P ICASSP, P629 SONMEZ M, 1998, P INT C SPOK LANG PR, P3189 SONMEZ MK, 1997, P EUR, P1391 SOONG FK, 1987, AT&T TECH J, V66, P14 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 Stolcke A, 2007, IEEE T AUDIO SPEECH, V15, P1987, DOI 10.1109/TASL.2007.902859 STOLCKE A, 2008, P ICASSP, P1577 Sturim D., 2001, P ICASSP, V1, P429 STURIM D, 2005, P ICASSP, P741 Teunen R., 2000, P INT C SPOK LANG PR, V2, P495 THEVENAZ P, 1995, SPEECH COMMUN, V17, P145, DOI 10.1016/0167-6393(95)00010-L THIAN N, 2004, P 1 INT C BIOM AUTH, P631 THIRUVARAN T, 2008, P INT 2008 INC SST 2, P1497 Thiruvaran T., 2008, ELECT LETT, V44 TONG R, 2006, 5 INT S CHIN SPOK LA, P494 TORRESCARRASQUI.PA, 2002, P ICASSP, V1, P757 Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 TYDLITAT B, 2007, P INT C AC SPEECH SI, V4, P293 Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8 Vogt R., 2008, SPEAK LANG REC WORKS Vogt R., 2005, P INT, P3117 Vogt R, 2008, COMPUT SPEECH LANG, V22, P17, DOI 10.1016/j.csl.2007.05.003 Wan V, 2005, IEEE T SPEECH AUDI P, V13, P203, DOI 10.1109/TSA.2004.841042 WILDERMOTH B, 2008, P 8 AUSTR INT C SPEE, P324 WOLF JJ, 1972, J ACOUST SOC AM, V51, P2044, DOI 10.1121/1.1913065 Xiang B, 2003, IEEE SIGNAL PROC LET, V10, P141, DOI 10.1109/LSP.2003.810913 Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822 XIANG B, 2002, P ICASSP, V1, P681 Xiong ZY, 2006, SPEECH COMMUN, V48, P1273, DOI 10.1016/j.specom.2606.06.011 Yegnanarayana B, 2002, NEURAL NETWORKS, V15, P459, DOI 10.1016/S0893-6080(02)00019-9 You CH, 2009, IEEE SIGNAL PROC LET, V16, P49, DOI 10.1109/LSP.2008.2006711 Yuo KH, 1999, SPEECH COMMUN, V28, P227, DOI 10.1016/S0167-6393(99)00017-5 Zheng NH, 2007, IEEE SIGNAL PROC LET, V14, P181, DOI 10.1109/LSP.2006.884031 ZHU D, 2008, P INT 2008 BRISB AUS ZHU D, 2007, P INT C AC SPEECH SI, V4, P61 Zhu D., 2009, P ICASSP, P4045 Zilca RD, 2002, IEEE T SPEECH AUDI P, V10, P363, DOI 10.1109/TSA.2002.803419 Zilca RD, 2006, IEEE T AUDIO SPEECH, V14, P467, DOI [10.1109/TSA.2005.857809, 10.1109/FSA.2005.857809] Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450 NR 247 TC 185 Z9 198 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2010 VL 52 IS 1 BP 12 EP 40 DI 10.1016/j.specom.2009.08.009 PG 29 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 522QX UT WOS:000272014500002 ER PT J AU Ishizuka, K Nakatani, T Fujimoto, M Miyazaki, N AF Ishizuka, Kentaro Nakatani, Tomohiro Fujimoto, Masakiyo Miyazaki, Noboru TI Noise robust voice activity detection based on periodic to aperiodic component ratio SO SPEECH COMMUNICATION LA English DT Article DE Voice activity detection; Robustness; Periodicity; Aperiodicity; Noise robust front-end processing for automatic speech recognition ID FUNDAMENTAL-FREQUENCY ESTIMATION; SUBBAND-BASED PERIODICITY; HIGHER-ORDER STATISTICS; SPEECH RECOGNITION; SPECTRUM ESTIMATION; PITCH DETECTION; ALGORITHM; DECOMPOSITION; SIGNALS; MODEL AB This paper proposes a noise robust voice activity detection (VAD) technique called PARADE (PAR based Activity DEtection) that employs the periodic component to aperiodic component ratio (PAR). Conventional noise robust features for VAD are still sensitive to non-stationary noise, which yields variations in the signal-to-noise ratio, and sometimes requires a priori noise power estimations, although the characteristics of environmental noise change dynamically in the real world. To overcome this problem, we adopt the PAR, which is insensitive to both stationary and non-stationary noise, as an acoustic feature for VAD. By considering both periodic and aperiodic components simultaneously in the PAR, we can mitigate the effect of the non-stationarity of noise. PARADE first estimates the fundamental frequencies of the dominant periodic components of the observed signals, decomposes the power of the observed signals into the powers of its periodic and aperiodic components by taking account of the power of the aperiodic components at the frequencies where the periodic components exist, and calculates the PAR based on the decomposed powers. Then it detects the presence of target speech signals by estimating the voice activity likelihood defined in relation to the PAR. Comparisons of the VAD performance for noisy speech data confirmed that PARADE outperforms the conventional VAD algorithms even in the presence of non-stationary noise. In addition, PARADE is applied to a front-end processing technique for automatic speech recognition (ASR) that employs a robust feature extraction method called SPADE (Subband based Periodicity and Aperiodicity DEcomposition) as an application of PARADE. Comparisons of the ASR performance for noisy speech show that the SPADE front-end combined with PARADE achieves significantly higher word accuracies than those achieved by MFCC (Mel-frequency Cepstral Coefficient) based feature extraction, which is widely used for conventional ASR systems, the SPADE front-end without PARADE, and other standard noise robust front-end processing techniques (ETSI ES 202 050 and ETSI ES 202 212). This result confirmed that PARADE can improve the performance of front-end processing for ASR. (C) 2009 Elsevier B.V. All rights reserved. C1 [Ishizuka, Kentaro; Nakatani, Tomohiro; Fujimoto, Masakiyo] NTT Corp, NTT Commun Sci Labs, Kyoto 6190237, Japan. [Miyazaki, Noboru] NTT Corp, NTT Cyber Space Labs, Yokosuka, Kanagawa 2390847, Japan. RP Ishizuka, K (reprint author), NTT Corp, NTT Commun Sci Labs, Hikaridai 2-4, Kyoto 6190237, Japan. EM ishizuka@cslab.kecl.ntt.co.jp; nak@cslab.kecl.ntt.co.jp; masakiyo@cslab.kecl.ntt.co.jp; miyazaki.noboru@lab.ntt.co.jp CR Adami A, 2002, P ICSLP, P21 Agarwal A., 1999, P ASRU, P67 Ahmadi S, 1999, IEEE T SPEECH AUDI P, V7, P333, DOI 10.1109/89.759042 [Anonymous], 2002, 202050 ETSI ES [Anonymous], 2003, 202212 ETSI ES [Anonymous], 1999, 301708 ETSI EN ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800 Basu S., 2003, P ICASSP, V1, pI Benitez C, 2001, P EUR 2001, P429 Boersma P., 1993, P I PHONETIC SCI, V17, P97 Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403 Chen C.-P., 2002, P ICSLP, P241 Cho YD, 2001, IEEE SIGNAL PROC LET, V8, P276 Cournapeau D., 2007, P INT, P2945 Davis A, 2006, IEEE T AUDIO SPEECH, V14, P412, DOI 10.1109/TSA.2005.855842 DELATORRE A, 2006, P INT, P1954 Deshmukh O, 2005, IEEE T SPEECH AUDI P, V13, P776, DOI 10.1109/TSA.2005.851910 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 *ETSI, 2000, 101707 ETSI TS Evangelopoulos G, 2006, IEEE T AUDIO SPEECH, V14, P2024, DOI 10.1109/TASL.2006.872625 Fisher E, 2006, IEEE T AUDIO SPEECH, V14, P502, DOI 10.1109/TSA.2005.857806 Fujimoto M., 2008, P ICASSP 08 APR, P4441 Fujimoto M, 2008, IEICE T INF SYST, VE91D, P467, DOI [10.1093/ietisy/e91-d.3.467, 10.1093/ietisy/e9l-d.3.467] Gorriz JM, 2006, IEEE SIGNAL PROC LET, V13, P636, DOI 10.1109/LSP.2006.876340 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 HAMADA M, 1990, P INT C SPEECH LANG, P893 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hess W., 1983, PITCH DETERMINATION HILLENBRAND J, 1987, J SPEECH HEAR RES, V30, P448 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 Ishizuka K, 2006, J ACOUST SOC AM, V120, P443, DOI 10.1121/1.2205131 Ishizuka K, 2006, SPEECH COMMUN, V48, P1447, DOI 10.1016/j.specom.2006.06.008 ITAKURA F, 1968, REP INT C AC JACKSON PJB, 2003, P EUROSPEECH, P2321 Jackson PJB, 2001, IEEE T SPEECH AUDI P, V9, P713, DOI 10.1109/89.952489 Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 Karray L, 2003, SPEECH COMMUN, V40, P261, DOI 10.1016/S0167-6393(02)00066-3 KINGSBURY B, 2002, P ICASSP, V1, P53 Kitaoka N., 2007, P IEEE WORKSH AUT SP, P607 KRISTIANSSON T., 2005, P INTERSPEECH, P369 Krom G., 1993, J SPEECH HEAR RES, V36, P254 LAMEL LF, 1981, IEEE T ACOUST SPEECH, V29, P777, DOI 10.1109/TASSP.1981.1163642 LAROCHE J, 1993, P ICASSP, V1, P550 LEBOUQUINJEANNES R, 1995, SPEECH COMMUN, V16, P245, DOI 10.1016/0167-6393(94)00056-G Lee A., 2004, P ICSLP, V1, P173 Li K, 2005, IEEE T SPEECH AUDI P, V13, P965, DOI 10.1109/TSA.2005.851955 Li Q., 2001, P 7 EUR C SPEECH COM, P619 Machiraju VR, 2002, J CARDIAC SURG, V17, P20 MAK B, 1992, P IEEE INT C AC SPEE, V1, P269 Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548 Mauuary L., 1998, P EUSPICO 98, V1, P359 MOUSSET E, 1996, P INT C SPOK LANG PR, V2, P1273, DOI 10.1109/ICSLP.1996.607842 NAKAMURA A, 1996, P ICSLP, V4, P2199, DOI 10.1109/ICSLP.1996.607241 NAKAMURA S, 2005, TEICE T INF SYST D, V88, P535 NAKAMURA S, 2003, P 8 IEEE WORKSH AUT, P619 Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522 Nakatani T, 2008, SPEECH COMMUN, V50, P203, DOI 10.1016/j.specom.2007.09.003 Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996 NOE B, 2001, P EUR 2001 AALB DENM, P433 Pearce D., 2000, P ICSLP, V4, P29 RABINER LR, 1975, AT&T TECH J, V54, P297 RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P24, DOI 10.1109/TASSP.1977.1162905 Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002 Ramirez J, 2005, IEEE SIGNAL PROC LET, V12, P689, DOI 10.1109/LSP.2005.855551 RAMIREZ J, 2007, P ICASSP, V4, P801 RICHARD G, 1996, PROGR TEXT TO SPEECH, P41 SAVOJI MH, 1989, SPEECH COMMUN, V8, P45, DOI 10.1016/0167-6393(89)90067-8 SERRA X, 1990, COMPUT MUSIC J, V14, P12, DOI 10.2307/3680788 SHEN JL, 1998, P ICSLP Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Solvang HK, 2008, SPEECH COMMUN, V50, P476, DOI 10.1016/j.specom.2008.02.003 Srinivasant K., 1993, P IEEE SPEECH COD WO, P85, DOI 10.1109/SCFT.1993.762351 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521 TUCKER R, 1992, IEE PROC-I, V139, P377 Wilpon J. G., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90015-5 Wu BF, 2005, IEEE T SPEECH AUDI P, V13, P762, DOI 10.1109/TSA.2005.851909 YANTORNO RE, 2001, P IEEE INT WORKSH IN Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P1, DOI 10.1109/89.650304 NR 80 TC 11 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2010 VL 52 IS 1 BP 41 EP 60 DI 10.1016/j.specom.2009.08.003 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 522QX UT WOS:000272014500003 ER PT J AU Misu, T Kawahara, T AF Misu, Teruhisa Kawahara, Tatsuya TI Bayes risk-based dialogue management for document retrieval system with speech interface SO SPEECH COMMUNICATION LA English DT Article DE Spoken dialogue system; Dialogue management; Document retrieval; Bayes risk AB We propose an efficient technique of dialogue management for an information navigation system based oil a document knowledge base. The system can use ASR N-best hypotheses and contextual information to perform robustly for fragmental speech input and erroneous output of automatic speech recognition (ASR). It also has several choices in generating responses or confirmations. We formulate the optimization of these choices based oil a Bayes risk criterion, which is defined based on a reward for correct information presentation and a penalty for redundant turns. The parameters for the dialogue management we propose can be adaptively tuned by online learning. We evaluated this strategy with our spoken dialogue system called "Dialogue Navigator for Kyoto City", which generates responses based oil the document retrieval and also has question-answering capability. The effectiveness of the proposed framework was demonstrated by the increased success rate of dialogue and the reduced number of turns for information access through an experiment with a large number of utterances by real users. (C) 2009 Elsevier B.V. All rights reserved. C1 [Misu, Teruhisa; Kawahara, Tatsuya] Kyoto Univ, Sch Informat, Sakyo Ku, Kyoto 6068501, Japan. RP Misu, T (reprint author), Natl Inst Informat & Commun Technol, Kyoto, Japan. EM misu@ar.media.kyoto-u.ac.jp CR AKIBA T, 2005, P INT Boni M. D., 2005, NAT LANG ENG, V11, P343 BRONDSTED T, 2006, P WORKSH SPEECH MOB Chen B., 2005, P EUR C SPEECH COMM, P109 DOHSAKA K, 2003, P EUR HORVITZ E, 2006, USER MODEL USER-ADAP, V17, P159 Komatani K, 2005, USER MODEL USER-ADAP, V15, P169, DOI 10.1007/s11257-004-5659-0 KUDO T, 2003, P 42 ANN M ACL LAMEL L, 2002, SPEECH COMM, V38 LAMEL L, 1999, P ICASSP Lee A., 2004, P ICASSP QUEB, P793 Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 Levin E., 2006, P SPOK LANG TECHN WO, P198 LITMAN DJ, 2000, P 17 C COMP LING, P502 MATSUDA M, 2006, P 5 NTCIR WORKSH M E, P414 MISU T, 2007, P ICASSP MISU T, 2006, P INTERSPEECH, P9 Misu T, 2006, SPEECH COMMUN, V48, P1137, DOI 10.1016/j.specom.2006.04.001 MURATA M, 2006, P AS INF RETR S, P601 NIIMI Y, 1996, P ICSLP NISHIMURA R, 2005, P INT *NIST, 2003, NIST SPEC PUBL, P500 Pan Y. C., 2007, P AUT SPEECH REC UND, P544 POTAMIANOS A, 2000, P ICSLP Raux A., 2005, P INT RAVICHANDRAN D, 2002, P 40 ANN M ACL RAYMOND C, 2003, P AUT SPEECH REC UND Reithinger N., 2005, P INT Roy N., 2000, P 38 ANN M ASS COMP, P93, DOI DOI 10.3115/1075218.1075231 RUDNICKY A, 2000, P ICSLP, V2 SENEFF S, 2000, P ANLP NAACL 2000 SA Singh S, 2002, J ARTIF INTELL RES, V16, P105 Sturm J., 1999, P ESCA WORKSH INT DI YOUNG S, 2007, P ICASSP Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 35 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2010 VL 52 IS 1 BP 61 EP 71 DI 10.1016/j.specom.2009.08.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 522QX UT WOS:000272014500004 ER PT J AU Srinivasan, S Wang, DL AF Srinivasan, Soundararajan Wang, DeLiang TI Robust speech recognition by integrating speech separation and hypothesis testing SO SPEECH COMMUNICATION LA English DT Article DE Robust speech recognition; Missing-data recognizer; Ideal binary mask; Speech segregation; Top-down processing ID NOVELTY DETECTION; SOUNDS; NOISE AB Missing-data methods attempt to improve robust speech recognition by distinguishing between reliable and unreliable data in the time-frequency (T-F) domain. Such methods require a binary mask to label speech-dominant T-F regions of a noisy speech signal as reliable and the rest as unreliable. Current methods for computing the mask are based mainly on bottom-up cues such as harmonicity and produce labeling errors that degrade recognition performance. In this paper, we propose a two-stage recognition system that combines bottom-up and top-down cues in order to simultaneously improve both mask estimation and recognition accuracy. First, an n-best lattice consistent with a speech separation mask is generated. The lattice is then re-scored by expanding the mask using a model-based hypothesis test to determine the reliability of individual T-F units. Systematic evaluations of the proposed system show significant improvement in recognition performance compared to that using speech separation alone. (C) 2009 Elsevier B.V. All rights reserved. C1 [Srinivasan, Soundararajan] Ohio State Univ, Dept Biomed Engn, Columbus, OH 43210 USA. [Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. [Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA. RP Srinivasan, S (reprint author), Robert Bosch LLC, Res & Technol Ctr N Amer, Pittsburgh, PA 15212 USA. EM srinivasan.36@osu.edu; dwang@cse.ohio-state.edu FU AFOSR [FA9550-08-1-0155]; NSF [IIS-0534707] FX This research was supported in part by an AFOSR grant (FA9550-08-1-0155), an NSF grant (IIS-0534707). We thank J. Barker for help with the speech fragment decoder and E. Fosler-Lussier for helpful discussions. A preliminary version of this work was presented in 2005 ICASSP (Srinivasan and Wang, 2005a). CR Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 BISHOP CM, 1994, IEE P-VIS IMAGE SIGN, V141, P217, DOI 10.1049/ip-vis:19941330 Boersma P, 2002, PRAAT DOING PHONETIC BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 2001, P IJCNN 01, P2907 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DROPPO J, 2002, P INT C SPOK LANG PR, P1569 Drygajlo A., 1998, P ICASSP 98, V1, P121, DOI 10.1109/ICASSP.1998.674382 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 Gales Mark, 2007, Foundations and Trends in Signal Processing, V1, DOI 10.1561/2000000004 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Huang X., 2001, SPOKEN LANGUAGE PROC Leonard R. G., 1984, P ICASSP 84, P111 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Markou M, 2003, SIGNAL PROCESS, V83, P2481, DOI [10.1016/j.sigpro.2003.07.018, 10.1016/j.sigpro.2003.018] McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212 McLachlan G.J., 1988, MIXTURE MODELS INFER Patterson R.D., 1988, 2341 APU Pearce D., 2000, P ICSLP, V4, P29 Renevey P., 2001, P CONS REL AC CUES S, P71 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 SELTZER ML, 2000, P INT C SPOK LANG PR, P538 Srinivasan S., 2005, P IEEE INT C AC SPEE, V1, P89, DOI 10.1109/ICASSP.2005.1415057 Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003 Srinivasan S, 2005, SPEECH COMMUN, V45, P63, DOI 10.1016/j.specom.2004.09.002 Srinivasan S., 2006, THESIS OHIO STATE U Stark H., 2002, PROBABILITY RANDOM P, V3rd Tax D, 1998, LECT NOTES COMPUTER, V1451, P593 Van hamme H, 2004, P IEEE ICASSP, V1, P213 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1992, NOISEX 92 STUDY EFFE Wang D., 2006, COMPUTATIONAL AUDITO Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129 Young S., 2000, HTK BOOK HTK VERSION NR 39 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2010 VL 52 IS 1 BP 72 EP 81 DI 10.1016/j.specom.2009.08.008 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 522QX UT WOS:000272014500005 ER PT J AU Jansen, A Niyogi, P AF Jansen, Aren Niyogi, Partha TI Point process models for event-based speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Event-based speech recognition; Speech processing; Point process models ID SPIKE; FRAMEWORK; FEATURES; NEURONS AB Several strands of research in the fields of linguistics, speech perception, and neuroethology suggest that modelling the temporal dynamics of an acoustic event landmark-based representation is a scientifically plausible approach to the automatic speech recognition (ASR) problem. Adopting a point process representation of the speech signal opens up ASR to a large class of statistical models that have seen wide application in the neuroscience community. In this paper, we formulate several point process models for application to speech recognition, designed to operate on sparse detector-based representations of the speech signal. We find that even with a noisy and extremely sparse phone-based point process representation, obstruent phones can be decoded at accuracy levels comparable to a basic hidden Markov model baseline and with improved robustness. We conclude by outlining various avenues for future development of our methodology. (C) 2009 Elsevier B.V. All rights reserved. C1 [Jansen, Aren; Niyogi, Partha] Univ Chicago, Dept Comp Sci, Chicago, IL 60637 USA. RP Jansen, A (reprint author), Univ Chicago, Dept Comp Sci, 1100 E 58th St, Chicago, IL 60637 USA. EM aren@cs.uchicago.cdu; niyogi@cs.uchicago.edu CR Amarasingham A, 2006, J NEUROSCI, V26, P801, DOI 10.1523/JNEUROSCI.2948-05.2006 AMIT Y, 2005, J ACOUST SOC AM, V118 BOURLARD H, 1996, IDIAPRR9607 Brown EN, 2005, METHODS MODELS NEURO, P691 Chi ZY, 2007, J NEUROPHYSIOL, V97, P1221, DOI 10.1152/jn.00448.2006 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 Ellis D.P.W., 2005, PLP RASTA MFCC INVER Esser KH, 1997, P NATL ACAD SCI USA, V94, P14019, DOI 10.1073/pnas.94.25.14019 FRANGOULIS E, 1989, P ICASSP, P9 FUZESSERY ZM, 1983, J COMP PHYSIOL, V150, P333 Geiger D, 1999, INT J COMPUT VISION, V33, P139, DOI 10.1023/A:1008146126392 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 Greenberg S, 2003, J PHONETICS, V31, P465, DOI 10.1016/j.wocn.2003.09.005 Gutig R, 2006, NAT NEUROSCI, V9, P420, DOI 10.1038/nn1643 HASEGAWAJOHNSON M, 2002, ACCUMU J ARTS TECHNO Jansen A, 2008, J ACOUST SOC AM, V124, P1739, DOI 10.1121/1.2956472 JUANG BH, 1985, P ICASSP Lee Chin-Hui, 2007, P INT, P1825 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Legenstein R, 2005, NEURAL COMPUT, V17, P2337, DOI 10.1162/0899766054796888 Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2 Li J., 2005, P INT LISB PORT SEP, P3365 Livescu K., 2004, P HLT NAACL MA C, 2006, P INT MAK B, 2000, P ICSLP, P149 MARGOLIASH D, 1992, J NEUROSCI, V12, P4309 Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666 NIYOGI P, 1998, P ICSLP Nock HJ, 2003, COMPUT SPEECH LANG, V17, P233, DOI 10.1016/S0885-2308(03)00009-3 OLHAUSEN BA, 2003, P ICIP, P41 OSTENDORF M, 1992, P DARPA WORKSH CONT Ostendorf M., 1996, AUTOMATIC SPEECH SPE, P185 Parker Steve, 2002, THESIS U MASSACHUSET POEPPEL D, 2007, PHILOS T ROYAL SOC B Pruthi T, 2004, SPEECH COMMUN, V43, P225, DOI 10.1016/j.specom.2004.06.001 RAMESH P, 1992, P ICASSP RUSSELL MJ, 1987, P ICASSP Serre T, 2007, IEEE T PATTERN ANAL, V29, P411, DOI 10.1109/TPAMI.2007.56 Sha F., 2007, P ICASSP, P313 STEVENS K, 1992, P ICSLP Stevens K. N., 1981, PERSPECTIVES STUDY S, P1 Stevens KN, 2002, J ACOUST SOC AM, V111, P1872, DOI 10.1121/1.1458026 Suga N, 2006, LISTENING SPEECH AUD, P159 Sun JP, 2002, J ACOUST SOC AM, V111, P1086, DOI 10.1121/1.1420380 Truccolo W, 2005, J NEUROPHYSIOL, V93, P1074, DOI 10.1152/jn.00697.2004 WILLETT R, 2007, P ICASSP, P1249 XIE Z, 2006, P ICSLP Zhang Yongshun, 2003, Proceedings. 2003 IEEE International Conference on Robotics, Intelligent Systems and Signal Processing (IEEE Cat. No.03EX707) NR 48 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1155 EP 1168 DI 10.1016/j.specom.2009.05.008 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800001 ER PT J AU Qian, Y Soong, FK AF Qian, Yao Soong, Frank K. TI A Multi-Space Distribution (MSD) and two-stream tone modeling approach to Mandarin speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Tone model; Mandarin speech recognition; Multi-Space Distribution (MSD); Noisy digit recognition; LVCSR AB Tone plays an important role in recognizing spoken tonal languages like Chinese. However, the discontinuity of F0 between voiced and unvoiced transition has traditionally been a hurdle in creating a succinct statistical tone model for automatic speech recognition and synthesis. Various heuristic approaches have been proposed before to get around the problem but with limited success. The Multi-Space Distribution (MSD) proposed by Tokuda et al. which models the two probability spaces, discrete for unvoiced region and continuous for voiced F0 contour, in a linearly weighted mixture, has been successfully applied to Hidden Markov Model (HMM)-based text-to-speech synthesis. We extend MSD to Chinese Mandarin tone modeling for speech recognition. The tone features and spectral features are further separated into two streams and corresponding stream-dependent models are trained. Finally two separated decision trees are constructed by clustering corresponding stream-dependent HMMs. The MSD and two-stream modeling approach is evaluated on large vocabulary, continuously read and spontaneous speech Mandarin databases and its robustness is further investigated in a noisy, continuous Mandarin digit database with eight types of noises at five different SNRs. Experimental results show that our MSD and two-stream based tone modeling approach can significantly improve the recognition performance over a toneless baseline system. The relative tonal syllable error rate (TSER) reductions are 21.0%, 8.4% and 17.4% for large vocabulary read and spontaneous and noisy digit speech recognition tasks, respectively. Comparing with the conventional system where F0 contours are interpolated in unvoiced segments, our approach improves the recognition performance by 9.8%, 7.4% and 13.3% in relative TSER reductions in the corresponding speech recognition tasks, respectively. (C) 2009 Elsevier B.V. All rights reserved. C1 [Qian, Yao; Soong, Frank K.] Microsoft Res Asia, Beijing 100190, Peoples R China. RP Qian, Y (reprint author), Microsoft Res Asia, Beijing 100190, Peoples R China. EM yaoqian@microsoft.com; frankkps@microsoft.com CR Chang E., 2000, P ICSLP 2000, P983 Chen C.J., 1997, P EUROSPEECH, P1543 Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7 Freij G., 1988, P ICASSP 1988, P135 Hirsch H. G., 2000, ISCA ITRW ASR Hirst D. J., 1993, TRAVAUX I PHONETIQUE, V15, P71 Ho T.-H., 1999, P EUROSPEECH 1999, P883 LEI X, 2006, P INT C SPOK LANG PR, P1237 Lin CH, 1996, SPEECH COMMUN, V18, P175, DOI 10.1016/0167-6393(95)00043-7 Peng G, 2005, SPEECH COMMUN, V45, P49, DOI 10.1016/j.specom.2004.09.004 Qian Y., 2006, P ICASSP, V1, P133 Qian Y, 2007, J ACOUST SOC AM, V121, P2936, DOI 10.1121/1.2717413 QIANG S, 2007, P INTERSPEECH 2007, P1801 Seide F., 2000, P ICSLP 2000, P495 SHINOZAKI T, 2001, P EUROSPEECH AALB, V1, P491 Talkin A. D., 1995, SPEECH CODING SYNTHE, P495 TIAN Y, 2004, P ICASSP, V1, P105 Tokuda K, 2002, IEICE T INF SYST, VE85D, P455 Wang HL, 2006, LECT NOTES COMPUT SC, V4274, P445 WANG HL, 2006, P ICSLP 2006, P1047 Wang HM, 1997, IEEE T SPEECH AUDI P, V5, P195 Xu Y, 2006, ITALIAN J LINGUISTIC, V18, P125 Zhang JS, 2005, SPEECH COMMUN, V46, P440, DOI 10.1016/j.specom.2005.03.010 Zhang L, 2006, LECT NOTES COMPUT SC, V4274, P590 Zhou J., 2004, P ICASSP 2004, P997 NR 25 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1169 EP 1179 DI 10.1016/j.specom.2009.08.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800002 ER PT J AU Chen, JF Phua, K Shue, L Sun, HW AF Chen, Jianfeng Phua, Koksoon Shue, Louis Sun, Hanwu TI Performance evaluation of adaptive dual microphone systems SO SPEECH COMMUNICATION LA English DT Article DE Microphone array; Adaptive beamforming; Noise reduction ID ARRAY HEARING-AIDS; BINAURAL OUTPUT AB In this paper, the performance of the adaptive noise cancellation method is evaluated on several possible dual microphone system (DMS) configurations. Two groups of DMS are taken into consideration with one consisting of two omnidirectional microphones and another involving directional microphones. The properties of these methods are theoretically analyzed under incoherent, coherent and diffuse noise respectively. To further investigate their achievable noise reduction performance in real situations, a series of experiments in simulated and real office environments are carried out. Some recommendations are given at the end for designing and choosing the suitable methods in real applications. (C) 2009 Elsevier B.V. All rights reserved. C1 [Chen, Jianfeng; Phua, Koksoon; Shue, Louis; Sun, Hanwu] Inst Infocomm Res, Singapore 138632, Singapore. RP Phua, K (reprint author), Inst Infocomm Res, 1 Fusionopolis Way,21-01 Connexis S Tower, Singapore 138632, Singapore. EM jfchen@i2r.a-star.edu.sg; ksphua@i2r.a-star.edu.sg; lshue@i2r.a-star.edu.sg; hwsun@i2r.a-star.edu.sg CR ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 *AM NAT STAND, 2004, ANSIS3 American National Standard, 1997, METH CALC SPEECH INT Berghe J. V., 1998, Journal of the Acoustical Society of America, V103, DOI 10.1121/1.423066 Bitzer J., 1998, P EUR SIGN PROC C RH, P105 BITZER J, 1999, P IEEE ASSP WORKSH A, V1, P7 BRANDSTEIN M, 2001, MICROPHONE ARRAYS, pCH2 COMPERNOLLE DV, 1990, IEEE T ACOUST SPEECH, V1, P833 COX H, 1986, IEEE T ACOUST SPEECH, V34, P393, DOI 10.1109/TASSP.1986.1164847 Csermak B., 2000, HEARING REV, V7, P56 Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298 Elko G. W., 1995, P IEEE WORKSH APPL S, P169, DOI 10.1109/ASPAA.1995.482983 Elko G. W., 2000, ACOUSTIC SIGNAL PROC, P181 GREENBERG JE, 1993, J ACOUST SOC AM, V94, P3009, DOI 10.1121/1.407334 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 Haykin S., 1996, ADAPTIVE FILTER THEO Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650 Luo FL, 2002, IEEE T SIGNAL PROCES, V50, P1583 MAJ JB, 2003, P INT WORKSH ACOUST, V1, P171 Maj J.B, 2004, THESIS KATHOLIEKE U Phua KS, 2005, SIGNAL PROCESS, V85, P809, DOI 10.1016/j.sigpro.2004.12.004 Ricketts T, 2002, INT J AUDIOL, V41, P100, DOI 10.3109/14992020209090400 SASAKI T, 1995, MICROPHONE APPARATUS TOMPSON SC, 1999, HEARING REV, V3, P31 Welker DP, 1997, IEEE T SPEECH AUDI P, V5, P543, DOI 10.1109/89.641299 WIDROW B, 1975, P IEEE, V63, P1692, DOI 10.1109/PROC.1975.10036 NR 26 TC 6 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1180 EP 1193 DI 10.1016/j.specom.2009.06.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800003 ER PT J AU Stouten, V Van Hamme, H AF Stouten, Veronique Van Hamme, Hugo TI Automatic voice onset time estimation from reassignment spectra SO SPEECH COMMUNICATION LA English DT Article DE Voice Onset Time; Speech attributes; Estimation; Reassignment spectrum; Lattice rescoring ID FREQUENCY; PLOSIVES; FEATURES AB We describe an algorithm to automatically estimate the voice onset time (VOT) of plosives. The VOT is the time delay between the burst onset and the start of periodicity when it is followed by a voiced sound. Since the VOT is affected by factors like place of articulation and voicing it can be used for inference of these factors. The algorithm uses the reassignment spectrum of the speech signal, a high resolution time-frequency representation which simplifies the detection of the acoustic events in a plosive. The performance of our algorithm is evaluated on a subset of the TIMIT database by comparison with manual VOT measurements. On average, the difference is smaller than 10 ms for 76.1% and smaller than 20 ms for 91.4% of the plosive segments. We also provide analysis statistics of the VOT of /b/, /d/, /g/, /p/, /t/ and /k/ and experimentally verify some sources of variability. Finally, to illustrate possible applications, we integrate the automatic VOT estimates as an additional feature in an HMM-based speech recognition system and show a small but statistically significant improvement in phone recognition rate. (C) 2009 Elsevier B.V. All rights reserved. C1 [Stouten, Veronique; Van Hamme, Hugo] Katholieke Univ Leuven, ESAT Dept, B-3001 Louvain, Belgium. RP Van Hamme, H (reprint author), Katholieke Univ Leuven, ESAT Dept, Kasteelpk Arenberg 10,PO 2441, B-3001 Louvain, Belgium. EM Hugo.Vanhamme@esat.kuleuven.be RI Van hamme, Hugo/D-6581-2012 FU IWT - SBO [040102]; European Commission [FP6-034362] FX This research was funded by the IWT - SBO project 'SPACE' (Project no. 040102) and by the European Commission under Contract FP6-034362 (ACORNS). CR AUGER F, 1995, IEEE T SIGNAL PROCES, V43, P1068, DOI 10.1109/78.382394 BEYERLEIN P, 1998, P IEEE INT C AC SPEE, V1, P481, DOI 10.1109/ICASSP.1998.674472 Bilmes JA, 2005, IEEE SIGNAL PROC MAG, V22, P89 Borden G. J., 1984, SPEECH SCI PRIMER PH DEMUYNCK K, 2006, P INT 2006 ICSLP 9 I, P1622 Demuynck K., 2001, THESIS K U LEUVEN GAROFOLO J, 1990, SPEECH DISC 1 1 1 Hainsworth S. W., 2003, CUEDFINFENGTR459 Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841 KAZEMZADEH A, 2006, P ICSLP PITTSB PA US King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Lee Chin-Hui, 2007, P INT, P1825 LEFEBVRE C, 1990, P ICSLP KOB JAP, P1073 McCrea CR, 2005, J SPEECH LANG HEAR R, V48, P1013, DOI 10.1044/1092-4388(2005/069) Niyogi P, 1998, INT CONF ACOUST SPEE, P13, DOI 10.1109/ICASSP.1998.674355 OBRIEN SM, 1993, INT J MAN MACH STUD, V38, P97, DOI 10.1006/imms.1993.1006 Plante F, 1998, IEEE T SPEECH AUDI P, V6, P282, DOI 10.1109/89.668821 Ramesh P., 1998, P ICSLP SYDN AUSTR Seppi D., 2007, P INTERSPEECH ANTW B, P1805 SONMEZ K, 2000, P ICSLP BEIJ CHIN STOUTEN F, 2006, P INT 2006, P357 Whiteside SP, 2004, J ACOUST SOC AM, V116, P1179, DOI 10.1121/1.1768256 WITTEN IH, 1991, IEEE T INFORM THEORY, V37, P1085, DOI 10.1109/18.87000 Xiao J, 2007, IEEE T SIGNAL PROCES, V55, P2851, DOI 10.1109/TSP.2007.893961 NR 24 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1194 EP 1205 DI 10.1016/j.specom.2009.06.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800004 ER PT J AU Pitsikalis, V Maragos, P AF Pitsikalis, Vassilis Maragos, Petros TI Analysis and classification of speech signals by generalized fractal dimension features SO SPEECH COMMUNICATION LA English DT Article DE Feature extraction; Generalized fractal dimensions; Broad class phoneme classification ID MULTIFRACTAL NATURE; STRANGE ATTRACTORS; CHAOTIC SYSTEMS; TIME-SERIES; RECOGNITION; TURBULENCE; DYNAMICS; MODELS AB We explore nonlinear signal processing methods inspired by dynamical systems and fractal theory in order to analyze and characterize speech sounds. A speech signal is at first embedded in a multidimensional phase-space and further employed for the estimation of measurements related to the fractal dimensions. Our goals are to compute these raw measurements in the practical cases of speech signals, to further utilize them for the extraction of simple descriptive features and to address issues on the efficacy of the proposed features to characterize speech sounds. We observe that distinct feature vector elements obtain values or show statistical trends that on average depend on general characteristics such as the voicing, the manner and the place of articulation of broad phoneme classes. Moreover the way that the statistical parameters of the features are altered as an effect of the variation of phonetic characteristics seem to follow some roughly formed patterns. We also discuss some qualitative aspects concerning the linear phoneme-wise correlation between the fractal features and the commonly employed mel-frequency cepstral coefficients (MFCCs) demonstrating phonetic cases of maximal and minimal correlation. In the same context we also investigate the fractal features' spectral content, in terms of the most and least correlated components with the MFCC. Further the proposed methods are examined under the light of indicative phoneme classification experiments. These quantify the efficacy of the features to characterize broad classes of speech sounds. The results are shown to be comparable for some classification scenarios with the corresponding ones of the MFCC features. (C) 2009 Elsevier B.V. All rights reserved. C1 [Pitsikalis, Vassilis; Maragos, Petros] Natl Tech Univ Athens, Sch Elect & Comp Engn, GR-15773 Athens, Greece. RP Pitsikalis, V (reprint author), Natl Tech Univ Athens, Sch Elect & Comp Engn, Iroon Polytexneiou Str, GR-15773 Athens, Greece. EM vpitsik@cs.ntua.gr; maragos@cs.ntua.gr FU FP6 European research programs HIWIRE; Network of Excellence MUSCLE; 'Protagoras' NTUA research program FX This work was supported in part by the FP6 European research programs HIWIRE and the Network of Excellence MUSCLE and by the 'Protagoras' NTUA research program. CR Abarbanel H. D. I., 1996, ANAL OBSERVED CHAOTI ADEYEMI O, 1997, IEEE T ACOUST SPEECH, P2377 Anderson TW, 2003, INTRO MULTIVARIATE S Kumar A, 1996, J ACOUST SOC AM, V100, P615, DOI 10.1121/1.415886 BADII R, 1985, J STAT PHYS, V40, P725 Banbrook M, 1999, IEEE T SPEECH AUDI P, V7, P1, DOI 10.1109/89.736326 BENZI R, 1984, J PHYS A-MATH GEN, V17, P3521, DOI 10.1088/0305-4470/17/18/021 BERNHARD HP, 1991, 12 INT C PHON SCI AI, P19 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deng L, 1997, SPEECH COMMUN, V22, P93, DOI 10.1016/S0167-6393(97)00018-6 Evertsz C., 1992, CHAOS FRACTALS NEW F GAROFOLO J, 1993, TIMIT ACOUST PHONETI GRASSBERGER P, 1983, PHYSICA D, V9, P189, DOI 10.1016/0167-2789(83)90298-1 Greenwood GW, 1997, BIOSYSTEMS, V44, P161, DOI 10.1016/S0303-2647(97)00056-7 HENTSCHEL HGE, 1983, PHYSICA D, V8, P435, DOI 10.1016/0167-2789(83)90235-X HENTSCHEL HGE, 1983, PHYS REV A, V27, P1266, DOI 10.1103/PhysRevA.27.1266 HERZEL H, 1993, NCVS STATUS PROGR RE, V4, P177 Hirschberg A., 1992, B COMMUNICATION PARL, V2, P7 Howe MS, 2005, P ROY SOC A-MATH PHY, V461, P1005, DOI 10.1098/rspa.2004.1405 HUNT F, 1986, DIMENSIONS ENTROPIES Johnson MT, 2005, IEEE T SPEECH AUDI P, V13, P458, DOI 10.1109/TSA.2005.848885 Kaiser J. F., 1983, VOCAL FOLD PHYSL BIO, P358 Kantz H., 1997, NONLINEAR TIME SERIE Kokkinos I, 2005, IEEE T SPEECH AUDI P, V13, P1098, DOI 10.1109/TSA.2005.852982 Kubin G., 1996, P INT C AC SPEECH SI, V1, P267 LIVESCU K, 2003, 8 EUR C SPEECH COMM, P2529 Mandelbrot B. B., 1982, FRACTAL GEOMETRY NAT MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P3024, DOI 10.1109/78.277799 Maragos P, 1999, J ACOUST SOC AM, V105, P1925, DOI 10.1121/1.426738 Maragos P., 1991, P IEEE ICASSP, P417, DOI 10.1109/ICASSP.1991.150365 MENEVEAU C, 1991, J FLUID MECH, V224, P429, DOI 10.1017/S0022112091001830 NARAYANAN SS, 1995, J ACOUST SOC AM, V97, P2511, DOI 10.1121/1.411971 PACKARD NH, 1980, PHYS REV LETT, V45, P712, DOI 10.1103/PhysRevLett.45.712 Papandreou G, 2009, IEEE T AUDIO SPEECH, V17, P423, DOI 10.1109/TASL.2008.2011515 Pitsikalis V., 2003, P EUROSPEECH 2003 GE, P817 Pitsikalis V, 2006, IEEE SIGNAL PROC LET, V13, P711, DOI 10.1109/LSP.2006.879424 PITSIKALIS V, 2002, IEEE T ACOUST SPEECH, P533 Potamianos G, 2004, ISSUES VISUAL AUDIO QUATIERI TF, 1990, P ICASSP 1990 ALB NM, V3, P1551 SAUER T, 1991, J STAT PHYS, V65, P579, DOI 10.1007/BF01053745 Takens F., 1981, DYNAMICAL SYSTEMS TU, V898, P366, DOI DOI 10.1007/BFB0091924 TEAGER HM, 1989, SPEECH PRODUCTION D, V55 TEMAM R, 1993, APPL MATH SCI, V68 Thomas T. J., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80019-5 Tokuda I, 2001, J ACOUST SOC AM, V110, P3207, DOI 10.1121/1.1413749 TOWNSHEND B, 1991, P INT C AC SPEECH SI, P425, DOI 10.1109/ICASSP.1991.150367 Tritton D. J., 1988, PHYS FLUID DYNAMICS, V1st Young S., 2002, HTK BOOK, P3 NR 48 TC 9 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1206 EP 1223 DI 10.1016/j.specom.2009.06.005 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800005 ER PT J AU Deshmukh, OD Verma, A AF Deshmukh, Om D. Verma, Ashish TI Nucleus-level clustering for word-independent syllable stress classification SO SPEECH COMMUNICATION LA English DT Article DE Syllable stress; Speech analysis; Language learning; Acoustic-phonetics AB This paper presents a word-independent technique for classifying the syllable stress of spoken English words. The proposed technique improves upon the existing word-independent techniques by utilizing the acoustic differences of various syllabic nuclei. Syllables with acoustically similar nuclei are grouped together and a separate stress classifier is trained for each such group. The performance of the proposed group-specific classifiers is analyzed as the number of groups is increased and is also compared with an alternative data-driven clustering based approach. The proposed technique improves the syllable-level accuracy by 5.2% and the word-level accuracy by 1.1%. The corresponding improvements using the data-driven clustering based approach are 0.12% and 0.02%, respectively. (C) 2009 Elsevier B.V. All rights reserved. C1 [Deshmukh, Om D.; Verma, Ashish] IBM India Res Lab, Vasant Kunj Inst Area, New Delhi 110070, India. RP Deshmukh, OD (reprint author), IBM India Res Lab, Vasant Kunj Inst Area, Block C,Plot 4, New Delhi 110070, India. EM odeshmuk@in.ibm.com; va-shish@in.ibm.com CR CHANDEL A, 2007, IEEE AUT SPEECH REC, P711 Duda R. O., 2001, PATTERN CLASSIFICATI FOSLERLUSSIER E, 1999, IEEE AUT SPEECH REC, P16 Garner S. R., 1995, P NZ COMP SCI RES ST, P57 GREENBERG S, 2001, P ISCA WORKSH PROS S, P51 HARRIS KS, 1988, ANN B RES I LOGOPEDI, V22, P53 Imoto K., 2002, P ICSLP, P749 Jenkin K. L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607466 Kahn D., 1976, THESIS MIT CAMBRIDGE KOTHARI R, 2003, PATTERN RECOGN, V24, P1215 OPPELSTRUP L, 2005, P FONETIK GOT, P51 Ramabhadran B., 2007, IEEE AUT SPEECH REC, P472 SILIPO R, 2000, AUTOMATIC DETECTION Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994 Sluijter AMC, 1996, INT C SPOK LANG PROC, V2, P630 Stevens K.N., 1999, ACOUSTIC PHONETICS TEPPERMAN J, 2005, P ICASSP PHIL MAR, P733 VERMA A, 2006, P INT C ACOUST SPEEC, pI1237 Ying G. S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607932 You K, 2005, IEEE ICCE, P267 NR 20 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1224 EP 1233 DI 10.1016/j.specom.2009.06.006 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800006 ER PT J AU Engelbrecht, KP Quade, M Moller, S AF Engelbrecht, Klaus-Peter Quade, Michael Moeller, Sebastian TI Analysis of a new simulation approach to dialog system evaluation SO SPEECH COMMUNICATION LA English DT Article DE Evaluation; User simulation; Spoken dialog system; Usability; Prediction model; Optimization AB The evaluation of spoken dialog systems still relies on subjective interaction experiments for quantifying interaction behavior and user-perceived quality. In this paper, we present a simulation approach replacing subjective tests in early system design and evaluation phases. The simulation is based on a model of the system, and a probabilistic model of user behavior. Probabilities for the next user action vary in dependence of system features and user characteristics, as defined by rules. This way, simulations can be conducted before data have been acquired. In order to evaluate the simulation approach, characteristics of simulated interactions are compared to interaction corpora obtained in subjective experiments. As was previously proposed in the literature, we compare interaction parameters for both corpora and calculate recall and precision of user utterances. The results are compared to those from a comparison of real user corpora. While the real corpora are not equal, they are more similar than the simulation is to the real data. However, the simulations can predict differences between system versions and user groups quite well on a relative level. In order to derive further requirements for the model, we conclude with a detailed analysis of utterances missing in the simulated corpus and consider the believability of entire dialogs. (C) 2009 Elsevier B.V. All rights reserved. C1 [Engelbrecht, Klaus-Peter; Moeller, Sebastian] TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, D-10587 Berlin, Germany. [Quade, Michael] TU Berlin, DAI Labor, D-10587 Berlin, Germany. RP Engelbrecht, KP (reprint author), TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, Ernst Reuter Pl 7, D-10587 Berlin, Germany. EM klaus-peter.engelbrecht@telekom.de; michael.quade@dai-labor.de; sebastian.moeller@telekom.de CR AI H, 2008, P 9 SIGDIAL WORKSH D AI H, 2008, P 46 ANN M ASS COMP Ai H., 2006, P AAAI WORKSH STAT E Anderson JR, 2004, PSYCHOL REV, V111, P1036, DOI 10.1037/0033-295x.111.4.1036 [Anonymous], 2003, P851 ITUT ARAKI M, 1997, ECAI 96, P183 BOHUS D, 2005, P 6 SIGDIAL WORKSH D Card S. K., 1983, PSYCHOL HUMAN COMPUT CHUNG G, 2004, P 42 ANN M ASS COMP Eckert W., 1997, P IEEE WORKSH AUT SP ENGELBRECHT KP, 2008, P INT 2008 BRISB AUS Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M HERMANN F, 2007, P HCI INT 2007 ITO A, 2006, P INT 2006 PITTSB PA *ITU T, 2005, PAR DESCR INT SP P S, V24 IVORY MY, 2000, UCBCSD001105 EECS DE JANARTHANAM S, 2008, P SEMDIAL 2008 LONDI John BE, 2005, IEEE PERVAS COMPUT, V4, P27, DOI 10.1109/MPRV.2005.80 Kieras D.E., 2003, HUMAN COMPUTER INTER, P1191 Lopez-Cozar R, 2003, SPEECH COMMUN, V40, P387, DOI [10.1016/S0167-6393(02)00126-7, 10.1016/S0167-6393902)00126-7] MOLLER S, 2007, P 12 INT C SPEECH CO MOLLER S, 2006, P INT 2006 PITTSB PA Moller S, 2007, COMPUT SPEECH LANG, V21, P26, DOI 10.1016/j.csl.2005.11.003 MOLLER S, 2007, P INT 2007 ANTW BELG Newell A., 1990, UNIFIED THEORIES COG Nielsen J., 1993, USABILITY ENG NORMAN DA, 1981, PSYCHOL REV, V88, P1, DOI 10.1037/0033-295X.88.1.1 Norman D. A., 1983, MENTAL MODELS, P7 PIETQUIN O, 2006, P IEEE INT C MULT EX PIETQUIN O, 2002, P IEEE INT C ACOUST PIETQUIN O, 2009, MACH LEARN, P167 RIESER V, 2006, P INT 2006 PITTSB PA SCHATZMANN J, 2007, P HLT NAACL ROCH NY SCHATZMANN J, 2007, P ASRU KYOT JAP SCHATZMANN J, 2007, P 8 SIGDIAL WORKSH D SCHATZMANN J, 2005, P 6 SIGDIAL WORKSH D SCHEFFLER K, 2001, P NAACL 2001 WORKSH Seneff S, 2002, COMPUT SPEECH LANG, V16, P283, DOI 10.1016/SO885-2308(02)00011-6 Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 WALKER MA, 1997, P ACL EACL 35 ANN M NR 40 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1234 EP 1252 DI 10.1016/j.specom.2009.06.007 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800007 ER PT J AU Lu, YY Cooke, M AF Lu, Youyi Cooke, Martin TI The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise SO SPEECH COMMUNICATION LA English DT Article DE Intelligibility; Noise; Speech production; Spectral tilt ID FUNDAMENTAL-FREQUENCY; SPEAKER INTELLIGIBILITY; NORMAL-HEARING; CLEAR SPEECH; RECOGNITION; LISTENERS; ENHANCEMENT; PERCEPTION; ENVIRONMENTS; CHILDREN AB Talkers modify the way they speak in the presence of noise. As well as increases in voice level and fundamental frequency (170), a flattening of spectral tilt is observed. The resulting "Lombard speech" is typically more intelligible than speech produced in quiet, even when level differences are removed. What is the cause of the enhanced intelligibility of Lombard speech? The current study explored the relative contributions to intelligibility of changes in mean F0 and spectral tilt. The roles of F0 and spectral tilt were assessed by measuring the intelligibility gain of non-Lombard speech whose mean F0 and spectrum were manipulated, both independently and in concert, to simulate those of natural Lombard speech. In the presence of speech-shaped noise, flattening of spectral tilt contributed greatly to the intelligibility gain of noise-induced speech over speech produced in quiet while an increase in F0 did not have a significant influence. The perceptual effects of spectrum flattening was attributed to its ability of increasing the amount of speech time-frequency plane "glimpsed" in the presence of noise. However, spectral tilt changes alone could not fully account for the intelligibility of Lombard speech. Other changes observed in Lombard speech such as durational modifications may well contribute to intelligibility. (C) 2009 Elsevier B.V. All rights reserved. C1 [Lu, Youyi] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. [Cooke, Martin] Univ Basque Country, Fac Letras, Language & Speech Lab, Vitoria, Spain. RP Lu, YY (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM y.lu@dcs.shef.ac.uk; acq05yl@shef.ac.uk FU EU FX The second author acknowledges support from the EU Marie Curie Network "Sound to Sense". The authors would like to thank Hideki Kawahara for providing the Matlab implementation of STRAIGHT v40. CR Assmann PF, 2005, J ACOUST SOC AM, V117, P886, DOI 10.1121/1.1852549 Assmann P.F., 2002, INT C SPOK LANG PROC, P425 Assmann PF, 2008, J ACOUST SOC AM, V124, P3203, DOI 10.1121/1.2980456 Barker J, 2007, SPEECH COMMUN, V49, P402, DOI 10.1016/j.specom.2006.11.003 Boersma P., 1993, P I PHONETIC SCI, V17, P97 BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4 Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 COX RM, 1987, J ACOUST SOC AM, V81, P1598, DOI 10.1121/1.394512 DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780 Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078 Garnier M, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2246 GORDONSALANT S, 1986, J ACOUST SOC AM, V80, P1599, DOI 10.1121/1.394324 Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9 Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826 Jones C, 2007, COMPUT SPEECH LANG, V21, P641, DOI 10.1016/j.csl.2007.03.001 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 Kawahara H, 1997, INT CONF ACOUST SPEE, P1303, DOI 10.1109/ICASSP.1997.596185 KAWAHARA H, 1998, P135 M ACOUST SOC AM, V103, P2776 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Krause JC, 2004, J ACOUST SOC AM, V115, P362, DOI 10.1121/1.1635842 Laures JS, 2003, J COMMUN DISORD, V36, P449, DOI 10.1016/S0021-9924(03)00032-7 Lee SH, 2007, Proceedings of the Fourth IASTED International Conference on Signal Processing, Pattern Recognition, and Applications, P287 LU Y, 2009, J ACOUST SOC AM, V126 Lu YY, 2008, J ACOUST SOC AM, V124, P3261, DOI 10.1121/1.2990705 MCLOUGHLIN IV, 1997, P 13 INT C DSP, V2, P591, DOI 10.1109/ICDSP.1997.628419 NIEDERJOHN RJ, 1976, IEEE T ACOUST SPEECH, V24, P277, DOI 10.1109/TASSP.1976.1162824 Pisoni D. B., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8) Pittman AL, 2001, J SPEECH LANG HEAR R, V44, P487, DOI 10.1044/1092-4388(2001/038) RYALLS JH, 1982, J ACOUST SOC AM, V72, P1631, DOI 10.1121/1.388499 Skowronski MD, 2006, SPEECH COMMUN, V48, P549, DOI 10.1016/j.specom.2005.09.003 Sommers MS, 1997, J ACOUST SOC AM, V101, P2278, DOI 10.1121/1.418208 Steeneken HJM, 1999, INT CONF ACOUST SPEE, P2079 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 Tallal P, 1996, SCIENCE, V271, P81, DOI 10.1126/science.271.5245.81 TARTTER VC, 1993, J ACOUST SOC AM, V94, P2437, DOI 10.1121/1.408234 Thomas I.B., 1967, P NUT EL C, V23, P544 THOMAS IB, 1968, J AUDIO ENG SOC, V16, P182 Uchanski RM, 2002, J SPEECH LANG HEAR R, V45, P1027, DOI 10.1044/1092-4388(2002/083) Watson PJ, 2008, AM J SPEECH-LANG PAT, V17, P348, DOI 10.1044/1058-0360(2008/07-0048) Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0 NR 42 TC 25 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1253 EP 1262 DI 10.1016/j.specom.2009.07.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800008 ER PT J AU Rao, KS Yegnanarayana, B AF Rao, K. Sreenivasa Yegnanarayana, B. TI Duration modification using glottal closure instants and vowel onset points SO SPEECH COMMUNICATION LA English DT Article DE Instants of significant excitation; Group delay function; Hilbert envelope; Linear prediction residual; Vowel onset point; Time scale modification; Duration modification ID TIME-SCALE MODIFICATION; SIGNIFICANT EXCITATION; SPEECH AB This paper proposes a method for duration (time scale) modification using glottal closure instants (GCI, also known as instants of significant excitation) and vowel onset points (VOP). In general, most of the time scale modification methods attempt to vary the duration of speech segments uniformly over all regions. But it is observed that consonant regions and transition regions between a consonant and the following vowel, and between two consonant regions do not vary appreciably with speaking rate. The proposed method implements the duration modification without changing the durations of the transition and consonant regions. Vowel onset points are used to identify the transition and consonant regions. A VOP is the instant at which the onset of the vowel takes place, which corresponds to the transition from a consonant to the following vowel in most cases. The VOPs are computed using the Hilbert envelope of linear prediction (LP) residual. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations, like the onset of burst, in the case of nonvoiced speech. Manipulation of duration is achieved by modifying the duration of the LP residual with the help of instants of significant excitation as pitch markers. The modified residual is used to excite the time-varying filter whose parameters are derived from the original speech signal. Perceptual quality of the synthesized speech is found to be natural. Performance of the proposed method is compared with the method, where the duration of speech is modified uniformly over all regions. Samples of speech signals for different modification factors is available for listening at http://sit.iitkgp.ernet.in/similar to ksrao/result.html. (C) 2009 Elsevier B.V. All rights reserved. C1 [Rao, K. Sreenivasa] Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India. [Yegnanarayana, B.] Int Inst Informat Technol, Hyderabad 500032, Andhra Pradesh, India. RP Rao, KS (reprint author), Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India. EM ksrao@iitkgp.ac.in; yegna@iiit.ac.in CR Deller J. R., 1993, DISCRETE TIME PROCES DIMARINO J, 2001, P IEEE INT C ACOUST Donnellan Olivia, 2003, P 3 IEEE INT C ADV L Flanagan J., 1972, SPEECH ANAL SYNTHESI Gabor D., 1946, Journal of the Institution of Electrical Engineers. III. Radio and Communication Engineering, V93 Gangashetty SV, 2004, PROCEEDINGS OF INTERNATIONAL CONFERENCE ON INTELLIGENT SENSING AND INFORMATION PROCESSING, P159 Hogg RV, 1987, ENG STAT Ilk HG, 2006, SIGNAL PROCESS, V86, P127, DOI 10.1016/j.sigpro.2005.05.006 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MOULINES E, 1995, SPEECH COMMUN, V16, P175, DOI 10.1016/0167-6393(94)00054-E MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P374, DOI 10.1109/TASSP.1981.1163581 Prasanna S.R.M., 2004, THESIS INDIAN I TECH PRASANNA SRM, 2002, P IEEE INT C ACOUST QUATIERI TF, 1992, IEEE T SIGNAL PROCES, V40, P497, DOI 10.1109/78.120793 Rao KS, 2006, IEEE T AUDIO SPEECH, V14, P972, DOI 10.1109/TSA.2005.858051 SLANEY M, 1996, P IEEE INT C ACOUST SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 Stevens K.N., 1999, ACOUSTIC PHONETICS NR 20 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2009 VL 51 IS 12 BP 1263 EP 1269 DI 10.1016/j.specom.2009.06.004 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 560FZ UT WOS:000274888800009 ER PT J AU Zen, H Tokuda, K Black, AW AF Zen, Heiga Tokuda, Keiichi Black, Alan W. TI Statistical parametric speech synthesis SO SPEECH COMMUNICATION LA English DT Review DE Speech synthesis; Unit selection; Hidden Markov models ID HIDDEN MARKOV-MODELS; SPEAKER ADAPTATION; MAXIMUM-LIKELIHOOD; SYNTHESIS SYSTEM; VOICE CONVERSION; COVARIANCE MATRICES; HMM; RECOGNITION; GENERATION; ALGORITHM AB This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future. (C) 2009 Elsevier B.V. All rights reserved. C1 [Zen, Heiga; Tokuda, Keiichi] Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Nagoya, Aichi 4668555, Japan. [Zen, Heiga] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England. [Black, Alan W.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. RP Zen, H (reprint author), Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Gokiso Cho, Nagoya, Aichi 4668555, Japan. EM heiga.zen@crl.toshiba.co.uk; tokuda@nitech.ac.jp; awb@cs.cmu.edu FU Ministry of Education, Culture, Sports, Science and Technology (MEXT); Hori information science promotion foundation; JSPS [1880009]; European Community's Seventh Framework Programme [FP7/2007-2013]; US National Science Foundation [0415021] FX The authors would like to thank Drs. Tomoki Toda of the Nara Institute of Science and Technology, Junichi Yamagishi of the University of Edinburgh, and Ranniery Maia of the ATR Spoken Language Communication Research Laboratories for their helpful comments and discussions. We are also grateful to many researchers who provided us with useful information that enabled us to write this review. This work was partly supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) e-Society project, the Hori information science promotion foundation, a Grant-in-Aid for Scientific Research (No. 1880009) by JSPS, and the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement 213845 (the EMIME Project). This work was also partly supported by the US National Science Foundation under Grant No. 0415021 "SPICE: Speech Processing Interactive Creation and Evaluation Toolkit for new Languages." Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. CR ABDELHAMID O, 2006, P INT, P1332 Acero A., 1999, P EUROSPEECH, P1047 AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 AKAMINE M, 1998, P ICSLP, P139 ALLAUZEN C, 2004, P 42 M ACL Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 Aylett M., 2008, P LANGTECH BAI Q, 2007, COMMUNICATION Banos E., 2008, 5 JORN TECN HABL, P145 BARROS MJ, 2005, P INTERSPEECH 05 EUR, P2581 Beal M. J., 2003, THESIS U LONDON Bennett C. L., 2005, P INT EUR, P105 Bennett C. L., 2006, P BLIZZ CHALL WORKSH BERRY J, 2008, P AR LINGUISTICSCIRC Beutnagel B, 1999, P JOINT ASA EAA DAEA, P15 Bilmes JA, 2003, COMPUT SPEECH LANG, V17, P213, DOI 10.1016/S0885-2308(03)00010-X Black A., 2003, P EUROSPEECH GEN SWI, P1649 Black A., 2006, P ISCA ITRW MULTILIN BLACK A, 2006, P INT, P1762 Black A., 2000, P ICSLP BEIJ CHIN, P411 BLACK A, 2002, P IEEE SPEECH SYNTH Black A. W., 1997, P EUR, P601 BONAFONTE A, 2008, P LREC Breen A, 1998, P ICSLP, P2735 BULYKO I, 2002, P ICASSP, P461 CABRAL J, 2008, P INT, P1829 Cabral J., 2007, P 6 ISCA SPEECH SYNT, P113 CHOMPHAN S, 2007, P INTERSPEECH 2007, P2849 Clark R. A. J., 2007, P BLIZZ CHALL WORKSH Coorman G, 2000, P INT C SPOK LANG BE, P395 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A Deng L, 2006, IEEE T AUDIO SPEECH, V14, P1492, DOI 10.1109/TASL.2006.878265 DINES J, 2001, P ICASSP, P833 Donovan R., 1995, P EUR, P573 Donovan RE, 1998, P ICSLP, P1703 DRUGMAN T, 2009, P ICASSP, P3793 DRUGMAN T, 2008, P BEN EICHNER M, 2001, P IEEE INT C AC SPEE, P829 EICHNER M, 2000, P INT C SPOK LANG PR, P701 EIDE E, 2004, P ISCA SSW5 FARES T, 2008, P CATA, P93 Ferguson J. D., 1980, P S APPL HIDD MARK M, P143 Frankel J, 2007, IEEE T AUDIO SPEECH, V15, P246, DOI 10.1109/TASL.2006.876766 Frankel J, 2007, COMPUT SPEECH LANG, V21, P620, DOI 10.1016/j.csl.2007.03.002 Freij G., 1988, P ICASSP 1988, P135 FUJINAGA K, 2001, P ICASSP, P513 Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953 Gales M. J. F., 1996, CUEDFINFENGTR263 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223 GAO BH, 2008, P INT, P2266 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GISH H, 1993, P INT C AC SPEECH SI, P447 GONZALVO X, 2007, P NOLISP, P7 Gonzalvo X., 2007, P 6 ISCA SPEECH SYNT, P362 HASHIMOTO K, 2008, P AUT M ASJ, P251 HEMPTINNE C, 2006, THESIS IDIAP RES I HILL DR, 1995, P AVIOS95 S SAN JOS, P27 HIRAI T, 2004, P ISCA SSW5 Hirose K, 2005, SPEECH COMMUN, V46, P385, DOI 10.1016/j.specom.2005.03.014 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 HOMAYOUNPOUR M, 2004, CSI J COMPUT SCI ENG, V2 Hon H, 1998, INT CONF ACOUST SPEE, P293, DOI 10.1109/ICASSP.1998.674425 HUANG X, 1996, P ICSLP 96 PHIL, P2387 Hunt A. J., 1996, P ICASSP 96, P373 Imai S., 1983, ELECTR COMMUN JPN, V66, P10, DOI 10.1002/ecja.4400660203 IRINO T, 2002, P ICSLP, P2545 ISHIMATSU Y, 2001, SP200181 IEICE, V101, P57 ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641 IWAHASHI N, 1995, SPEECH COMMUN, V16, P139, DOI 10.1016/0167-6393(94)00051-B IWANO K, 2002, SP200273 IEICE, P11 JENSEN U, 1994, COMPUT SPEECH LANG, V8, P247, DOI 10.1006/csla.1994.1013 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 KARABETSOS S, 2008, P TSD, P349 Karaiskos V., 2008, P BLIZZ CHALL WORKSH Katagiri S., 1991, P IEEE WORKSH NEUR N, P299 KATAOKA S, 2004, P INT, P1205 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 KAWAI H, 2002, P IEEE SPEECH SYNTH KAWAI H, 2004, P ISCA SSW5 *KDDI R D LAB, 2008, DEV DOWNL SPEECH SYN Kim SJ, 2006, IEICE T INF SYST, VE89D, P1116, DOI 10.1093/ietisy/e89-d.3.1116 Kim SJ, 2006, IEEE T CONSUM ELECTR, V52, P1384, DOI 10.1109/TCE.2006.273160 Kim SJ, 2007, IEICE T INF SYST, VE90D, P378, DOI 10.1093/ietisy/e90-d.1.378 King S., 2008, P INT, P1869 KISHIMOTO Y, 2003, P SPRING M ASJ, P243 Kishore S. P., 2003, P EUROSPEECH 2003 GE, P1317 Koishida K, 2001, IEICE T INF SYST, VE84D, P1427 KOMINEK J, 2006, P BLIZZ CHALL WORKSH Kominek J., 2003, CMULTI03177 KRSTULOVIC S, 2008, P SPEECH PROS, P67 Krstulovic S., 2007, P INT 2007 ANTW BELG, P1897 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Latorre J, 2006, SPEECH COMMUN, V48, P1227, DOI 10.1016/j.specom.2006.05.003 LATORRE J, 2007, P ICASSP, P1241 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2 LIANG H, 2008, P ICASSP, P4641 Ling Z.H., 2008, P INT, P573 Ling Z.-H, 2007, P BLIZZ CHALL WORKSH LING ZH, 2006, P INT 2006 SEP, P2034 LING ZH, 2008, P ISCSLP, P5 LING ZH, 2008, P IEEE INT C AC SPEE, P3949 LING ZH, 2007, P ICASSP, P1245 Ling Z.-H., 2006, P BLIZZ CHALL WORKSH LU H, 2009, P ICASSP, P4033 Lundgren A., 2005, THESIS ROYAL I TECHN MAIA R, 2008, P ICASSP, P3965 Maia R., 2007, P ISCA SSW6, P131 MAIA R, 2009, P SPRING M ASJ, P311 Maia R., 2003, P EUROSPEECH, P2465 Martincic-Ipsic S., 2006, Journal of Computing and Information Technology - CIT, V14, DOI 10.2498/cit.2006.04.06 MARUME M, 2006, P AUT M ASJ, P185 MASUKO T, 2003, P AUT M ASJ, P209 Masuko T., 1997, P ICASSP, P1611 MATSUDA S, 2003, IEICE T INFORM SY D2, V86, P741 Miyanaga K., 2004, P INTERSPEECH 2004 I, P1437 Mizutani N., 2002, P AUT M ASJ, P241 Morioka Y., 2004, P AUT M ASJ, P325 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Nakamura K., 2006, P ICASSP, P93 NAKAMURA K, 2007, THESIS NAGOYA I TECH Nakatani N., 2006, P SPECOM, P261 NANKAKU Y, 2003, TECH REP IEICE, V103, P19 Nose T., 2007, P INTERSPEECH 2007 A, P2285 Nose T, 2007, IEICE T INF SYST, VE90D, P1406, DOI 10.1093/ietisy/e90-d.9.1406 NOSE T, 2009, TEICE T INFORM SYS D, V92, P489 Odell J.J., 1995, THESIS U CAMBRIDGE Ogata K., 2006, P INTERSPEECH 2006 I, P1328 OJALA T, 2006, THESIS HELSINKI U TE Okubo T, 2006, IEICE T INF SYST, VE89D, P2775, DOI 10.1093/ietisy/e89-d.11.2775 Olsen PA, 2004, IEEE T SPEECH AUDI P, V12, P37, DOI 10.1109/TSA.2003.819943 OURA K, 2008, P AUT M ASJ, P421 OURA K, 2008, P ISCSLP, P1 OURA K, 2007, P AUT M ASJ, P367 PENNY W, 1998, H MARKOV MODELS EXTE PLUMPE M, 1998, P ICSLP, P2751 POLLET V, 2008, P INT, P1825 Povey D., 2003, THESIS U CAMBRIDGE QIAN Y, 2006, P ISCSLP, P223 QIAN Y, 2008, P ISCSLP, P13 Qian Y., 2008, P INT, P2126 QIAN Y, 2009, P ICASSP, P3781 QIN L, 2006, P INT, P2250 QIN L, 2008, P ICASSP MAR, P3953 RAITIO T, 2008, P INT, P1881 RICHARDS H, 1999, P ICASSP, V1, P357 Rissanen J., 1980, STOCHASTIC COMPLEXIT Ross K., 1994, P ESCA IEEE WORKSH S, P131 ROSTI A, 2003, CUEDFTNFENGTR461 U C Rosti AVI, 2004, COMPUT SPEECH LANG, V18, P181, DOI 10.1016/j.csl.2003.09.004 Rouibia S., 2005, P INT, P2565 Russell M. J., 1985, P IEEE INT C AC SPEE, P5 Sagisaka Y., 1992, P ICSLP, P483 Sakai S., 2005, P INT 2005 LISB PORT, P81 Sakti S., 2008, P OR COCOSDA, P215 SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 Segi H., 2004, P 5 ISCA SPEECH SYNT, P115 SHERPA U, 2008, P OR COCOSDA, P150 SHI Y, 2002, P ICSLP, P2369 Shichiri K., 2002, P ICSLP, P1269 Shinoda K, 2001, IEEE T SPEECH AUDI P, V9, P276, DOI 10.1109/89.906001 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 SHINTA Y, 2005, P RES WORKSH HOK AR SILEN H, 2008, P INT, P1853 SONDHI M, 2002, P IEEE SPEECH SYNTH STYLIANOU Y, 1999, P ICASSP, P377 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 SUN JW, 2009, P ICASSP, P4021 *SVOX AG, 2007, SVOX ANN SVOX PIC RE *SVOX AG, 2008, SVOX REL PIC HIGH QU SYKORA T, 2006, THESIS SLOVAK U TECH Tachibana M, 2006, IEICE T INF SYST, VE89D, P1092, DOI 10.1093/ietisy/e89-d.3.1092 TACHIBANA M, 2008, P ICASSP 2008 APR, P4633 Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484 TACHIWA W, 1999, P SPRING M AC SOC JA, P239 TAKAHASHI JI, 1995, P ICASSP 95, P696 TAKAHASHI T, 2001, P AUT M ASJ OCT, V1, P5 Tamura M., 2001, P ICASSP 2001, P805 TAMURA M, 2005, P ICASSP, P351 TAYLOR P, 2006, P INT, P1758 TAYLOR P, 1999, P EUR, P1531 TIOMKIN S, 2008, P INT, P1841 Toda T., 2004, P 5 ISCA SPEECH SYNT, P31 TODA T, 2009, P ICASSP, P4025 Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344 TODA T, 2008, P ICASSP, P3925 Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816 Tokuda K., 2008, HMM BASED SPEECH SYN TOKUDA K, 2005, P INT EUR, P77 TOKUDA K, 1995, P ICASSP, P660 Tokuda K, 2002, IEICE T INF SYST, VE85D, P455 Tokuda K., 2002, P IEEE SPEECH SYNTH TOKUDA K, 2000, P ICASSP, V3, P1315 Toth B., 2008, INFOCOMMUNICATIONS, VLXIII, P30 VAINIO M, 2005, P 2 BALT C HLT, P201 VESNICER B, 2004, P TSD2004, P513 Wang C., 2008, P ISCSLP, P129 Watanabe S, 2004, IEEE T SPEECH AUDI P, V12, P365, DOI 10.1109/TSA.2004.828640 Watanabe S., 2007, P IEEE S FDN COMP IN, P383 WATANABE T, 2007, P AUT M ASJ, P209 WEISS C, 2005, P ESSP WOUTERS J, 2000, P ICSLP, P302 Wu Y., 2008, P ICASSP LAS VEG US, P4621 Wu Y. J., 2008, P ISCSLP, P9 WU YJ, 2006, P INT, P2046 Wu Y.-J., 2008, P INT, P577 WU YJ, 2006, P ICASSP, P89 WU YJ, 2009, P ICASSP, P4013 [吴义坚 WU Yijian], 2006, [中文信息学报, Journal of Chinese Information Processing], V20, P75 YAMAGISHI J, 2008, P ICASSP, P3957 Yamagishi J., 2008, P BLIZZ CHALL WORKSH Yamagishi J., 2008, P INT 08 BRISB AUSTR, P581 Yamagishi J, 2006, THESIS TOKYO I TECHN Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647 Yamagishi J, 2007, IEICE T INF SYST, VE90D, P533, DOI 10.1093/ietisy/e90-d.2.533 Yang J.-H., 2006, P BLIZZ CHALL WORKSH Yoshimura T., 1997, P EUR, P2523 Yoshimura T., 2001, P EUROSPEECH, P2263 Yoshimura T, 1998, P ICSLP, P29 Yoshimura T, 1999, P EUR, P2347 YOUNG S, 2006, H MARKOV MODEL TOOLK YU J, 2007, P ICASSP, P709 YU K, 2009, P ICASSP, P3773 YU ZP, 2008, P ICSP Zen H., 2007, P INT, P2065 ZEN H, 2003, TRSLT0032 ATRSLT ZEN H, 2008, P INT, P1068 Zen H, 2007, P 6 ISCA WORKSH SPEE, P294 Zen H., 2006, P BLIZZ CHALL WORKSH Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825 Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002 Zen H., 2003, P EUR, P3189 Zen H., 2006, P INT, P2274 Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325 ZHANG L, 2009, THESIS U EDINBURGH NR 238 TC 184 Z9 190 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2009 VL 51 IS 11 BP 1039 EP 1064 DI 10.1016/j.specom.2009.04.004 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 498SR UT WOS:000270164000001 ER PT J AU Gonon, G Bimbot, F Gribonval, R AF Gonon, Gilles Bimbot, Frederic Gribonval, Remi TI Probabilistic scoring using decision trees for fast and scalable speaker recognition SO SPEECH COMMUNICATION LA English DT Article DE Speaker recognition; Decision trees; Embedded systems; Resource constraint; Biometric authentication ID VERIFICATION; ALGORITHM AB In the context of fast and low cost speaker recognition, this article investigates several techniques based on decision trees. A new approach is introduced where the trees are used to estimate a score function rather than returning a decision among classes. This technique is developed to approximate the GMM log-likelihood ratio (LLR) score function. On top of this approach, different solutions are derived to improve the accuracy of the proposed trees. The first one studies the quantization of the LLR function to create classification trees on the LLR values. The second one makes use of knowledge on the GMM distribution of the acoustic features in order to build oblique trees. A third extension consists in using a low-complexity score function in each of the tree leaves. Series of comparative experiments are performed on the NIST 2005 speaker recognition evaluation data in order to evaluate the impact of the proposed improvements in terms of efficiency, execution time and algorithmic complexity. Considering a baseline system with an Equal Error Rate (EER) of 9.6% on the NIST 2005 evaluation, the best tree-based configuration achieves an EER of 12.9%, with a computational cost adapted to embedded devices and an execution time suitable for real-time speaker identification. (C) 2009 Elsevier B.V. All rights reserved. C1 [Bimbot, Frederic] IRISA METISS, CNRS, F-35042 Rennes, France. IRISA METISS, INRIA, F-35042 Rennes, France. RP Bimbot, F (reprint author), IRISA METISS, CNRS, Campus Univ Beaulieu, F-35042 Rennes, France. EM Gilles.Gonon@irisa.fr; Frederic.Bimbot@irisa.fr; Remi.Gribonval@irisa.fr CR AUCKENTHALER R, 2001, ODYSSEY, P83 BARRAS C, 2003, IEEE INT C AC SPEECH, V2, P49 Bengio S., 2004, ODYSSEY 2004 SPEAK L, P237 Bengio S., 2001, P IEEE INT C AC SPEE, V1, P425 BLOUET R, 2002, THESIS U RENNES 1 BLOUET R, 2001, WORKSH SPEAK OD, P223 Bocchieri E., 1993, ICASSP, V2, P692, DOI 10.1109/ICASSP.1993.319405 Campbell WM, 2002, INT CONF ACOUST SPEE, P161 CAMPBELL WM, 2004, ODYSS SPEAK LANG REC, P41 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Farrell KR, 1994, IEEE T SPEECH AUDI P, V2, P194, DOI 10.1109/89.260362 FRITSCH J, 1996, IEEE INT C AC SPEECH, V2, P837 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P152, DOI 10.1109/89.748120 Gauvain J., 1994, IEEE T SPEECH AUDIO, V2 GONON G, 2005, 9 EUR C SPEECH COMM, V4, P2661 Lim TS, 2000, MACH LEARN, V40, P203, DOI 10.1023/A:1007608224229 MCLAUGHLIN J, 1999, EUROSPEECH 99, V3, P1215 MOON YS, 2003, WBMA 03, P53 MURTHY SK, 1981, J ARTIF INTELL RES, V2, P1 MURVEIT H, 1994, HLT 94, P393 *NIST, 2005, NIST YEAR 2005 SPEAK Olshen R., 1984, CLASSIFICATION REGRE, V1st Pellom BL, 1998, IEEE SIGNAL PROC LET, V5, P281, DOI 10.1109/97.728467 Reynolds D. A., 2000, DIGITAL SIGNAL PROCE, V10 Schapire RE, 2003, LECT NOTES STAT, V171, P149 WAN V, 2003, INT C AC SPEECH SIGN, V2, P221 Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822 XIANG B, 2002, P ICASSP, V1, P681 XIE Y, 2006, WORLD C INT CONTR AU, V2, P9463 XIONG Z, 2005, IEEE ICASSP 05, V1, P625 NR 31 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2009 VL 51 IS 11 BP 1065 EP 1081 DI 10.1016/j.specom.2009.02.007 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 498SR UT WOS:000270164000002 ER PT J AU van Santen, JPH Prud'hommeaux, ET Black, LM AF van Santen, Jan P. H. Prud'hommeaux, Emily Tucker Black, Lois M. TI Automated assessment of prosody production SO SPEECH COMMUNICATION LA English DT Article DE Prosody; Automated assessment; Acoustic analysis; Speech pathology; Language pathology ID DIAGNOSTIC-OBSERVATION-SCHEDULE; HIGH-FUNCTIONING AUTISM; REVISED ALGORITHMS; NORMAL SPEAKERS; SPEECH; CHILDREN; ENGLISH; COMMUNICATION; REPLICATION; DURATION AB Assessment of prosody is important for diagnosis and remediation of speech and language disorders, for diagnosis of neurological conditions, and for foreign language instruction. Current assessment is largely auditory-perceptual, which has obvious drawbacks; however, automation of assessment faces numerous obstacles. We propose methods for automatically assessing production of lexical stress, focus, phrasing, pragmatic style, and vocal affect. Speech was analyzed from children in six tasks designed to elicit specific prosodic contrasts. The methods involve dynamic and global features, using spectral, fundamental frequency, and temporal information. The automatically computed scores were validated against mean scores from judges who, in all but one task, listened to "prosodic minimal pairs" of recordings, each pair containing two utterances from the same child with approximately the same phonemic material but differing on a specific prosodic dimension, such as stress. The judges identified the prosodic categories of the two utterances and rated the strength of their contrast. For almost all tasks, we found that the automated scores correlated with the mean scores approximately as well as the judges' individual scores. Real-time scores assigned during examination - as is fairly typical in speech assessment - correlated substantially less than the automated scores with the mean scores. (C) 2009 Elsevier B.V. All rights reserved. C1 [van Santen, Jan P. H.; Prud'hommeaux, Emily Tucker; Black, Lois M.] Oregon Hlth & Sci Univ, CSLU, Div Biomed Comp Sci BMCS, Sch Med, Beaverton, OR 97006 USA. RP van Santen, JPH (reprint author), Oregon Hlth & Sci Univ, CSLU, Div Biomed Comp Sci BMCS, Sch Med, 20000 NW Walker Rd, Beaverton, OR 97006 USA. EM vansanten@cslu.ogi.edu FU National Institute on Deafness and Other Communicative Disorders [NIDCD 1R01DC007129-01]; National Science Foundation [IIS-0205731]; AutismSpeaks FX We thank Sue Peppe for making available the pictorial stimuli and for granting permission to us for creating new versions of several PEPS-C tasks; Lawrence Shriberg for helpful comments on an earlier draft of the paper, in particular on the LSR section; Rhea Paul for helpful comments on an earlier draft of the paper and for suggesting the Lexical Stress and Pragmatic Style Tasks; the clinical staff at CSLU (Beth Langhorst, Rachel Coulston, and Robbyn Sanger Hahn) and at Yale University (Nancy Fredine, Moira Lewis, Allyson Lee) for data collection; senior programmer Jacques de Villiers for the data collection software and data management architecture; Meg Mitchell and Justin Holguin for speech transcription; and the parents and children for participating in the study. This research was supported by grants from the National Institute on Deafness and Other Communicative Disorders, NIDCD 1R01DC007129-01 and from the National Science Foundation, IIS-0205731, both to van Santen; by a Student Fellowship from AutismSpeaks to Emily Tucker Prud'hommeaux; and by an Innovative Technology for Autism grant from AutismSpeaks to Brian Roark. The views herein are those of the authors and reflect the views neither of the funding agencies nor of any of the individuals acknowledged. CR ADAMI AG, 2003, P ICASSP 03 Barlow R., 1972, STAT INFERENCE ORDER BERK S, 1983, J COMMUN DISORD, V16, P49, DOI 10.1016/0021-9924(83)90026-6 Berument SK, 1999, BRIT J PSYCHIAT, V175, P444, DOI 10.1192/bjp.175.5.444 CARDOZO BL, 1968, IEEE T ACOUST SPEECH, VAU16, P159, DOI 10.1109/TAU.1968.1161978 Cohen J, 1960, EDUC PSYCHOL MEAS, V20, P46 DARLEY FL, 1969, J SPEECH HEAR RES, V12, P462 DARLEY FL, 1969, J SPEECH HEAR RES, V12, P246 Dollaghan C, 1998, J SPEECH LANG HEAR R, V41, P1136 Ekman P., 1976, PICTURES FACIAL AFFE *EL, 1993, MULT VOIC PROGR MDVP *ENTR RES LAB INC, 1996, ESPS PROGR A L GILBERG C, 1998, ASPERGER SYNDROME HI, P79 Gotham K, 2007, J AUTISM DEV DISORD, V37, P613, DOI 10.1007/s10803-006-0280-1 Gotham K, 2008, J AM ACAD CHILD PSY, V47, P642, DOI 10.1097/CHI.0b013e31816bffb7 Grabe E., 2002, P SPEECH PROS 2002 C, P343 Grabe E, 2000, J PHONETICS, V28, P161, DOI 10.1006/jpho.2000.0111 HIRSCHBERG J, 2000, FESTSCHRIFT HONOR G Hirschberg J., 1995, P 13 INT C PHON SCI, V2, P36 HOUSE A, 1987, J NEUROL NEUROSUR PS, V50, P910, DOI 10.1136/jnnp.50.7.910 Kent RD, 1996, AM J SPEECH-LANG PAT, V5, P7, DOI DOI 10.1044/1058-0360.0503.07 Klabbers E., 2007, P 6 ISCA WORKSH SPEE, P339 KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 Le Dorze G, 1998, FOLIA PHONIATR LOGO, V50, P1, DOI 10.1159/000021444 Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 Lord C, 2000, J AUTISM DEV DISORD, V30, P205, DOI 10.1023/A:1005592401947 Mackey LS, 1997, J SPEECH LANG HEAR R, V40, P349 Marasek K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607920 McCann J, 2003, INT J LANG COMM DIS, V38, P325, DOI 10.1080/1368282031000154204 MIAO Q, 2006, EFFECTS PROSODIC FAC MILROY L, 1978, SOCIOLINGUISTIC PATT, P19 MURRY T, 1980, J SPEECH HEAR RES, V23, P361 Ofuka E, 2000, SPEECH COMMUN, V32, P199, DOI 10.1016/S0167-6393(00)00009-1 OFUKA E, 1994, ICSLP 1994, P1447 Paul R., 2005, J AUTISM DEV DISORD, V35, P201 Peppe S, 2003, CLIN LINGUIST PHONET, V17, P345, DOI 10.1080/0269920031000079994 Peppe S, 2006, J PRAGMATICS, V38, P1776, DOI 10.1016/j.pragma.2005.07.004 Peppe S, 2007, J SPEECH LANG HEAR R, V50, P1015, DOI 10.1044/1092-4388(2007/071) PLANT G, 1986, STL QPSR, V27, P65 PRUDHOMMEAUX ET, 2008, INT M AUT RES 2008 L ROARK R, 2007, P 2 INT C TECHN AG I *ROYAL I TECHN, 2006, SNACK SOUND TOOLK Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Shriberg LD, 2001, PEPPER PROGRAMS EXAM Shriberg LD, 2006, J SPEECH LANG HEAR R, V49, P500, DOI 10.1044/1092-4388(2006/038) Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P549, DOI 10.1080/0269920031000138123 Siegel DJ, 1996, J AUTISM DEV DISORD, V26, P389, DOI 10.1007/BF02172825 SILVERMAN K, 1993, P 1993 EUR, V3, P2169 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 SONMEZ K, 1998, P ICSLP 1998 Sutton S, 1998, P INT C SPOK LANG PR, P3221 TVERSKY A, 1977, PSYCHOL REV, V84, P327, DOI 10.1037/0033-295X.84.4.327 van Santen J., 2006, ITALIAN J LINGUISTIC, V18, P161 Van Santen J.P.H., 1994, P INT C SPOK LANG PR, P719 VANSANTEN J, 1999, P EUR 99 BUD HUNG VANSANTEN J, 2002, 4 IEEE WORKSH SPEECH VANSANTEN J, 2008, INT M AUT RES 2008 L VANSANTEN J, 2007, INT M AUT RES 2007 S VANSANTEN J, 2000, INTONATION ANAL MODE, P69 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 van Santen JPH, 2000, J ACOUST SOC AM, V107, P1012, DOI 10.1121/1.428281 van Santen J. P. H., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1004 WINGFIELD A, 1984, J SPEECH HEAR RES, V7, P128 Zhang Y, 2008, J VOICE, V22, P1, DOI 10.1016/j.jvoice.2006.08.003 2002, DSM 4 TR DIAGNOSTIC NR 65 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2009 VL 51 IS 11 BP 1082 EP 1097 DI 10.1016/j.specom.2009.04.007 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 498SR UT WOS:000270164000003 ER PT J AU Derakhshan, N Akbari, A Ayatollahi, A AF Derakhshan, Nima Akbari, Ahmad Ayatollahi, Ahmad TI Noise power spectrum estimation using constrained variance spectral smoothing and minima tracking SO SPEECH COMMUNICATION LA English DT Article DE Adaptive smoothing; Noise power spectrum estimation; Speech enhancement ID SPEECH ENHANCEMENT; AMPLITUDE ESTIMATOR; DENSITY-ESTIMATION; ENVIRONMENTS; STATISTICS AB In this paper, we propose a new noise estimation algorithm based on tracking the minima of an adaptively smoothed noisy short-time power spectrum (STPS). The heart of the proposed algorithm is a constrained variance smoothing (CVS) filter, which smoothes; the noisy STPS independently of the noise level. The proposed smoothing procedure is capable of tracking the non-stationary behavior of the noisy STPS while reducing its variance. The minima of the smoothed STPS are tracked with a low delay and are used to construct voice activity detectors (VAD) in frequency bins. Finally, the noise power spectrum is estimated by averaging the noisy STPS on the noise-only regions. Experiments show that the proposed noise estimation algorithm possesses a very short delay in tracking the non-stationary behavior of the noise. When the proposed algorithm is utilized in a noise reduction system, it exhibits superior performance over the other recently proposed noise estimation algorithms. (C) 2009 Elsevier B.V. All rights reserved. C1 [Derakhshan, Nima; Akbari, Ahmad] Iran Univ Sci & Technol, Res Ctr Informat Technol, Dept Comp Engn, Tehran 16844, Iran. [Derakhshan, Nima; Ayatollahi, Ahmad] Iran Univ Sci & Technol, Dept Elect Engn, Tehran 16844, Iran. RP Derakhshan, N (reprint author), Iran Univ Sci & Technol, Res Ctr Informat Technol, Dept Comp Engn, Tehran 16844, Iran. EM nima_derakhshan@ee.iust.ac.ir CR [Anonymous], 1993, P56 ITUT [Anonymous], 2001, P862 ITUT Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 DERAKHSHAN N, 2007, P 12 INT C SPEECH CO, P542 Doblinger G., 1995, P 4 EUR C SPEECH COM, P1513 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Lynch J. F. Jr., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 Martin R, 2006, SIGNAL PROCESS, V86, P1215, DOI 10.1016/j.sigpro.2005.07.037 Paajanen E., 2002, P IEEE WORKSH SPEECH, P77 Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005 Rohdenburg T., 2005, P 9 INT WORKSH AC EC, P169 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 NR 17 TC 3 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2009 VL 51 IS 11 BP 1098 EP 1113 DI 10.1016/j.specom.2009.04.008 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 498SR UT WOS:000270164000004 ER PT J AU Taft, DA Grayden, DB Burkitt, AN AF Taft, D. A. Grayden, D. B. Burkitt, A. N. TI Speech coding with traveling wave delays: Desynchronizing cochlear implant frequency bands with cochlea-like group delays SO SPEECH COMMUNICATION LA English DT Article DE Cochlear implant; Phase; Envelope; Traveling wave delay ID NORMAL-HEARING; RECOGNITION; LISTENERS; STRATEGIES; HUMANS; NOISE; EAR; MAP AB Traveling wave delays are the frequency-dependent delays for sounds along the cochlear partition. In this study, a set of suitable delays was calibrated to cochlear implant users' pitch perception along the implanted electrode array. These delays were then explicitly coded in a cochlear implant speech processing strategy as frequency specific group delays. The envelopes of low frequency filter bands were delayed relative to high frequencies, with amplitude and fine structure unmodified. Incorporating such delays into subjects' own processing strategies in this way produced a significant improvement in speech perception scores in noise. A subsequent investigation indicated that perceptual sensitivity to changes in delay size was low and so accurate delay calibration may not be necessary. It is proposed that contention between broadband envelope cues is reduced when frequency bands are de-synchronized. (C) 2009 Elsevier B.V. All rights reserved. C1 [Taft, D. A.; Grayden, D. B.; Burkitt, A. N.] Bion Ear Inst, Melbourne, Vic 3002, Australia. [Taft, D. A.; Grayden, D. B.; Burkitt, A. N.] Univ Melbourne, Dept Elect & Elect Engn, Melbourne, Vic 3010, Australia. [Taft, D. A.] Univ Melbourne, Dept Otolaryngol, Melbourne, Vic 3010, Australia. RP Taft, DA (reprint author), Bion Ear Inst, 384-388 Albert St, Melbourne, Vic 3002, Australia. EM dtaft@bionicear.org; grayden@unimelb.edu.au; aburkitt@unimelb.edu.au RI Burkitt, Anthony/N-9077-2013 OI Burkitt, Anthony/0000-0001-5672-2772 FU Harold Mitchell Foundation FX This research was supported by The Harold Mitchell Foundation. CR ARAI T, 1998, P 1998 IEEE INT C AC, V2, P933, DOI 10.1109/ICASSP.1998.675419 Baumann U, 2006, HEARING RES, V213, P34, DOI 10.1016/j.heares.2005.12.010 BEKESY GV, 1960, WAVE MOTION COCHLEA Blarney P.J., 1996, HEARING RES, V99, P139 Boex C, 2006, JARO-J ASSOC RES OTO, V7, P110, DOI 10.1007/s10162-005-0027-2 Cainer KE, 2008, HEARING RES, V238, P155, DOI 10.1016/j.heares.2007.10.001 DALLOS P, 1992, J NEUROSCI, V12, P4575 Dawson PW, 2007, EAR HEARING, V28, P163, DOI 10.1097/AUD.0b013e3180312651 DONALDSON GS, 1993, J ACOUST SOC AM, V93, P940, DOI 10.1121/1.405454 Dorman MF, 2007, JARO-J ASSOC RES OTO, V8, P234, DOI 10.1007/s10162-007-0071-1 Elberling C, 2007, J ACOUST SOC AM, V122, P2772, DOI 10.1121/1.2783985 Friesen LM, 2001, J ACOUST SOC AM, V110, P1150, DOI 10.1121/1.1381538 Fu QJ, 2001, J ACOUST SOC AM, V109, P1166, DOI 10.1121/1.1344158 GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052 Healy EW, 2002, J SPEECH LANG HEAR R, V45, P1262, DOI 10.1044/1092-4388(2002/101) Javel E, 2000, HEARING RES, V140, P45, DOI 10.1016/S0378-5955(99)00186-0 Loizou PC, 1998, IEEE SIGNAL PROC MAG, V15, P101, DOI 10.1109/79.708543 MCDERMOTT H, 2009, AUDIOL NEUROTOL S1, V14 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Plant Kerrie L, 2002, Cochlear Implants Int, V3, P104, DOI 10.1002/cii.56 Rubinstein Jay T, 2004, Curr Opin Otolaryngol Head Neck Surg, V12, P444, DOI 10.1097/01.moo.0000134452.24819.c0 Skinner MW, 2002, EAR HEARING, V23, P207, DOI 10.1097/00003446-200206000-00005 Sridhar Divya, 2006, Audiol Neurootol, V11 Suppl 1, P16, DOI 10.1159/000095609 Stakhovskaya O, 2007, JARO-J ASSOC RES OTO, V8, P220, DOI 10.1007/s10162-007-0076-9 Vandali AE, 2005, J ACOUST SOC AM, V117, P3126, DOI 10.1121/1.1874632 Wilson RH, 2007, J SPEECH LANG HEAR R, V50, P844, DOI 10.1044/1092-4388(2007/059) NR 26 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2009 VL 51 IS 11 BP 1114 EP 1123 DI 10.1016/j.specom.2009.05.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 498SR UT WOS:000270164000005 ER PT J AU Van Segbroeck, M Van Hamme, H AF Van Segbroeck, Maarten Van Hamme, Hugo TI Unsupervised learning of time-frequency patches as a noise-robust representation of speech SO SPEECH COMMUNICATION LA English DT Article DE Acoustic signal analysis; Language acquisition; Matrix factorization; Automatic speech recognition; Noise robustness ID NONNEGATIVE MATRIX FACTORIZATION; REASSIGNMENT; SPECTROGRAM; SEPARATION AB We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time-frequency patches, patch activations are obtained which express to what extent each patch is present across time. We then show that speaker-independent patterns appear to recur in these patch activations and how they can be discovered by applying a second NMF-based algorithm on the co-occurrence counts of activation events. By providing information about the word identity to the learning algorithm, the retrieved patterns can be associated with meaningful objects of the language. In case of a small vocabulary task, the system is able to learn patterns corresponding to words and subsequently detects the presence of these words in speech utterances. Without the prior requirement of expert knowledge about the speech as is the case in conventional automatic speech recognition, we illustrate that the learning algorithm achieves a promising accuracy and noise robustness. (C) 2009 Elsevier B.V. All rights reserved. C1 [Van Segbroeck, Maarten; Van Hamme, Hugo] Katholieke Univ Leuven, Dept ESAT, B-3001 Louvain, Belgium. RP Van Segbroeck, M (reprint author), Katholieke Univ Leuven, Dept ESAT, Kasteelpk Arenberg 10, B-3001 Louvain, Belgium. EM maarten.vansegbroeck@esat.kuleuven.be; hugo.vanhamme@esat.kuleuven.be RI Van hamme, Hugo/D-6581-2012 FU Institute for the Promotion of Innovation through Science and Technology in Flanders, Belgium; European Commission [FP6-034362] FX This research was funded by the Institute for the Promotion of Innovation through Science and Technology in Flanders, Belgium (I.W.T.-Vlaanderen) and by the European Commission under contract FP6-034362 (ACORNS). CR [Anonymous], 2000, 202050 ETSI ES [Anonymous], 2000, 201108 ETSI ES AUGER F, 1995, IEEE T SIGNAL PROCES, V43, P1068, DOI 10.1109/78.382394 Aversano G., 2001, P 44 IEEE MIDW S CIR, V2, P516, DOI 10.1109/MWSCAS.2001.986241 BAKER JM, 2006, MINDS HIST DEV FUTUR BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W Chomsky N, 2000, NEW HORIZONS STUDY L Ding C., 2006, LBNL60428 Ezzat T., 2007, P INT C SPOK LANG PR, P506 Hainsworth S. W., 2003, CUEDFINFENGTR459 Hermansky H., 2005, P INT, P361 Hermansky H., 1998, P INT C SPOK LANG PR, P1003 HERMANSKY H, 1997, P INT C AC SPEECH SI, V1, P289 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P18 Hoyer PO, 2004, J MACH LEARN RES, V5, P1457 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Kleinschmidt M., 2003, P EUR, P2573 KODERA K, 1978, IEEE T ACOUST SPEECH, V26, P64, DOI 10.1109/TASSP.1978.1163047 Lee DD, 2001, ADV NEUR IN, V13, P556 Machiraju VR, 2002, J CARDIAC SURG, V17, P20 Martin A. F., 1997, P EUROSPEECH, P1895 MEYER BT, 2008, P INT, P906 O'Grady PD, 2008, NEUROCOMPUTING, V72, P88, DOI 10.1016/j.neucom.2008.01.033 Park A., 2005, P ASRU SAN JUAN PUER, P53 Plante F, 1998, IEEE T SPEECH AUDI P, V6, P282, DOI 10.1109/89.668821 QIAO Y, 2008, P SPRING M AC SOC JA Scharenborg O., 2007, P INT C SPOK LANG PR, P1953 Scharenborg O, 2005, COGNITIVE SCI, V29, P867, DOI 10.1207/s15516709cog0000_37 SIIVOLA V, 2003, P 8 EUR C SPEECH COM, P2293 Smaragdis P, 2007, IEEE T AUDIO SPEECH, V15, P1, DOI 10.1109/TASL.2006.876726 Stouten V, 2008, IEEE SIGNAL PROC LET, V15, P131, DOI 10.1109/LSP.2007.911723 Tyagi V., 2003, P ASRU 03, P399 Van hamme H., 2008, P INT C SPOK LANG PR, P2554 VANHAMME H, 2008, ISCA ITRW WORKSH SPE Virtanen T, 2007, IEEE T AUDIO SPEECH, V15, P1066, DOI 10.1109/TASL.2006.885253 YOUNG S, 1999, HTK BOOK VERSION2 2 NR 36 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2009 VL 51 IS 11 BP 1124 EP 1138 DI 10.1016/j.specom.2009.05.003 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 498SR UT WOS:000270164000006 ER PT J AU Siniscalchi, SM Lee, CH AF Siniscalchi, Sabato Marco Lee, Chin-Hui TI A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Knowledge based system; Lattice rescoring; Continuous phone recognition; Large vocabulary continuous speech recognition ID HIDDEN MARKOV-MODELS; VERIFICATION AB In this paper, a lattice rescoring approach to integrating acoustic-phonetic information into automatic speech recognition (ASR) is described. Additional information over what is used in conventional log-likelihood based decoding is provided by a bank of speech event detectors that score manner and place of articulation events with log-likelihood ratios that are treated as confidence levels. An artificial neural network (ANN) is then used to transform raw log-likelihood ratio scores into manageable terms for easy incorporation. We refer to the union of the event detectors and the ANN as knowledge module. A goal of this study is to design a generic framework which makes it easier to incorporate other sources of information into an existing ASR system. Another aim is to start investigating the possibility of building a generic knowledge module that can be plugged into an ASR system without being trained on specific data for the given task. To this end, the proposed approach is evaluated on three diverse ASR tasks: continuous phone recognition, connected digit recognition, and large vocabulary continuous speech recognition, but the data-driven knowledge module is trained with a single corpus and used in all three evaluation tasks without further training. Experimental results indicate that in all three cases the proposed rescoring framework achieves better results than those obtained without incorporating the confidence scores provided by the knowledge module. It is interesting to note that the rescoring process is especially effective in correcting utterances with errors in large vocabulary continuous speech recognition, where constraints imposed by the lexical and language models sometimes produce recognition results not strictly observing the underlying acoustic-phonetic properties. (C) 2009 Elsevier B.V. All rights reserved. C1 [Siniscalchi, Sabato Marco] Norwegian Univ Sci & Technol, Dept Elect & Telecommun, NO-7491 Trondheim, Norway. [Lee, Chin-Hui] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA. RP Siniscalchi, SM (reprint author), Norwegian Univ Sci & Technol, Dept Elect & Telecommun, NO-7491 Trondheim, Norway. EM marco77@iet.ntnu.no; chl@ece.gatech.edu RI Siniscalchi, Sabato/I-3423-2012 FU NSF [IIS-04-27113]; IBM Faculty Award FX This study was partially supported by the NSF Grant IIS-04-27113 and an IBM Faculty Award. The first author would also like to thank Jinyu Li of Georgia Institute of Technology for stimulating discussions. CR BAHL LR, 1986, P ICASSP TOK JAP Bazzi I., 2000, P ICSLP BEIJ CHIN, P401 Bitar N., 1996, P INT C AC SPEECH SI, P29 Bourlard Ha, 1994, CONNECTIONIST SPEECH CHEN B, 2004, P INT S SPOK LANG PR, P925 CHENG I, 2006, P 8 IEEE INT S MULT, P533 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 EIDE E, 2001, P EUR AALB DENM, P1613 EVERMANN G, 2000, P ICASSP, P1655 Fant G., 1960, ACOUSTIC THEORY SPEE Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 Frankel J, 2007, COMPUT SPEECH LANG, V21, P620, DOI 10.1016/j.csl.2007.03.002 FU Q, 2006, P INT, P681 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Goel V, 2000, COMPUT SPEECH LANG, V14, P115, DOI 10.1006/csla.2000.0138 Hacioglu K., 2004, P ICASSP04 MONTR CAN, P925 HASEGAWA M, 2005, P ICASSP PHIL US Haykin S., 1998, NEURAL NETWORKS COMP, V2nd Hermansky H., 2005, P INT, P361 International Phonetic Association, 1999, HDB INT PHON ASS GUI Jaakkola TS, 1999, ADV NEUR IN, V11, P487 Jiang H, 2006, IEEE T AUDIO SPEECH, V14, P1584, DOI 10.1109/TASL.2006.879805 JUANG BH, 1986, IEEE T INFORM THEORY, V32, P307 Kawahara T, 1998, IEEE T SPEECH AUDI P, V6, P558, DOI 10.1109/89.725322 Kay S. M., 1998, FUNDAMENTALS STAT SI, V2 Kirchhoff K., 1999, THESIS U BIELEFELD G KIRCHHOFF K., 1998, P ICSLP, P891 Koo MW, 2001, IEEE T SPEECH AUDI P, V9, P821 LAUNARY B, 2002, P ICASSP02 ORL, P817 Lee C.-H., 1997, P COST WORKSH SPEECH, P62 Lee C.-H., 2004, P ICSLP, P109 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LEONARD RG, 1984, P ICSLP SAN DIEG US LEVINSON SE, 1985, P IEEE, V73, P1625, DOI 10.1109/PROC.1985.13344 LI J, 2006, P INT, P2422 Li J, 2005, P ICASSP PHIL US, P837 Li J., 2005, P INT LISB PORT SEP, P3365 Li JY, 2007, IEEE T AUDIO SPEECH, V15, P2393, DOI 10.1109/TASL.2007.906178 Macherey W., 2005, P INT LISB PORT, P2133 MAK B, 2006, P ICASSP, P222 Mangu L., 1999, P EUR C SPEECH COMM, P495 Metze F., 2002, P ICSLP DENV US SEPT, P16 Metze F., 2005, THESIS U KARLSRUHE G Morris J., 2006, P INT, P597 Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Richardson M, 2003, SPEECH COMMUN, V41, P511, DOI 10.1016/S0167-6393(03)00031-1 SAKTI S, 2007, P INTERSPEECH, P2117 SCHWARTZ R, 1996, AUTOMATIC SPEECH SPE, P29 Schwarz P., 2006, P ICASSP, P325 Siniscalchi SM, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P517 Stolcke A., 1997, P EUROSPEECH, P163 Stuker S., 2003, P ICASSP HONG KONG C, P144 Sukkar R.A., 1993, P IEEE ICASSP 93, P451 Tang M., 2003, P EUR GEN SWITZ, P2585 Tsao Y., 2005, P INT 05, P1109 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125 YOUNG S, 2007, HTK BOOK HTK VERSION NR 59 TC 21 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2009 VL 51 IS 11 BP 1139 EP 1153 DI 10.1016/j.specom.2009.05.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 498SR UT WOS:000270164000007 ER PT J AU Eskenazi, M AF Eskenazi, Maxine TI Special Issue: Spoken Language Technology for Education SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. RP Eskenazi, M (reprint author), Carnegie Mellon Univ, Language Technol Inst, 4619 Newell Simon Hall,5000 Forbes Ave, Pittsburgh, PA 15213 USA. EM max@cmu.edu NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 831 EP 831 DI 10.1016/j.specom.2009.07.001 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500001 ER PT J AU Eskenazi, M AF Eskenazi, Maxine TI An overview of spoken language technology for education SO SPEECH COMMUNICATION LA English DT Article DE Language learning; Automatic speech processing; Speech recognition; Speech synthesis; Spoken dialogue systems; Speech technology for education ID 2ND-LANGUAGE LEARNERS FLUENCY; QUANTITATIVE ASSESSMENT; SPEECH RECOGNITION; JAPANESE; SYSTEM; TUTORS AB This paper reviews research in spoken language technology for education and more specifically for language learning. It traces the history of the domain and then groups main issues in the interaction with the student. It addresses the modalities of interaction and their implementation issues and algorithms. Then it discusses one user population - children - and an application for them. Finally it has a discussion of overall systems. It can be used as an introduction to the field and a source of reference materials. (C) 2009 Elsevier B.V. All rights reserved. C1 Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. RP Eskenazi, M (reprint author), Carnegie Mellon Univ, Language Technol Inst, 4619 Newell Simon Hall,5000 Forbes Ave, Pittsburgh, PA 15213 USA. EM max@cmu.edu CR Akahane-Yamada R., 1997, Journal of the Acoustical Society of Japan (E), V18 AKAHANEYAMADA R, 1998, P STILL 98 ESCA MARH, P111 AKAHANEYAMADA R, 2004, 18 INT C AC P ICA 20, V3, P2319 ALWAN A, 2007, P IEEE MULT SIGN PRO BACHMAN L, 1990, OXFORD APPL LINGUIST, P216 BADIN P, 1998, P ESCA TUT RES WORKS, P167 Beck J. E., 2004, TECHNOLOGY INSTRUCTI, V2, P61 BERNSTEIN J, 1977, P IEEE ICASSP 77, P244 BERNSTEIN J, 1989, J ACOUST SOC AM S1, V86, pS77, DOI 10.1121/1.2027647 BERNSTEIN J, 1984, P IEEE ICASSP 84 BERNSTEIN J, 1986, EIF SPEC SESS SPEECH BERNSTEIN J, 1996, SPEECH RECOGNITION C Bernstein J., 1990, P INT C SPOK LANG PR, P1185 Bernstein J., 1999, CALICO Journal, V16 BESKOW J, 2000, P ESCA ETRW INSTIL D, P138 BIANCHI D, 2004, P ISCA ITRW INSTIL04, P203 Black A, 2007, P ISCA ITRW SLATE WO BONAVENTURA P, 2000, KONVENS 2000, P225 Bradlow AR, 1997, J ACOUST SOC AM, V101, P2299, DOI 10.1121/1.418276 CHAO C, 2007, P ISCA ITRW SLATE FA COLE R, 1999, P ESCA SOCRATES ETRW COLE RA, 1998, P ESCA ETRW STILL MA, P163 COSI P, 2004, P INSTD ICALL S VER, P207 Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894 CUCCHIARINI C, 1998, P 5 INT C SPOK LANG, P2619 Cucchiarini C., 2007, P INT 2007 ANTW BELG, P2181 Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279 DARCY SM, 2005, P ISCA INT 2005 LISB Delcloque P., 2000, HIST CALL DELMONTE R, 1998, P ESCA ETRW STILL MA, P57 Delmonte R, 2000, SPEECH COMMUN, V30, P145, DOI 10.1016/S0167-6393(99)00043-6 DESTOMBES F, 1993, NATO ASI SERIES COMP, P187 Ehsani F, 2000, SPEECH COMMUN, V30, P167, DOI 10.1016/S0167-6393(99)00042-4 EHSANI F, 1997, P EUR 1997 RHOD GREE, P681 ELLIS NC, 2007, P ISCA ITRW SLATE FA Ellis R., 1997, 2 LANGUAGE ACQUISITI ESKENAZI M, 1996, P INT C SPOK LANG PR FLEGE JE, 1988, LANG LEARN, V38, P365, DOI 10.1111/j.1467-1770.1988.tb00417.x FORBESRILEY K, 2007, P HUM LANG TECHN ANN FORBESRILEY K, 2007, P AFF COMP INT INT A Forbes-Riley K., 2008, P 9 INT C INT TUT SY Franco H., 2000, P INSTILL 2000 DUND, P123 GEROSA M, 2006, P INT C AC SPEECH SI, V1, P393 GEROSA M, 2004, P NLP SPEECH TECHN A, P9 GRANSTROM B, 2004, P ISCA ITRW INSTIL04 HACKER C, 2007, P IEEE ICASSP HAW Hagen A, 2007, SPEECH COMMUN, V49, P861, DOI 10.1016/j.specom.2007.05.004 Handley Z, 2005, LANG LEARN TECHNOL, V9, P99 HAZAN V, 2005, SPEECH COMM Hazan Valerie L., 1998, P ESCA ETRW STILL98, P119 HILLER S, 1993, SPEECH COMMUN, V13, P463, DOI 10.1016/0167-6393(93)90045-M Hincks R., 2002, TMH QPSR, V44, P153 HINCKS R, 2004, P ISCA ITRW INSTIL V, P63 HOLLINGSED T, 2007, P ISCA ITRW SPEECH L HOUSE D, 1999, P ESCA EUR, V99, P1843 HUANG X, 2004, SPOKEN LANGUAGE PROC ISHI CT, 2000, P ESCA ETRW INSTIL20, P106 Ito A., 2005, P EUROSPEECH, P173 Johnson W. L., 2008, P IAAI 2008 JOHNSON WL, 2004, P I ITSEC 2004 KAWAHARA H, 2006, 4 JOINT M ASA ASJ DE KAWAI G, 1998, P INT C SPOK LANG PR Kim Y., 1997, P EUR, P645 LANGLAIS P, 1998, P ESCA WORKSH SPEECH, P41 LaRocca S., 2000, P ISCA ITRW INSTIL20, P26 LISCOMBE J, 2006, P ISCA INT 2006 PITT LIU Y, 2007, P ISCA ITRW SLATE FA Mak B., 2003, P HLT NAACL, P23 MARTONY J, 1968, AM ANN DEAF, V113, P195 MASSARO, 1998, P SPEECH TECHN LANG, P169 Massaro D. W., 2006, NATURAL INTELLIGENT, P183 Massaro D. W., 2003, P 5 INT C MULT INT V, P172, DOI 10.1145/958432.958466 MASSARO DW, 2006, P 9 INT C SPOK LANG, P825 MCGRAW I, 2007, P ISCA ITRW SLATE FA MERCIER G, 2000, P INSTIL, P145 MICH O, 2004, P ISCA ITRW NLP SPEE, P169 MINEMATSU N, 2007, P ISCA ITRW SPEECH L MOSTOW J, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P785 MOTE N, 2004, P ISCA ITRW INSTIL04, P47 Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001 NERI A, 2006, P ISCA INT PITTSB PA, P1982 NEUMEYER L, 1996, P INT C SPOK LANG PR NICKERSON RS, 1972, P C SPEECH COMM PROC PEABODY M, 2004, P ISCA ITRW INSTIL04, P173 Precoda K., 2000, P ISCA ITRW INSTIL20, P102 Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7 PRUITT J, 1998, P ESCA ETRW STILL98, P107 QUN L, 2001, P ESCA EUROSPEECH 20, P2671 Raux A., 2002, P ICSLP, P737 Raux Antoine, 2004, P INSTIL ICALL 2004, P147 RHEE SC, 2004, P ISCA INTERSPEECH 2, P1681 Russell M, 1996, P INT C SPOK LANG PR Russell M, 2000, COMPUT SPEECH LANG, V14, P161, DOI 10.1006/csla.2000.0139 Rypa M. E., 1999, CALICO Journal, V16 SENEFF S, 2007, P ISCA ITRW SLATE07, P9 Seneff S., 2007, P NAACL HLT07 ROCH N SENEFF S, 2004, P INSTIL ICALL 2004, P151 STRIK H, 2007, P INT 2007 ANTW NETH Sundstrom A., 1998, P ISCA WORKSH SPEECH, P49 SVENSTER B, 1998, P ESCA ETRW STILL MA, P91 TOWNSHEND B, 1998, P ESCA WORKSH SPEECH, P179 TRONG K, 2005, P ISCA INT LISB, P1345 Tsubota Y., 2002, P ICSLP, P1205 TSUBOTA Y, 2004, P ESCA ETRW INSTIL04, P139 VANLEHN K, 2007, SLATE WORKSH SPEECH WEIGELT LF, 1990, J ACOUST SOC AM, V87, P2729, DOI 10.1121/1.399063 WIK P, 2007, P ISCA ITRW SLATE FA WISE B, INTERACTIVE LITERACY WITT S, 1997, 5 EUR C SPEECH COMM, P22 WITT SM, 1999, THESIS U CAMBRIDGE, P445 Yamashita Y, 2005, IEICE T INF SYST, VE88D, P496, DOI 10.1093/ietisy/e88-d.3.496 YI J, 1998, P ICSLP SYDN AUSTR N YOSHIMURA Y, 2007, P ISCA ITRW SLATE FA Zechner K., 2007, P ISCA ITRW SPEECH L NR 114 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 832 EP 844 DI 10.1016/j.specom.2009.04.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500002 ER PT J AU Strik, H Truong, K de Wet, F Cucchiarini, C AF Strik, Helmer Truong, Khiet de Wet, Febe Cucchiarini, Catia TI Comparing different approaches for automatic pronunciation error detection SO SPEECH COMMUNICATION LA English DT Article DE Pronunciation error detection; Acoustic-phonetic classification; Computer assisted pronunciation training; Computer assisted language learning AB One of the biggest challenges in designing computer assisted language learning (CALL) applications that provide automatic feedback on pronunciation errors consists in reliably detecting the pronunciation errors at such a detailed level that the information provided can be useful to learners. In our research we investigate pronunciation errors frequently made by foreigners learning Dutch as a second language. In the present paper we focus on the velar fricative /x/ and the velar plosive /k/. We compare four types of classifiers that can be used to detect erroneous pronunciations of these phones: two acoustic-phonetic classifiers (one of which employs Linear Discriminant Analysis (LDA)), a classifier based on cepstral coefficients in combination with LDA, and one based on confidence measures (the so-called Goodness Of Pronunciation score). The best results were obtained for the two LDA classifiers which produced accuracy levels of about 85-93%. (C) 2009 Elsevier B.V. All rights reserved. C1 [Strik, Helmer; Cucchiarini, Catia] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands. [Truong, Khiet] Univ Twente, Dept Comp Sci, Human Media Interact Grp, NL-7500 AE Enschede, Netherlands. [de Wet, Febe] Univ Stellenbosch, Ctr Language & Speech Technol, ZA-7602 Matieland, South Africa. RP Strik, H (reprint author), Radboud Univ Nijmegen, Ctr Language & Speech Technol, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM h.strik@let.ru.nl; truongkp@ewi.utwen-te.nl; fdw@sun.ac.za; c.cucchiarini@let.ru.nl CR Boersma P., 2001, GLOT INT, V5, P341 Cucchiarini C, 1996, CLIN LINGUIST PHONET, V10, P131, DOI 10.3109/02699209608985167 Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279 DEN OS, 1995, P EUR, P825 Mak B., 2003, P HLT NAACL, P23 NERI A, 2006, APPL LINGUIST LANG T, V44, P357 NERI A, 2004, P INSTIL ICALL S, P13 NERI A, 2006, P ISCA INT PITTSB PA, P1982 Neumeyer L, 2000, SPEECH COMMUN, V30, P83, DOI 10.1016/S0167-6393(99)00046-1 TRUONG KP, 2004, THESIS UTRECHT U WEIGELT LF, 1990, J ACOUST SOC AM, V87, P2729, DOI 10.1121/1.399063 Witt S. M., 1999, THESIS U CAMBRIDGE Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8 Young S., 2000, HTK BOOK VERSION 3 0 NR 14 TC 18 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 845 EP 852 DI 10.1016/j.specom.2009.05.007 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500003 ER PT J AU Cucchiarini, C Neri, A Strik, H AF Cucchiarini, Catia Neri, Ambra Strik, Helmer TI Oral proficiency training in Dutch L2: The contribution of ASR-based corrective feedback SO SPEECH COMMUNICATION LA English DT Article DE Computer assisted pronunciation training (CAPT); Corrective feedback; Pronunciation error detection; Goodness of pronunciation; Accent reduction ID FLUENCY AB In this paper, we introduce a system for providing automatically generated corrective feedback on pronunciation errors in Dutch, Dutch-CAPT We describe the architecture of the system paying particular attention to the rationale behind it, to the performance of the error detection algorithm and its relationship to the effectiveness of the corrective feedback provided. It appears that although the system does not achieve 100% accuracy in error detection, learners enjoy using it and the feedback provided is still effective in improving pronunciation errors after only a few hours of use over a period of one month. We discuss which factors may have led to these positive results and argue that it is worthwhile studying how ASR technology could be applied to the training of other speaking skills. (C) 2009 Elsevier B.V. All rights reserved. C1 [Cucchiarini, Catia; Neri, Ambra; Strik, Helmer] Radboud Univ Nijmegen, Dept Linguist, CLST, NL-6525 HT Nijmegen, Netherlands. RP Cucchiarini, C (reprint author), Radboud Univ Nijmegen, Dept Linguist, CLST, Erasmuspl 1, NL-6525 HT Nijmegen, Netherlands. EM C.Cucchiarini@let.ru.nl; ambra.neri@gmail.com; h.strik@let.ru.nl CR CAREY M, 2002, THESIS MACQUARIE U A Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279 DeKeyser RM, 2005, LANG LEARN, V55, P1, DOI 10.1111/j.0023-8333.2005.00294.x Egan K. B., 1999, CALICO Journal, V16 Ehsani F., 1998, LANGUAGE LEARNING TE, V2, P45 ELTATAWY M, 2002, WORKING PAPERS TESOL Giuliani D., 2003, Proceedings 3rd IEEE International Conference on Advanced Technologies, DOI 10.1109/ICALT.2003.1215131 HERRON D, 1999, P EUR 99 SEPT, P855 LENNON P, 1990, LANG LEARN, V40, P387, DOI 10.1111/j.1467-1770.1990.tb00669.x Mak B., 2003, P HLT NAACL, P23 Neri A., 2006, IRAL-INT REV APPL LI, V44, P357, DOI 10.1515/IRAL.2006.016 Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473 Oostdijk N, 2002, LANG COMPUT, P105 REESER TW, 2001, CALICO REV SCHMIDT RW, 1990, APPL LINGUIST, V11, P129, DOI 10.1093/applin/11.2.129 Van Bael C., 2003, P EUR GEN SWITZ, P1545 Witt S., 1999, THESIS U CAMBRIDGE C Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8 Young S., 2000, HTK BOOK VERSION 3 0 ZHENG T, 2002, CALICO REV NR 20 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 853 EP 863 DI 10.1016/j.specom.2009.03.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500004 ER PT J AU de Wet, F Van der Walt, C Niesler, TR AF de Wet, F. Van der Walt, C. Niesler, T. R. TI Automatic assessment of oral language proficiency and listening comprehension SO SPEECH COMMUNICATION LA English DT Article DE Automatic language proficiency assessment; Large-scale assessment; Rate of speech; Goodness of pronunciation; CALL ID 2ND-LANGUAGE LEARNERS FLUENCY; QUANTITATIVE ASSESSMENT; PRONUNCIATION QUALITY; WORKING-MEMORY; SPEECH AB This paper describes an attempt to automate the large-scale assessment of oral language proficiency and listening comprehension for fairly advanced students of English as a second language. The automatic test is implemented as a spoken dialogue system and consists of a reading as well as a repeating task. Two experiments are described in which different rating criteria were used by human judges. In the first experiment, proficiency was scored globally for each of the two test components. In the second experiment, various aspects of proficiency were evaluated for each section of the test. In both experiments, rate of speech (ROS), goodness of pronunciation (GOP) and repeat accuracy were calculated for the spoken utterances. The correlation between scores assigned by human raters and these three automatically derived measures was determined to assess their suitability as proficiency indicators. Results show that the more specific rating instructions used in the second experiment improved intra-rater agreement, but made little difference to inter-rater agreement. In addition, the more specific rating criteria resulted in a better correlation between the human and the automatic scores for the repeating task, but had almost no impact in the reading task. Overall, the results indicate that, even for the narrow range of proficiency levels observed in the test population, the automatically derived ROS and accuracy scores give a fair indication of oral proficiency. (C) 2009 Elsevier B.V. All rights reserved. C1 [de Wet, F.] Univ Stellenbosch, Ctr Language & Speech Technol, ZA-7600 Stellenbosch, South Africa. [Van der Walt, C.] Univ Stellenbosch, Dept Curriculum Studies, ZA-7600 Stellenbosch, South Africa. [Niesler, T. R.] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa. RP de Wet, F (reprint author), Univ Stellenbosch, Ctr Language & Speech Technol, ZA-7600 Stellenbosch, South Africa. EM fdw@sun.ac.za; cvdwalt@sun.ac.za; trn@dsp.sun.ac.za CR Bejar I. I., 2006, P HUM LANG TECHN C N, P216, DOI 10.3115/1220835.1220863 Bernstein J., 2000, P INSTIL2000 INT SPE, P57 Chalhoub-Deville M., 2001, LANGUAGE LEARNING TE, V5, P95 CINCAREK T, 2004, THESIS FRIEDRICHALEX Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894 Cucchiarini C, 2000, SPEECH COMMUN, V30, P109, DOI 10.1016/S0167-6393(99)00040-0 Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279 DANEMAN M, 1991, J PSYCHOLINGUIST RES, V20, P445, DOI 10.1007/BF01067637 DEWET F, 2007, P INT ANTW BELG, P218 ELLIS NC, 1996, Q J EXP PSYCHOL, V49, P34250 Franco H., 2000, P INSTILL 2000 DUND, P123 Fulcher G., 1996, LANG TEST, V13, P208, DOI DOI 10.1177/026553229601300205 FULCHER G, 1997, ENCY LANGUAGE ED LAN, V17, P75 KAWAI G, 1998, P ICSLP SYDN AUSTR Kenyon DM, 2000, MOD LANG J, V84, P85, DOI 10.1111/0026-7902.00054 LISCOMBE JJ, 2007, THESIS COLUMBIA U CO Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001 NERI A, 2006, P ISCA INT PITTSB PA, P1982 Neumeyer L, 2000, SPEECH COMMUN, V30, P83, DOI 10.1016/S0167-6393(99)00046-1 Payne JS, 2005, LANG LEARN TECHNOL, V9, P35 ROUX JC, 2004, P 4 INT C LANG RES E, V1, P93 Spolsky B., 1995, MEASURED WORDS *STATSOFT, 2008, STAT 8 0 SUNDH S, 2003, THESIS UPPSALA U UPP Upshur J., 1995, ELT J, V49, P3, DOI 10.1093/elt/49.1.3 van der Walt C, 2008, SO AFR LINGUIST APPL, V26, P135, DOI 10.2989/SALALS.2008.26.1.11.426 Wigglesworth G., 1997, LANG TEST, V14, P85, DOI DOI 10.1177/026553229701400105 Witt S., 1999, THESIS U CAMBRIDGE C Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8 XI X, 2006, ANN M NAT COUNC MEAS Young Steve, 2002, HTK BOOK VERSION 3 2 NR 31 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 864 EP 874 DI 10.1016/j.specom.2009.03.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500005 ER PT J AU Ohkawa, Y Suzuki, M Ogasawara, H Ito, A Makino, S AF Ohkawa, Yuichi Suzuki, Motoyuki Ogasawara, Hirokazu Ito, Akinori Makino, Shozo TI A speaker adaptation method for non-native speech using learners' native utterances for computer-assisted language learning systems SO SPEECH COMMUNICATION LA English DT Article DE Speaker adaptation; Non-native speech; Computer-assisted language learning; Bilingual speaker AB In recent years, various CALL systems which can evaluate a learner's pronunciation using speech recognition technology have been proposed. In order to evaluate a learner's utterances and point out problems with higher accuracy, speaker adaptation is a promising technology. However, many learners who use the CALL system often have very poor speaking ability in the target language (L2), so conventional speaker adaptation methods have problems because they require the learners' correctly-pronounced L2 utterances for adaptation. In this paper, we propose two new types of speaker adaptation methods for the CALL system. The new methods only require the learners' utterances in their native language (L1) for adapting the acoustic model for L2. The first method is an algorithm to adapt acoustic models using a bilingual speaker's utterances. The speaker-independent acoustic models of L1 and L2 are adapted to the bilingual speaker once, then they are adapted to the learner again using the learner's L1 utterances. Using this method, we obtained about 5-point higher phoneme recognition accuracy than the baseline method. The second method is a training algorithm of a set of acoustic models based on speaker adaptive training. It can robustly train bilinguals' models using a few utterances in L1 and L2 uttered by bilingual speakers. Using this method, we obtained about 10-point higher phoneme recognition accuracy than the baseline method. (C) 2009 Elsevier B.V. All rights reserved. C1 [Ohkawa, Yuichi] Tohoku Univ, Grad Sch Educ Informat, Aoba Ku, Sendai, Miyagi 9808576, Japan. [Suzuki, Motoyuki] Univ Tokushima, Fac Engn, Tokushima 7708506, Japan. [Ogasawara, Hirokazu; Ito, Akinori; Makino, Shozo] Tohoku Univ, Grad Sch Engn, Aoba Ku, Sendai, Miyagi 9808579, Japan. RP Ohkawa, Y (reprint author), Tohoku Univ, Grad Sch Educ Informat, Aoba Ku, 27-1 Kawauchi, Sendai, Miyagi 9808576, Japan. EM kuri@ei.tohoku.ac.jp; moto@m.ieice.org; ogasawara@makino.ecei.tohoku.ac.jp; aito@fw.ipsj.or.jp; makino@makino.ecei.tohoku.ac.jp CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 Chun D. M., 1995, Journal of Educational Multimedia and Hypermedia, V4 DUNKEL P, 1991, MOD LANG J, V75, P64, DOI 10.2307/329835 HILLER S, 1993, P EUR 199O BERL GERM, P1343 ITO A, 2006, ED TECHNOLOGY RES, V29, P13 Kawai G, 2000, SPEECH COMMUN, V30, P131, DOI 10.1016/S0167-6393(99)00041-2 KWEON O, 2004, ED TECHNOL RES, V27, P9 Lee C. H., 1993, P ICASSP, VII-558 LEE KK, 1994, T INFORMATION PROCES, V35, P1223 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MASHIMO M, 2002, P ICSLP, P293 NAKAMURA N, 2002, SP200220 IEICE OGASAWARA H, 2003, SP2003127 IEICE OHKURA K, 1992, P ICSLP 92, P369 STEIDL S, 2004, P ICSLP, P314 WITT S, 1998, P ICSLP, P1010 WITT S, 1997, LANGUAGE TEACHING LA, P25 Young Steve, 2002, HTK BOOK VERSION 3 2 NR 18 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 875 EP 882 DI 10.1016/j.specom.2009.05.005 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500006 ER PT J AU Zechner, K Higgins, D Xi, XM Williamson, DM AF Zechner, Klaus Higgins, Derrick Xi, Xiaoming Williamson, David M. TI Automatic scoring of non-native spontaneous speech in tests of spoken English SO SPEECH COMMUNICATION LA English DT Article DE Speech scoring; Automatic scoring; Spoken language scoring; Scoring of spontaneous speech; Speaking assessment ID 2ND-LANGUAGE LEARNERS FLUENCY; QUANTITATIVE ASSESSMENT; PRONUNCIATION QUALITY; ALGORITHMS; SCORES AB This paper presents the first version of the SpeechRater(SM) system for automatically scoring non-native spontaneous high-entropy speech in the context of an online practice test for prospective takers of the Test of English as a Foreign Language (R) internet-based test (TOEFL (R) iBT). The system consists of a speech recognizer trained on non-native English speech data, a feature computation module, using speech recognizer output to compute a set of mostly fluency based features, and a multiple regression scoring model which predicts a speaking proficiency score for every test item response, using a subset of the features generated by the previous component. Experiments with classification and regression trees (CART) complement those performed with multiple regression. We evaluate the system both on TOEFL Practice data [TOEFL Practice Online (TPO)] as well as on Field Study data collected before the introduction of the TOEFL iBT. Features are selected by test development experts based on both their empirical correlations with human scores as well as on their coverage of the concept of communicative competence. We conclude that while the correlation between machine scores and human scores on TPO (of 0.57) still differs by 0.17 from the inter-human correlation (of 0.74) on complete sets of six items (Pearson r correlation coefficients), the correlation of 0.57 is still high enough to warrant the deployment of the system in a low-stakes practice environment, given its coverage of several important aspects of communicative competence such as fluency, vocabulary diversity, grammar, and pronunciation. Another reason why the deployment of the system in a low-stakes practice environment is warranted is that this system is an initial version of a long-term research and development program where features related to vocabulary, grammar, and content will be added in a later stage when automatic speech recognition performance improves, which can then be easily achieved without a re-design of the system. Exact agreement on single TPO items between our system and human scores was 57.8%, essentially at par with inter-human agreement of 57.2%. Our system has been in operational use to score TOEFL Practice Online Speaking tests since the Fall of 2006 and has since scored tens of thousands of tests. (C) 2009 Elsevier B.V. All rights reserved. C1 [Zechner, Klaus; Higgins, Derrick; Xi, Xiaoming; Williamson, David M.] Educ Testing Serv, Automated Scoring & NLP, Princeton, NJ 08541 USA. RP Zechner, K (reprint author), Educ Testing Serv, Automated Scoring & NLP, Rosedale Rd,MS 11-R, Princeton, NJ 08541 USA. EM kzechner@ets.org; dhiggins@ets.org; xxi@ets.org; dmwilliamson@ets.org CR ATTALI Y, 2005, RR0445 ETS ATTALI Y, 2004, ANN M INT ASS ED ASS Bachman L., 1996, LANGUAGE TESTING PRA Bachman L. F., 1990, FUNDAMENTAL CONSIDER BERNSTEIN J, 2000, P INSTILL2000 DUND S Bernstein J., 1999, PHONEPASS TESTING ST Brieman L, 1984, CLASSIFICATION REGRE Burstein J., 1998, RR9815 ETS CHODOROW M, 2004, RR73 TOEFL ED TEST S Clauser BE, 1997, J EDUC MEAS, V34, P141, DOI 10.1111/j.1745-3984.1997.tb00511.x COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256 Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894 Cucchiarini C, 2000, SPEECH COMMUN, V30, P109, DOI 10.1016/S0167-6393(99)00040-0 CUCCHIARINI C, 1997, IEEE AUT SPEECH REC CUCCHIARINI C, 1997, 3 INT S ACQ 2 LANG S Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279 FRANCO H, 2000, P INSTILL 2000 INT S Franco H, 2000, SPEECH COMMUN, V30, P121, DOI 10.1016/S0167-6393(99)00045-X Landauer TK, 1997, PSYCHOL REV, V104, P211, DOI 10.1037/0033-295X.104.2.211 Leacock C, 2003, COMPUT HUMANITIES, V37, P389, DOI 10.1023/A:1025779619903 Leacock C., 2004, EXAMENS, V1 North B., 2000, DEV COMMON FRAMEWORK PAGE EB, 1966, PHI DELTA KAPPAN, V47, P238 Rudner L., 2006, J TECHNOLOGY LEARNIN, V4 Steinberg D., 1995, CART TREE STRUCTURED Williamson DM, 1999, J EDUC MEAS, V36, P158, DOI 10.1111/j.1745-3984.1999.tb00552.x Xi X., 2006, TOEFLIBT01 ZECHNER K, 2006, P 2006 C HUM LANG TE 1997, HUB 4 BROADCAST NEWS NR 29 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 883 EP 895 DI 10.1016/j.specom.2009.04.009 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500007 ER PT J AU Wei, S Hu, GP Hu, Y Wang, RH AF Wei, Si Hu, Guoping Hu, Yu Wang, Ren-Hua TI A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; Mispronunciation detection; Support Vector Machine; Pronunciation Space Models ID SPEECH RECOGNITION; CONFIDENCE MEASURES AB This paper presents two new ideas for text dependent mispronunciation detection. Firstly, mispronunciation detection is formulated as a classification problem to integrate various predictive features. A Support Vector Machine (SVM) is used as the classifier and the log-likelihood ratios between all the acoustic models and the model corresponding to the given text are employed as features for the classifier. Secondly, Pronunciation Space Models (PSMs) are proposed to enhance the discriminative capability of the acoustic models for pronunciation variations. In PSMs, each phone is modeled with several parallel acoustic models to represent pronunciation variations of that phone at different proficiency levels, and an unsupervised method is proposed for the construction of the PSMs. Experiments on a database consisting of more than 500,000 Mandarin syllables collected from 1335 Chinese speakers show that the proposed methods can significantly outperform the traditional posterior probability based method. The overall recall rates for the 13 most frequently mispronounced phones increase from 17.2%, 7.6% and 0% to 58.3%, 44.3% and 29.5% at three precision levels of 60%, 70%, and 80%, respectively. The improvement is also demonstrated by a subjective experiment with 30 subjects, in which 53.3% of the subjects think the proposed method is better than the traditional one and 23.3% of them think that the two methods are comparable. (C) 2009 Elsevier B.V. All rights reserved. C1 [Wei, Si; Wang, Ren-Hua] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China. [Wei, Si; Hu, Guoping; Hu, Yu] iFLYTEK Res, Hefei, Anhui, Peoples R China. RP Wei, S (reprint author), Univ Sci & Technol China, 96 Jinzhai Rd, Hefei 230026, Anhui, Peoples R China. EM siwei@iflytek.com; gphu@iflytek.com; yuhu@iflytek.com; rhw@ustc.edu.cn CR ATAL BS, 1994, J ACOUST SOC AM, P1304 Cucchiarini C., 1998, P 1998 INT C SPOK LA, P1739 DONG B, 2006, P INT S CHIN SPOK LA, P580 Franco H., 1999, P EUROSPEECH, P851 Franco H, 2000, SPEECH COMMUN, V30, P121, DOI 10.1016/S0167-6393(99)00045-X Ito A., 2005, P EUROSPEECH, P173 Jiang H, 2005, SPEECH COMMUN, V45, P455, DOI 10.1016/j.specrom.2004.12.004 Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310 Lee L., 1996, P ICASSP LIU Y, 2002, THESIS HONG KONG U S Liu Y, 2003, COMPUT SPEECH LANG, V17, P357, DOI 10.1016/S0885-2308(03)00008-1 Minematsu N., 2004, P ICSLP, P1317 Neumeyer L., 1999, SPEECH COMMUN, V30, P83 ROSE RC, 1995, P INT C AC SPEECH SI, P281 Saraclar M., 2000, THESIS J HOPKINS U Strik H., 2007, P INT 07, P1837 TRUONG K, 2004, THESIS ULTRECHT U NE Vapnik V., 1995, NATURE STAT LEARNING WEGMANN S, 1996, P ICASSP, P339 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 Witt S., 1999, THESIS CAMBRIDGE U E Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8 YOUNG K, 2000, HTK BOOK HTK VERSION Zhang F., 2008, P INT C AC SPEECH SI, P2077 Zhang R., 2001, P 7 EUR C SPEECH COM, P2105 NR 25 TC 20 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 896 EP 905 DI 10.1016/j.specom.2009.03.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500008 ER PT J AU Handley, Z AF Handley, Zoee TI Is text-to-speech synthesis ready for use in computer-assisted language learning? SO SPEECH COMMUNICATION LA English DT Article DE CALL; Speech synthesis; TTS synthesis; Evaluation AB Text-to-speech (TTS) synthesis, the generation of speech from text input, offers another means of providing spoken language input to learners in Computer-Assisted Language Learning (CALL) environments. Indeed, many potential benefits (ease of creation and editing of speech models, generation of speech models and feedback on demand, etc.) and uses (talking dictionaries, talking texts, dictation, pronunciation training, dialogue partner, etc.) of TTS synthesis in CALL have been put forward. Yet, the use of TTS synthesis in CALL is not widely accepted and only a few applications have found their way onto the market. One potential reason for this is that TTS synthesis has not been adequately evaluated for this purpose. Previous evaluations of TTS synthesis for use in CALL, have only addressed the comprehensibility of TTS synthesis. Yet, CALL places demands on the comprehensibility, naturalness, accuracy, register and expressiveness of the output of TTS synthesis. In this paper, the aforementioned aspects of the quality of the output of four state-of-the-art French TTS synthesis systems are evaluated with respect to their use in the three different roles that TTS synthesis systems may assume within CALL applications, namely: (1) reading machine, (2) pronunciation model and (3) conversational partner [Handley, Z., Hamel, M.-J., 2005. Establishing a methodology for benchmarking speech synthesis for computer-assisted language learning (CALL). Language Learning and Technology Journal 9(3), 99-119. Retrieved from: http://llt.msu.edu/vol9num3/handley/default.html.]. The results of this evaluation suggest that the best TTS synthesis systems are ready for use in applications in which they 'add value' to CALL, i.e. exploit the unique capacity of TTS synthesis to generate speech models on demand. An example of such an application is a dialogue partner. In order to fully meet the requirements of CALL, further attention needs to be paid to accuracy and naturalness, in particular at the prosodic level, and expressiveness. (C) 2008 Elsevier B.V. All rights reserved. C1 [Handley, Zoee] Univ Manchester, Sch Comp Sci, Manchester M13 9EP, Lancs, England. RP Handley, Z (reprint author), Univ Nottingham, Learning Sci Res Inst, Exchange Bldg,Jubilee Campus,Wollaton Rd, Nottingham NG7 1BB, England. EM zoe.handley@nottingham.ac.uk CR *AUR, 2002, TALK ME CONV METH VE *BAB TECHN, 2003, BRIGHTSPEECH Bailly G., 2003, P EUR 2003 GEN, P37 Bennett C. L., 2005, P INT EUR, P105 Beutnagel M., 1999, P JOINT M ASA EAA DA Black A. B., 2005, P INT 2005 LISB PORT, P77 BLACK AW, 2000, P ICSLP BEIJ CHIN BLACK AW, 1994, P C COMP LING KYOT J, P983 Campbell N., 1997, PROGR SPEECH SYNTHES, P279 CAMPBELL N, 2006, IEEE T AUDIO SPEECH, V14 CHAPELLE C, 2001, RECALL, V23, P3 Chapelle C., 1998, LANGUAGE LEARNING TE, V2, P22 Chapelle C.A., 2001, COMPUTER APPL 2 LANG COHEN R, 1993, COMPUT EDUC, V21, P25, DOI 10.1016/0360-1315(93)90044-J CONKIE A, 1999, P JOINT M ASA EAA DA DEPIJPER JR, 1997, PROGR SPEECH SYNTHES, P575 DESAINTEXUPERY A, 1999, PETIT PRINCE Dutoit T., 1997, INTRO TEXT TO SPEECH EDGINGTON M, 1997, P EUR 97 RHOD GREEC, P1 EGAN BK, 2000, P INSTIL 2000, P4 Ehsani B.K., 1998, LANGUAGE LEARNING TE, V2, P45 *ELSE, 1999, D11 ELSE FRANCIS AL, 1999, HUMAN FACTORS VOICE, P63 Galliers J. R., 1996, EVALUATING NATURAL L HAMEL MJ, 2003, THESIS UMIST MANCHES HAMEL MJ, 2003, P MICTE2003, V3, P1661 Hamel M.-J., 1998, RECALL, V10, P79 Handley Z, 2005, LANG LEARN TECHNOL, V9, P99 HANDLEY Z, 2006, THESIS U MANCHESTER Henton C., 2002, International Journal of Speech Technology, V5, DOI 10.1023/A:1015416013198 Hincks R., 2002, TMH QPSR, V44, P153 Huang X., 2001, SPOKEN LANGUAGE PROC JOHNSON WL, 2002, P 7 INT C SPOK LANG KELLER E, 2000, P INSTIL U AB DUND D, P109 MERCIER G, 2000, P INSTIL, P145 *MULT, 2005, ELITE DOC PISONI DB, 1987, TEXT SPEECH MITALK S, P151 Polkosky M. D., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1022390615396 Raux Antoine, 2004, P INSTIL ICALL 2004, P147 RODMAN RD, 1999, COMPUTER SPEECH TECH Santagiustina M, 1999, J OPT B-QUANTUM S O, V1, P191, DOI 10.1088/1464-4266/1/1/033 SCHMIDTNIELSEN A, 1995, APPL SPEECH TECHNOLO, P195 Schroeter J, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P211, DOI 10.1109/WSS.2002.1224411 SENEFF S, 2004, P INSTIL ICALL 2004, P151 SHERWOOD B, 1981, STUDIES LANGUAGE LER, V3, P175 SOBKOWIAK W, 1998, MULTIMEDIA CALL THEO STEVENS V, 1989, TEACHING LANGUAGES C, P31 STRATIL M, 1987, PROGRAM LEARN EDUC T, V24, P309 Stratil M., 1987, Literary & Linguistic Computing, V2, DOI 10.1093/llc/2.2.116 *TMA ASS, 2003, NUANC US ENGL van Bezooijen Renee, 1997, HDB STANDARDS RESOUR, P481 NR 51 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 906 EP 919 DI 10.1016/j.specom.2008.12.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500009 ER PT J AU Felps, D Bortfeld, H Gutierrez-Osuna, R AF Felps, Daniel Bortfeld, Heather Gutierrez-Osuna, Ricardo TI Foreign accent conversion in computer assisted pronunciation training SO SPEECH COMMUNICATION LA English DT Article DE Voice conversion; Foreign accent; Speaker identity; Computer assisted pronunciation training; Implicit feedback ID PROCESSING TECHNIQUES; VOICE CONVERSION; SPEECH; ENGLISH; SPEAKER; IDENTIFICATION; RECOGNITION; FRAMEWORK; FEATURES; VOWELS AB Learners of a second language practice their pronunciation by listening to and imitating utterances from native speakers. Recent research has shown that choosing a well-matched native speaker to imitate can have a positive impact on pronunciation training. Here we propose a voice-transformation technique that can be used to generate the (arguably) ideal voice to imitate: the own voice of the learner with a native accent. Our work extends previous research, which suggests that providing learners with prosodically corrected versions of their utterances can be a suitable form of feedback in computer assisted pronunciation training. Our technique provides a conversion of both prosodic and segmental characteristics by means of a pitch-synchronous decomposition of speech into glottal excitation and spectral envelope. We apply the technique to a corpus containing parallel recordings of foreign-accented and native-accented utterances, and validate the resulting accent conversions through a series of perceptual experiments. Our results indicate that the technique can reduce foreign accentedness without significantly altering the voice quality properties of the foreign speaker. Finally, we propose a pedagogical strategy for integrating accent conversion as a form of behavioral shaping in computer assisted pronunciation training. (C) 2008 Elsevier B.V. All rights reserved. C1 [Felps, Daniel; Gutierrez-Osuna, Ricardo] Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA. [Bortfeld, Heather] Texas A&M Univ, Dept Psychol, College Stn, TX 77843 USA. RP Gutierrez-Osuna, R (reprint author), Texas A&M Univ, Dept Comp Sci, 3112 TAMU, College Stn, TX 77843 USA. EM dlfelps@cs.tamu.edu; bortfeld@psyc.ta-mu.edu; rgutier@cs.tamu.edu CR Abe M., 1988, P ICASSP, P655 ANISFELD M, 1962, J ABNORM PSYCHOL, V65, P223, DOI 10.1037/h0045060 Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608 ARSLAN LM, 1997, VOICE CONVERSION COD, P1347 ARTHUR B, 1974, LANG SPEECH, V17, P255 *AUR, 2002, TALK ME Bissiri M. P., 2006, P 11 AUSTR INT C SPE, P24 Boersma P., 2007, PRAAT DOING PHONETIC Bongaerts T, 1999, SEC LANG ACQ RES, P133 RYAN EB, 1975, J PERS SOC PSYCHOL, V31, P855, DOI 10.1037/h0076704 CELCEMURCIA M, 1996, TEACHING PRONUNCIATI, V12 CHILDERS DG, 1989, SPEECH COMMUN, V8, P147, DOI 10.1016/0167-6393(89)90041-1 Chun D., 1998, LANGUAGE LEARNING TE, V2, P61 COMPTON AJ, 1963, J ACOUST SOC AM, V35, P1748, DOI 10.1121/1.1918810 Derwing TM, 1998, LANG LEARN, V48, P393, DOI 10.1111/0023-8333.00047 Dijkstra E. W., 1959, NUMER MATH, V1, P269, DOI DOI 10.1007/BF01386390 ESKENAZI M, 1998, P STILL WORKSH SPEEC Eskenazi M., 1999, LANGUAGE LEARNING TE, V2, P62 Fant G., 1960, ACOUSTIC THEORY SPEE GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 Hansen T. K., 2006, 4 INT C MULT INF COM, P342 Hincks R., 2003, ReCALL, V15, DOI 10.1017/S0958344003000211 Huckvale M., 2007, P ISCA SPEECH SYNTH, P64 Jilka M., 1998, P ESCA WORKSH SPEECH, P115 Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423 KENNY OP, 1998, P 1998 IEEE INT C AC, V1, P573, DOI 10.1109/ICASSP.1998.674495 KEWLEYPORT D, 1994, APPL SPEECH TECHNOLO, P565 Kominek J., 2003, CMU ARCTIC DATABASES KOUNOUDES A, 2002, P INT C AC SPEECH SI, P349 KREIMAN J, 1991, SPEECH COMMUN, V10, P265, DOI 10.1016/0167-6393(91)90016-M LENNEBERG EH, 1967, BIOL FDN LANGUAGE, V16 LEVY M, 1997, COMPUTER ASSISTED LA, V15 Lippi-Green Rosina, 1997, ENGLISH ACCENT LANGU Lyster R, 2001, LANG LEARN, V51, P265, DOI 10.1111/j.1467-1770.2001.tb00019.x MAJOR RC, 2001, FOREIGN ONTOGENY PHY, V9 Makhoul J., 1979, P IEEE INT C AC SPEE, P428 MARKHAM D, 1997, TRAVAUX I LINGUISTIQ MARTIN P, 2004, WINPITCH LTL 2 MULTI MATSUMOT.H, 1973, IEEE T ACOUST SPEECH, VAU21, P428, DOI 10.1109/TAU.1973.1162507 MCALLISTER R, 1998, P SPEECH TECHN LANG, P155 MENZEL W, 2000, AUTOMATIC DETECTION, P49 MOULINES E, 1995, SPEECH COMMUN, V16, P175, DOI 10.1016/0167-6393(94)00054-E MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Munro M. J., 1995, STUDIES 2 LANGUAGE A, V17, P17, DOI 10.1017/S0272263100013735 Munro M. J., 1994, LANG TEST, V11, P253, DOI 10.1177/026553229401100302 MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x Murray G., 1999, SYSTEM, V27, P295, DOI 10.1016/S0346-251X(99)00026-3 Nagano K, 1990, 1 INT C SPOK LANG PR, P1169 Neri A., 2003, P 15 INT C PHON SCI, P1157 Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473 PAUL DB, 1981, IEEE T ACOUST SPEECH, V29, P786, DOI 10.1109/TASSP.1981.1163643 Peabody M, 2006, LECT NOTES COMPUT SC, V4274, P602 Pelham B. W., 2007, CONDUCTING RES PSYCH Penfield W, 1959, SPEECH BRAIN MECH Pennington M. C., 1999, Computer Assisted Language Learning, V12, DOI 10.1076/call.12.5.427.5693 Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7 REPP BH, 1987, SPEECH COMMUN, V6, P1, DOI 10.1016/0167-6393(87)90065-3 Rogers C.L., 1996, J ACOUST SOC AM, V100, P2725, DOI 10.1121/1.416179 SAMBUR MR, 1975, IEEE T ACOUST SPEECH, VAS23, P176, DOI 10.1109/TASSP.1975.1162664 SCHAIRER KE, 1992, MOD LANG J, V76, P309, DOI 10.2307/330161 SCOVEL T, 1988, ISSUES 2 LANGUAGE RE, P206 Sheffert SM, 2002, J EXP PSYCHOL HUMAN, V28, P1447, DOI 10.1037//0096-1523.28.6.1447 *SPEEDLINGU, 2007, GENEVALOGIC *SPHINX, 2001, SPHINXTRAIN BUILD AC Sundermann D., 2003, P IEEE WORKSH AUT SP, P676 Sundstrom A., 1998, P ISCA WORKSH SPEECH, P49 TANG M, 2001, VOICE TRANSFORMATION Tenenbaum JB, 2000, SCIENCE, V290, P2319, DOI 10.1126/science.290.5500.2319 TRAUNMULLER H, 1994, PHONETICA, V51, P170 TURK O, 2005, DONOR SELECTION VOIC Turk O, 2006, COMPUT SPEECH LANG, V20, P441, DOI 10.1016/j.csl.2005.06.001 VANLANCKER D, 1985, J PHONETICS, V13, P19 VIERUDIMULESCU B, 2005, P ISCA WORKSH PLAST, P66 Wachowicz K. A., 1999, CALICO Journal, V16 WATSON CS, 1989, VOLTA REV, V91, P29 YAN Q, 2004, P IEEE INT C AC SPEE, P637 YOUNG SJ, 1993, HTK HIDDEN MARKOV MO, P153 NR 77 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 920 EP 932 DI 10.1016/j.specom.2008.11.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500010 ER PT J AU Bissiri, MP Pfitzinger, HR AF Bissiri, Maria Paola Pfitzinger, Hartmut R. TI Italian speakers learn lexical stress of German morphologically complex words SO SPEECH COMMUNICATION LA English DT Article DE CALL; Computer Assisted Language Learning; Prosody modification; Intonation; Local speech rate; Intensity; Performance assessment; Lexical stress ID FEEDBACK; RECASTS AB Italian speakers tend to stress the second component of German morphologically complex words such as compounds and prefix verbs even if the first component is lexically stressed. To improve their prosodic phrasing an automatic pronunciation teaching method was developed based on auditory feedback of prosodically corrected utterances in the learners' own voices. Basically, the method copies contours of F0, local speech rate, and intensity from reference utterances of a German native speaker to the learners' speech signals. It also adds emphasis to the stress position in order to help the learners better recognise the correct pronunciation and identify their errors. A perception test with German native speakers revealed that manipulated utterances significantly better reflect lexical stress than the corresponding original utterances. Thus, two groups of Italian learners of German were provided with different feedback during a training session, one group with manipulated utterances in their individual voices and the other with correctly pronounced original utterances in the teacher's voice. Afterwards, both groups produced the same sentences again and German native speakers judged the resulting utterances. Resynthesised stimuli, especially with emphasised stress, were found to be a more effective feedback than natural stimuli to learn the correct stress position. Since resynthesis was obtained without previous segmentation of the learners' speech signals, this technology could be effectively included in Computer Assisted Language Learning software. (C) 2009 Elsevier B.V. All rights reserved. C1 [Bissiri, Maria Paola] Univ Munich, Inst Phonet & Speech Proc IPS, D-80799 Munich, Germany. [Bissiri, Maria Paola] Univ Sassari, Dipartimento Sci Linguaggi, I-07100 Sassari, Italy. [Pfitzinger, Hartmut R.] Univ Kiel, Inst Phonet & Digital Speech Proc IPDS, D-24118 Kiel, Germany. RP Bissiri, MP (reprint author), Univ Munich, Inst Phonet & Speech Proc IPS, Schellingstr 3, D-80799 Munich, Germany. EM mariapa@phonetik.uni-muenchen.de; hpt@ipds.uni-kiel.de CR Anderson-Hsieh J., 1994, CALICO Journal, V11 Anderson-Hseih J., 1992, SYSTEM, V20, P51, DOI 10.1016/0346-251X(92)90007-P Bertinetto Pier Marco, 1981, STRUTTURE PROSODICHE BERTINETTO PM, 1980, J PHONETICS, V8, P385 Bissiri M. P., 2006, P 11 AUSTR INT C SPE, P24 BISSIRI MP, 2008, THESIS LUDWIGMAXIMIL BISSIRI MP, 2008, P 4 C SPEECH PROS CO, P639 BISSIRI MP, 2007, PERSPEKTIVEN, V2, P353 Chapelle C., 1998, LANGUAGE LEARNING TE, V2, P22 Chun D., 1998, LANGUAGE LEARNING TE, V2, P61 Chun D. M., 1989, CALICO Journal, V7 CRANEN B, 1984, SYSTEM, V12, P25, DOI 10.1016/0346-251X(84)90044-7 DEBOT K, 1983, LANG SPEECH, V26, P331 Delmonte R, 2000, SPEECH COMMUN, V30, P145, DOI 10.1016/S0167-6393(99)00043-6 DELMONTE R, 2003, ATT 13 GIORN STUD GR, P169 DELMONTE R, 1981, STUDI GRAMMATICA ITA, P69 DELMONTE R, 1997, P ESCA EUR 97 RHOD, V2, P669 DIMPERIO M, 2000, OHIO STATE U WORKING, V54, P59 Dupoux E, 2001, J ACOUST SOC AM, V110, P1606, DOI 10.1121/1.1380437 Eskenazi M., 2000, P INSTIL 2000 INT SP, P73 Eskenazi M., 1999, CALICO Journal, V16 Eskenazi M, 1998, P SPEECH TECHN LANG, P77 Gass S., 1994, 2 LANGUAGE ACQUISITI Gass Susan M., 1997, INPUT INTERACTION 2 GERMAINRUTHERFO.A, 2000, COMM ALSIC, V3, P61 Hardison DM, 2004, LANG LEARN TECHNOL, V8, P34 HIROSE K, 2003, P EUROSPEECH GEN, V4, P3149 Hirose K, 2004, P INT S TON ASP LANG, P77 JAMES E, 1976, LANG TEACHING, V14, P227 Jessen M., 1995, P 13 INT C PHON SCI, V4, P428 Kohler Klaus, 1995, EINFUHRUNG PHONETIK, VSecond KOMMISSARCHIK J, 2000, P INSTILL WORKSH SPE, P86 Ladd D. R., 1996, INTONATIONAL PHONOLO LANE H, 1969, LANG TEACHING, P159 Lehiste I., 1970, SUPRASEGMENTALS Lyster R, 1998, LANG LEARN, V48, P183, DOI 10.1111/1467-9922.00039 MARTIN P, 2004, P INSTIL ICALL2004 S, P177 Mennen I., 2007, NONNATIVE PROSODY PH, P53 NAGANO K, 1990, P ICSLP KOB, V2, P1169 Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473 Nicholas H, 2001, LANG LEARN, V51, P719, DOI 10.1111/0023-8333.00172 PFITZINGER HR, 2009, AIPUK, V38, P21 Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 TILLMANN HG, 2004, P INSTIL ICALL2004 S, P17 van der Hulst H., 1999, WORD PROSODIC SYSTEM, P273 VARDANIAN RM, 1964, LANG LEARN, P109 Wang Q., 2008, P 4 C SPEECH PROS CA, P635 NR 48 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 933 EP 947 DI 10.1016/j.specom.2009.03.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500011 ER PT J AU Saz, O Yin, SC Lleida, E Rose, R Vaquero, C Rodriguez, WR AF Saz, Oscar Yin, Shou-Chun Lleida, Eduardo Rose, Richard Vaquero, Carlos Rodriguez, William R. TI Tools and Technologies for Computer-Aided Speech and Language Therapy SO SPEECH COMMUNICATION LA English DT Article DE Spoken language learning; Speech disorders; Speech corpora; Automatic speech recognition; Pronunciation verification ID RECOGNITION AB This paper addresses the problem of Computer-Aided Speech and Language Therapy (CASLT). The goal of the work described in the paper is to develop and evaluate a semi-automated system for providing interactive speech therapy to the increasing population of impaired individuals and help professional speech therapists. A discussion on the development and evaluation of a set of interactive therapy tools, along with the underlying speech technologies that support these tools is provided. The interactive tools are designed to facilitate the acquisition of language skills in the areas of basic phonatory skills, phonetic articulation and language understanding primarily for children with neuromuscular disorders like dysarthria. Human-machine interaction for all of these areas requires the existence of speech analysis, speech recognition, and speech verification algorithms that are robust with respect to the sources of speech variability that are characteristic of this population of speakers. The paper will present an experimental study that demonstrates the effectiveness of an interactive system for eliciting speech from a population of impaired children and young speakers ranging in age from 11 to 21 years. The performance of automatic speech recognition (ASR) systems and subword-based pronunciation verification (PV) on this domain are also presented. The results indicate that ASR and PV systems configured from speech utterances taken from the impaired speech domain can provide adequate performance, similar to the experts' agreement rate, for supporting the presented CASLT applications. (C) 2009 Elsevier B.V. All rights reserved. C1 [Saz, Oscar; Lleida, Eduardo; Vaquero, Carlos; Rodriguez, William R.] Univ Zaragoza, GTC, Aragon Inst Engn Res 13A, Zaragoza, Spain. [Yin, Shou-Chun; Rose, Richard] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ H3A 2A7, Canada. RP Saz, O (reprint author), Univ Zaragoza, GTC, Aragon Inst Engn Res 13A, Maria de Luna 1, Zaragoza, Spain. EM oskarsaz@unizar.es; shou-chun.yin@-mail.mcgill.ca; lleida@unizar.es; rose@ece.mcgill.ca; cvaquero@unizar.es; wricardo@unizar.es RI Lleida, Eduardo/K-8974-2014; Saz Torralba, Oscar/L-7329-2014 OI Lleida, Eduardo/0000-0001-9137-4013; CR ACEROVILLAN P, 2005, TRATAMIENTO VOZ MANU Aguinaga G., 2004, PRUEBA LENGUAJE ORAL Alarcos E., 1950, FONOLOGIA ESPANOLA ALBOR JC, 1991, ELA EXAMEN LOGOPEDIC Bengio S., 2004, P OD SPEAK LANG REC, P237 Coorman G, 2000, P INT C SPOK LANG BE, P395 Cucchiarini C., 2007, P INT 2007 ANTW BELG, P2181 DELLER JR, 1991, COMPUT METH PROG BIO, V35, P125, DOI 10.1016/0169-2607(91)90071-Z DELONG ER, 1988, J BIOMETR, V3, P837 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DUCHATEAU J, 2007, P 10 EUR C SPEECH CO, P1210 ESCARTIN A, 2008, COMUNICA FRAMEWORK GARCIAGOMEZ R, 1999, P 6 EUR C SPEECH COM, P1067 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GEROSA M, 2008, P 2008 INT C AC SPEE, P5057 Goronzy S., 1999, P SON RES FOR 99, V1, P9 GRANSTROM B, 2005, P INT LISB, P449 HATZIS A, 1999, THESIS U SHEFFIELD S HAWLEY M, 2003, P 7 C ASS ADV ASS TE ITO A, 2008, P 11 INT C SPEECH CO, P2819 JUSTO R, 2008, P 1 INT C AMB MED SY KIM H, 2008, P INT C SPOK LANG PR, P1741 Koehn P., 2005, P 10 MACH TRANSL SUM Kornilov A.-U., 2004, P 9 INT C SPEECH COM LEFEVRE JP, 1996, 1060 TIDE Legetter C., 1995, COMPUTER SPEECH LANG, V9, P171 Lleida E, 2000, IEEE T SPEECH AUDI P, V8, P126, DOI 10.1109/89.824697 Mangu L., 1999, P EUR C SPEECH COMM, P495 MARTINEZ B, 2007, P 3 C NAC U DISC ZAR Menendez-Pidal X., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608020 Monfort M., 2001, MENTE SOPORTE GRAFIC Monfort M., 1989, REGISTRO FONOLOGICO Moreno A., 1993, P EUR SEPT, P653 MORENO A, 2000, P LREC ATH GREEC, P895 NAVARROMESA JL, 2005, ORAL CORPUS PROJECT OESTER AM, 2002, P 15 SWED PHON C FON, P45 Patel R, 2002, ALTERNATIVE AUGMENTA, V18, P2, DOI 10.1080/714043392 PRATT SR, 1993, J SPEECH HEAR RES, V36, P1063 Rabiner L., 1978, SIGNAL PROCESSING SE RODRIGUEZ V, 2008, THESIS U ANTONIO NEB RODRIGUEZ WR, 2008, P 4 KUAL LUMP INT C Sanders E., 2002, P 7 INT C SPOK LANG, P661 SAZ O, 2008, P 1 WORKSH CHILD COM VAQUERO C, 2008, P IEEE INT C AC SPEE, P4509 VICSI K, 1999, P 6 EUR C SPEECH COM, P859 YIN SC, 2009, P 2009 INT C AC SPEE ZHANG F, 2008, P ICASSP, P5077 NR 47 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 948 EP 967 DI 10.1016/j.specom.2009.04.006 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500012 ER PT J AU Price, P Tepperman, J Iseli, M Duong, T Black, M Wang, S Boscardin, CK Heritage, M Pearson, PD Narayanan, S Alwan, A AF Price, Patti Tepperman, Joseph Iseli, Markus Duong, Thao Black, Matthew Wang, Shizhen Boscardin, Christy Kim Heritage, Margaret Pearson, P. David Narayanan, Shrikanth Alwan, Abeer TI Assessment of emerging reading skills in young native speakers and language learners SO SPEECH COMMUNICATION LA English DT Article DE Children's speech recognition; Reading assessment; Language learning; Accented English; Speaker adaptation ID CHILDRENS SPEECH; RECOGNITION; VARIABILITY; INFORMATION; INTERFACE; LITERACY; READERS AB To automate assessments of beginning readers, especially those still learning English, we have investigated the types of knowledge sources that teachers use and have tried to incorporate them into an automated system. We describe a set of speech recognition and verification experiments and compare teacher scores with automatic scores in order to decide when a novel pronunciation is best viewed as a reading error or as dialect variation. Since no one classroom teacher is expected to be familiar with as many dialect systems as might occur in an urban classroom, making progress in automated assessments in this area can improve the consistency and fairness of reading assessment. We found that automatic methods performed best when the acoustic models were trained on both native and non-native speech, and argue that this training condition is necessary for automatic reading assessment since a child's reading ability is not directly observable in one utterance. We also found assessment of emerging reading skills in young children to be an area ripe for more research! (C) 2009 Elsevier B.V. All rights reserved. C1 [Price, Patti] PPrice Speech & Language Technol, Menlo Pk, CA 94025 USA. [Tepperman, Joseph; Black, Matthew; Narayanan, Shrikanth] Univ So Calif, Dept Elect Engn, Los Angeles, CA 90089 USA. [Iseli, Markus] Univ Calif Los Angeles, Dept Elect Engn, Henry Samueli Sch Engn & Appl Sci Engr 63 134 4, Los Angeles, CA 90095 USA. [Duong, Thao; Pearson, P. David] Univ Calif Berkeley, Grad Sch Educ, Berkeley, CA 94720 USA. [Boscardin, Christy Kim] Univ Calif San Francisco, Sch Med, Off Med Educ, San Francisco, CA 94143 USA. [Heritage, Margaret] Univ Calif Los Angeles, CRESST, Los Angeles, CA 90095 USA. [Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Henry Samueli Sch Engn & Appl Sci Engr 66 147G 4, Los Angeles, CA 90095 USA. RP Price, P (reprint author), PPrice Speech & Language Technol, 420 Shirley Way, Menlo Pk, CA 94025 USA. EM pjp@pprice.com; tepperma@usc.edu; iseli@ee.ucla.edu; thaod@berkeley.edu; mattthepb@usc.edu; szwang@ee.ucla.edu; BoscardinCK@medsch.ucsf.edu; mheritag@ucla.edu; ppearson@berkeley.edu; shri@sipi.us-c.edu; alwan@ee.ucla.edu RI Narayanan, Shrikanth/D-5676-2012 CR Alwan A., 2007, P IEEE INT WORKSH MU, P26 August D., 2006, DEV LITERACY 2 LANGU BARKER TA, 1995, J EDUC COMPUT RES, V13, P89 BARRON RW, 1986, COGNITION, V24, P93, DOI 10.1016/0010-0277(86)90006-5 BLACK M, 2008, P INTERSPEECH ICSLP, P2783 BUNTSCHUH B, 1998, P ICSLP, P2863 Cassell J., 2001, PERSONAL TECHNOLOGIE, V5, P203 COHEN PR, 1998, P INT C SPOK LANG PR, V2, P249 Coulston R., 2002, P 7 INT C SPOK LANG, V4, P2689 Dalby J., 1999, CALICO Journal, V16 DARVES C, 2002, P 7 INT C SPOK LANG, P16 DIFABRIZZIO G, 1999, P INT DIAL MULT SYST, P9 Eguchi S, 1969, Acta Otolaryngol Suppl, V257, P1 Eskenazi M., 1999, CALICO Journal, V16 FARMER ME, 1992, REM SPEC EDUC, V13, P50 Gerosa M, 2007, SPEECH COMMUN, V49, P847, DOI 10.1016/j.specom.2007.01.002 Goldstein UG., 1980, THESIS MIT CAMBRIDGE Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X Hagen A, 2007, SPEECH COMMUN, V49, P861, DOI 10.1016/j.specom.2007.05.004 Harris A., 1982, BASIC READING VOCABU JONES B, 2007, AM ED RES ASS AERA C Kazemzadeh A., 2005, P EUR, P1581 KENT RD, 1976, J SPEECH HEAR RES, V19, P421 Lamel LF, 1997, SPEECH COMMUN, V23, P67, DOI 10.1016/S0167-6393(97)00037-X Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LOVETT MW, 1994, BRAIN LANG, V47, P117, DOI 10.1006/brln.1994.1045 MA J, 2002, INT C SPOK LANG PROC, V1, P197 MCCULLOUGH CS, 1995, SCHOOL PSYCHOL REV, V24, P426 Mostow J., 1995, P ACM S US INT SOFTW, P77, DOI 10.1145/215585.215665 Narayanan S, 2002, IEEE T SPEECH AUDI P, V10, P65, DOI 10.1109/89.985544 National Reading Panel, 2000, NIH PUBL, V00-4769 Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026 PRICE P, 2007, WORKSH MULT SIGN PRO Russell M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607069 Sharma R, 1998, P IEEE, V86, P853, DOI 10.1109/5.664275 Khalili A., 1994, Journal of Research on Computing in Education, V27 Shefelbine J., 1996, BPST BEGINNING PHONI SMITH BL, 1992, J ACOUST SOC AM, V91, P2165, DOI 10.1121/1.403675 Oviatt S, 2000, EMBODIED CONVERSATIONAL AGENTS, P319 TAKEZAWA T, 1998, P ICSLP SYDN AUSTR TEPPERMAN J, 2007, P INTERSPEECH ANTW B, P2185 VANDUSEN L, 1993, COMPUTER BASED INTEG, P35 WANG S, 2007, P SLATE FARM PENNS, P120 Whitehurst GJ, 1998, CHILD DEV, V69, P848, DOI 10.1111/j.1467-8624.1998.00848.x WIBURG K, 1995, COMPUTING TEACHER, V22, P7 Williams SM, 2000, PROCEEDINGS OF ICLS 2000 INTERNATIONAL CONFERENCE OF THE LEARNING SCIENCES, P115 XIAO B, 2002, P 7 INT C SPOK LANG, V1, P629 You H., 2005, P EUR LISB PORT, P749 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 WATCH ME READ PROJECT LISTEN NR 51 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 968 EP 984 DI 10.1016/j.specom.2009.05.001 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500013 ER PT J AU Duchateau, J Kong, YO Cleuren, L Latacz, L Roelens, J Samir, A Demuynck, K Ghesquiere, P Verhelst, W Van Hamme, H AF Duchateau, Jacques Kong, Yuk On Cleuren, Leen Latacz, Lukas Roelens, Jan Samir, Abdurrahman Demuynck, Kris Ghesquiere, Pol Verhelst, Werner Van Hamme, Hugo TI Developing a reading tutor: Design and evaluation of dedicated speech recognition and synthesis modules SO SPEECH COMMUNICATION LA English DT Article DE Reading tutor; Computer-assisted language learning (CALL); Speech technology for education ID LEARNING-DISABILITIES; CORRECTIVE FEEDBACK; CHILDREN; SYSTEM AB When a child learns to read, the learning process can be enhanced by significant reading practice with individual support from a tutor. But in reality, the availability of teachers or clinicians is limited, so the additional use of a fully automated reading tutor would be beneficial for the child. This paper discusses our efforts to develop an automated reading tutor for Dutch. First, the dedicated speech recognition and synthesis modules in the reading tutor are described. Then, three diagnostic and remedial reading tutor tools are evaluated in practice and improved based on these evaluations: (1) automatic assessment of a child's reading level, (2) oral feedback to a child at the phoneme, syllable or word level, and (3) tracking where a child is reading, for automated screen advancement or for direct feedback to the child. In general, the presented tools work in a satisfactory way, including for children with known reading disabilities. (C) 2009 Elsevier B.V. All rights reserved. C1 [Duchateau, Jacques; Roelens, Jan; Samir, Abdurrahman; Demuynck, Kris; Van Hamme, Hugo] Katholieke Univ Leuven, ESAT Dept, B-3001 Louvain, Belgium. [Kong, Yuk On; Latacz, Lukas; Verhelst, Werner] Vrije Univ Brussel, ETRO Dept, B-1050 Brussels, Belgium. [Cleuren, Leen; Ghesquiere, Pol] Katholieke Univ Leuven, Ctr Parenting Child Welfare & Disabil, B-3000 Louvain, Belgium. RP Duchateau, J (reprint author), Katholieke Univ Leuven, ESAT Dept, Kasteelpk Arenberg 10,POB 2441, B-3001 Louvain, Belgium. EM Jacques.Duchateau@esat.kuleuven.be RI Ghesquiere, Pol/B-9226-2009; Van hamme, Hugo/D-6581-2012 OI Ghesquiere, Pol/0000-0001-9056-7550; CR ABDOU SM, 2006, P INT 2006 ICSLP 9 I, P849 Adams M. J., 2006, INT HDB LITERACY TEC, V2, P109 Banerjee S., 2003, P 8 EUR C SPEECH COM, P3165 BLACK AW, 2001, P 4 ISCA SPEECH S WO, P63 BLACK M, 2007, P INTERSPEECH ICSLP, P206 CHU M, 2003, P INT C AC SPEECH SI, V1, P264 Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014 Cleuren L., 2008, P 6 INT C LANG RES E Coltheart M., 1978, STRATEGIES INFORMATI, P151 Daelemans Walter, 2005, MEMORY BASED LANGUAG DEMUYNCK K, 2006, P INT 2006 ICSLP 9 I, P1622 DEMUYNCK K, 2004, P 4 INT C LANG RES E, V1, P61 DMELLO SK, 2007, P SLATE WORKSH SPEEC, P49 DUCHATEA J, 2007, EUR 10 EUR C SPEECH, P1210 DUCHATEAU J, 2002, P IEEE INT C AC SPEE, V1, P221 Duchateau J., 2006, P ITRW SPEECH REC IN, P59 ESKENAZI M, 2002, PMLA, P48 Hagen A, 2007, SPEECH COMMUN, V49, P861, DOI 10.1016/j.specom.2007.05.004 Heiner C., 2004, P INSTIL ICALL S NLP, P195 Hunt A. J., 1996, P ICASSP 96, P373 KERKHOFF J, 2002, P 13 M COMP LING NET LATACZ L, 2007, P 6 ISCA WORKSH SPEE, P270 LATACZ L, 2006, P PRORISC IEEE BEN W MacArthur CA, 2001, ELEM SCHOOL J, V101, P273, DOI 10.1086/499669 MCCOY KM, 1986, READ TEACH, V39, P548 NERI A, 2006, P ISCA INT PITTSB PA, P1982 PANY D, 1988, J LEARN DISABIL, V21, P546 PERKINS VL, 1988, J LEARN DISABIL, V21, P244 Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7 Russell M, 2000, COMPUT SPEECH LANG, V14, P161, DOI 10.1006/csla.2000.0139 SPAAI GWG, 1991, J EDUC RES, V84, P204 Wise BW, 1998, READING AND SPELLING, P473 NR 32 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 985 EP 994 DI 10.1016/j.specom.2009.04.010 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500014 ER PT J AU Wang, HC Waple, CJ Kawahara, T AF Wang, Hongcui Waple, Christopher J. Kawahara, Tatsuya TI Computer Assisted Language Learning system based on dynamic question generation and error prediction for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Computer Assisted Language Learning (CALL); Second language learning; Automatic speech recognition; Error prediction AB We have developed a new Computer Assisted Language Learning (CALL) system to aid students learning Japanese as a second language. The system offers students the chance to practice elementary Japanese by creating their own sentences based on visual prompts, before receiving feedback on their mistakes. It is designed to detect lexical and grammatical errors in the input sentence as well as pronunciation errors in the speech input. Questions are dynamically generated along with sentence patterns of the lesson point, to realize variety and flexibility of the lesson. Students can give their answers with either text input or speech input. To enhance speech recognition performance, a decision tree-based method is incorporated to predict possible errors made by non-native speakers for each generated sentence on the fly. Trials of the system were conducted by foreign university students, and positive feedback was reported. (C) 2009 Elsevier B.V. All rights reserved. C1 [Wang, Hongcui; Waple, Christopher J.; Kawahara, Tatsuya] Kyoto Univ, Sch Informat, Sakyo Ku, Kyoto 6068501, Japan. RP Wang, HC (reprint author), Kyoto Univ, Sch Informat, Sakyo Ku, Kyoto 6068501, Japan. EM wang@ar.media.kyoto-u.ac.jp CR ABDOU SM, 2006, P ICSLP Bernstein J., 1999, CALICO Journal, V16 ESKENAZI M, 1998, STILL FRANCO H, 2000, P INSTIL INT SPEECH Kawai G, 2000, SPEECH COMMUN, V30, P131, DOI 10.1016/S0167-6393(99)00041-2 LEVIE WH, 1982, ECTJ-EDUC COMMUN TEC, V30, P195 NAGATA N, 2002, COMPUTER ASSISTED SY NELSON DL, 1976, HUMAN LEARNING MEMOR, V2, P523 NEUMEYER L, 1998, STILL SMITH MC, 1980, J EXP PSYCHOL GEN, V109, P373, DOI 10.1037/0096-3445.109.4.373 Tsubota Y., 2002, P ICSLP, P1205 Tsubota Y., 2004, P ICSLP, P1689 WANG H, 2008, P ICASSP Witt S.M., 1999, THESIS ZINOVJEVA N, 2005, SPEECH TECHNOLOGY NR 15 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 995 EP 1005 DI 10.1016/j.specom.2009.03.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500015 ER PT J AU McGraw, I Yoshimoto, B Seneff, S AF McGraw, Ian Yoshimoto, Brandon Seneff, Stephanie TI Speech-enabled card games for incidental vocabulary acquisition in a foreign language SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Intelligent computer assisted language learning; Computer aided vocabulary acquisition ID WORD MEANINGS AB In this paper, we present a novel application for speech technology to aid students with vocabulary acquisition in a foreign language through interactive card games. We describe a generic platform for card game development and then introduce a particular prototype card game called Word War, designed for learning Mandarin Chinese. We assess the feasibility of deploying Word War via the Internet by conducting our first user study remotely and evaluating the performance of the speech recognition component. It was found that the three central concepts in our system were recognized with ail error rate of 16.02%. We then turn to assessing the effects of the Word War game on vocabulary retention in a controlled environment. To this end, we performed a user study using two variants of the Word War game: a speaking mode, in which the user issues spoken commands to manipulate the game cards, and a listening mode, in which the computer gives spoken directions that the students must follow by manipulating the cards manually with the mouse. These two modes of learning were compared against a more traditional computer assisted vocabulary learning system: an oil-line flash cards program. To assess long-term learning gains as a function of time-on-task, we had the students interact with each system twice over a period of three weeks. We found that all three systems were competitive in terms of the vocabulary words learned as measured by pre-tests and post-tests, with less than a 5% difference among the systems' average overall learning gains. We also conducted surveys, which indicated that the students enjoyed the speaking mode of Word War more than the other two systems. (C) 2009 Elsevier B.V. All rights reserved. C1 [McGraw, Ian; Yoshimoto, Brandon; Seneff, Stephanie] MIT Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA. RP McGraw, I (reprint author), MIT Comp Sci & Artificial Intelligence Lab, 32 Vassar St, Cambridge, MA 02139 USA. EM imcgraw@csail.mit.edu; yoshimoto@alum.mit.edu; seneff@csail.mit.edu CR Anderson R. C., 1992, AM EDUC, V16, P44 ATWELL E, 1999, RECOGNITION LEARNER BERNSTEIN J, 1999, COMPUTER ASSISTED LA, V16 BRINDLY G, 1988, STUDIES 2 LANGUAGE A, V10, P217 BROWN TS, 1991, TESOL QUART, V25, P655, DOI 10.2307/3587081 Carey Susan, 1978, LINGUISTIC THEORY PS, P264 Cooley R. E, 2001, AIED 2001 WORKSH PAP, P17 Dalby J., 1999, COMPUTER ASSISTED LA, V16 EHSANI F, 1998, LANGUAGE LEARNING TE ELLIS R, 1994, LANG LEARN, V44, P449, DOI 10.1111/j.1467-1770.1994.tb01114.x Ellis R, 1995, APPL LINGUIST, V16, P409, DOI 10.1093/applin/16.4.409 Ellis R., 1999, STUDIES 2 LANGUAGE A, V21, P285, DOI DOI 10.1017/S0272263199002077 Eskenazi M., 2007, SLATE WORKSH SPEECH, P124 Eskenazi M., 1999, LANGUAGE LEARNING TE, V2, P62 GAMPER J, 2002, P 6 WORLD MULT SYST GAMPER J, 2002, COMPUTER ASSISTED LA GLASS J, 2003, COMPUTER SPEECH LANG GRUENSTEIN A, 2008, IMCI 08 P 10 INT C M, P141 GRUNEBERG MM, 1991, LANGUAGE LEARNING J, V4, P60, DOI 10.1080/09571739185200511 Harless WG, 2003, IEEE COMPUT GRAPH, V23, P46, DOI 10.1109/MCG.2003.1231177 HERMAN PA, 1987, NATURE VOCABULARY AC, P19 Hincks R., 2003, ReCALL, V15, DOI 10.1017/S0958344003000211 HOLLAND MM, 1999, COMPUTER ASSISTED LA, V16 Johnson WL, 2004, LECT NOTES COMPUT SC, V3220, P336 KRASHEN S, 1982, INPUT HYPOTHESIS ISS Krashen S., 1994, IMPLICIT EXPLICIT LE, P45 Krashen S., 2004, LANGUAGE TEACHER, V28/7, P3 Krashen S., 1982, PRINCIPLES PRACTICE LONG MH, 1981, INPUT INTERACTION 2, P259 MCGRAW I, P AAAI MCGRAW I, SLATE WORKSH SPEECH Menzel W., 2001, ReCALL, V13 NATION ISP, 2001, LEARN VOC AN LANG NERBONNE J, 1998, COMPUTER ASSISTED LA, P543 NERI A, 2006, INTERSPEECH Oxford R., 1990, TESL CANADA J, V7, P9 Pavlik PI, 2005, COGNITIVE SCI, V29, P559, DOI 10.1207/s15516709cog0000_14 Peabody M, 2006, LECT NOTES COMPUT SC, V4274, P602 PIMSLEUR P, 1980, LEARN FOREIGN LANGUA Swain M., 1985, INPUT 2 LANGUAGE ACQ, P235 TAO JH, 2008, P BLIZZ CHALL *TAYL FRANC, CHIN ESS GRAMM ESS G von Ahn L, 2006, COMPUTER, V39, P92, DOI 10.1109/MC.2006.196 YI J, 2000, ICSLP TALK ME NR 45 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 1006 EP 1023 DI 10.1016/j.specom.2009.04.011 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500016 ER PT J AU Wik, P Hjalmarsson, A AF Wik, Preben Hjalmarsson, Anna TI Embodied conversational agents in computer assisted language learning SO SPEECH COMMUNICATION LA English DT Article DE Second langue learning; Dialogue systems; Embodied conversational agents; Pronunciation training; CALL; CAPT AB This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students. (C) 2009 Elsevier B.V. All rights reserved. C1 [Wik, Preben; Hjalmarsson, Anna] KTH, Ctr Speech Technol, SE-10044 Stockholm, Sweden. RP Wik, P (reprint author), KTH, Ctr Speech Technol, Lindstedtsvagen 24, SE-10044 Stockholm, Sweden. EM preben@speech.kth.se; annah@speech.kth.se CR AIST G, 2006, P INT PITTSB PA US, P1922 Engwall Olov, 2007, Computer Assisted Language Learning, V20, DOI 10.1080/09588220701489507 BANNERT R, 2004, VAG MOT SVENSKT UTTA Beskow J., 2003, THESIS KTH STOCKHOLM Bosseler A, 2003, J AUTISM DEV DISORD, V33, P653, DOI 10.1023/B:JADD.0000006002.82367.4f BRENNAN SE, 2000, P 38 ANN M ASS COMP Brusk J., 2007, P ACM FUT PLAY TOR C, P137, DOI 10.1145/1328202.1328227 BURNHAM D, 1999, AVSP, P80 CARLSON R, 2002, P FON STOCKH SWED MA, P65 Ellis Rod, 1994, STUDY 2 LANGUAGE ACQ Engwall O., 2004, P ICSLP 2004 JEJ ISL, P1693 ENGWALL O, 2008, P INT 2008 BRISB AUS, P2631 Eskenazi M., 1999, LANGUAGE LEARNING TE, V2, P62 FLEGE JE, 1998, STILL SPEECH TECHNOL, P1 Gee JP, 2003, WHAT VIDEO GAMES HAVE TO TEACH US ABOUT LEARNING AND LITERACY, P1 Granstrom B., 1999, P INT C PHON SCI ICP, P655 GUSTAFSON J, 2004, P SIGDIAL Hjalmarsson A., 2007, P SIGDIAL ANTW BELG, P132 Hjalmarsson A., 2008, P SIGDIAL 2008 COL O Iuppa Nicholas, 2007, STORY SIMULATIONS SE JOHNSON WL, 2004, P I ITSEC KODA T, 1996, P HCI 96 LOND UK, P98 Lester J. C., 1997, Proceedings of the First International Conference on Autonomous Agents, DOI 10.1145/267658.269943 Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025) Mateas M., 2003, GAM DEV C GAM DES TR MCALLISTER R, 1997, 2 LANGUAGE SPEECH Meng H., 2007, AUTOMATIC SPEECH REC, P437 NERI A, 2002, ICSLP, P1209 Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473 Prensky M., 2002, HORIZON, V10, P5, DOI DOI 10.1108/10748120210431349 Prensky M., 2001, DIGITAL GAME BASED L Sjolander K., 2003, P FON 2003 UM U DEP, V9, P93 SKANTZE G, 2005, P SIGDIAL LISB PORT, P178 SLUIJTER AMC, 1995, THESIS SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 van Mulken S, 1998, PEOPLE AND COMPUTER XIII, PROCEEDINGS, P53 von Ahn L, 2006, COMPUTER, V39, P92, DOI 10.1109/MC.2006.196 Walker J. H., 1994, P SIGCHI C HUM FACT, P85, DOI DOI 10.1145/191666.191708 WIK P, 2007, P SLATE 2007 WIK P, 2004, P 17 SWED PHON C FON, P136 NR 40 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2009 VL 51 IS 10 SI SI BP 1024 EP 1037 DI 10.1016/j.specom.2009.05.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 485AK UT WOS:000269092500017 ER PT J AU Chetouani, M Faundez-Zanuy, M Hussain, A Gas, B Zarader, JL Paliwal, K AF Chetouani, Mohamed Faundez-Zanuy, Marcos Hussain, Amir Gas, Bruno Zarader, Jean-Luc Paliwal, Kuldip TI Special issue on non-linear and non-conventional speech processing SO SPEECH COMMUNICATION LA English DT Editorial Material C1 [Chetouani, Mohamed; Gas, Bruno; Zarader, Jean-Luc] Univ Paris 06, CNRS, UMR 7222, ISIR, F-75252 Paris, France. [Faundez-Zanuy, Marcos] Escola Univ Politecn Mataro, Dept Telecommun, Barcelona 08303, Spain. [Hussain, Amir] Univ Stirling, Dept Math & Comp Sci, Stirling FK9 4LA, Scotland. [Paliwal, Kuldip] Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia. RP Chetouani, M (reprint author), Univ Paris 06, CNRS, UMR 7222, ISIR, 4 Pl Jussieu, F-75252 Paris, France. EM mohamed.chetouani@upmc.fr; faundez@eupmt.es; ahu@cs.stir.ac.uk; bruno.gas@upmc.fr; jean-luc.zarader@upmc.fr; k.paliwal@griffith.edu.au RI CHETOUANI, Mohamed/F-5854-2010; Faundez-Zanuy, Marcos/F-6503-2012 OI Faundez-Zanuy, Marcos/0000-0003-0605-1282 NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 713 EP 713 DI 10.1016/j.specom.2009.06.001 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700001 ER PT J AU Huang, HY Lin, FH AF Huang, Heyun Lin, Fuhuei TI A speech feature extraction method using complexity measure for voice activity detection in WGN SO SPEECH COMMUNICATION LA English DT Article DE Kolmogorov complexity; Feature extraction; Speech model; Voice activity detection; Complexity analysis ID HIGHER-ORDER STATISTICS; MODEL AB A novel speech extraction algorithm is proposed in this paper for Voice Activity Detection (VAD). Signal complexity analysis with definition of Kolmogorov complexity is adopted, which explores model characteristics of speech production to differentiate speech and white Gaussian noise (WGN). In the view of speech signal processing, properties of speech's source and vocal tract are explored by complexity analysis. Also, some interesting properties of signal complexity are presented with experimental study, including complexity analysis of general noise-corrupted signal. Moreover, some enhanced features with complexity and a feature incorporation method are presented. These features incorporate some unique characteristics of speech, like pitch information, vocal organ information, and so oil. With a large database of speech signals and synthetic/real Gaussian noise, distributions of novel features and receiver operating characteristics (ROC) curves are shown, which are proved as potential features for voice activity detection. (C) 2009 Elsevier B.V. All rights reserved. C1 [Huang, Heyun; Lin, Fuhuei] Spreadtrum Commun Inc, Shanghai, Peoples R China. RP Huang, HY (reprint author), Spreadtrum Commun Inc, Zuchongzhi Rd 2288, Shanghai, Peoples R China. EM heyun.huang@spreadtrum.com; fuhuei.lin@spreadtrum.com CR CHI Z, 1998, P INT C SIGN PROC, P1185 Gazor S, 2003, IEEE T SPEECH AUDI P, V11, P498, DOI 10.1109/TSA.2003.815518 KASPAR F, 1987, PHYS REV A, V36, P842, DOI 10.1103/PhysRevA.36.842 LEMPEL A, 1976, IEEE T INFORM THEORY, V22, P75, DOI 10.1109/TIT.1976.1055501 Li K, 2005, IEEE T SPEECH AUDI P, V13, P965, DOI 10.1109/TSA.2005.851955 Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996 Ramirez J, 2005, IEEE SIGNAL PROC LET, V12, P689, DOI 10.1109/LSP.2005.855551 Ramirez J, 2007, IEEE T AUDIO SPEECH, V15, P2177, DOI 10.1109/TASL.2007.903937 Shin JW, 2008, IEEE SIGNAL PROC LET, V15, P257, DOI 10.1109/LSP.2008.917027 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521 Tong QY, 1996, CHAOS SOLITON FRACT, V7, P371, DOI 10.1016/0960-0779(95)00070-4 TUCKER R, 1992, IEE PROC-I, V139, P377 NR 13 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 714 EP 723 DI 10.1016/j.specom.2009.02.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700002 ER PT J AU Charbuillet, C Gas, B Chetouani, M Zarader, JL AF Charbuillet, C. Gas, B. Chetouani, M. Zarader, J. L. TI Optimizing feature complementarity by evolution strategy: Application to automatic speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Feature extraction; Evolution strategy; Speaker verification ID FEATURE-EXTRACTION; RECOGNITION AB Conventional automatic speaker verification systems are based on cepstral features like Mel-scale frequency cepstrum coefficient (MFCC), or linear predictive cepstrum coefficient (LPCC). Recent published works showed that the use of complementary features can significantly improve the system performances. In this paper, we propose to use an evolution strategy to optimize the complementarity of two filter bank based feature extractors. Experiments we made with a state of the art speaker verification system show that significant improvement can be obtained. Compared to the standard MFCC, an equal error rate (EER) improvement of 11.48% and 21.56% was obtained on the 2005 Nist SRE and Ntimit databases, respectively. Furthermore, the obtained filter banks picture out. the importance of some specific spectral information for automatic speaker verification. (C) 2009 Elsevier B.V. All rights reserved. C1 [Charbuillet, C.; Gas, B.; Chetouani, M.; Zarader, J. L.] Univ Paris 06, CNRS, UMR 7222, ISIR, F-94200 Ivry, France. RP Charbuillet, C (reprint author), Univ Paris 06, CNRS, UMR 7222, ISIR, F-94200 Ivry, France. EM christophe.charbuillet@lis.jussieu.fr; bruno.gas@upmc.fr; mohamed.chetouani@upmc.fr; jean-luc.zarader@upmc.fr RI CHETOUANI, Mohamed/F-5854-2010 CR BEYER HG, 2002, NAT COMPUT, V1, P2 CAMPBELL WM, 2004, SPEAK LANG REC WORKS, P41 CAMPBELL WM, 2007, IEEE INT C AC SPEECH, V4, P217 Chetouani M, 2005, LECT NOTES ARTIF INT, V3445, P344 Chin-Teng Lin, 2000, IEEE Transactions on Speech and Audio Processing, V8, DOI 10.1109/89.876300 CIERI C, 2006, LREC 2006 FARRELL K, 1998, P 1998 IEEE INT C AC, V2, P1129, DOI 10.1109/ICASSP.1998.675468 Katagiri S, 1998, P IEEE, V86, P2345, DOI 10.1109/5.726793 Mitchell T.M., 1997, MACHINE LEARNING Miyajima C, 2001, SPEECH COMMUN, V35, P203, DOI 10.1016/S0167-6393(00)00079-0 PARIS G, 2004, LECT NOTES COMPUTER, P267 PELECANOS J, 2001, SPEAK LANG REC WORKH PRZYBOCKI M, 2006, SPEAK LANG REC WORKS, P1 Reynolds D., 2003, P ICASSP 03, VIV, P784 Reynolds D.A., 2002, P IEEE INT C AC SPEE, V4, P4072 ROSS B, 2000, GECCO, P443 Thian NPH, 2004, LECT NOTES COMPUT SC, V3072, P631 Torkkola K., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753742 Vair C., 2006, SPEAK LANG REC WORKS YI L, 2004, P 8 IEEE INT S HIGH YIN SC, 2006, SPEAK LANG REC WORKS ZAMALLOA M, 2006, SPEAK LANG REC WORKS, V1, P1 ZHIYOU M, 2003, IEEE INT C SYST MAN, V5, P4153 NR 23 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 724 EP 731 DI 10.1016/j.specom.2009.01.005 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700003 ER PT J AU Zouari, L Chollet, G AF Zouari, Leila Chollet, Gerard TI Efficient codebooks for fast and accurate low resource ASR systems SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Gaussian selection; Codebook ID SPEECH RECOGNITION; HMMS AB Today, speech interfaces have become widely employed in mobile devices, thus recognition speed and resource consumption are becoming new metrics of Automatic Speech Recognition (ASR) performance. For ASR systems using continuous Hidden Markov Models (HMMs), the computation of the state likelihood is one of the most time consuming parts. In this paper, we propose novel multi-level Gaussian selection techniques to reduce the cost of state likelihood computation. These methods are based on original and efficient codebooks. The proposed algorithms are evaluated within the framework of a large vocabulary continuous speech recognition task. (C) 2009 Elsevier B.V. All rights reserved. C1 [Zouari, Leila; Chollet, Gerard] GET ENST CNRS LTCI, Dept Traitement Signal & Images, F-75634 Paris, France. RP Zouari, L (reprint author), GET ENST CNRS LTCI, Dept Traitement Signal & Images, 46 Rue Barrault, F-75634 Paris, France. EM zouari@enst.fr CR Aiyer A, 2000, INT CONF ACOUST SPEE, P1519, DOI 10.1109/ICASSP.2000.861939 Bocchieri E., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319405 CHAN A, 2004, INT C SPOK LANG PROC CHAN A, 2005, EUR C SPEECH COMM TE, P565 Digalakis V, 2000, COMPUT SPEECH LANG, V14, P33, DOI 10.1006/csla.1999.0134 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931 FILALI K, 2002, INT C SPOK LANG PROC GALES MJF, 1999, IEEE T SPEECH AUDIO, P470 GALES MJF, 1996, INT C SPOK LANG PROC, P470 Galliano S, 2005, EUR C SPEECH COMM TE Herman SM, 1998, INT CONF ACOUST SPEE, P485, DOI 10.1109/ICASSP.1998.674473 JURGEN F, 1996, INT C AC SPEECH SIGN, P837 JURGEN F, 1996, EUR C SPEECH COMM TE, P1091 LEE A, 2001, INT C AC SPEECH SIGN, P1269 LEE A, 1997, IEEE INT C AC SPEECH LEPPANEN J, 2006, INT C AC SPEECH SIGN LI X, 2006, IEEE T AUD SPEECH LA MAK B, 2001, IEEE T SPEECH AUDIO, P264 Mokbel C, 2001, IEEE T SPEECH AUDI P, V9, P342, DOI 10.1109/89.917680 MOSUR RM, 1997, EUR C SPEECH COMM TE OLSEN J, 2000, IEEE NORD PROC S ORTMANNS S, 1997, EUR C SPEECH COMM TE, P139 PADMANABLAN M, 1997, IEEE WORKSH AUT SPEE, P325 PADMANABLAN M, 1999, IEEE T SPEECH AUDIO, P282 Pellom BL, 2001, IEEE SIGNAL PROC LET, V8, P221, DOI 10.1109/97.935736 SAGAYAMA S, 1995, SPEECH SIGNAL PROCES, V2, P213 SANKAR A, 1999, EUR C SPEECH COMM TE Sankar A, 2002, SPEECH COMMUN, V37, P133, DOI 10.1016/S0167-6393(01)00063-2 SUONTAUSTS J, 1999, WORKSH AUT SPEECH RE TAKAHASHI S, 1995, INT C AC SPEECH SIGN, V1, P520 Tsakalidis S, 1999, INT CONF ACOUST SPEE, P569 WOSZCZYNA M, 1998, THESIS KARLSRUHE U NR 32 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 732 EP 743 DI 10.1016/j.specom.2009.01.010 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700004 ER PT J AU Iriondo, I Planet, S Socoro, JC Martinez, E Alias, F Monzo, C AF Iriondo, Ignasi Planet, Santiago Socoro, Joan-Claudi Martinez, Elisa Alias, Francesc Monzo, Carlos TI Automatic refinement of an expressive speech corpus assembling subjective perception and automatic classification SO SPEECH COMMUNICATION LA English DT Article DE Expressive speech databases; Expression of emotion; Speech technology; Expressive speech synthesis ID EMOTIONAL SPEECH; RECOGNITION; DATABASES; FEATURES AB This paper presents an automatic system able to enhance expressiveness in speech corpora recorded from acted or stimulated speech. The system is trained with the results of a subjective evaluation carried out on a reduced set of the original corpus. Once the system has been trained, it is able to check the complete corpus and perform an automatic pruning of the unclear utterances, i.e. with expressive styles which are different from the intended corpus. The content which most closely matches the subjective classification remains in the resulting corpus. An expressive speech corpus in Spanish, designed and recorded for speech synthesis purposes, has been used to test the presented proposal. The automatic refinement has been applied to the whole corpus and the result has been validated with a second subjective test. (C) 2008 Elsevier B.V. All rights reserved. C1 [Iriondo, Ignasi; Planet, Santiago; Socoro, Joan-Claudi; Martinez, Elisa; Alias, Francesc; Monzo, Carlos] Univ Ramon Llull, GPMM Grp Recerca Processament Multimodal Enginyer, Barcelona 08022, Spain. RP Iriondo, I (reprint author), Univ Ramon Llull, GPMM Grp Recerca Processament Multimodal Enginyer, C Quatre Camins 2, Barcelona 08022, Spain. EM iriondo@salle.url.edu RI Planet, Santiago/N-8400-2013; Alias, Francesc/L-1088-2014; Iriondo, Ignasi/L-1664-2014 OI Planet, Santiago/0000-0003-4573-3462; Alias, Francesc/0000-0002-1921-2375; Iriondo, Ignasi/0000-0003-2467-4192 FU European Commission [FP6 IST-4-027122-1P]; Spanish Government [TEC2006-08043/TCM] FX This work has been partially Supported by the European Commission, Project SALERO (FP6 IST-4-027122-1P) and the Spanish Government, Project SAVE (TEC2006-08043/TCM). CR Alias F, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1698 Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123 BOZKURT B, 2003, 8 EUR C SPEECH COMM, P277 Campbell N, 2005, IEICE T INF SYST, VE88D, P376, DOI 10.1093/ietisy/e88-d.3.376 Campbell N., 2000, P ISCA WORKSH SPEECH, P34 CAMPBELL NW, 2002, P 3 INT C LANG RES E CAMPBELL WN, 1991, J PHONETICS, V19, P37 Cowie R., 2001, IEEE SIGNAL PROCESSI, V18, P33 Cowie R, 2005, NEURAL NETWORKS, V18, P371, DOI 10.1016/j.neunet.2005.03.002 Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 DRIOLI C, 2003, VOQUAL 03, P127 Duda R. O., 2001, PATTERN CLASSIFICATI FRANCOIS H, 2002, P 3 INT C LANG RES E Goldberg D. E, 1989, GENETIC ALGORITHMS S HOZJAN V, 2002, P 3 INT C LANG RES E Iriondo I, 2007, LECT NOTES ARTIF INT, V4885, P86 IRIONDO I, 2007, 32 IEEE I C AC SPEEC, V4, P821 Iriondo T, 2007, LECT NOTES COMPUT SC, V4507, P646 Krstulovic S., 2007, P INT 2007 ANTW BELG, P1897 Michaelis D, 1997, ACUSTICA, V83, P700 MONTERO JM, 1998, 5 INT C SPOK LANG PR, P923 Montoya N., 1998, ZER REV ESTUDIOS SPA, P161 Monzo C., 2007, P 16 INT C PHON SCI, P2081 Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Navas E, 2006, IEEE T AUDIO SPEECH, V14, P1117, DOI 10.1109/TASL.2006.876121 Nogueiras A., 2001, P EUROSPEECH, P2679 Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 PEREZ EH, 2003, FRECUENCIA FONEMAS Planet S., 2008, P 2 INT WORKSH EMOTI Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Schroder M, 2004, THESIS SAARLAND U SCHWEITZER A, 2003, 15 INT C PHON SCI BA, P1301 Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006 Theune M, 2006, IEEE T AUDIO SPEECH, V14, P1137, DOI 10.1109/TASL.2006.876129 Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003 WELLS J, 1993, SAMPA COMPUTER READA Witten I.H., 2005, DATA MINING PRACTICA NR 39 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 744 EP 758 DI 10.1016/j.specom.2008.12.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700005 ER PT J AU Gomez-Vilda, P Fernandez-Baillo, R Rodellar-Biarge, V Lluis, VN Alvarez-Marquina, A Mazaira-Fernandez, LM Martinez-Olalla, R Godino-Llorente, JI AF Gomez-Vilda, Pedro Fernandez-Baillo, Roberto Rodellar-Biarge, Victoria Nieto Lluis, Victor Alvarez-Marquina, Agustin Miguel Mazaira-Fernandez, Luis Martinez-Olalla, Rafael Ignacio Godino-Llorente, Juan TI Glottal Source biometrical signature for voice pathology detection SO SPEECH COMMUNICATION LA English DT Article DE Voice biometry; Speaker's identification; Speaker biometrical characterization; Voice pathology detection; Glottal Source ID ACOUSTIC ANALYSIS; VOCAL FOLDS; PARAMETERS AB The Glottal Source is ill important component of voice as it call be considered as the excitation signal to the voice apparatus. The use of the Glottal Source for pathology detection or the biometric characterization of the speaker are important objectives in the acoustic study of the voice nowadays. Through the present work a biometric signature based oil the speaker's power spectral density of the Glottal Source is presented. It may be shown that this spectral density is related to the vocal fold cover biomechanics, and from literature it is well-known that certain speaker's features its gender, age or pathologic condition leave changes in it. The paper describes the methodology to estimate the biometric signature from the power spectral density of the mucosal wave correlate, which after normalization can be used in pathology detection experiments. Linear Discriminant Analysis is used to confront the detection capability of the parameters defined oil this glottal signature among themselves and compared to classical perturbation parameters. A database of 100 normal and 100 pathologic subjects equally balanced in gender and age is used to derive the best parameter cocktails for pathology detection and quantification purposes to validate this methodology in voice evaluation tests. In a study case presented to illustrate the detection capability of the methodology exposed a control Subset of 24 + 24 subjects is used to determine a subject's voice condition in a pre- and post-surgical evaluation. Possible applications of the study can be found in pathology detection and grading and in rehabilitation assessment after treatment. (C) 2008 Elsevier B.V. All rights reserved. C1 [Gomez-Vilda, Pedro; Fernandez-Baillo, Roberto; Rodellar-Biarge, Victoria; Nieto Lluis, Victor; Alvarez-Marquina, Agustin; Miguel Mazaira-Fernandez, Luis; Martinez-Olalla, Rafael] Univ Politecn Madrid, Fac Informat, E-28660 Madrid, Spain. [Ignacio Godino-Llorente, Juan] Univ Politecn Madrid, Escuela Univ Ingn Tecn Telecomunicac, Madrid 28031, Spain. RP Gomez-Vilda, P (reprint author), Univ Politecn Madrid, Fac Informat, Campus Montegancedo S-N, E-28660 Madrid, Spain. EM pedro@pino.datsi.fi.upm.es FU Plan Nacional de I + D + i [TIC2003-08756, TEC2006-12887-CO2-01/02]; Ministry of Education and Science; CAM/UPM [CCG06-UPM/TIC-0028]; Project HESPERIA; Programme CENIT; Centro para el Desarrollo Tecnologico Industrial; Ministry of Industry, Spain FX This work is being funded by Grants TIC2003-08756 and TEC2006-12887-CO2-01/02 from Plan Nacional de I + D + i, Ministry of Education and Science, by Grant CCG06-UPM/TIC-0028 from CAM/UPM, and by Project HESPERIA (http://www.proyecto-hesperia.org) from the Programme CENIT, Centro para el Desarrollo Tecnologico Industrial, Ministry of Industry, Spain. The authors want to express their most thanks to the anonymous reviewers helping to produce a better conceptualized and understandable manuscript. CR AKANDE OO, 2005, SPEECH COMMUN, V46, P1 Alku P., 1992, P IEEE INT C AC SPEE, V2, P29 Alku P., 2003, P VOQUAL 03, P81 ARROABARREN I, 2003, P ISCA TUTORIAL RES, P29 Berry DA, 2001, J PHONETICS, V29, P431, DOI 10.1006/jpho.2001.0148 Bimbot Frederic, 2004, EURASIP J APPL SIG P, V4, P430, DOI [DOI 10.1155/S1110865704310024, 10.1155/S1110865704310024] Boyanov B, 1997, IEEE ENG MED BIOL, V16, P74, DOI 10.1109/51.603651 Deller J. R., 1993, DISCRETE TIME PROCES Doval B, 2003, P VOQUAL 03, P16 Fant G, 1960, THEORY SPEECH PRODUC FANT G, 2004, STLQSPR, V4, P1 FERNANDEZBAILLO R, 2007, 7 PAN EUR VOIC C GRO, P94 FERNANDEZBAILLO R, 2007, MAVEBA 07, P65 GODINO JI, 2001, P IEEE ENG MED BIOL, P4253 GODINO JI, 2004, IEEE T BIOMED ENG, V5, P1380 Godino-Llorente JI, 2006, IEEE T BIO-MED ENG, V53, P1943, DOI 10.1109/TBME.2006.871883 GOMEZ P, 2006, LECT NOTES COMPUTER, V3817, P242 GOMEZ P, 2005, P EUROSPEECH 05, P645 GOMEZ P, 2007, P MABEVA 07, P183 GOMEZ P, 2004, P ICSLP 04, P842 Gomez-Vilda P, 2007, J VOICE, V21, P450, DOI 10.1016/j.jvoice.2006.01.008 Hadjitodorov S, 2000, IEEE T INF TECHNOL B, V4, P68, DOI 10.1109/4233.826861 HIRANO M, 1988, ACTA OTO-LARYNGOL, V105, P432, DOI 10.3109/00016488809119497 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 JACKSON LB, 1989, IEEE T ACOUST SPEECH, V10, P1606 Johnson RA, 2002, APPL MULTIVARIATE ST, V5th KUO J, 1999, P ICASSP 99 15 19 MA, V1, P77 Nickel R. M., 2006, IEEE Circuits and Systems Magazine, V6 ORR R, 2003, P ISCA WORKSH VOIC Q, P35 Parsa V, 2000, J SPEECH LANG HEAR R, V43, P469 PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 Ritchings RT, 2002, MED ENG PHYS, V24, P561, DOI 10.1016/S1350-4533(02)00064-4 Rodellar V., 1993, Simulation Practice and Theory, V1, DOI 10.1016/0928-4869(93)90008-E Rosa MD, 2000, IEEE T BIO-MED ENG, V47, P96 Ruiz MT, 1997, J EPIDEMIOL COMMUN H, V51, P106, DOI 10.1136/jech.51.2.106 SHALVI O, 1990, IEEE T INFORM THEORY, V36, P312, DOI 10.1109/18.52478 STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234 Svec JG, 2000, J ACOUST SOC AM, V108, P1397, DOI 10.1121/1.1289205 TITZE IR, 1994, WORKSH AC VOIC AN NA Whiteside SP, 2001, J ACOUST SOC AM, V110, P464, DOI 10.1121/1.1379087 NR 40 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 759 EP 781 DI 10.1016/j.specom.2008.09.005 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700006 ER PT J AU Bouzid, A Ellouze, N AF Bouzid, A. Ellouze, N. TI Voice source parameter measurement based on multi-scale analysis of electroglottographic signal SO SPEECH COMMUNICATION LA English DT Article DE Electroglottographic signal; Multi-scale product; Voicing decision; Voice source parameters; Fundamental frequency; Open quotient ID EDGE-DETECTION; SCALE MULTIPLICATION; SPEECH; CONTACT; DOMAIN; MODEL; FLOW AB This paper deals with glottal parameter measurement from electroglottographic signal (EGG). The proposed approach is based on GCI and GOI determined by the multi-scale analysis of the EGG signal. Wavelet transform of EGG signal is done with a quadratic spline function. Wavelet coefficients calculated on different dyadic scales, show modulus maxima at localized discontinuities of the EGG signal. The detected maxima and minima correspond to the so-called GOIs and GCIs. To improve the GCI and GOI localization precision, the product of wavelet transform coefficients of three successive dyadic scales, called multi-scale product (MP), is operated. This process enhances edges and reduces noise and spurious peaks. Applying the cubic root amplitude on the multi-scale product improves the detection of weak GOI maximum and avoids the GCI misses. Applied on the Keele University database, the method brings about a good detection of GCI and GOI. Based on the GCI and GOI, voicing classification, pitch frequency and open quotient measurements are processed. The proposed voicing classification approach is evaluated with additive noise. For clean signal the performance is of 96.4%, and at SNR level of 5 dB, the performance is of 93%. For the fundamental frequency and the open quotient measurement, the comparison of the MP with the DEGG, Howard (3/7), the threshold (35% and 50%), and the DECOM methods show that this new proposed approach is similar to the major methods with an improvement displayed by its lowest deviation. (C) 2008 Elsevier B.V. All rights reserved. C1 [Bouzid, A.] ISECS, Sfax 3018, Tunisia. [Ellouze, N.] ENIT, Tunis 1002, Tunisia. RP Bouzid, A (reprint author), ISECS, PB 868, Sfax 3018, Tunisia. EM bouzidacha@yahoo.fr CR ANASTAPLO S, 1988, J ACOUST SOC AM, V83, P1883, DOI 10.1121/1.396472 Bao P, 2005, IEEE T PATTERN ANAL, V27, P1485, DOI 10.1109/TPAMI.2005.173 BOUZID A, 2003, P EUR 2003 GEN, P2837 Bouzid A., 2004, P EUR SIGN PROC C EU, P729 BROOKES M, 2008, SPEECH PROCESSING TO CHILDERS DG, 1985, CRIT REV BIOMED ENG, V12, P131 CHILDERS DG, 1984, FOLIA PHONIATR, V36, P105 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 Fisher E, 2006, IEEE T AUDIO SPEECH, V14, P502, DOI 10.1109/TSA.2005.857806 Henrich N, 2004, J ACOUST SOC AM, V115, P1321, DOI 10.1121/1.1646401 HERBST C, 2004, THESIS DEP SPEECH MU HOWARD DM, 1995, J VOICE, V9, P1212 HOWARD DM, 1990, J VOICE, V4, P205, DOI 10.1016/S0892-1997(05)80015-3 Kadambe S., 1991, P ICASSP, P449, DOI 10.1109/ICASSP.1991.150373 KRISHNAMURTHY AK, 1986, IEEE T ACOUST SPEECH, V34, P730, DOI 10.1109/TASSP.1986.1164909 Mallat S., 1999, WAVELET TOUR SIGNAL MALLAT S, 1992, IEEE T INFORM THEORY, V38, P617, DOI 10.1109/18.119727 MCKENNA J, 1999, P EUR 99 BUD, P2793 Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878 NOGOC TV, 1999, P EUR 1999, P2805 PEREZ J, 2005, P ICSLP 2005, P1065 Plante F., 1995, P EUR, P837 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 ROSENFEL.A, 1970, PR INST ELECTR ELECT, V58, P814, DOI 10.1109/PROC.1970.7756 ROTHENBERG M, 1988, J SPEECH HEAR RES, V31, P338 Sadler BM, 1999, IEEE T INFORM THEORY, V45, P1043, DOI 10.1109/18.761341 Sadler BM, 1998, J ACOUST SOC AM, V104, P955, DOI 10.1121/1.423312 Sapienza CM, 1998, J VOICE, V12, P31, DOI 10.1016/S0892-1997(98)80073-8 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 VEENEMAN DE, 1985, IEEE T ACOUST SPEECH, V33, P369, DOI 10.1109/TASSP.1985.1164544 XU YS, 1994, IEEE T IMAGE PROCESS, V3, P747 Zhang L, 2002, PATTERN RECOGN LETT, V23, P1771, DOI 10.1016/S0167-8655(02)00151-4 NR 32 TC 5 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 782 EP 792 DI 10.1016/j.specom.2008.08.004 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700007 ER PT J AU Kroger, BJ Kannampuzha, J Neuschaefer-Rube, C AF Kroeger, Bernd J. Kannampuzha, Jim Neuschaefer-Rube, Christiane TI Towards a neurocomputational model of speech production and perception SO SPEECH COMMUNICATION LA English DT Review DE Speech; Speech production; Speech perception; Neurocomputational model; Artificial neural networks; Self-organizing networks ID NEURAL-NETWORK MODEL; TEMPORAL-LOBE; CATEGORICAL PERCEPTION; ACTION REPRESENTATION; SENSORIMOTOR CONTROL; LANGUAGE PRODUCTION; SPANISH VOWELS; LEXICAL ACCESS; BRAIN; FMRI AB The limitation in performance of current speech synthesis and speech recognition systems may result from the fact that these systems are not designed with respect to the human neural processes of speech production and perception. A neurocomputational model of speech production and perception is introduced which is organized with respect to human neural processes of speech production and perception. The production-perception model comprises all artificial computer-implemented vocal tract as a front-end module, which is capable of generating articulatory speech movements and acoustic speech signals. The structure of the production-perception model comprises motor and sensory processing pathways. Speech knowledge is collected during training stages which imitate early stages of speech acquisition. This knowledge is stored in artificial self-organizing maps. The current neurocomputational model is capable of producing and perceiving vowels, VC-, and CV-syllables (V = vowels and C = voiced plosives). Basic features of natural speech production and perception are predicted from this model in a straight forward way: Production of speech items is feedforward and feedback controlled and phoneme realizations vary within perceptually defined regions. Perception is less categorical in the case of vowels in comparison to consonants. Due to its human-like production-perception processing the model should be discussed as a basic module for more technical relevant approaches for high-quality speech synthesis and for high performance speech recognition. (C) 2008 Elsevier B.V. All rights reserved. C1 [Kroeger, Bernd J.] Univ Hosp Aachen, Dept Phoniatr Pedaudiol & Commun Disorders, Aachen, Germany. Univ Aachen, D-5100 Aachen, Germany. RP Kroger, BJ (reprint author), Univ Hosp Aachen, Dept Phoniatr Pedaudiol & Commun Disorders, Aachen, Germany. EM bkroeger@ukaachen.de; jkannampuzha@ukaachen.de; cneuschaefer@ukaachen.de RI Kroger, Bernd/A-4435-2009 OI Kroger, Bernd/0000-0002-4727-2957 FU German Research Council [KR 1439/13-1] FX This work was supported in part by the German Research Council Grant No. KR 1439/13-1. CR Ackermann H, 2004, BRAIN LANG, V89, P320, DOI 10.1016/S0093-934X(03)00347-X Arbib MA, 2005, BEHAV BRAIN SCI, V28, P105, DOI 10.1017/S0140525X05000038 Bailly G, 1997, SPEECH COMMUN, V22, P251, DOI 10.1016/S0167-6393(97)00025-3 Batchelder EO, 2002, COGNITION, V83, P167, DOI 10.1016/S0010-0277(02)00002-1 Benson RR, 2001, BRAIN LANG, V78, P364, DOI 10.1006/brln.2001.2484 Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006 Binder JR, 2000, CEREB CORTEX, V10, P512, DOI 10.1093/cercor/10.5.512 Birkholz P., 2006, P 7 INT SEM SPEECH P, P493 BIRKHOLZ P, 2007, P 16 INT C PHON SCI, P377 Birkholz P., 2004, P INT 2004 ICSLP JEJ, P1125 BIRKHOLZ P, 2006, P INT C AC SPEECH SI, P873 Birkhoz P, 2007, IEEE T AUDIO SPEECH, V15, P1218, DOI 10.1109/TASL.2006.889731 Blank SC, 2002, BRAIN, V125, P1829, DOI 10.1093/brain/awf191 Boatman D, 2004, COGNITION, V92, P47, DOI 10.1016/j.cognition.2003.09.010 Bookheimer SY, 2000, NEUROLOGY, V55, P1151 BRADLOW AR, 1995, J ACOUST SOC AM, V97, P1916, DOI 10.1121/1.412064 Brent MR, 1999, TRENDS COGN SCI, V3, P294, DOI 10.1016/S1364-6613(99)01350-9 BROWMAN CP, 1992, PHONETICA, V49, P155 Browman CP, 1989, PHONOLOGY, V6, P201, DOI 10.1017/S0952675700001019 BULLOCK D, 1993, J COGNITIVE NEUROSCI, V5, P408, DOI 10.1162/jocn.1993.5.4.408 Callan DE, 2006, NEUROIMAGE, V31, P1327, DOI 10.1016/j.neuroimage.2006.01.036 Cervera T, 2001, J SPEECH LANG HEAR R, V44, P988, DOI 10.1044/1092-4388(2001/077) Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014 Damper RI, 2000, PERCEPT PSYCHOPHYS, V62, P843, DOI 10.3758/BF03206927 Dell GS, 1999, COGNITIVE SCI, V23, P517 EIMAS PD, 1963, LANG SPEECH, V6, P206 Fadiga L, 2004, J CLIN NEUROPHYSIOL, V21, P157, DOI 10.1097/00004691-200405000-00004 Fadiga L, 2002, EUR J NEUROSCI, V15, P399, DOI 10.1046/j.0953-816x.2001.01874.x FOWLER CA, 1986, J PHONETICS, V14, P3 FRY DB, 1962, LANG SPEECH, V5, P171 Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613 Goldstein L, 2006, ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM, P215, DOI 10.1017/CBO9780511541599.008 Goldstein L, 2007, COGNITION, V103, P386, DOI 10.1016/j.cognition.2006.05.010 Grossberg S, 2003, J PHONETICS, V31, P423, DOI 10.1016/S0095-4470(03)00051-2 GUENTHER FH, 1994, BIOL CYBERN, V72, P43, DOI 10.1007/BF00206237 Guenther FH, 2006, J COMMUN DISORD, V39, P350, DOI 10.1016/j.jcomdis.2006.06.013 Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 Hartsuiker RJ, 2001, COGNITIVE PSYCHOL, V42, P113, DOI 10.1006/cogp.2000.0744 Heim S, 2003, COGNITIVE BRAIN RES, V16, P285, DOI 10.1016/S0926-6410(02)00284-7 Hickok G, 2004, COGNITION, V92, P67, DOI 10.1016/j.cognition.2003.10.011 Hickok G, 2000, TRENDS COGN SCI, V4, P131, DOI 10.1016/S1364-6613(00)01463-7 Hillis AE, 2004, BRAIN, V127, P1479, DOI 10.1093/brain/awh172 Huang J, 2001, HUM BRAIN MAPP, V15, P39 Iacoboni M, 2005, CURR OPIN NEUROBIOL, V15, P632, DOI 10.1016/j.conb.2005.10.010 Indefrey P, 2004, COGNITION, V92, P101, DOI 10.1016/j.cognition.2002.06.001 Ito T, 2004, BIOL CYBERN, V91, P275, DOI 10.1007/s00422-004-0510-6 Jardri R, 2007, NEUROIMAGE, V35, P1645, DOI 10.1016/j.neuroimage.2007.02.002 Jusczyk PW, 1999, TRENDS COGN SCI, V3, P323, DOI 10.1016/S1364-6613(99)01363-7 Kandel E. R., 2000, PRINCIPLES NEURAL SC Kemeny S, 2005, HUM BRAIN MAPP, V24, P173, DOI 10.1002/hbm.20078 Kohler E, 2002, SCIENCE, V297, P846, DOI 10.1126/science.1070311 Kohonen T., 2001, SELF ORG MAPS Kroger BJ, 2007, LECT NOTES COMPUT SC, V4775, P174 KROGER BJ, 1993, PHONETICA, V50, P213 KROGER BJ, 2006, P 9 INT C SPOK LANG, P565 KROGER BJ, 2006, DAGA P ANN M GERM AC, P561 Kroger BJ, 2008, LECT NOTES ARTIF INT, V5042, P121 KROGER BJ, 2006, P 7 INT SEM SPEECH P, P67 Kuriki S, 1999, NEUROREPORT, V10, P765, DOI 10.1097/00001756-199903170-00019 Latorre J, 2006, SPEECH COMMUN, V48, P1227, DOI 10.1016/j.specom.2006.05.003 LEVELT WJM, 1994, COGNITION, V50, P239, DOI 10.1016/0010-0277(94)90030-2 Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1 Li P, 2004, NEURAL NETWORKS, V17, P1345, DOI 10.1016/j.neunet.2004.07.004 LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 Liebenthal E, 2005, CEREB CORTEX, V15, P1621, DOI 10.1093/cercor/bhi040 Luce PA, 2000, PERCEPT PSYCHOPHYS, V62, P615, DOI 10.3758/BF03212113 Maass W, 1999, INFORM COMPUT, V153, P26, DOI 10.1006/inco.1999.2806 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 Murphy K, 1997, J APPL PHYSIOL, V83, P1438 Nasir SM, 2006, CURR BIOL, V16, P1918, DOI 10.1016/j.cub.2006.07.069 Norris D, 2006, COGNITIVE PSYCHOL, V53, P146, DOI 10.1016/j.cogpsych.2006.03.001 Obleser J, 2006, HUM BRAIN MAPP, V27, P562, DOI 10.1002/hbm.20201 Obleser J, 2007, J NEUROSCI, V27, P2283, DOI 10.1523/JNEUROSCI.4663-06.2007 Okada K, 2006, BRAIN LANG, V98, P112, DOI 10.1016/j.bandl.2006.04.006 Oller DK, 1999, J COMMUN DISORD, V32, P223, DOI 10.1016/S0021-9924(99)00013-1 PERKELL JS, 1993, J ACOUST SOC AM, V93, P2948, DOI 10.1121/1.405814 Poeppel D, 2004, NEUROPSYCHOLOGIA, V42, P183, DOI 10.1016/j.neuropsychologia.2003.07.010 Postma A, 2000, COGNITION, V77, P97, DOI 10.1016/S0010-0277(00)00090-1 Raphael L.J., 2007, SPEECH SCI PRIMER PH Riecker A, 2006, NEUROIMAGE, V29, P46, DOI 10.1016/j.neuroimage.2005.03.046 Rimol LM, 2005, NEUROIMAGE, V26, P1059, DOI 10.1016/j.neuroimage.2005.03.028 RITTER H, 1989, BIOL CYBERN, V61, P241, DOI 10.1007/BF00203171 Rizzolatti G, 2004, ANNU REV NEUROSCI, V27, P169, DOI 10.1146/annurev.neuro.27.070203.144230 Rizzolatti G, 1998, TRENDS NEUROSCI, V21, P188, DOI 10.1016/S0166-2236(98)01260-0 Rosen HJ, 2000, BRAIN COGNITION, V42, P201, DOI 10.1006/brcg.1999.1100 Saltzman E, 2000, HUM MOVEMENT SCI, V19, P499, DOI 10.1016/S0167-9457(00)00030-0 SALTZMAN E, 1979, J MATH PSYCHOL, V20, P91, DOI 10.1016/0022-2496(79)90020-8 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 Sanguineti V, 1997, BIOL CYBERN, V77, P11, DOI 10.1007/s004220050362 Scharenborg O, 2007, SPEECH COMMUN, V49, P336, DOI 10.1016/j.specom.2007.01.009 Scott SK, 2000, BRAIN, V123, P2400, DOI 10.1093/brain/123.12.2400 SHADMEHR R, 1994, J NEUROSCI, V14, P3208 Shuster LI, 2005, BRAIN LANG, V93, P20, DOI 10.1016/j.bandl.2004.07.007 Sober SJ, 2003, J NEUROSCI, V23, P6982 Soros P, 2006, NEUROIMAGE, V32, P376, DOI 10.1016/j.neuroimage.2006.02.046 Studdert-Kennedy M, 2002, MIRROR NEURONS EVOLU, P207 Tani J, 2004, NEURAL NETWORKS, V17, P1273, DOI 10.1016/j.neunet.2004.05.007 Todorov E, 2004, NAT NEUROSCI, V7, P907, DOI 10.1038/nn1309 Tremblay S, 2003, NATURE, V423, P866, DOI 10.1038/nature01710 Ullman MT, 2001, NAT REV NEUROSCI, V2, P717, DOI 10.1038/35094573 Uppenkamp S, 2006, NEUROIMAGE, V31, P1284, DOI 10.1016/j.neuroimage.2006.01.004 Vanlancker-Sidtis D, 2003, BRAIN LANG, V85, P245, DOI 10.1016/S0093-934X(02)00596-5 Varley R, 2001, APHASIOLOGY, V15, P39 Werker JF, 2005, TRENDS COGN SCI, V9, P519, DOI 10.1016/j.tics.2005.09.003 Westermann G, 2004, BRAIN LANG, V89, P393, DOI 10.1016/S0093-934X(03)00345-6 Wilson M, 2005, PSYCHOL BULL, V131, P460, DOI 10.1037/0033-2909.131.3.460 Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263 Wise RJS, 1999, LANCET, V353, P1057, DOI 10.1016/S0140-6736(98)07491-1 Zekveld AA, 2006, NEUROIMAGE, V32, P1826, DOI 10.1016/j.neuroimage.2006.04.199 Zell A., 2003, SIMULATION NEURONALE NR 113 TC 27 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 793 EP 809 DI 10.1016/j.specom.2008.08.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700008 ER PT J AU Hsieh, CH Feng, TY Huang, PC AF Hsieh, Cheng-Hsiung Feng, Ting-Yu Huang, Po-Chin TI Energy-based VAD with grey magnitude spectral subtraction SO SPEECH COMMUNICATION LA English DT Article DE Voice activity detection; Grey system; Magnitude spectral subtraction; G.729; GSM AMR ID VOICE ACTIVITY DETECTION; ROBUST SPEECH RECOGNITION; MODEL; ENHANCEMENT; NOISE AB In this paper, we propose a novel voice activity detection (VAD) scheme for low SNR conditions with additive white noise. The proposed approach consists of two parts. First, a grey magnitude spectral subtraction (GMSS) is applied to remove additive noise from a given noisy speech. By this doing, an estimated clean speech is obtained. Second, the enhanced speech by the GMSS is segmented and put into an energy-based VAD to determine whether it is a speech or non-speech segment. The approach presented in this paper is called the GMSS/EVAD. Simulation results indicate that the proposed GMSS/EVAD outperforms VAD in G.729 and GSM AMR for the given low SNR examples. To investigate the performance of the GMSS/EVAD for real-life background noises, the babble and volvo noises in the NOISEX-92 database are under consideration. The simulation results for the given examples indicate that the GMSS/EVAD is able to handle appropriately for the cases of the babble noise with the SNR above 10 dB and the cases of the volvo noise with SNR 15 dB and up. (C) 2008 Elsevier B.V. All rights reserved. C1 [Hsieh, Cheng-Hsiung; Feng, Ting-Yu; Huang, Po-Chin] Chaoyang Univ Technol, Dept Comp Sci & Informat Engn, Wufong 413, Taiwan. RP Hsieh, CH (reprint author), Chaoyang Univ Technol, Dept Comp Sci & Informat Engn, Wufong 413, Taiwan. EM chhsieh@cyut.edu.tw FU National Science Council of Republic of China [NSC 95-2221-E-324040, NSC 96-2221-E-324-044] FX This work was supported by National Science Council of Republic of China under Grants NSC 95-2221-E-324040 and NSC 96-2221-E-324-044. CR Chang JH, 2001, IEICE T INF SYST, VE84D, P1231 Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403 Chang JH, 2003, ELECTRON LETT, V39, P632, DOI 10.1049/el:20030392 Childers D. G., 1999, SPEECH PROCESSING SY Davis A, 2006, IEEE T AUDIO SPEECH, V14, P412, DOI 10.1109/TSA.2005.855842 Deng Julong, 1989, Journal of Grey Systems, V1 Deng JL, 1982, SYSTEMS CONTROL LETT, V5, P288 ESTEVEZ PA, 2005, ELECT LETT, V41 *ETSI, 1999, 301 708 ETSI EN Goorriz JM, 2006, SPEECH COMMUN, V48, P1638, DOI 10.1016/j.specom.2006.07.006 Gorriz JM, 2006, J ACOUST SOC AM, V120, P470, DOI 10.1121/1.2208450 Hsieh CH, 2003, IEICE T INF SYST, VE86D, P522 ITU-T, 1996, G729 ITUT Kim DK, 2007, IEEE SIGNAL PROC LET, V14, P891, DOI 10.1109/LSP.2007.900225 Kim HI, 2004, ELECTRON LETT, V40, P1454, DOI 10.1049/el:20046064 LONGBOTHAM HG, 1989, IEEE T ACOUST SPEECH, V37, P275, DOI 10.1109/29.21690 Ramirez J, 2004, IEEE SIGNAL PROC LET, V11, P266, DOI 10.1109/LSP.2003.821762 Ramirez J, 2005, IEEE T SPEECH AUDI P, V13, P1119, DOI 10.1109/TSA.2005.853212 Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002 Ramirez J, 2007, IEEE T AUDIO SPEECH, V15, P2177, DOI 10.1109/TASL.2007.903937 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521 NR 22 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 810 EP 819 DI 10.1016/j.specom.2008.08.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700009 ER PT J AU Monte-Moreno, E Chetouani, M Faundez-Zanuy, M Sole-Casals, J AF Monte-Moreno, Enric Chetouani, Mohamed Faundez-Zanuy, Marcos Sole-Casals, Jordi TI Maximum likelihood linear programming data fusion for speaker recognition SO SPEECH COMMUNICATION LA English DT Article DE Speaker recognition; Data fusion; GMM; Maximum likelihood; Linear programming; Simplex ID WIENER SYSTEMS; IDENTIFICATION; SPEECH; INFORMATION; PREDICTION; INVERSION; MIXTURES AB Biometric system performance can be improved by means of data fusion. Several kinds of information can be fused in order to obtain a more accurate classification (identification or verification) of an input sample. In this paper we present a method for computing the weights in a weighted sum fusion for score combinations, by means of a likelihood model. The maximum likelihood estimation is set as a linear programming problem. The scores are derived from a GMM classifier working on different feature extraction techniques. Our experimental results assessed the robustness of the system in front changes oil time (different sessions) and robustness in front of changes of microphone. The improvements obtained were significantly better (error bars of two standard deviations) than a uniform weighted sum or a uniform weighted product or the best single classifier. The proposed method scales computationally with the number of scores to be fusioned as the simplex method for linear programming. (C) 2008 Elsevier B.V. All rights reserved. C1 [Monte-Moreno, Enric] UPC Barcelona, TALP Res Ctr, Barcelona, Spain. [Chetouani, Mohamed] Univ Paris 06, F-75252 Paris 05, France. [Faundez-Zanuy, Marcos] UPC Barcelona, Escola Univ Politecn Mataro, Barcelona, Spain. [Sole-Casals, Jordi] Univ Vic, Barcelona, Spain. RP Monte-Moreno, E (reprint author), UPC Barcelona, TALP Res Ctr, Barcelona, Spain. EM enric@gps.tsc.upc.edu RI CHETOUANI, Mohamed/F-5854-2010; Faundez-Zanuy, Marcos/F-6503-2012; Monte Moreno, Enrique/F-8218-2013; Sole-Casals, Jordi/B-7754-2008 OI Faundez-Zanuy, Marcos/0000-0003-0605-1282; Monte Moreno, Enrique/0000-0002-4907-0494; Sole-Casals, Jordi/0000-0002-6534-1979 FU FEDER; MEC [TEC2006-13141-C03-02/TCM, TIN2005-08852, TEC2007-61535/TCM] FX This work hits been partially supported by FEDER and MEC, TEC2006-13141-C03-02/TCM, TIN2005-08852, and the TEC2007-61535/TCM. CR ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 BELLINGS SA, 1978, P IEEE, V66, P691 Bertsimas D., 1997, INTRO LINEAR OPTIMIZ Bishop C. M., 1995, NEURAL NETWORKS PATT BOER ED, 1976, P IEEE, V64, P1443 Boyd S., 2004, CONVEX OPTIMIZATION COVER T. M., 1991, WILEY SERIES TELECOM FAUNDEZ M, 1998, INT C SPOK LANG PROC, V2, P121 Faundez-Zanuy M, 2004, IEEE AERO EL SYS MAG, V19, P3, DOI 10.1109/MAES.2004.1308819 Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P7, DOI 10.1109/MAES.2005.1432568 Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P25, DOI 10.1109/MAES.2005.1499300 Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P34, DOI 10.1109/MAES.2005.1396793 Faundez-Zanuy M, 2006, IEEE AERO EL SYS MAG, V21, P15, DOI 10.1109/MAES.2006.1662038 Faundez-Zanuy M., 2002, Control and Intelligent Systems, V30 Hayakawa S, 1997, LECT NOTES COMPUT SC, V1206, P253 HE J, 1996, P IEEE ICASSP 96, V1, P5 JACOVITTI G, 1987, IEEE T ACOUST SPEECH, V35, P1126, DOI 10.1109/TASSP.1987.1165253 Kubin G., 1995, SPEECH CODING SYNTHE, P557 Kuncheva L.I., 2004, COMBINING PATTERN CL MARY L, 2004, P ISCA TUT RES WORKS, P323 Nikias C., 1993, HIGHER ORDER SPECTRA NIKIAS CL, 1987, P IEEE, V75, P869, DOI 10.1109/PROC.1987.13824 Ortega-Garcia J, 2000, SPEECH COMMUN, V31, P255, DOI 10.1016/S0167-6393(99)00081-3 PRAKRIYA S, 1985, BIOL CYBERN, V55, P135 Prasanna SRM, 2006, SPEECH COMMUN, V48, P1243, DOI 10.1016/j.specom.2006.06.002 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 SATUE A, 1999, EUROSPEECH, V3, P1231 Sole-Casals J, 2006, NEUROCOMPUTING, V69, P1467, DOI 10.1016/j.neucom.2005.12.023 Sole J, 2002, NEUROCOMPUTING, V48, P339, DOI 10.1016/S0925-2312(01)00651-8 Sole-Casals J, 2005, SIGNAL PROCESS, V85, P1780, DOI 10.1016/j.sigpro.2004.11.030 Taleb A, 1999, IEEE T SIGNAL PROCES, V47, P2807, DOI 10.1109/78.790661 Taleb A, 2001, IEEE T SIGNAL PROCES, V49, P917, DOI 10.1109/78.917796 THEVENAZ P, 1995, SPEECH COMMUN, V17, P145, DOI 10.1016/0167-6393(95)00010-L YEGNANARAYANA B, 2001, P IEEE INT C AC SPEE, P409 ZHENG N, 2006, IEEE SIGNAL PROCESS, V14, P181 NR 35 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2009 VL 51 IS 9 SI SI BP 820 EP 830 DI 10.1016/j.specom.2008.05.009 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 475OZ UT WOS:000268370700010 ER PT J AU Shao, Y Wang, DL AF Shao, Yang Wang, DeLiang TI Sequential organization of speech in computational auditory scene analysis SO SPEECH COMMUNICATION LA English DT Article DE Sequential organization; Computational auditory scene analysis; Speaker quantization; Binary time-frequency mask ID MODELS; RECOGNITION; INTERFERENCE; SEGREGATION; TRACKING AB A human listener has the ability to follow a speaker's voice over time in the presence of other talkers and non-speech interference. This paper proposes a general system for sequential organization of speech based on speaker models. By training a general background model, the proposed system is shown to function well with both interfering talkers and non-speech intrusions. To deal with situations where prior information about specific speakers is not available, a speaker quantization method is employed to extract representative models from a large speaker space and obtained generic models are used to perform sequential grouping. Our systematic evaluations show that grouping performance using generic models is only moderately lower than the performance level achieved with known speaker models. (C) 2009 Elsevier B.V. All rights reserved. C1 [Shao, Yang; Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. [Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA. RP Shao, Y (reprint author), Ohio State Univ, Dept Comp Sci & Engn, 2015 Neil Ave, Columbus, OH 43210 USA. EM shao.19@osu.edu; dwang@cse.ohio-state.edu FU AFOSR [FA9550-04-1-0117]; NSF [IIS-0081058]; AFRL [FA8750-04-1-0093] FX This research was supported in part by an AFOSR Grant (FA9550-04-1-0117), an NSF Grant (IIS-0081058), and an AFRL Grant (FA8750-04-1-0093). CR Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 BEN M, 2002, P INT C AC SPEECH SI, V1, P689 Bimbot Frederic, 2004, EURASIP J APPL SIG P, V4, P430, DOI [DOI 10.1155/S1110865704310024, 10.1155/S1110865704310024] Bregman AS., 1990, AUDITORY SCENE ANAL Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 Cooke M., 2006, SPEECH SEPARATION RE Deng L, 2005, IEEE T SPEECH AUDI P, V13, P412, DOI 10.1109/TSA.2005.845814 Duda R. O., 2001, PATTERN CLASSIFICATI Dunn RB, 2000, DIGIT SIGNAL PROCESS, V10, P93, DOI 10.1006/dspr.1999.0359 Ellis D.P.W., 2006, COMPUTATIONAL AUDITO, P115 Furui S., 2001, DIGITAL SPEECH PROCE HELMHOLTZ H, 1863, SENSATION TONE Hershey J. R., 2007, P ICASSP, V4, P317 Hu G., 2006, THESIS OHIO STATE U Hu G., 2006, TOPICS ACOUSTIC ECHO, P485 Hu GN, 2008, J ACOUST SOC AM, V124, P1306, DOI 10.1121/1.2939132 Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Huang X., 2001, SPOKEN LANGUAGE PROC Kullback S., 1968, INFORM THEORY STAT Kwon S., 2004, P ICSLP, P1517 Kwon S, 2005, IEEE T SPEECH AUDI P, V13, P1004, DOI 10.1109/TSA.2005.851981 Moore BC., 2003, INTRO PSYCHOL HEARIN Patterson R. D., 1988, 2341 APU MRC APPL PS PRZYBOCKI MA, 2004, P OD 2004 QUATIERI TF, 1990, IEEE T ACOUST SPEECH, V38, P56, DOI 10.1109/29.45618 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Rice J, 1995, MATH STAT DATA ANAL Russell S., 2003, ARTIFICIAL INTELLIGE, V2nd SHAO Y, 2007, P ICASSP, V4, P277 Shao Y., 2007, THESIS OHIO STATE U Shao Y, 2006, IEEE T AUDIO SPEECH, V14, P289, DOI 10.1109/TSA.2005.854106 Silva J, 2006, IEEE T AUDIO SPEECH, V14, P890, DOI 10.1109/TSA.2005.858059 Srinivasan S, 2007, IEEE T AUDIO SPEECH, V15, P2130, DOI 10.1109/TASL.2007.901836 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Vasconcelos N, 2004, IEEE T INFORM THEORY, V50, P1482, DOI 10.1109/TIT.2004.830760 Wang D., 2006, COMPUTATIONAL AUDITO Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12 Wang D. L., 2006, COMPUTATIONAL AUDITO, P81 NR 41 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2009 VL 51 IS 8 BP 657 EP 667 DI 10.1016/j.specom.2009.02.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HH UT WOS:000267572900001 ER PT J AU Messing, DP Delhorne, L Bruckert, E Braida, LD Ghitza, O AF Messing, David P. Delhorne, Lorraine Bruckert, Ed Braida, Louis D. Ghitza, Oded TI A non-linear efferent-inspired model of the auditory system; matching human confusions in stationary noise SO SPEECH COMMUNICATION LA English DT Article DE MOC efferent; Speech recognition; Noise robustness; Human confusions; MBPNL model ID CROSSED-OLIVOCOCHLEAR-BUNDLE; MODULATION TRANSFER-FUNCTION; RATE RESPONSES; SPEECH; STIMULATION; RECOGNITION; NEURONS; TONES; INTELLIGIBILITY; DISCRIMINATION AB Current predictors of speech intelligibility are inadequate for understanding and predicting speech confusions caused by acoustic interference. We develop a model of auditory speech processing that includes a phenomenological representation of the action of the Medial Olivocochlear efferent pathway and that is capable of predicting consonant confusions made by normal hearing listeners in speech-shaped Gaussian noise. We then use this model to predict human error patterns of initial consonants in consonant-vowel-consonant words in the context of a Dynamic Rhyme Test. In the process we demonstrate its potential for speech discrimination in noise. Our results produced performance that was robust to varying levels of stationary additive speech-shaped noise and which mimicked human performance in discrimination of synthetic speech as measured by the Chi-squared test. (C) 2009 Elsevier B.V. All rights reserved. C1 [Messing, David P.; Delhorne, Lorraine; Braida, Louis D.] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA. [Bruckert, Ed; Ghitza, Oded] Sensimetrics Corp, Malden, MA USA. RP Messing, DP (reprint author), 11208 Vista Sorrento,Pkwy J306, San Diego, CA 92103 USA. EM dpmessing@gmail.com; delhome@MIT.EDU; Ebruckert@fonix.com; ldbraida@mit.edu; oded.ghitza@gmail.com FU US Air Force Office of Scientific Research [F49620-03-C-0051, FA9550-05-C-0032]; NIH [R01-DC7152] FX This work was sponsored by the US Air Force Office of Scientific Research (contracts F49620-03-C-0051 and FA9550-05-C-0032) and by NIH Grant R01-DC7152. CR AINSWORTH W, 2001, EFFECTS NOISE ADAPTA, P371 AINSWORTH WA, 1994, J ACOUST SOC AM, V96, P687, DOI 10.1121/1.410306 *ANSI, 1997, S361997 ANSI ANSI, 1969, S351969 ANSI Cervera T, 2007, ACTA ACUST UNITED AC, V93, P1036 DEWSON JH, 1968, J NEUROPHYSIOL, V31, P122 DOLAN DF, 1988, J ACOUST SOC AM, V83, P1081, DOI 10.1121/1.396052 Dunn HK, 1940, J ACOUST SOC AM, V11, P278, DOI 10.1121/1.1916034 Ferry RT, 2007, J ACOUST SOC AM, V122, P3519, DOI 10.1121/1.2799914 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 GHITZA O, 2007, HEARING SENSORY PROC GHITZA O, 2004, J ACOUST SOC AM 2, V115 GIFFORD ML, 1983, J ACOUST SOC AM, V74, P115, DOI 10.1121/1.389728 Giraud AL, 1997, NEUROREPORT, V8, P1779 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T GOLDSTEIN JL, 1990, HEARING RES, V49, P39, DOI 10.1016/0378-5955(90)90094-6 Guinan Jr J.J., 1996, COCHLEA, P435 GUMMER M, 1988, HEARING RES, V36, P41, DOI 10.1016/0378-5955(88)90136-0 Hant JJ, 2003, SPEECH COMMUN, V40, P291, DOI 10.1016/S0167-6393(02)00068-7 HOUTGAST T, 1980, ACUSTICA, V46, P60 JOHNSON DH, 1980, J ACOUST SOC AM, V68, P1115, DOI 10.1121/1.384982 KAWASE T, 1993, J NEUROPHYSIOL, V70, P2519 KIANG NYS, 1987, INT COCHL IMPL S DUR LIBERMAN MC, 1986, HEARING RES, V24, P17, DOI 10.1016/0378-5955(86)90003-1 LIBERMAN MC, 1988, J NEUROPHYSIOL, V60, P1779 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 MAY BJ, 1992, J NEUROPHYSIOL, V68, P1589 PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456 Scharf B, 1997, HEARING RES, V103, P101, DOI 10.1016/S0378-5955(96)00168-2 Slaney M., 1993, 35 APPL COMP Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009 Voiers W. D., 1983, Speech Technology, V1 Warr WB, 1978, EVOKED ELECTRICAL AC, P43 WINSLOW RL, 1988, HEARING RES, V35, P165, DOI 10.1016/0378-5955(88)90116-5 Zar J. H., 1999, BIOSTATISTICAL ANAL, V4th Zeng FG, 2000, HEARING RES, V142, P102, DOI 10.1016/S0378-5955(00)00011-3 NR 36 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2009 VL 51 IS 8 BP 668 EP 683 DI 10.1016/j.specom.2009.02.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HH UT WOS:000267572900002 ER PT J AU Lee, S Iverson, GK AF Lee, Soyoung Iverson, Gregory K. TI Vowel development in English and Korean: Similarities and differences in linguistic and non-linguistic factors SO SPEECH COMMUNICATION LA English DT Article DE Vowel development; Fundamental and formant frequency; Vocal tract length; Korean; English; Cross-linguistic study ID VOCAL-TRACT; VOICES; FREQUENCIES; PATTERNS; SPEAKERS; SPEECH AB This study presents an acoustic examination of vowels produced by English and Korean-speaking male and female children in two age groups, 5 and 10 years. In response to picture cards, a total of 80 children (10 each in eight groups categorized according to age, sex and language) produced tokens of seven vowels which are typically transcribed using the same IPA symbols in both languages. Fundamental as well as first and second formant frequencies were measured. In addition, vocal tract length was estimated on the basis of the F-3 values Of /Lambda/. As expected, these properties differed between the two age groups. However, the two non-linguistic elements (fundamental frequency and vocal tract length) were similar between sexes and between languages whereas the linguistic factors (frequencies of the first and second formants) varied as a function of sex and language. Moreover, the F-2 differences between English and Korean-speaking children were similar to those between English and Korean-speaking adults as reported by Yang [Yang, B., 1996. A comparative study of American English and Korean vowels produced by male and female speakers. J. Phonet. 24, 245-261], with children as young as 5 years thus reflecting the formant frequencies of their ambient languages. The results of this study suggest that formant frequency differences between sexes during the pre- and peri-puberty periods can be attributed chiefly to behavioral rather than to anatomical differences. (C) 2009 Elsevier B.V. All rights reserved. C1 [Lee, Soyoung] Univ Wisconsin Milwaukee, Dept Commun Sci & Disorders, Milwaukee, WI 53201 USA. [Iverson, Gregory K.] Univ Wisconsin Milwaukee, Univ Maryland Ctr Adv Study Language, Milwaukee, WI 53201 USA. RP Lee, S (reprint author), Univ Wisconsin Milwaukee, Dept Commun Sci & Disorders, POB 413, Milwaukee, WI 53201 USA. EM lees59@uwm.edu CR Assmann PF, 2000, J ACOUST SOC AM, V108, P1856, DOI 10.1121/1.1289363 Baker W, 2005, LANG SPEECH, V48, P1 BENNETT S, 1981, J ACOUST SOC AM, V69, P231, DOI 10.1121/1.385343 BENNETT S, 1983, J SPEECH HEAR RES, V26, P137 BUSBY PA, 1995, J ACOUST SOC AM, V97, P2603, DOI 10.1121/1.412975 Diehl RL, 1996, J PHONETICS, V24, P187, DOI 10.1006/jpho.1996.0011 Eguchi S, 1969, Acta Otolaryngol Suppl, V257, P1 Fant G, 1975, STL QPSR, V2-3, P1 Fant G., 1970, ACOUSTIC THEORY SPEE Fant G., 1973, SPEECH SOUNDS FEATUR Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 Goldstein UG, 1980, ARTICULATORY MODEL V HA H, 2005, THESIS YONSEI U KORE HASEK CS, 1980, J ACOUST SOC AM, V68, P1262, DOI 10.1121/1.385118 Hillenbrand JM, 2001, J ACOUST SOC AM, V109, P748, DOI 10.1121/1.1337959 Kent R., 1997, SPEECH SCI Lee S, 2008, CLIN LINGUIST PHONET, V22, P523, DOI 10.1080/02699200801945120 Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LINDBLOM B, 1989, J PHONETICS, V17, P107 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 MATTINGLY IG, 1966, 71 M AC SOC AM BOST Nordstrom P.E., 1977, J PHONETICS, V4, P81 NORDSTROM PE, 1975, INT C PHON SCI LEEDS Perry TL, 2001, J ACOUST SOC AM, V109, P2988, DOI 10.1121/1.1370525 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Rvachew S, 2006, J ACOUST SOC AM, V120, P2250, DOI 10.1121/1.2266460 Sachs J., 1973, LANGUAGE ATTITUDES C, P74 Sagart L., 1989, J CHILD LANG, V16, P1 Sohn H. -M. -M., 1999, KOREAN LANGUAGE TITZE IR, 1989, J ACOUST SOC AM, V85, P1699, DOI 10.1121/1.397959 TRAUNMULLER H, 1988, PHONETICA, V45, P1 Vorperian HK, 2005, J ACOUST SOC AM, V117, P338, DOI 10.1121/1.1835958 WANG S, 1996, J KOREAN OTOLARYNGOL, V39, P12 Whiteside SP, 2001, J ACOUST SOC AM, V110, P464, DOI 10.1121/1.1379087 Whittam LR, 2000, CLIN EXP DERMATOL, V25, P122 Xue SA, 2006, CLIN LINGUIST PHONET, V20, P691, DOI 10.1080/02699200500297716 Yang BG, 1996, J PHONETICS, V24, P245, DOI 10.1006/jpho.1996.0013 YOON S, 1998, J SPEECH HEAR DISORD, V7, P67 NR 38 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2009 VL 51 IS 8 BP 684 EP 694 DI 10.1016/j.specom.2009.03.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HH UT WOS:000267572900003 ER PT J AU Jackson, PJB Singampalli, VD AF Jackson, Philip J. B. Singampalli, Veena D. TI Statistical identification of articulation constraints in the production of speech SO SPEECH COMMUNICATION LA English DT Article DE Critical articulator; Speech production model; Articulatory gesture; Coarticulation ID MODEL; COARTICULATION; RECOGNITION; MOVEMENTS; FEATURES; TONGUE; MRI; HMM AB We present a statistical technique for identifying critical, dependent and redundant roles played by the articulators during production of English phonemes using articulatory (EMA) data. It identifies a list of critical articulators for each phone based on changes in the distribution of articulator positions. The effect of critical articulation on dependent articulators is derived from inter-articulator correlation. Articulators unaffected or not correlated with the critical articulators are regarded as redundant. The technique was implemented on 1D and 2D distributions of midsagittal articulator coordinates, and the results of this data-driven approach are analyzed in comparison with the phonetic descriptions from the IPA chart. The results using the proposed method gave a closer fit to measured data than those estimated from IPA information alone and highlighted significant factors in the phoneme-to-phone transformation. The proposed algorithm was evaluated against an exhaustive search of critical articulators, and found to be as effective as the exhaustive search in modeling phone distributions with the added advantage of faster execution times. The efficiency of the approach in generating a parsimonious yet accurate representation of the observed articulatory constraintsis described, and its potential for applications in speech science and technology discussed. (C) 2009 Elsevier B.V. All rights reserved. C1 [Jackson, Philip J. B.; Singampalli, Veena D.] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, Surrey, England. RP Jackson, PJB (reprint author), Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, Surrey, England. EM p.jackson@surrey.ac.uk RI Jackson, Philip/E-8422-2013 FU UK by EPSRC [GR/S85511/01] FX This research formed part of the DANSA project funded in the UK by EPSRC (GR/S85511/01), http://personal.ce.sui-rey.ac.uk/Personal/P.Jackson/Dansa/. CR Anderson TW, 1984, INTRO MULTIVARIATE S Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166 Badin P., 2006, P 7 INT SEM SPEECH P, P395 BAKIS R, 1991, P IEEE WORKSH AUT SP, P20 BLACKBURN S, 2000, J ACOUST SOC AM, V103, P1659 BLADON RA, 1976, J PHONETICS, V4, P135 Browman Catherine, 1986, PHONOLOGY YB, V3, P219 Chomsky N., 1968, SOUND PATTERN ENGLIS Cohen M. M., 1993, Models and Techniques in Computer Animation COKER CH, 1976, P IEEE, V64, P452, DOI 10.1109/PROC.1976.10154 Coppin B., 2004, ARTIFICIAL INTELLIGE DANG J, 2004, ACOUSTICAL SCI TECHN, V25, P318, DOI 10.1250/ast.25.318 Dang J.W., 2005, P INT LISB, P1025 Dang JW, 2004, J ACOUST SOC AM, V115, P853, DOI 10.1121/1.1639325 Daniloff R. G., 1973, J PHONETICS, V1, P239 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 EIDE E, 2001, P EUR AALB DENM, P1613 Erler K, 1996, J ACOUST SOC AM, V100, P2500, DOI 10.1121/1.417358 FANT G, 1969, DISTINCTIVE FEATURES FRANKEL J, 2001, P EUR AALB DENM, P599, DOI DOI 10.1109/TSA.2005.851910 Frankel J., 2003, THESIS U EDINBURGH Frankel J., 2000, P ICSLP, V4, P254 FRANKEL J, 2004, P INT C SPOK LANG P, P1477 HENKE WL, 1965, THESIS MIT CAMBRIDGE Hershey J. R., 2007, P ICASSP, V4, P317 HOOLE P, 2008, P INT SEM SPEECH PRO, P157 Jackson P., 2008, J ACOUST SOC AM 2, V123, P3321, DOI 10.1121/1.2933798 JACKSON PJB, 2004, GRS8551101 CVSSP EPS JACKSON PJB, 2008, P INT SEM SPEECH PRO, P377 Johnson R. A., 1998, APPL MULTIVARIATE ST Keating P. A., 1988, UCLA WORKING PAPERS, V69, P3 King S, 2007, J ACOUST SOC AM, V121, P723, DOI 10.1121/1.2404622 Kirchhoff K., 1999, THESIS U BIELEFELD KOREMAN J, 1998, P INT C SPOK LANG P, V3, P1035 Kullback S., 1968, INFORM THEORY STAT LIBERMAN AM, 1970, COGNITIVE PSYCHOL, V1, P301, DOI 10.1016/0010-0285(70)90018-6 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 LOFQVIST A, 1990, NATO ADV SCI I D-BEH, V55, P289 MACNEILA.PF, 1970, PSYCHOL REV, V77, P182, DOI 10.1037/h0029070 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Meister IG, 2007, CURR BIOL, V17, P1692, DOI 10.1016/j.cub.2007.08.064 MERMELSTEIN P, 1973, J ACOUST SOC AM, V53, P1072 Metze F., 2002, P INT C SPOK LANG PR, P2133 MOLL KL, 1971, J ACOUST SOC AM, V50, P678, DOI 10.1121/1.1912683 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 Ostry DJ, 1996, J NEUROSCI, V16, P1570 PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994 Parthasarathy V, 2007, J ACOUST SOC AM, V121, P491, DOI 10.1121/1.2363926 QIN C, 2008, P INT BRISB AUSTR, P2306 Recasens D, 1999, J PHONETICS, V27, P143, DOI 10.1006/jpho.1999.0092 Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727 RICHARDS H, 1999, INT C ACOUST SPEECH, V1, P357 RICHARDSON M, 2000, P INT C SPOK LANG PR, V3, P131 Richmond K, 2007, P INT ANTW BELG, P2465 RICHMOND K, 2006, P INT PITTSB PA RUSSELL MJ, 2002, P ICSLP DENV CO, P1253 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 Shadle C., 1985, THESIS MIT CAMBRIDGE Singampalli V. D., 2007, P INT ANTW, P70 SOQUET A, 1999, P ICPHS 99 SAN FRANC, P1645 Westbury J.R., 1994, XRAY MICROBEAM SPEEC Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263 WRENCH AA, 2001, P I ACOUST, V23, P207 Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002 NR 66 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2009 VL 51 IS 8 BP 695 EP 710 DI 10.1016/j.specom.2009.03.007 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HH UT WOS:000267572900004 ER PT J AU Wutiwiwatchai, C Furui, S AF Wutiwiwatchai, Chai Furui, Sadaoki TI Thai speech processing technology: A review (vol 49, pg 8, 2007) SO SPEECH COMMUNICATION LA English DT Correction C1 [Wutiwiwatchai, Chai] Natl Elect & Comp Technol Ctr, Informat R&D Div, Speech Technol Res Sect, Klongluang 12120, Pathumthani, Thailand. [Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Wutiwiwatchai, C (reprint author), Natl Elect & Comp Technol Ctr, Informat R&D Div, Speech Technol Res Sect, 112 Thailand Sci Pk,Pahonyothin Rd, Klongluang 12120, Pathumthani, Thailand. EM chai@nectec.or.th RI Wutiwiwatchai, Chai/G-5010-2012 CR Wutiwiwatchsi C, 2007, SPEECH COMMUN, V49, P8, DOI 10.1016/j.specom.2006.10.004 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2009 VL 51 IS 8 BP 711 EP 711 DI 10.1016/j.specom.2009.04.003 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HH UT WOS:000267572900005 ER PT J AU Blomberg, M Elenius, K House, D Karlsson, I AF Blomberg, Mats Elenius, Kjell House, David Karlsson, Inger TI Research Challenges in Speech Technology: A Special Issue in Honour of Rolf Carlson and Bjorn Granstrom SO SPEECH COMMUNICATION LA English DT Editorial Material C1 [Blomberg, Mats; Elenius, Kjell; House, David; Karlsson, Inger] KTH, Stockholm, Sweden. RP Blomberg, M (reprint author), KTH, Stockholm, Sweden. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 563 EP 563 DI 10.1016/j.specom.2009.04.002 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800001 ER PT J AU Fant, G AF Fant, Gunnar TI A personal note from Gunnar Fant SO SPEECH COMMUNICATION LA English DT Editorial Material ID SYNTHESIZER C1 KTH, Dept Speech Mus & Hearing, S-10040 Stockholm, Sweden. RP Fant, G (reprint author), KTH, Dept Speech Mus & Hearing, S-10040 Stockholm, Sweden. EM gunnar@speech.kth.se CR BESKOW J, 2003, THESIS DEP SPEECH MU BLOMBERG M, 1993, P EUR C SPEECH COMM, P1867 Carlson R., 1975, AUDITORY ANAL PERCEP, P55 CARLSON R, 1981, 41981 KTH SPEECH TRA, P29 Carlson R., 1970, 231970 STLQPSR, V2-3, P19 Carlson R, 1989, ICASSP, V1, P223 CARLSON R, 1973, 231973 KTH SPEECH TR, P31 CARLSON R, 1982, REPRESENTATION SPEEC, P109 CARLSON R, 1975, 1 SPEECH TRANSM LAB, P17 CARLSON R, 1990, ADV SPEECH HEARING L FANT G, 1978, RIV ITALIANA ACUSTIC, V2, P69 FANT G, 2004, SPEECH ACOUSTICS PHO FANT G, 1974, P SPEECH COMM SEM ST, V3, P117 FANT G, 1953, IVA ROYAL SWEDISH AC, V2, P331 FANT G, 1962, 21962 STLQPSR, P18 FANT G, 1986, INVARIANCE VARIABILI, P480 FANT G, 1990, SPEECH COMMUN, V9, P171, DOI 10.1016/0167-6393(90)90054-D HOLMES J, 1961, 11961 KTH SPEECH TRA, P10 HUNNICUTT S, 1986, BLISSTALK J AM VOICE, V3 LILJENCRANTS J, 1969, 41969 KTH SPEECH TRA, P43 LILJENCR.JC, 1968, IEEE T ACOUST SPEECH, VAU16, P137, DOI 10.1109/TAU.1968.1161961 STEVENS KN, 1991, J PHONETICS, V19, P161 NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 564 EP 568 DI 10.1016/j.specom.2009.04.001 PG 5 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800002 ER PT J AU Mariani, J AF Mariani, Joseph TI Research infrastructures for Human Language Technologies: A vision from France SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Seminar on Research Challenges in Speech Technology CY OCT, 2005 CL Stockholm, SWEDEN SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol DE Research funding; International cooperation; Language technologies; Language resources; Language technology evaluation AB The size of the effort in language science and technology has increased over the years, reflecting the complexity of this research area. The role of Language Resources and Language Technology Evaluation is now recognized as being crucial for the development of written and spoken language processing systems. Several initiatives have been taken in this framework, with a pioneering activity in the US starting in the mid 80s. In France, several programs may be reported, such as the Aupelf-Uref Francil network and the Techno-Langue national program. Despite the size of the effort and the success of those actions, they did not provide yet a permanent infrastructure. To be properly covered, the development of Language Technologies would require a better share of efforts between the European Commission and the Member States to address the major challenge of multilingualism in Europe, and worldwide. (C) 2008 Elsevier B.V. All rights reserved. C1 [Mariani, Joseph] LIMSI, CNRS, F-91403 Orsay, France. [Mariani, Joseph] French Minist Res, F-75005 Paris, France. RP Mariani, J (reprint author), LIMSI, CNRS, BP 133, F-91403 Orsay, France. EM joseph.mariani@limsi.fr CR ADDA G, 1998, 1 INT C LANG RES EV ANTOINE JY, 1998, 1 INT C LANG RES EV AYACHE C, 2006, 5 INT C LANG RES EV BONHOMME P, 1997, JST FRANC 1997 P AUP BONNEAUMAYNARD H, 2006, 5 INT C LANG RES EV BUEHLER D, 2006, 5 INT C LANG RES EV *BUILD LANG RES EV, 2004, JOINT COCOSDA WRITE CAMARA E, 1997, JST FRANC 1997 P AUP CARRE R, 1984, IEEE ICASSP C SAN DI CHAUDIRON S, 2006, 5 INT C LANG RES EV CHIAO Y, 2006, 5 INT C LANG RES EV CIERI C, 2006, 5 INT C LANG RES EV COLE R, 1997, SURVEY STATE ART HUM, pCH12 DAILLE B, 1997, JST FRANC 1997 P AUP DEBILI F, 1997, JST FRANC 1997 P AUP DEMAREUIL PB, 1998, 1 INT C LANG RES EV DEMAREUIL PB, 2006, 5 INT C LANG RES EV DENIS A, 2006, 5 INT C LANG RES EV DEVILLERS L, 2004, P 4 INT C LANG RES E DOLMAZON JM, 1997, JST FRANC 1997 P AUP ELHADI WM, 2006, 5 INT C LANG RES EV ELHADI WM, 2004, P 4 INT C LANG RES E ELHADI WM, 2004, COLING 29004 GARCIA M, 2006, 5 INT C LANG RES EV GILLARD L, 2006, 5 INT C LANG RES EV GRAU B, 2006, 5 INT C LANG RES EV GRAVIER G, 2004, P 4 INT C LANG RES E HAMON O, 2006, 5 INT C LANG RES EV HIRSCHMAN L, 1997, SURVEY STATE OF THE HOVY E, 1999, MULTILINGUAL INFORM JARDINO M, 1998, 1 INT C LANG RES EV LABED L, 1997, JST FRANC 1997 P AUP LANDI B, 1998, 1 INT C LANG RES EV LANGLAIS P, 1998, 1 INT C LANG RES EV LAZZARI G, 2006, HUMAN LANGUAGE TECHN MAPELLI V, 2004, P 4 INT C LANG RES E MARIANI J, 2002, ARE WE LOOSING GROUN MARIANI J, 1997, SURVEY STATE ART HUM, pCH13 MARIANI J, 1995, COC WORKSH SEPT 22 1 MAUCLAIR J, 2006, 5 INT C LANG RES EV MCTAIT K, 2004, TALN JEP RECITAL C 2 MOSTEFA D, 2006, 5 INT C LANG RES EV MUSTAFA W, 1998, 1 INT C LANG RES EV NAVA M, 2004, TALN JEP RECITAL 200 PAPINENI K, 2002, P 3 INT C LANG RES E PAROUBEK P, 2006, 5 INT C LANG RES EV PAROUBEK P, 2000, 2 INT C LANG RES EV PIERRE J, 2007, LANGUE COEUR NUMERIQ ROSSET S, 1997, JST FRANC 1997 P AUP SABATIER P, 1997, JST FRANC 1997 P AUP VANRULLEN T, 2006, 5 INT C LANG RES EV VILNAT A, 2004, P 4 INT C LANG RES E WHITE J, 1999, MULTILINGUAL INFORM WITTENBURG P, 2006, 5 INT C LANG RES EV, V6 ZAMPOLLI A, 1999, MULTILINGUAL INFORM NR 55 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 569 EP 584 DI 10.1016/j.specom.2007.12.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800003 ER PT J AU Greenberg, Y Shibuya, N Tsuzaki, M Kato, H Sagisaka, Y AF Greenberg, Yoko Shibuya, Nagisa Tsuzaki, Minoru Kato, Hiroaki Sagisaka, Yoshinori TI Analysis on paralinguistic prosody control in perceptual impression space using multiple dimensional scaling SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Seminar on Research Challenges in Speech Technology CY OCT, 2005 CL Stockholm, SWEDEN SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol DE Paralinguistic prosody; Nonverbal information; Fundamental frequency control; Communicative speech synthesis AB A multi-dimensional perceptual space for communicative speech prosodies was derived using a psychometric method from multidimensional expressions of impressions to characterize paralinguistic information conveyed by prosody in communication. Single word utterances of "n" were employed to allow freedom from lexical effects and to cover communicative prosodic variations as much as possible. The analysis of daily conversations showed that conversational speech impressions were manifested in the global F0 control of "n" as differences of average height (high-low) and dynamic patterns (rise, fall, gradual fall, and rise&fall). Using controlled single utterances of "n", multiple dimensional scaling analysis was applied to a mutual distance matrix obtained by 26 dimensional vectors expressing perceptual impressions. The result showed the three-dimensional structure of a perceptual impression space, and each dimension corresponded to different F0 control characteristics. The positive-negative impression can be controlled by average F0 height while confident-doubtful or allowable-unacceptable impressions can be controlled by F0 dynamic patterns. Unlike conventional categorical classification of prosodic patterns frequently observed in studies of emotional prosody, this control characterization enables us to flexibly and quantitatively describe prosodic impressions. These experimental results allow the possibility of input specifications for communicative prosody generation using impression vectors and control through average F0 height and F0 dynamic patterns. Instead of the generation of speech with categorical prototypical prosody, more adequate communicative speech synthesis can be approached through input specification and its correspondence with control characteristics. (C) 2007 Elsevier B.V. All rights reserved. C1 [Greenberg, Yoko; Shibuya, Nagisa; Sagisaka, Yoshinori] Waseda Univ, GITI, Shinjuku Ku, Tokyo 1690051, Japan. [Tsuzaki, Minoru] Kyoto City Univ Arts, Nishikyo Ku, Kyoto 6101197, Japan. [Kato, Hiroaki] Natl Inst Informat & Commun Technol, ATR Cognit Informat Sci Lab, Seika, Kyoto 6190288, Japan. RP Greenberg, Y (reprint author), Waseda Univ, GITI, Shinjuku Ku, 1-3-10 Nishi Waseda, Tokyo 1690051, Japan. EM Yoko.Kokenawa@toki.waseda.jp CR AUDIBERT N, 2005, P 9 EUR C SPEECH COM, P525 Campbell N., 2004, J PHONET SOC JPN, V8, P9 Campbell Nick, 2004, P 2 INT C SPEECH PRO, P217 *CTT, 2005, WAVESURFER Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 GREENBERG Y, 2006, P SPEECH PROS 2006 ISHI TC, 2005, P SPEECH PROS 2005 Maekawa K., 2004, P SPEECH PROS 2004, V2004, P367 TORGERSON WS, 1952, PSYCHOMETRIKA, V17, P401 NR 9 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 585 EP 593 DI 10.1016/j.specom.2007.10.006 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800004 ER PT J AU Gronnum, N AF Gronnum, Nina TI A Danish phonetically annotated spontaneous speech corpus (DanPASS) SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Seminar on Research Challenges in Speech Technology CY OCT, 2005 CL Stockholm, SWEDEN SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol DE Monologue; Dialogue; Spontaneous speech; Corpus; Phonetic notation; Prosodic labeling ID MAP TASK AB A corpus is described consisting of non-scripted monologues and dialogues, recorded by 27 speakers, comprising a total of 73,227 running words, corresponding to 9 h and 46 min of speech. The monologues were recorded as one-way communication with an unseen partner where the speaker performed three different tasks: (s)he described a network consisting of various geometrical shapes in various colours, (s)he guided the listener through four different routes in a virtual city map, and (s)he instructed the listener how to build a house from its individual pieces. The dialogues are replicas of the HCRC map tasks. Annotation is performed in Praat. The sound files are segmented into prosodic phrases, words, and syllables. The files are supplied, in separate interval tiers, with an orthographical representation, detailed part-of-speech tags, simplified part-of-speech tags, a phonemic notation, a semi-narrow phonetic notation, a symbolic representation of the pitch relation between each stressed and post-tonic syllable, and a symbolic representation of the phrasal intonation. (C) 2008 Elsevier B.V. All rights reserved. C1 Univ Copenhagen, Dept Scandinavian Studies & Linguist, Linguist Lab, DK-2300 Copenhagen, Denmark. RP Gronnum, N (reprint author), Univ Copenhagen, Dept Scandinavian Studies & Linguist, Linguist Lab, 120 Njalsgade, DK-2300 Copenhagen, Denmark. EM ninag@hum.ku.dk CR ANDERSON AH, 1991, LANG SPEECH, V34, P351 Boersma P., 2001, GLOT INT, V5, P341 Boersma P., 2006, PRAAT DOING PHONETIC BROWN G, 1984, TEACHING TALK DAU T, 2007, CARLSBERGFONDET ARSS, P44 DYRBY M, 2005, M PROGRAM CIRC NOV 1 Fletcher J, 2002, LANG SPEECH, V45, P229 Horiuchi Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14 GRONNUM N, 1995, STOCKHOLM 1995, V2, P124 GRONNUM N, 2006, P 5 INT C LANG RES E GRONNUM N, 2007, P 16 INT C PHON SCI, P1229 GRONNUM N, 2005, FONETIK FONOLOGI GRONNUM N, 1985, J ACOUST SOC AM, V77, P1205 GRONNUM N, 1990, PHONETICA, V47, P182 GRONNUM N, 1986, J ACOUST SOC AM, V80, P1040 HELGASON P, 2006, WORKING PAPERS, V52, P57 HENRICHSEN P, 2002, LAMBDA I DATALINGVIS, V27 Jensen C., 2005, P INT 2005 LISB, P2385 Kohler K., 2007, EXPT APPROACHES PHON, P41 KOHLER KJ, 2006, METHODS EMPIRICAL PR, V3, P123 Kohler KJ, 2005, PHONETICA, V62, P88, DOI 10.1159/000090091 PAGGIO P, 2006, P 5 INT C LANG RES E Silverman K., 1992, P INT C SPOK LANG PR, P867 Sole, 2007, EXPT APPROACHES PHON, P192 SWERTS M, 1994, PROSODIC FEATURES DI SWERTS M, 1992, SPEECH COMMUN, V121, P463 TERKEN JMB, 1984, LANG SPEECH, V27, P269 TONDERING J, 2008, THESIS U COPENHAGEN NR 28 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 594 EP 603 DI 10.1016/j.specom.2008.11.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800005 ER PT J AU Massaro, DW Jesse, A AF Massaro, Dominic W. Jesse, Alexandra TI Read my lips: speech distortions in musical lyrics can be overcome (slightly) by facial information SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Seminar on Research Challenges in Speech Technology CY OCT, 2005 CL Stockholm, SWEDEN SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol DE Perception of speech; Perception of music; Audiovisual speech perception; Singing ID VOICE ONSET TIME; SUNG VOWELS; VISUAL-PERCEPTION; FUNDAMENTAL-FREQUENCY; PHRASE STRUCTURE; PERFORMANCE; PROSODY; PITCH; ENGLISH; SINGERS AB Understanding the lyrics of many contemporary songs is difficult, and an earlier study [Hidalgo-Barnes, M., Massaro, D.W., 2007. Read my lips: an animated face helps communicate musical lyrics. Psychomusicology 19, 3-12] showed a benefit for lyrics recognition when seeing a computer-animated talking head (Baldi (R)) mouthing the lyrics along with hearing the singer. However, the contribution of visual information was relatively small compared to what is usually found for speech. In the current experiments, our goal was to determine why the face appears to contribute less when aligned with sung lyrics than when aligned with normal speech presented in noise. The first experiment compared the contribution of the talking head with the originally sung lyrics versus the case when it was aligned with the Festival text-to-speech synthesis (TtS) spoken at the original duration of the song's lyrics. A small and similar influence of the face was found in both conditions. In the three experiments, we compared the presence of the face when the durations of the TtS were equated with the duration of the original musical lyrics to the case when the lyrics were read with typical TtS durations and this speech embedded in noise. The results indicated that the unusual temporally distorted durations of musical lyrics decreases the contribution of the visible speech from the face. (C) 2008 Elsevier B.V. All rights reserved. C1 [Massaro, Dominic W.; Jesse, Alexandra] Univ Calif Santa Cruz, Dept Psychol, Santa Cruz, CA 95064 USA. RP Jesse, A (reprint author), Max Planck Inst Psycholinguist, POB 310, NL-6500 AH Nijmegen, Netherlands. EM massaro@ucsc.edu; Alexandra.Jesse@mpi.nl CR AUER ET, 2004, J ACOUST SOC AM, V116, P2644 Austin SF, 2007, J VOICE, V21, P72, DOI 10.1016/j.jvoice.2005.08.013 BENOLKEN MS, 1990, J ACOUST SOC AM, V87, P1781, DOI 10.1121/1.399426 BERNSTEIN LE, 1989, J ACOUST SOC AM, V85, P397, DOI 10.1121/1.397690 Burnham D., 2001, P INT C AUD VIS SPEE, P155 Calvert GA, 1997, SCIENCE, V276, P593, DOI 10.1126/science.276.5312.593 Cave C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607235 Chiappe P, 1997, PSYCHON B REV, V4, P254, DOI 10.3758/BF03209402 Clarke E., 1987, PSYCHOL MUSIC, V15, P58, DOI 10.1177/0305735687151005 CLARKE EF, 1993, MUSIC PERCEPT, V10, P317 Clarke EF, 1985, MUSICAL STRUCTURE CO, P209 Stone RE, 2003, J VOICE, V17, P283, DOI 10.1067/S0892-1997(03)00074-2 Cleveland TE, 2001, J VOICE, V15, P54, DOI 10.1016/S0892-1997(01)00006-6 CLEVELAND TF, 1994, J VOICE, V8, P18, DOI 10.1016/S0892-1997(05)80315-7 CUDDY LL, 1981, J EXP PSYCHOL HUMAN, V7, P869, DOI 10.1037//0096-1523.7.4.869 Dahl S, 2007, MUSIC PERCEPT, V24, P433, DOI 10.1525/MP.2007.24.5.433 de Gelder B, 2000, COGNITION EMOTION, V14, P289 DICARLO NS, 2004, SEMIOTICA, V149, P47 DICARLO NS, 1985, PHONETICA, V42, P188 Dohen M, 2005, P AVSP, P115 Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009 Ellison JW, 1997, J EXP PSYCHOL HUMAN, V23, P213, DOI 10.1037/0096-1523.23.1.213 FISHER CG, 1969, J SPEECH HEAR RES, V12, P379 Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 FROMKIN VA, 1971, LANGUAGE, V47, P27, DOI 10.2307/412187 FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022 Granstrom B., 1999, P INT C PHON SCI ICP, P655 Granstrom B, 2005, SPEECH COMMUN, V46, P473, DOI 10.1016/j.specom.2005.02.017 Gregg JW, 2006, J VOICE, V20, P198, DOI 10.1016/j.jvoice.2005.01.007 GREGORY AH, 1978, PERCEPT PSYCHOPHYS, V24, P171, DOI 10.3758/BF03199545 Hasegawa T, 2004, COGNITIVE BRAIN RES, V20, P510, DOI 10.1016/j.cogbrainres.2004.04.005 HIDALGO-BARNES M., 2007, PSYCHOMUSICOLOGY, V19, P3, DOI [10.1037/h0094037, DOI 10.1037/H0094037] HNATHCHISOLM T, 1988, EAR HEARING, V9, P329, DOI 10.1097/00003446-198812000-00009 Hollien H, 2000, J VOICE, V14, P287, DOI 10.1016/S0892-1997(00)80038-7 House D., 2001, P EUR 2001, P387 HOUSE D, 2002, TMH QPSR FONETIK, V44, P41 Huron D, 2003, MUSIC PERCEPT, V21, P267, DOI 10.1525/mp.2003.21.2.267 Jackendoff R, 2006, COGNITION, V100, P33, DOI 10.1016/j.cognition.2005.11.005 Jesse A., 2000, INTERPRETING, V5, P95, DOI 10.1075/intp.5.2.04jes JUSCZYK PW, 1993, J EXP PSYCHOL HUMAN, V19, P627, DOI 10.1037/0096-1523.19.3.627 Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770 Keating P., 2003, P 16 INT C PHON SCI, P2071 Krumhansl C. L., 1997, MUSIC SCI, V1, P63 KRUMHANSL CL, 1990, PSYCHOL SCI, V1, P70, DOI 10.1111/j.1467-9280.1990.tb00070.x Lansing CR, 1999, J SPEECH LANG HEAR R, V42, P526 LARGE EW, 1995, COGNITIVE SCI, V19, P53, DOI 10.1207/s15516709cog1901_2 Lerdahl F., 1983, GENERATIVE THEORY TO LISKER L, 1986, LANG SPEECH, V29, P3 Lundy DS, 2000, J VOICE, V14, P490, DOI 10.1016/S0892-1997(00)80006-5 Massaro D. W., 1998, PERCEIVING TALKING F Massaro D. W., 1987, SPEECH PERCEPTION EA Massaro DW, 2002, TEXT SPEECH LANG TEC, V19, P45 Massaro DW, 1998, AM SCI, V86, P236, DOI 10.1511/1998.25.861 Massaro DW, 1996, PSYCHON B REV, V3, P215, DOI 10.3758/BF03212421 Massaro DW, 1999, J SPEECH LANG HEAR R, V42, P21 McCrea CR, 2005, J SPEECH LANG HEAR R, V48, P1013, DOI 10.1044/1092-4388(2005/069) McCrea CR, 2007, J VOICE, V21, P54, DOI 10.1016/j.jvoice.2005.05.002 McCrea CR, 2005, J VOICE, V19, P420, DOI 10.1016/j.jvoice.2004.08.002 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Mixdorff H., 2005, P AUD VIS SPEECH PRO, P3 Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x Neuhaus C, 2006, J COGNITIVE NEUROSCI, V18, P472, DOI 10.1162/089892906775990642 Nicholson KG, 2003, BRAIN COGNITION, V52, P382, DOI 10.1016/S0278-2626(03)00182-9 Omori K, 1996, J VOICE, V10, P228, DOI 10.1016/S0892-1997(96)80003-8 Ouni S, 2007, EURASIP J AUDIO SPEE, DOI 10.1155/2007/47891 PALMER C, 1995, J EXP PSYCHOL HUMAN, V21, P947, DOI 10.1037//0096-1523.21.5.947 PALMER C, 1987, J EXP PSYCHOL HUMAN, V13, P116, DOI 10.1037//0096-1523.13.1.116 PALMER C, 1992, J MEM LANG, V31, P525, DOI 10.1016/0749-596X(92)90027-U PALMER C, 1990, J EXP PSYCHOL HUMAN, V16, P728, DOI 10.1037//0096-1523.16.4.728 PALMER C, 1992, COGNITIVE BASES OF MUSICAL COMMUNICATION, P249, DOI 10.1037/10104-014 Palmer C, 2006, PSYCHOL LEARN MOTIV, V46, P245, DOI 10.1016/S0079-7421(06)46007-2 PALMER C, 1989, J EXP PSYCHOL HUMAN, V15, P331, DOI 10.1037/0096-1523.15.2.331 Patel AD, 2003, COGNITION, V87, pB35, DOI 10.1016/S0010-0277(02)00187-7 Patel AD, 2003, MUSIC PERCEPT, V21, P273, DOI 10.1525/mp.2003.21.2.273 Patel AD, 2006, J ACOUST SOC AM, V119, P3034, DOI 10.1121/1.2179657 Penel A, 2004, PERCEPT PSYCHOPHYS, V66, P545, DOI 10.3758/BF03194900 *PRIM, 1993, PRESSM ALB PORK SOD REPP BH, 1995, PERCEPT PSYCHOPHYS, V57, P1217, DOI 10.3758/BF03208378 Repp BH, 1998, J EXP PSYCHOL HUMAN, V24, P791, DOI 10.1037//0096-1523.24.3.791 REPP BH, 1992, COGNITION, V44, P241, DOI 10.1016/0010-0277(92)90003-Z Risberg A., 1978, STL QPSR, V4, P1 ROSSING TD, 1986, J ACOUST SOC AM, V79, P1975, DOI 10.1121/1.393205 SALDANA HM, 1993, PERCEPT PSYCHOPHYS, V54, P406, DOI 10.3758/BF03205276 SCHMUCKLER MA, 1989, MUSIC PERCEPT, V7, P109 SLOBODA JA, 1980, CAN J PSYCHOL, V34, P274, DOI 10.1037/h0081052 SLOBODA JA, 1983, Q J EXP PSYCHOL-A, V35, P377 SMITH GP, 2003, ELT J, V57, P113, DOI 10.1093/elt/57.2.113 SMITH LA, 1980, J ACOUST SOC AM, V67, P1795, DOI 10.1121/1.384308 Srinivasan RJ, 2003, LANG SPEECH, V46, P1 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 Sundberg J, 2003, TMH QPSR SPEECH MUSI, V45, P11 Sundberg J, 1997, J VOICE, V11, P301, DOI 10.1016/S0892-1997(97)80008-2 SUNDBERG J, 1974, J ACOUST SOC AM, V55, P838, DOI 10.1121/1.1914609 SUNDBERG J, 1982, PSYCHOL MUSIC, P59 Swerts M, 2005, J MEM LANG, V53, P81, DOI 10.1016/j.jml.2005.02.003 SWERTS M, 2004, P SPEECH PROS TAN N, 1981, MEM COGNITION, V9, P533, DOI 10.3758/BF03202347 Thompson DM, 1934, J GEN PSYCHOL, V11, P160 TITZE IR, 1992, J ACOUST SOC AM, V91, P2936, DOI 10.1121/1.402929 TODD NPM, 1995, J ACOUST SOC AM, V97, P1940, DOI 10.1121/1.412067 Trainor LJ, 2000, PERCEPT PSYCHOPHYS, V62, P333, DOI 10.3758/BF03205553 Vatakis A, 2006, NEUROSCI LETT, V393, P40, DOI 10.1016/j.neulet.2005.09.032 Vines BW, 2006, COGNITION, V101, P80, DOI 10.1016/j.cognition.2005.09.003 Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165 NR 105 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 604 EP 621 DI 10.1016/j.specom.2008.05.013 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800006 ER PT J AU Lindblom, B Diehl, R Creeger, C AF Lindblom, Bjorn Diehl, Randy Creeger, Carl TI Do 'Dominant Frequencies' explain the listener's response to formant and spectrum shape variations? SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Seminar on Research Challenges in Speech Technology CY OCT, 2005 CL Stockholm, SWEDEN SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol DE Vowel quality perception; Auditory representation; Dominant Frequency AB Psychoacoustic experimentation shows that formant frequency shifts can give rise to more significant changes in phonetic vowel timber than differences in overall level, bandwidth, spectral tilt, and formant amplitudes. Carlson and Granstrom's perceptual and computational findings suggest that, in addition to spectral representations, the human ear uses temporal information on formant periodicities ('Dominant Frequencies') in building vowel timber percepts. The availability of such temporal coding in the cat's auditory nerve fibers has been demonstrated in numerous physiological investigations undertaken during recent decades. In this paper we explore, and provide further support for, the Dominant Frequency hypothesis using KONVERT, a computational auditory model. KONVERT provides auditory excitation patterns for vowels by performing a critical-band analysis. It simulates phase locking in auditory neurons and outputs DF histograms. The modeling supports the assumption that listeners judge phonetic distance among vowels on the basis formant frequency differences as deter-mined primarily by a time-based analysis. However, when instructed to judge psychophysical distance among vowels, they can also use spectral differences such as formant bandwidth, formant amplitudes and spectral tilt. Although there has been considerable debate among psychoacousticians about the functional role of phase locking in monaural hearing, the present research suggests that detailed temporal information may nonetheless play a significant role in speech perception. (C) 2008 Elsevier B.V. All rights reserved. C1 [Lindblom, Bjorn] Stockholm Univ, Dept Linguist, S-10691 Stockholm, Sweden. [Diehl, Randy; Creeger, Carl] Univ Texas Austin, Dept Psychol, Austin, TX 78712 USA. RP Lindblom, B (reprint author), Stockholm Univ, Dept Linguist, S-10691 Stockholm, Sweden. EM lindblom@ling.su.se; diehl@psy.utex-as.edu; creeger@psy.utexas.edu CR ASSMANN PF, 1989, J ACOUST SOC AM, V85, P327, DOI 10.1121/1.397684 BLADON RAW, 1981, J ACOUST SOC AM, V69, P1414, DOI 10.1121/1.385824 BLOMBERG M, 1984, ACOUSTICS SPEECH SIG, P33 Carlson R., 1970, STL QPSR, V2, P19 CARLSON R, 1975, AUDITORY ANAL PERCEP CARLSON R, 1979, STL QPSR, V3, P3 CARLSON R, 1979, STL QPSR, V20, P84 CARLSON R, 1982, REPRESENTATION SPEEC, P109 CARLSON R, 1976, STL QPSR, V17, P1 DELGUTTE B, 1984, J ACOUST SOC AM, V75, P866, DOI 10.1121/1.390596 Diehl RL, 1996, J PHONETICS, V24, P187, DOI 10.1006/jpho.1996.0011 Fant G., 1960, ACOUSTIC THEORY SPEE Fant G, 1972, STL QPSR, V2, P28 FANT G, 1963, J ACOUST SOC AM, V35, P1753, DOI 10.1121/1.1918812 Fletcher H, 1933, J ACOUST SOC AM, V5, P82, DOI 10.1121/1.1915637 GREENBERG S, 1998, J PHONETICS, V16, P3 Klatt D. H., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing Klatt D. H, 1982, REPRESENTATION SPEEC, P181 LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991 Lindblom B., 1986, EXPT PHONOLOGY, P13 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 PLOMP R, 1970, FREQUENCY ANAL PERIO ROSE JE, 1967, J NEUROPHYSIOL, V32, P402 SACHS MB, 1982, REPRESENTATION SPEEC, P115 Schroeder M. R., 1979, FRONTIERS SPEECH COM, P217 VIEMEISTER NF, 2002, GENETICS FUNCTION AU, P273 Zwicker E., 1967, OHR ALS NACHRICHTENE NR 27 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 622 EP 629 DI 10.1016/j.specom.2008.12.003 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800007 ER PT J AU Pelachaud, C AF Pelachaud, Catherine TI Studies on gesture expressivity for a virtual agent SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Seminar on Research Challenges in Speech Technology CY OCT, 2005 CL Stockholm, SWEDEN SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol DE Nonverbal behavior; Embodied conversational agents; Expressivity ID EMOTION; FACE AB Our aim is to create an affective embodied conversational agent (ECA); that is an ECA able to display communicative and emotional signals. Nonverbal communication is done through certain facial expressions, gesture shapes, gaze direction, etc. But it can also carry a qualitative aspect through behavior expressivity: how a facial expression, a gesture is executed. In this paper we describe some of the work we have conducted on behavior expressivity, more particularly on gesture expressivity. We have developed a model of behavior expressivity using a set of six parameters that act as modulation of behavior animation. Expressivity may act at different levels of the behavior: on a particular phase of the behavior, on the whole behavior and on a sequence of behaviors. When applied at these different levels, expressivity may convey different functions. (C) 2008 Published by Elsevier B.V. C1 Univ Paris 08, INRIA Paris Rocquencourt, IUT Montreuil, Paris, France. RP Pelachaud, C (reprint author), Univ Paris 08, INRIA Paris Rocquencourt, IUT Montreuil, Paris, France. EM pelachaud@iut-univ.paris8.fr CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BANZIGER T, 2006, WORKSH CORP RES EM A BASSILI JN, 1979, J PERS SOC PSYCHOL, V37, P2049, DOI 10.1037//0022-3514.37.11.2049 Bavelas JB, 2000, J LANG SOC PSYCHOL, V19, P163, DOI 10.1177/0261927X00019002001 BECKER C, 2006, INT WORKSH EM COMP C, P31 Bui T, 2004, COMPUTER GRAPHICS INTERNATIONAL, PROCEEDINGS, P284 BUISINE S, 2006, 6 INT C INT VIRT AG BUISINE S, 2004, BROWS TRUST EVALUATI CARIDAKIS G, 2006, WORKSH MULT CORP MUL Cassell J, 2007, CONVERSATIONAL INFOR, P133 CASSELL J, 2001, ANN C SERIES Cassell J., 1999, CHI 99, P520 CHAFAI NE, 2007, INT J LANGUAGE RESOU CHAFAI NE, 2006, LREC 2006 WORKSH MUL CHAFAI NE, 2006, 6 INT C INT VIRT AG Chi D, 2000, COMP GRAPH, P173 DECAROLIS B, 2004, LIFE LIKE CHARACTERS, P65 DEVILLERS L, 2005, 1 INT C AFF COMP INT Ekman P, 2003, EMOTIONS REVEALED Ekman P, 1975, UNMASKING FACE GUIDE FEYEREISEN P, 1997, GESTE COGNITION COMM, P20 GALLAHER PE, 1992, J PERS SOC PSYCHOL, V63, P133, DOI 10.1037/0022-3514.63.1.133 HARTMANN B, 2005, 3 INT JOINT C AUT AG HARTMANN B, 2006, LNCS LNAI KAISER S, 2006, P 19 INT C COMP AN S Kendon A., 2004, GESTURE VISIBLE ACTI Kendon A., 1990, CONDUCTING INTERACTI Kopp S, 2004, COMPUT ANIMAT VIRT W, V15, P39, DOI 10.1002/cav.6 KRAHMER E, 2004, BROWS TRUST EVALUATI LIU K, 2005, ANN C SERIES LUNDEBERG M, 1999, P ESCA WORKSH AUD VI MANCINI M, 2007, GEST WORKSH LISB MAY MARTIN JC, 2005, 1 INT C AFF COMP INT MARTIN JC, 2006, INT J HUM ROBOT, V20, P477 McNeill D., 1992, HAND MIND WHAT GESTU Menardais S., 2004, P ACM SIGGRAPH EUR S, P325, DOI 10.1145/1028523.1028567 NEFF M, 2004, P 2004 ACM SIGGRAPH, P49, DOI 10.1145/1028523.1028531 NIEWIADOMSKI R, 2007, 2 INT C AFF COMP INT Pandzic I.S., 2002, MPEG 4 FACIAL ANIMAT PELACHAUD C, 2005, ACM MULTIMEDIA BRAVE Peters C, 2006, INT J HUM ROBOT, V3, P321, DOI 10.1142/S0219843606000783 POGGI I, 1998, P 6 INT PRAGM C REIM Poggi I, 2002, GESTURE, V2, P71, DOI 10.1075/gest.2.1.05pog POGGI I, 2003, GESTURES MEANING USE POSNER R, 2003, GESTURES MEANING USE PRILLWITZ S, 1989, INT STUDIES SIGN LAN, V5 Raouzaiou A, 2002, EURASIP J APPL SIG P, V2002, P1021, DOI 10.1155/S1110865702206149 RAPANTZIKOS K, 2005, INT WORKSH CONT BAS Ruttkay Z, 2003, COMPUT GRAPH FORUM, V22, P49, DOI 10.1111/1467-8659.t01-1-00645 RUTTKAY Z, 2003, P AAMAS03 WS EMB CON Sidner C. L., 2004, IUI 04, P78, DOI DOI 10.1145/964442.964458 STONE M, 2004, ANN C SERIES STONE M, 2003, COMPUTER ANIMATION S TERZOPOULOS D, 1993, IEEE T PATTERN ANAL, V15, P569, DOI 10.1109/34.216726 Thomas F, 1981, DISNEY ANIMATION ILL Wallbott HG, 1998, EUR J SOC PSYCHOL, V28, P879, DOI 10.1002/(SICI)1099-0992(1998110)28:6<879::AID-EJSP901>3.0.CO;2-W WALLBOTT HG, 1986, J PERSONALITY SOCIAL, V24 WALLBOTT HG, 1985, J CLIN PSYCHOL, V41 NR 58 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 630 EP 639 DI 10.1016/j.specom.2008.04.009 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800008 ER PT J AU Rosenberg, A Hirschberg, J AF Rosenberg, Andrew Hirschberg, Julia TI Charisma perception from text and speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Seminar on Research Challenges in Speech Technology CY OCT, 2005 CL Stockholm, SWEDEN SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol DE Charismatic speech; Paralinguistic analysis; Political speech AB Charisma, the ability to attract and retain followers without benefit of formal authority, is more difficult to define than to identify. While we each seem able to identify charismatic individuals - and non-charismatic individuals - it is not clear what it is about an individual that influences our judgment. This paper describes the results of experiments designed to discover potential correlates of such judgments, in what speakers say and the way that they say it. We present results of two parallel experiments in which subjective judgments of charisma in spoken and in transcribed American political speech were analyzed with respect to the acoustic and prosodic (where applicable) and lexico-syntactic characteristics of the speech being assessed. While we find that there is considerable disagreement among subjects on how the speakers of each token are ranked, we also find that subjects appear to share a functional definition of charisma, in terms of other personal characteristics we asked them to rank speakers by. We also find certain acoustic, prosodic, and lexico-syntactic characteristics that correlate significantly with perceptions of charisma. Finally, by comparing the responses to spoken vs. transcribed stimuli, we attempt to distinguish between the contributions of "what is said" and "how it is said" with respect to charisma judgments. (C) 2008 Elsevier B.V. All rights reserved. C1 [Rosenberg, Andrew; Hirschberg, Julia] Columbia Univ, New York, NY 10027 USA. RP Rosenberg, A (reprint author), Columbia Univ, 2960 Broadway, New York, NY 10027 USA. EM amaxwell@cs.columbia.edu; julia@cs.columbia.edu CR BARD EG, 1996, P ICSLP 96, P1876 BIRD FB, 1993, HDB CULTS SECTS AM, VB, P75 Boersma P., 2001, GLOT INT, V5, P341 BOSS P, 1976, S SPEECH COMM J, V41, P300 Brill E, 1995, COMPUT LINGUIST, V21, P543 COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256 Cohen R., 1987, Computational Linguistics, V13 Dahan D, 2002, J MEM LANG, V47, P292, DOI 10.1016/S0749-596X(02)00001-3 DAVIES J, 1954, AM POLIT SCI REV, V48 DOWIS R, 2000, LOST ART GREAT SPEEC Hamilton M. A., 1993, COMMUNICATION Q, V41, P231 Ladd D. R., 1996, INTONATIONAL PHONOLO LARSON K, 2003, SCI WORD RECOGNITION MARCUS J, 1967, W POLIT Q, V14, P237 PIERREHUMBERT J, 1990, INTENTIONS COMMUNICA Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 Touati P., 1993, ESCA WORKSH PROS, P168 TUPPEN CJS, 1974, SPEECH MONOGR, V41, P253 Weber M., 1947, THEORY SOCIAL EC ORG NR 19 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2009 VL 51 IS 7 SI SI BP 640 EP 655 DI 10.1016/j.specom.2008.11.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 465HG UT WOS:000267572800009 ER PT J AU Molina, C Yoma, NB Wuth, J Vivanco, H AF Molina, Carlos Becerra Yoma, Nestor Wuth, Jorge Vivanco, Hiram TI ASR based pronunciation evaluation with automatically generated competing vocabulary and classifier fusion SO SPEECH COMMUNICATION LA English DT Article DE Computer aided pronunciation training; Speech recognition; Automatic generation of competitive vocabulary; Multi-classifier fusion; Computer aided language learning ID RECOGNITION; SYSTEMS AB In this paper, the application of automatic speech recognition (ASR) technology in computer aided pronunciation training (CAPT) is addressed. A method to automatically generate the competitive lexicon, required by an ASR engine to compare the pronunciation of a target word with its correct and wrong phonetic realizations, is proposed. In order to enable the efficient deployment of CAPT applications, the generation of this competitive lexicon does not require any human assistance or a priori information of mother language dependent error rules. Moreover, a Bayes based multi-classifier fusion approach to map ASR objective confidence scores to subjective evaluations in pronunciation assessment is presented. The method proposed here to generate a competitive lexicon given a target word leads to averaged subjective-objective score correlation equal to 0.67 and 0.82 with five and two levels of pronunciation quality, respectively. Finally, multi-classifier systems (MCS) provide a promising formal framework to combine poorly correlated scores in CAPT. When applied to ASR confidence metrics, MCS can lead to an increase of 2.4% and a reduction of 10.2% in subjective-objective score correlation and classification error, respectively, with two pronunciation quality levels. (c) 2009 Elsevier B.V. All rights reserved. C1 [Molina, Carlos; Becerra Yoma, Nestor; Wuth, Jorge] Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile. [Vivanco, Hiram] Univ Chile, Dept Linguist, Santiago, Chile. RP Yoma, NB (reprint author), Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile. EM nbecerra@ing.uchile.cl FU Conicyt-Chile [1070382]; Fondef [D051-10243, AT 24080121] FX This work was funded by Conicyt-Chile under grants Fondecyt No. 1070382, Fondef No. D051-10243, and AT 24080121. CR ABDOU S, 2006, P INTERSPEECH ICSLP BONAVENTURA P, 2000, P KONV 2000 C NAT LA, P225 CUCCHIARINI C, 1997, IEEE WORKSH ASRU, P622 Duda R. O., 1973, PATTERN CLASSIFICATI Fassinut-Mombot B., 2004, Information Fusion, V5, DOI 10.1016/j.inffus.2003.06.001 FRANCO H, 1997, ICASSP 97, V2, P1471 Fumera G, 2005, IEEE T PATTERN ANAL, V27, P942, DOI 10.1109/TPAMI.2005.109 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Garofalo J., 1993, CONTINUOUS SPEECH RE HAMID S, 2004, P NEMLAR C AR LANG R Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 Kittler J, 2003, IEEE T PATTERN ANAL, V25, P110, DOI 10.1109/TPAMI.2003.1159950 Kuncheva LI, 2002, IEEE T PATTERN ANAL, V24, P281, DOI 10.1109/34.982906 Kuncheva LI, 2001, PATTERN RECOGN, V34, P299, DOI 10.1016/S0031-3203(99)00223-X Kwan K.Y., 2002, P ICSLP, P69 *LDC, 1995, LAT 40 DAT PROV LING Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001 NAKAGAWA S, 2007, P INTERSPEECH ICSLP Neri A., 2003, P 15 INT C PHON SCI, P1157 NEUMEYER L, 1996, P ICSLP 96 Sooful J., 2002, P 7 INT C SPOK LANG, P521 TAX DMJ, 1997, P 1 IAPR WORKSH STAT, P165 TEPPERMAN J, 2007, P INTERSPEECH ICSLP XU L, 1992, IEEE T SYST MAN CYB, V22, P418, DOI 10.1109/21.155943 NR 24 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2009 VL 51 IS 6 BP 485 EP 498 DI 10.1016/j.specom.2009.01.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 444NX UT WOS:000265988800001 ER PT J AU Gerosa, M Giuliani, D Brugnara, F AF Gerosa, Matteo Giuliani, Diego Brugnara, Fabio TI Towards age-independent acoustic modeling SO SPEECH COMMUNICATION LA English DT Article DE Speaker adaptive acoustic modeling; Speaker normalization; Vocal tract length normalization; Children's speech recognition ID AUTOMATIC SPEECH RECOGNITION; CHILDRENS SPEECH; VOCAL-TRACT; NORMALIZATION AB In automatic speech recognition applications, due to significant differences in voice characteristics, adults and children are usually treated as two population groups, for which different acoustic models are trained. In this paper, age-independent acoustic modeling is investigated in the context of large vocabulary speech recognition. Exploiting a small amount (9 h) of children's speech and a more significant amount (57 h) of adult speech, age-independent acoustic models are trained using several methods for speaker adaptive acoustic modeling. Recognition results achieved using these models are compared with those achieved using age-dependent acoustic models for children and adults, respectively. Recognition experiments are performed on four Italian speech corpora, two consisting of children's speech and two of adult speech, using 64k word and 11k word trigram language models. Methods for speaker adaptive acoustic modeling prove to be effective for training age-independent acoustic models ensuring recognition results at least as good as those achieved with age-dependent acoustic models for adults and children. (c) 2009 Elsevier B.V. All rights reserved. C1 [Gerosa, Matteo; Giuliani, Diego; Brugnara, Fabio] FBK, I-38100 Trento, Italy. RP Gerosa, M (reprint author), FBK, I-38100 Trento, Italy. EM gerosa@fbk.eu; giuliani@fbk.eu; brugnara@fbk.eu FU European Union [IST-2001-37599]; TC-STAR [FP6-506738] FX This work was partially financed by the European Union under the Projects PF-STAR (Grant IST-2001-37599, http://pfstar.itc.it) and TC-STAR (Grant FP6-506738, http://www.tc-star.org). CR ACKERMANN U, 1997, P EUROSPEECH, P1807 Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 ANGELINI B, 1994, P ICSLP YOK, P1391 Batliner A., 2005, P INT, P2761 BERTOLDI N, 2001, P ICASSP SALT LAK CI, V1, P37 BRUGNARA F, 2002, P ICSLP DENV CO, P1441 BURNETT DC, 1996, P ICSLP PHIL PA, V2, P1145, DOI 10.1109/ICSLP.1996.607809 Claes T, 1998, IEEE T SPEECH AUDI P, V6, P549, DOI 10.1109/89.725321 DARCY S, 2005, P INT 2005 LISB PORT, P2197 Das S., 1998, P ICASSP SEATTL US M, V1, P433, DOI 10.1109/ICASSP.1998.674460 EIDE E, 1996, P ICASSP, P346 Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 GEROSA M, 2006, P ICASSP TOUL FRANC, P393 Gerosa M, 2007, SPEECH COMMUN, V49, P847, DOI 10.1016/j.specom.2007.01.002 GEROSA M, 2006, P INTERSPEECH PITTSB Gillick L., 1989, P ICASSP, P532 GIULIANI D, 2004, P ICSLP, V4, P2893 GIULIANI D, 2003, P ICASSP, V2, P137 Giuliani D, 2006, COMPUT SPEECH LANG, V20, P107, DOI 10.1016/j.csl.2005.05.002 Goldstein UG., 1980, THESIS MIT CAMBRIDGE HAGEN A, 2003, P IEEE AUT SPEECH RE Huber JE, 1999, J ACOUST SOC AM, V106, P1532, DOI 10.1121/1.427150 Lee L., 1996, P ICASSP Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LI Q, 2002, P ICSLP DENV CO, P2337 MARTINEZ F, 1998, P ICASSP, V2, P725, DOI 10.1109/ICASSP.1998.675367 MCGOWAN RS, 1988, J ACOUST SOC AM, V83, P229, DOI 10.1121/1.396425 MIRGHAFORI N, 1996, P IEEE INT C AC SPEE, P335 Narayanan S, 2002, IEEE T SPEECH AUDI P, V10, P65, DOI 10.1109/89.985544 Nisimura R., 2004, P ICASSP, V1, P433 NITTROUER S, 1989, J ACOUST SOC AM, V86, P1266, DOI 10.1121/1.398741 PALLETT DS, 1992, P 1992 ARPAS CONT SP POTAMIANOS A, 1997, P EUR C SPEECH COMM, P2371 Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026 Steidl S, 2003, LECT NOTES COMPUT SC, V2781, P600 Stemmer G., 2003, P EUR C SPEECH COMM, P1313 Stemmer G., 2005, P IEEE INT C AC SPEE, V1, P997, DOI 10.1109/ICASSP.2005.1415284 WEGMANN S, 1996, P ICASSP, P339 Welling L., 1999, P IEEE INT C AC SPEE, P761 WILPON JG, 1996, P ICASSP, P349 NR 42 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2009 VL 51 IS 6 BP 499 EP 509 DI 10.1016/j.specom.2009.01.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 444NX UT WOS:000265988800002 ER PT J AU Amano, S Kondo, T Kato, K Nakatani, T AF Amano, Shigeaki Kondo, Tadahisa Kato, Kazumi Nakatani, Tomohiro TI Development of Japanese infant speech database from longitudinal recordings SO SPEECH COMMUNICATION LA English DT Article DE Infant utterance; Speech database; Speech development ID VOCAL FUNDAMENTAL-FREQUENCY; UTTERANCES; FEATURES; CHILDREN; FATHER; PITCH AB Developmental research on speech production requires both a cross-sectional and a longitudinal speech database. Previous longitudinal speech databases are limited in terms of recording period or number of utterances. An infant speech database was developed from 5 years of recordings containing a large number of daily life utterances of five Japanese infants and their parents. The resulting database contains 269,467 utterances with various types of information including a transcription, an F0 value, and a phoneme label. This database can be used in future research on the development of speech production. (c) 2009 Elsevier B.V. All rights reserved. C1 [Amano, Shigeaki; Kondo, Tadahisa; Kato, Kazumi; Nakatani, Tomohiro] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan. RP Amano, S (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikari Dai, Seika, Kyoto 6190237, Japan. EM amano@cslab.kecl.ntt.co.jp CR Amano S, 2006, J ACOUST SOC AM, V119, P1636, DOI 10.1121/1.2161443 BENNETT S, 1983, J SPEECH HEAR RES, V26, P137 Eguchi S, 1969, Acta Otolaryngol Suppl, V257, P1 Fairbanks G, 1942, CHILD DEV, V13, P227 HOLLIEN H, 1994, J ACOUST SOC AM, V96, P2646, DOI 10.1121/1.411275 INUI T, 2003, COGNITIVE STUDIES, V10, P304 Ishizuka K, 2007, J ACOUST SOC AM, V121, P2272, DOI 10.1121/1.2535806 Kajikawa S, 2004, J CHILD LANG, V31, P215, DOI 10.1017/S0305000903005968 KAJIKAWA S, 2004, PAFOUMANSU KYOUIKU, V3, P61 KEATING P, 1978, J ACOUST SOC AM, V63, P567, DOI 10.1121/1.381755 KENT RD, 1976, J SPEECH HEAR RES, V19, P421 KENT RD, 1982, J ACOUST SOC AM, V72, P353, DOI 10.1121/1.388089 MacWhinney B., 2000, CHILDES PROJECT TOOL McRoberts GW, 1997, J CHILD LANG, V24, P719, DOI 10.1017/S030500099700322X MUGITANI R, 2006, J PHONET SOC JPN, V10, P96 Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522 Nakatani T., 2003, P EUROSPEECH, P2313 NAKATANI T, 2002, P ICSLP 2002, V3, P1733 Reissland N, 1998, INFANT BEHAV DEV, V21, P793, DOI 10.1016/S0163-6383(98)90046-7 ROBB MP, 1989, J ACOUST SOC AM, V85, P1708, DOI 10.1121/1.397960 ROBB MP, 1985, J SPEECH HEAR RES, V28, P421 SHEPPARD WC, 1968, J SPEECH HEAR RES, V11, P94 NR 22 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2009 VL 51 IS 6 BP 510 EP 520 DI 10.1016/j.specom.2009.01.009 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 444NX UT WOS:000265988800003 ER PT J AU Kim, K Baran, RH Ko, H AF Kim, Kihyeon Baran, Robert H. Ko, Hanseok TI Extension of two-channel transfer function based generalized sidelobe canceller for dealing with both background and point-source noise SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Adaptive signal processing; Non-stationary noise; Transfer function ratio; Generalized sidelobe canceller ID SPEECH ENHANCEMENT; MICROPHONE ARRAYS; ENVIRONMENTS; RECOGNITION; REDUCTION; ESTIMATOR; AID AB This paper describes an algorithm to suppress non-stationary noise as well as stationary noise in a speech enhancement system that employs a two-channel generalized sidelobe canceller (GSC). Our approach builds on recent advances in GSC design involving a transfer function ratio (TFR). The proposed system has four stages. The first stage estimates a new TFR along the acoustic paths from the non-stationary noise source to the microphones and the power of the stationary noise components. Second, the estimated power of the stationary noise components is used to execute spectral subtraction (SS) with respect to the input signals. Thirdly, the optimal gain is estimated for speech enhancement on the primary channel. In the final stage, an adaptive filter reduces the residual correlated noise components of the signal. These algorithmic improvements consistently give a better performance than a transfer function based GSC (TF-GSC) alone or a GSC with SS post-filtering under various noise conditions while slightly increasing the computational complexity. (c) 2009 Elsevier B.V. All rights reserved. C1 [Kim, Kihyeon; Baran, Robert H.; Ko, Hanseok] Korea Univ, ISPL, Dept Elect & Comp Engn, Seoul, South Korea. RP Ko, H (reprint author), Korea Univ, ISPL, Dept Elect & Comp Engn, 5-1 Anam Dong, Seoul, South Korea. EM khkim@ispl.korea.ac.kr; rhbaran@yahoo.com; hsko@korea.ac.kr FU MIC (Ministry of Information and Communication), Korea FX This research was supported by the MIC (Ministry of Information and Communication), Korea, Under the ITFSIP (IT Foreign Specialist Inviting Program) supervised by the IITA (Institute of Information Technology Advancement). CR Bitzer J., 1998, P EUR SIGN PROC C RH, P105 Bitzer J, 2001, SPEECH COMMUN, V34, P3, DOI 10.1016/S0167-6393(00)00042-X BITZER J, 1999, P IEEE INT C AC SPEE, V5, P2965 Cohen I, 2001, SIGNAL PROCESS, V81, P2403, DOI 10.1016/S0165-1684(01)00128-1 Cohen I, 2004, IEEE T SIGNAL PROCES, V52, P1149, DOI 10.1109/TSP.2004.826166 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 FISCHER S, 1995, P 4 INT WORKSH AC EC, P44 Fischer S, 1996, SPEECH COMMUN, V20, P215, DOI 10.1016/S0167-6393(96)00054-4 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132 GANNOT S, 2002, EE PUB, V1319 Gannot S, 2004, IEEE T SPEECH AUDI P, V12, P561, DOI 10.1109/TSA.2004.834599 GRENIER Y, 1992, P ICASSP 92, P305, DOI 10.1109/ICASSP.1992.225911 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 HOSHUYAMA O, 2001, MICROPHONE ARRAYS SI, P87 Jeong S, 2008, ELECTRON LETT, V44, P253, DOI 10.1049/el:20083327 Kim G, 2007, ELECTRON LETT, V43, P783, DOI 10.1049/el:20070780 LeBouquinJeannes R, 1997, IEEE T SPEECH AUDI P, V5, P484 McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212 MEYER J, 1997, P ICASSP 97, V2, P1167 NORDHOLM S, 1993, IEEE T VEH TECHNOL, V42, P514, DOI 10.1109/25.260760 Shalvi O, 1996, IEEE T SIGNAL PROCES, V44, P2055, DOI 10.1109/78.533725 Spriet A, 2005, IEEE T SPEECH AUDI P, V13, P487, DOI 10.1109/TSA.2005.845821 Van Veen B. D., 1988, IEEE ASSP Magazine, V5, DOI 10.1109/53.665 VANCOMPERNOLLE D, 1995, P COST, V229, P107 VANCOMPERNOLLE D, 1990, SPEECH COMMUN, V9, P433, DOI 10.1016/0167-6393(90)90019-6 Vaseghi S. V., 2002, ADV DIGITAL SIGNAL P NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2009 VL 51 IS 6 BP 521 EP 533 DI 10.1016/j.specom.2009.02.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 444NX UT WOS:000265988800004 ER PT J AU Zhang, SX Mak, MW AF Zhang, Shi-Xiong Mak, Man-Wai TI A new adaptation approach to high-level speaker-model creation in speaker verification SO SPEECH COMMUNICATION LA English DT Article DE Speaker verification; High-level features; Model adaptation; Maximum-a-posterior (MAP) adaptation ID TRANSFORMATION AB Research has shown that speaker verification based on high-level speaker features requires long enrollment utterances to guarantee low error rate during verification. However, in practical speaker verification, it is common to model speakers based on a limited amount of enrollment data, which will make the speaker models unreliable. This paper proposes four new adaptation methods for creating high-level speaker models to alleviate this undesirable effect. Unlike conventional methods in which only the phoneme-dependent background model is adapted, the proposed adaptation methods also adapts the phoneme-independent speaker model to fully utilize all the information available in the training data. A proportional factor, which is derived from the ratio between the phoneme-dependent background model and the phoneme-independent background model, is used to adjust the phoneme-independent speaker models during adaptation. The proposed method was evaluated under the NIST 2000 and NIST 2002 SRE frameworks. Experimental results show that the proposed adaptation method can alleviate the data-sparseness problem effectively and achieves a better performance when compared with traditional MAP adaptation. (c) 2009 Elsevier B.V. All rights reserved. C1 [Zhang, Shi-Xiong; Mak, Man-Wai] Hong Kong Polytech Univ, Ctr Multimedia Signal Proc, Elect & Informat Engn Dept, Kowloon, Hong Kong, Peoples R China. RP Mak, MW (reprint author), Hong Kong Polytech Univ, Ctr Multimedia Signal Proc, Elect & Informat Engn Dept, Kowloon, Hong Kong, Peoples R China. EM zhang.sx@alumni.polyu.edu.hk; enmwmak@polyu.edu.hk CR ADAMI A, 2003, P ICASSP, V4, P788 Andrews W., 2002, P ICASSP BAKER B, 2004, P IEEE WORKSH SPEAK, P91 BLAAUW E, 1994, SPEECH COMMUN, V14, P359, DOI 10.1016/0167-6393(94)90028-0 CAMPBELL JP, 1999, P INT C AC SPEECH SI, V2, P829 Campbell J.P., 2003, P EUR, P2665 Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086 CHAPPELL D, 1998, P ICASSP, V1, P885 CHEN KT, 2000, P ICSLP, V3, P742 Dahan D, 1996, LANG SPEECH, V39, P341 Doddington G., 2001, P EUR, P2521 FEDERICO M, 1996, ICSLP, P279 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gillick L., 1989, P ICASSP, P532 HANSEN EG, 2004, P OD 04 SPEAK LANG R, P179 JIN Q, 2003, P ICASSP KAJAREKAR SS, 2005, P IEEE INT C AC SPEE, V1, P173, DOI 10.1109/ICASSP.2005.1415078 KIMBALL O, 1997, EUROSPEECH 97, P967 Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 KLUSACEK D, 2003, P ICASSP, V4, P804 KOSAKA T, 1996, J COMPUT SPEECH LANG, V10, P54 Kuehn David P., 1976, J PHONETICS, V4, P303 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leung KY, 2006, SPEECH COMMUN, V48, P71, DOI 10.1016/j.specom.2005.05.013 Mak B, 2005, IEEE T SPEECH AUDI P, V13, P984, DOI 10.1109/TSA.2005.851971 Mak BKW, 2006, IEEE T AUDIO SPEECH, V14, P1267, DOI 10.1109/TSA.2005.860836 Mak MW, 2007, NEUROCOMPUTING, V71, P137, DOI 10.1016/j.neucom.2007.08.003 Mak MW, 2006, INT CONF ACOUST SPEE, P929 Marithoz J., 2002, INT C SPOK LANG PROC, P581 MATSUI T, 1993, P ICASSP 93, V2, P391 NAVRATIL J, 2003, P ICASSP 2003, V4, P796 Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213 PESKIN B, 2003, P ICASSP, V4, P792 Reynolds D., 2003, P ICASSP 03, VIV, P784 Reynolds D. A., 1997, P EUR, P963 REYNOLDS DA, 1997, P IEEE INT C AC SPEE, V2, P1535 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Shlens J., 2005, TUTORIAL PRINCIPAL C SHRIBERG E, 2007, SPEAKER CLASSIFICATI, V1, P241 Siohan O, 2001, IEEE T SPEECH AUDI P, V9, P417, DOI 10.1109/89.917687 SONMEZ K, 1998, ICSLP, V4, P3189 Sussman HM, 1998, PHONETICA, V55, P204, DOI 10.1159/000028433 Thyes O., 2000, P INT C SPOK LANG PR, V2, P242 WEBER F, 2002, P IEEE INT C AC SPEE, V1, P141 XIANG B, 2002, P ICASSP, V1, P681 Zhang SX, 2007, IEEE T COMPUT, V56, P1189, DOI 10.1109/TC.2007.1081 Zhang SX, 2007, LECT NOTES COMPUT SC, V4810, P325 NR 48 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2009 VL 51 IS 6 BP 534 EP 550 DI 10.1016/j.specom.2009.02.005 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 444NX UT WOS:000265988800005 ER PT J AU Wolfel, M AF Woelfel, Matthias TI Signal adaptive spectral envelope estimation for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Adaptive feature extraction; Spectral estimation; Minimum variance distortionless response; Automatic speech recognition; Bilinear transformation; Time vs. frequency domain ID HIDDEN MARKOV-MODELS; LINEAR PREDICTION; SCALE AB This paper describes a novel spectral envelope estimation technique which adapts to the characteristics of the observed signal. This is possible via the introduction of a second bilinear transformation into warped minimum variance distortionless response (MVDR) spectral envelope estimation. As opposed to the first bilinear transformation, however, which is applied in the time domain, the second bilinear transformation must be applied in the frequency domain. This extension enables the resolution of the spectral envelope estimate to be steered to lower or higher frequencies, while keeping the overall resolution of the estimate and the frequency axis fixed. When embedded in the feature extraction process of an automatic speech recognition system, it provides for the emphasis of the characteristics of speech features that are relevant for robust classification, while simultaneously suppressing characteristics that are irrelevant for classification. The change in resolution may be steered, for each observation window, by the normalized first autocorrelation coefficient. To evaluate the proposed adaptive spectral envelope technique, dubbed warped-twice MVDR, we use two objective functions: class separability and word error rate. Our test set consists of development and evaluation data as provided by NIST for the Rich Transcription 2005 Spring Meeting Recognition Evaluation. For both measures, we observed consistent improvements for several speaker-to-microphone distances. In average, over all distances, the proposed front-end reduces the word error rate by 4% relative compared to the widely used mel-frequency cepstral coefficients as well as perceptual linear prediction. (c) 2009 Elsevier B.V. All rights reserved. C1 Univ Karlsruhe TH, Inst Theoret Informat, D-76131 Karlsruhe, Germany. RP Wolfel, M (reprint author), Univ Karlsruhe TH, Inst Theoret Informat, Fasanengarten 5, D-76131 Karlsruhe, Germany. EM wolfel@ira.uka.de CR Acero A., 1990, THESIS CARNEGIE MELL BRACCINI C, 1974, IEEE T ACOUST SPEECH, VAS22, P236, DOI 10.1109/TASSP.1974.1162582 CAPON J, 1969, P IEEE, V57, P1408, DOI 10.1109/PROC.1969.7278 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DHARANIPRAGADA S, 2001, P ICASSP, V1, P309 Dharanipragada S, 2007, IEEE T AUDIO SPEECH, V15, P224, DOI 10.1109/TASL.2006.876776 Driaunys K., 2005, Information Technology and Control, V34 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 HAEBUMBACH R, 1999, P ICASSP, P397 Harma A, 2001, IEEE T SPEECH AUDI P, V9, P579, DOI 10.1109/89.928922 Haykin S., 1991, ADAPTIVE FILTER THEO HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 *LDC, TRANS ENGL DAT LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MATSUMOTO H, 2001, P ICASSP, P117 MATSUMOTO M, 1998, P ICSLP, P1051 MESGARANI N, 2007, P ICASSP, P765 MURTHI M, 1997, P IEEE INT C AC SPEE, P1687 Murthi MN, 2000, IEEE T SPEECH AUDI P, V8, P221, DOI 10.1109/89.841206 MUSICUS BR, 1985, IEEE T ACOUST SPEECH, V33, P1333, DOI 10.1109/TASSP.1985.1164696 NAKATOH Y, 2004, P ICSLP *NIST, 2005, RICH TRANSC 2005 SPR NOCERINO N, 1985, P ICASSP, P25 Olive J. P., 1993, ACOUSTICS AM ENGLISH OPPENHEIM A, 1971, IEEE P LETT, V59, P229 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695 Stevens SS, 1937, J ACOUST SOC AM, V8, P185, DOI 10.1121/1.1915893 STRUBE HW, 1980, J ACOUST SOC AM, V68, P1071, DOI 10.1121/1.384992 Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435 WOLFEL M, 2003, P ESSV, P22 WOLFEL M, 2006, P EUSIPCO Wolfel M, 2005, IEEE SIGNAL PROC MAG, V22, P117, DOI 10.1109/MSP.2005.1511829 WOLFEL M, 2003, P EUR, P1021 Yule GU, 1927, PHILOS T R SOC LOND, V226, P267, DOI 10.1098/rsta.1927.0007 COMPUTERS HUMAN INTE NR 38 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2009 VL 51 IS 6 BP 551 EP 561 DI 10.1016/j.specom.2009.02.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 444NX UT WOS:000265988800006 ER PT J AU Magi, C Pohjalainen, J Backstrom, T Alku, P AF Magi, Carlo Pohjalainen, Jouni Backstrom, Tom Alku, Paavo TI Stabilised weighted linear prediction SO SPEECH COMMUNICATION LA English DT Article DE Linear prediction; All-pole modelling; Spectral estimation ID SPEECH; RECOGNITION; EXTRACTION; SPECTRUM AB Weighted linear prediction (WLP) is a method to compute all-pole models of speech by applying temporal weighting of the square of the residual signal. By using short-time energy (STE) as a weighting function, this algorithm was originally proposed as an improved linear predictive (LP) method based oil emphasising those samples that fit the underlying speech production model well. The original formulation of WLP, however, did not guarantee stability of all-pole models. Therefore, the current work revisits the concept of WLP by introducing a modified short-time energy function leading always to stable all-pole models. This new method, stabilised weighted linear prediction (SWLP), is shown to yield all-pole models whose general performance can be adjusted by properly choosing the length of the STE window, a parameter denoted by M. The study compares the performances of SWLP, minimum variance distortionless response (MVDR), and conventional LP in spectral modelling of speech corrupted by additive noise. The comparisons were performed by computing, for each method, the logarithmic spectral differences between the all-pole spectra extracted from clean and noisy speech in different segmental signal-to-noise ratio (SNR) categories. The results showed that the proposed SWLP algorithm was the most robust method against zero-mean Gaussian noise and the robustness was largest for SWLP with a small M-value. These findings were corroborated by a small listening test in which the majority of the listeners assessed the quality of impulse-train-excited SWLP filters, extracted from noisy speech, to be perceptually closer to original clean speech than the corresponding all-pole responses computed by MVDR. Finally, SWLP was compared to other short-time spectral estimation methods (FFT, LP, MVDR) in isolated word recognition experiments. Recognition accuracy obtained by SWLP, in comparison to other short-time spectral estimation methods, improved already at moderate segmental SNR values for sounds corrupted by zero-mean Gaussian noise. For realistic factory noise of low pass characteristics, the SWLP method improved file recognition results at segmental SNR levels below 0 dB. (C) 2009 Published by Elsevier B.V. C1 [Magi, Carlo; Pohjalainen, Jouni; Backstrom, Tom; Alku, Paavo] Aalto Univ, Lab Acoust & Audio Signal Proc, FI-02015 Helsinki, Finland. RP Alku, P (reprint author), Aalto Univ, Lab Acoust & Audio Signal Proc, POB 3000, FI-02015 Helsinki, Finland. EM jpohjala@acoustics.hut.fi; tom.backstrom@tkk.fi; paavo.alku@tkk.fi RI Backstrom, Tom/E-2121-2011; Alku, Paavo/E-2400-2012 FU Academy of Finland [107494] FX Supported by Academy of Finland (Project No. 107494). CR ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R BACKSTROM T, 2004, THESIS HELSINKI U TE Bazaraa MS, 1993, NONLINEAR PROGRAMMIN CHILDERS DG, 1994, IEEE T BIO-MED ENG, V41, P663, DOI 10.1109/10.301733 DELSARTE P, 1982, IEEE T INFORM THEORY, V33, P412 DEWET F, 2001, P EUR 2001 AALB DENM Dharanipragada S, 2007, IEEE T AUDIO SPEECH, V15, P224, DOI 10.1109/TASL.2006.876776 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC GRAY AH, 1976, IEEE T ACOUST SPEECH, V24, P380, DOI 10.1109/TASSP.1976.1162849 Huiqun Deng, 2006, IEEE Transactions on Audio, Speech and Language Processing, V14, DOI 10.1109/TSA.2005.857811 Karaev MT, 2004, P AM MATH SOC, V132, P2321, DOI 10.1090/S0002-9939-04-07391-5 Kleijn W. B., 1995, SPEECH CODING SYNTHE KRISHNAMURTHY AK, 1986, IEEE T ACOUST SPEECH, V34, P730, DOI 10.1109/TASSP.1986.1164909 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 MA CX, 1993, SPEECH COMMUN, V12, P69, DOI 10.1016/0167-6393(93)90019-H MAGI C, 2006, CD P 7 NORD SIGN PRO MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Markel JD, 1976, LINEAR PREDICTION SP Murthi MN, 2000, IEEE T SPEECH AUDI P, V8, P221, DOI 10.1109/89.841206 O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd Rabiner L, 1993, FUNDAMENTALS SPEECH SAMBUR MR, 1976, IEEE T ACOUST SPEECH, V24, P488, DOI 10.1109/TASSP.1976.1162870 SHIMAMURA T, 2004, CD P 6 NORD SIGN PRO Theodoridis S., 2003, PATTERN RECOGNITION VARGA A, 1992, NOISEX 92 DATABASE Wolfel M, 2005, IEEE SIGNAL PROC MAG, V22, P117, DOI 10.1109/MSP.2005.1511829 WOLFEL M, 2003, P IEEE AUT SPEECH RE, V387, P387 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 WU JX, 1993, IEEE T PATTERN ANAL, V15, P1174 YAPANEL U, 2003, EUROSPEECH 2003 Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P313, DOI 10.1109/89.701359 ZHAO Q, 1997, COMMUN COMPUT PHYS, V2, P585 NR 33 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2009 VL 51 IS 5 BP 401 EP 411 DI 10.1016/j.specom.2008.12.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 429TI UT WOS:000264942600001 ER PT J AU Jeong, M Lee, GG AF Jeong, Minwoo Lee, Gary Geunbae TI Multi-domain spoken language understanding with transfer learning SO SPEECH COMMUNICATION LA English DT Article DE Spoken language understanding; Multi-domain dialog system; Transfer learning; Triangular-chain structure model ID SEMANTIC ROLES; DIALOGUE; SYSTEM AB This paper addresses the problem of multi-domain spoken language understanding (SLU) where domain detection and domain-dependent semantic tagging problems are combined. We present a transfer learning approach to the multi-domain SLU problem in which multiple domain-specific data sources can be incorporated. To implement multi-domain SLU with transfer learning, we introduce a triangular-chain structured model. This model effectively learns multiple domains in parallel, and allows use of domain-independent patterns among domains to create a better model for the target domain. We demonstrate that the proposed method outperforms baseline models on dialog data for multi-domain SLU problems. (C) 2009 Elsevier B.V. All rights reserved. C1 [Jeong, Minwoo; Lee, Gary Geunbae] Pohang Univ Sci & Technol, Dept Comp Sci & Engn, Pohang 790784, South Korea. RP Jeong, M (reprint author), Pohang Univ Sci & Technol, Dept Comp Sci & Engn, San 31, Pohang 790784, South Korea. EM stardust@postech.ac.kr; gblee@poste-ch.ac.kr FU Korea Science and Engineering Foundation (KOSEF), Korea Goverment (MEST) [R01-2008-000-20651-0] FX We thank thin anonymous reviewers for their valuable comments. We would also like to thank Donghyun Lee for his preparation of speech recognition results, and Derek Lactin for his proof-reading of the paper. This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea Goverment (MEST) (No. R01-2008-000-20651-0). CR Ammicht E, 1999, P EUR C SPEECH COMM, P1375 Caruana R, 1997, MACH LEARN, V28, P41, DOI 10.1023/A:1007379606734 Chung G., 1999, P EUR C SPEECH COMM, P2655 COHN T, 2006, P EUR C MACH LEARN E, P606 Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1 Crammer K, 2006, J MACH LEARN RES, V7, P551 Daume H, 2006, J ARTIF INTELL RES, V26, P101 Daume H., 2007, P 45 ANN M ASS COMP, P256 Daume III H., 2005, P 22 INT C MACH LEAR, P169, DOI 10.1145/1102351.1102373 De Mori R., 2008, SIGNAL PROCESS MAGAZ, V25, P50 Dredze M., 2008, P 25 INT C MACH LEAR, P264, DOI 10.1145/1390156.1390190 Gildea D, 2002, COMPUT LINGUIST, V28, P245, DOI 10.1162/089120102760275983 Gillick L., 1989, P ICASSP, P532 Gupta N, 2006, IEEE T AUDIO SPEECH, V14, P213, DOI 10.1109/TSA.2005.854085 Hardy H, 2006, SPEECH COMMUN, V48, P354, DOI 10.1016/j.specom.2005.07.006 Jeong M, 2008, IEEE T AUDIO SPEECH, V16, P1287, DOI 10.1109/TASL.2008.925143 Jeong M, 2006, P JOINT INT C COMP L, P412, DOI 10.3115/1273073.1273127 Komatani K, 2008, SPEECH COMMUN, V50, P863, DOI 10.1016/j.specom.2008.05.010 Lafferty John D., 2001, ICML, P282 LEE CJ, 2006, P IEEE SPOK LANG TEC, P194 Moschitti A, 2007, P IEEE WORKSH AUT SP Nocedal J., 1999, NUMERICAL OPTIMIZATI Palmer M, 2005, COMPUT LINGUIST, V31, P71, DOI 10.1162/0891201053630264 Peckham J., 1991, P WORKSH SPEECH NAT, P14, DOI 10.3115/112405.112408 Price P., 1990, P DARPA SPEECH NAT L, P91, DOI 10.3115/116580.116612 Ramshaw L. A., 1995, P 3 WORKSH VER LARG, P82 RAYMOND C, 2007, P INT ANTW BEG Sha F., 2003, P C N AM CHAPT ASS C, P134 Sutton C., 2007, P 24 INT C MACH LEAR, P863, DOI 10.1145/1273496.1273605 Taskar B., 2003, P ADV NEUR INF PROC Tsochantaridis I, 2005, J MACH LEARN RES, V6, P1453 TUR G, 2006, P IEEE INT C AC SPEE Walker M. A., 2002, P INT C SPOK LANG PR, P269 Wang YY, 2005, IEEE SIGNAL PROC MAG, V22, P16 NR 34 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2009 VL 51 IS 5 BP 412 EP 424 DI 10.1016/j.specom.2009.01.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 429TI UT WOS:000264942600002 ER PT J AU Maier, A Haderlein, T Eysholdt, U Rosanowski, F Batliner, A Schuster, M Noth, E AF Maier, A. Haderlein, T. Eysholdt, U. Rosanowski, F. Batliner, A. Schuster, M. Noeth, E. TI PEAKS - A system for the automatic evaluation of voice and speech disorders SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Speech and voice disorders; Automatic evaluation of speech and voice pathologies ID FOREARM FREE-FLAP; OROPHARYNGEAL CANCER; PARTIAL GLOSSECTOMY; PHARYNGEAL FLAP; CLEFT-PALATE; ORAL CAVITY; RECONSTRUCTION; INTELLIGIBILITY; REHABILITATION; ESOPHAGEAL AB We present a novel system for the automatic evaluation of speech and voice disorders. The system can be accessed via the internet platform-independently. The patient reads a text or names pictures. His or her speech is then analyzed by automatic speech recognition and prosodic analysis. For patients who had their larynx removed due to cancer and for children with cleft lip and palate we show that we can achieve significant correlations between the automatic analysis and the judgment of human experts in it leave-one-out experiment (p < .001). A correlation of .90 for the evaluation of the laryngectomees and .87 for the evaluation of the children's data was obtained. This is comparable to human inter-rater correlations. (C) 2009 Elsevier B.V. All rights reserved. C1 [Maier, A.; Haderlein, T.; Eysholdt, U.; Rosanowski, F.; Schuster, M.] Univ Erlangen Nurnberg, Abt Phoniatrie & Padaudiol, D-91054 Erlangen, Germany. [Maier, A.; Haderlein, T.; Batliner, A.; Noeth, E.] Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung, D-91508 Erlangen, Germany. RP Maier, A (reprint author), Univ Erlangen Nurnberg, Abt Phoniatrie & Padaudiol, Bohlenpl 21, D-91054 Erlangen, Germany. EM Andreas.Maier@informatik.uni-erlangen.de FU Deutsche Krebshilfe [106266]; Deutsche Forschungsgemeinschaft [SCHU2320/1-1] FX This work was funded by the German Cancer Aid (Deutsche Krebshilfe) under Grant 106266 and the German Research Foundation (Deutsche Forschungsgemeinschaft) under Grant SCHU2320/1-1. The responsibility for the content of this paper lies with the authors. The authors would like to thank both anonymous reviewers of this document for their beneficial suggestions and comments. CR ADELHARDT J, 2003, LECT NOTES COMPUTER, P591 Bagshaw P., 1993, P EUR C SPEECH COMM, P1003 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 BATLINER A, 1995, NATO ASI SERIES F, P325 BATLINER A, 2001, P EUR C SPEECH COMM, V4, P2781 BATLINER A, 2000, VERBMOBIL FDN SPEECH, P106 BATLINER A, 2003, P EUR C SPEECH COMM, V1, P733 Batliner A., 1999, P 14 INT C PHON SCI, V3, P2315 Bellandese MH, 2001, J SPEECH LANG HEAR R, V44, P1315, DOI 10.1044/1092-4388(2001/102) BODIN IKH, 1994, CLIN OTOLARYNGOL, V19, P28, DOI 10.1111/j.1365-2273.1994.tb01143.x Bressmann T, 2004, J ORAL MAXIL SURG, V62, P298, DOI 10.1016/j.joms.2003.04.017 Brown DH, 2003, WORLD J SURG, V27, P824, DOI 10.1007/s00268-003-7107-4 Brown JS, 1997, HEAD NECK-J SCI SPEC, V19, P524, DOI 10.1002/(SICI)1097-0347(199709)19:6<524::AID-HED10>3.0.CO;2-5 Cohen J., 1983, APPL MULTIPLE REGRES, V2nd Courrieu P., 2005, NEURAL INFORM PROCES, V8, P25 ENDERBY PM, 2004, FRENCHAY DYSRARTHRIE Flannery B. P., 1992, NUMERICAL RECIPES C FOX AV, 2002, PLAKSS PSYCHOLINGUIS Furia CLB, 2001, ARCH OTOLARYNGOL, V127, P877 Gales M. J. F., 1996, P ICSLP 96 PHIL US, V3, P1832, DOI 10.1109/ICSLP.1996.607987 GALLWITZ F, 2002, STUDIEN MUSTERERKENN, V6 Hacker C, 2006, LECT NOTES ARTIF INT, V4188, P581 Hall M.A., 1998, THESIS U WAIKATO HAM Harding A, 1998, INT J LANG COMM DIS, V33, P329 Haughey BH, 2002, ARCH OTOLARYNGOL, V128, P1388 Henningsson G, 2008, CLEFT PALATE-CRAN J, V45, P1, DOI 10.1597/06-086.1 HUBER R, 2002, STUDIEN MUSTERERKENN, V8 Keuning KHD, 1999, CLEFT PALATE-CRAN J, V36, P328, DOI 10.1597/1545-1569(1999)036<0328:TIROTP>2.3.CO;2 KIESSLING A, 1997, EXTRAKTION KLASSIFKA Knuuttila H, 1999, ACTA OTO-LARYNGOL, V119, P621 Kuttner C, 2003, HNO, V51, P151, DOI 10.1007/s00106-002-0708-7 Mady K, 2003, CLIN LINGUIST PHONET, V17, P411, DOI 10.1080/0269920031000079921 MAHANNA GK, 1998, PROSTHET DENT, V79, P310 Maier A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1757 Maier A, 2006, LECT NOTES ARTIF INT, V4188, P431 Markkanen-Leppanen M, 2006, ORAL ONCOL, V42, P646, DOI 10.1016/j.oraloncology.2005.11.004 Millard T, 2001, CLEFT PALATE-CRAN J, V38, P68, DOI 10.1597/1545-1569(2001)038<0068:DCCFAA>2.0.CO;2 Moore E. H, 1920, B AM MATH SOC, V26, P394 Paal Sonja, 2005, J Orofac Orthop, V66, P270, DOI 10.1007/s00056-005-0427-2 Panchal J, 1996, BRIT J PLAST SURG, V49, P363, DOI 10.1016/S0007-1226(96)90004-1 Pauloski BR, 1998, OTOLARYNG HEAD NECK, V118, P616, DOI 10.1177/019459989811800509 Pauloski BR, 1998, LARYNGOSCOPE, V108, P908, DOI 10.1097/00005537-199806000-00022 Penrose R., 1955, P CAMBRIDGE PHILOS S, P406, DOI DOI 10.1017/S0305004100030401 Riedhammer K., 2007, P AUT SPEECH REC UND, P717 ROBBINS J, 1984, J SPEECH HEAR DISORD, V49, P202 ROBBINS KT, 1987, ARCH OTOLARYNGOL, V113, P1214 Rosanowski Frank, 2002, Facial Plast Surg, V18, P197, DOI 10.1055/s-2002-33066 Ruben RJ, 2000, LARYNGOSCOPE, V110, P241, DOI 10.1097/00005537-200002010-00010 Scholkopf B., 1997, THESIS TU BERLIN SCHONWEILER R, 1994, HNO, V42, P691 Schonweiler R, 1999, INT J PEDIATR OTORHI, V50, P205, DOI 10.1016/S0165-5876(99)00243-8 SCHUKATTALAMAZZ.E, 1993, P EUR C SPEECH COMM, V1, P129 Schutte HK, 2002, FOLIA PHONIATR LOGO, V54, P8, DOI 10.1159/000048592 Seikaly H, 2003, LARYNGOSCOPE, V113, P897, DOI 10.1097/00005537-200305000-00023 Smola A., 1998, NC2TR1998030 ROYAL H Stemmer G., 2003, P EUR C SPEECH COMM, P1313 Stemmer G., 2005, STUDIEN MUSTERERKENN, V19 Su WF, 2003, OTOLARYNG HEAD NECK, V128, P412, DOI 10.1067/mhn.2003.38 Terai H, 2004, BRIT J ORAL MAX SURG, V42, P190, DOI 10.1016/j.bjoms.2004.02.007 V. Clark SAS Institute, 2004, SAS STAT 9 1 USERS G Wahlster W., 2000, VERBMOBIL FDN SPEECH Wantia Nina, 2002, Facial Plast Surg, V18, P147, DOI 10.1055/s-2002-33061 Witten I.H., 2005, DATA MINING PRACTICA ZECEVIC A, 2002, THESIS U MANNHEIM GE NR 64 TC 34 Z9 34 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2009 VL 51 IS 5 BP 425 EP 437 DI 10.1016/j.specom.2009.01.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 429TI UT WOS:000264942600003 ER PT J AU Jancovic, P Kokuer, M AF Jancovic, Peter Koekueer, Muenevver TI Incorporating the voicing information into HMM-based automatic speech recognition in noisy environments SO SPEECH COMMUNICATION LA English DT Article DE Source-filter model; Voicing estimation; HMM; Automatic speech recognition; Voicing modelling; Noise robustness; Missing-feature; Aurora 2; Phoneme recognition AB In this paper, we propose a model for the incorporation of voicing information into a speech recognition system in noisy environments. The employed voicing information is estimated by a novel method that can provide this information for each filter-bank channel and does not require information about the fundamental frequency. The voicing information is modelled by employing the Bernoulli distribution. The voicing model is obtained for each HMM state and mixture by a Viterbi-style training procedure. The proposed voicing incorporation is evaluated both within a standard model and two other models that had compensated for the noise effect, the missing-feature and the multi-conditional training model. Experiments are first performed on noisy speech data from the Aurora 2 database. Significant performance improvements are achieved when the voicing information is incorporated within the standard model as well as the noise-compensated models. The employment of voicing information is also demonstrated on a phoneme recognition task on the noise-corrupted TIMIT database and considerable improvements are observed. (C) 2009 Elsevier B.V. All rights reserved. C1 [Jancovic, Peter; Koekueer, Muenevver] Univ Birmingham, Sch Elect Engn & Comp Engn, Birmingham B15 2TT, W Midlands, England. RP Jancovic, P (reprint author), Univ Birmingham, Sch Elect Engn & Comp Engn, Pritchatts Rd, Birmingham B15 2TT, W Midlands, England. EM p.jancovic@bham.ac.uk; m.kokuer@bham.ac.uk FU UK EPSRC [EP/D033659/1, EP/F036132/1] FX This work was supported by UK EPSRC Grants EP/D033659/1 and EP/F036132/1. CR BEAUFAYS F, 2003, USING SPEECH NONSPEE, P424 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Fant G., 1960, ACOUSTIC THEORY SPEE Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC GRACIARENA M, 2004, P ICASSP MONTR, V1, P921 HIRSCH HG, 2000, AURORA EXPT FRAMEWOR HUANG HCH, 2000, PITCH TRACKING TONE, P1523 Ishizuka K, 2006, J ACOUST SOC AM, V120, P443, DOI 10.1121/1.2205131 Russell MJ, 2005, COMPUT SPEECH LANG, V19, P205, DOI 10.1016/j.csl.2004.08.001 JACKSON PJB, 2003, COVARIATION WEIGHTIN, P2321 Jancovic P, 2007, IEEE SIGNAL PROC LET, V14, P66, DOI 10.1109/LSP.2006.881517 JANCOVIC P, 2002, COMBINING UNION MODE, P69 JANCOVIC P, 2007, IEEE WORKSH AUT SPEE, P42 KITAOKA N, 2002, SPEAKER INDEPENDENT, P2125 LARSON M, 2001, P WORKSH LANG MOD IN LJOLJE A, 2002, SPEECH RECOGNITION U, P2137 Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0 Niyogi P, 2003, SPEECH COMMUN, V41, P349, DOI 10.1016/S0167-6393(02)00151-6 OSHAUGHNESSY D, 1999, ROBUST FAST CONTINUO, P413 RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P170, DOI 10.1109/TASSP.1976.1162794 Thomson DL, 2002, SPEECH COMMUN, V37, P197, DOI 10.1016/S0167-6393(01)00011-5 Young S., 1999, HTK BOOK V2 2 Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6 ZOLNAY A, 2003, EUROSPEECH GENEVA SW, P497 NR 25 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2009 VL 51 IS 5 BP 438 EP 451 DI 10.1016/j.specom.2009.01.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 429TI UT WOS:000264942600004 ER PT J AU Diaz, FC van Santen, J Banga, ER AF Campillo Diaz, Francisco van Santen, Jan Rodriguez Banga, Eduardo TI Integrating phrasing and intonation modelling using syntactic and morphosyntactic information SO SPEECH COMMUNICATION LA English DT Article DE Intonation modelling; Unit selection; Corpus-based; Syntax; POS; Phrasing ID SPEECH AB This paper focuses on the relationship between intonation and syntactic and morphosyntactic information. Although intonation and syntax are both related to dependencies between the different parts of a sentence, and are therefore related to meaning, the precise influence of grammar on intonation is not clear. We describe a novel method that uses syntactic and part-of-speech features in the framework of corpus-based intonation modelling, and which integrates part of the phrasing algorithm in the unit selection stage. Subjective tests confirm an improvement in the quality of the resulting synthetic intonation: 75% of the sentences synthesised with the new intonation model were considered to be better or much better than the sentences synthesised using the old model, while only 7.5% of sentences were rated as worse or much worse. (C) 2009 Elsevier B.V. All rights reserved. C1 [Campillo Diaz, Francisco; Rodriguez Banga, Eduardo] Univ Vigo, Dpto Teoria Serial & Comunicac, ETSI Telecomunicac, Vigo 36200, Pontevedra, Spain. [van Santen, Jan] Oregon Hlth & Sci Univ, Ctr Spoken Language Understanding, OGI Sch Sci & Engn, Beaverton, OR 97006 USA. RP Diaz, FC (reprint author), Univ Vigo, Dpto Teoria Serial & Comunicac, ETSI Telecomunicac, Campus Univ, Vigo 36200, Pontevedra, Spain. EM campillo@gts.tsc.uvigo; vansanten@ogi.cslu.edu; erbanga@gts.tsc.uvigo.es RI Rodriguez Banga, Eduardo/C-4296-2011 FU NSF [0205731]; MEC [TEC200613694-C03-03] FX The work reported here was carried out while the first author was a visiting post-doctoral researcher at the Center for Spoken Language Understanding, with funding from the Xunta de Galicia "Isidro Parga Pondal" research programme and PGIDIT05TIC32202-PR programme, and with support from NSF Grant 0205731, "ITR: Prosody Generation for Child Oriented Speech Synthesis" (PI Jan van Santen), and MEC under the Project TEC200613694-C03-03. Thanks also to Paul Hosom and Raychel Moldover, for their comments and suggestions on the paper. CR ABNEY S, 1992, P SPEECH NAT LANG WO, P425, DOI 10.3115/1075527.1075629 BLACK A, 1995, P EUR MADR SPAIN, V1, P581 Black A. W., 1999, FESTIVAL SPEECH SYNT Campillo F, 2008, ELECTRON LETT, V44, P501, DOI 10.1049/el:20083276 CAMPILLO F, 2006, JORNADAS TECNOLOGIAS, P167 CAMPILLO F, 2006, P ICSLP PITTSB, P2362 Campillo F. D., 2006, SPEECH COMMUN, V48, P941, DOI 10.1016/j.specom.2005.12.004 DIMPERIO M, 2003, PROSODIES ESCUDERO D, 2002, THESIS U VALLADOLID GARRIDO JM, 1996, THESIS U BARCELONA E Grosz B. J., 1986, Computational Linguistics, V12 HERNAEZ I, 2001, P 4 ISCA TUT RES WOR, P151 Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9 Hunt A., 1996, P INT C AC SPEECH SI, V1, P373 Koehn P., 2000, P IEEE INT C AC SPEE, V3, P1289 Ladd D. R., 1986, PHONOLOGY YB, V3, P311, DOI 10.1017/S0952675700000671 Ladd R., 1996, INTONATIONAL PHONOLO Ladd R.D., 1988, J ACOUST SOC AM, V84, P530 MENDEZ F, 2003, PROCESAMIENTO LENGUA, V31, P159 MOEBIUS B, 1999, COMPUT SPEECH LANG, V13, P319 Navarro T, 1977, MANUAL PRONUNCIACION NAVAS E, 2003, THESIS U PAIS VASCO Ostendorf M., 1994, Computational Linguistics, V20 PIERREHUMBERT J, 1990, SYS DEV FDN, P271 Pierrehumbert J. B., 1986, PHONOLOGY YB, V3, P15 PREVOST S, 1993, P 6 C EUR CHAPT ASS, P332, DOI 10.3115/976744.976783 RAUX A, 2003, ASRU STEEDMAN M, 1990, M ASS COMP LING, P9 TAYLOR P, 2000, CONCEPT TO SPEECH SY Taylor P, 1998, COMPUT SPEECH LANG, V12, P99, DOI 10.1006/csla.1998.0041 van Santen J., 1999, INTONATION ANAL MODE, P269 NR 31 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2009 VL 51 IS 5 BP 452 EP 465 DI 10.1016/j.specom.2009.01.007 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 429TI UT WOS:000264942600005 ER PT J AU Lee, C Jung, S Kim, S Lee, GG AF Lee, Cheongjae Jung, Sangkeun Kim, Seokhwan Lee, Gary Geunbae TI Example-based dialog modeling for practical multi-domain dialog system SO SPEECH COMMUNICATION LA English DT Article DE Example-based dialog modeling; Generic dialog modeling; Multi-domain dialog system; Domain identification ID STRATEGIES; SIMULATION AB This paper proposes a generic dialog modeling framework for a multi-domain dialog system to simultaneously manage goal-oriented and chat dialogs for both information access and entertainment. We developed a dialog modeling technique using an example-based approach to implement multiple applications such as car navigation, weather information, TV program guidance, and chatbot. Example-based dialog modeling (EBDM) is a simple and effective method for prototyping and deploying of various dialog systems. This paper also introduces the system architecture of multi-domain dialog systems using the EBDM framework and the domain spotting technique. In our experiments, we evaluate our system using both simulated and real users. We expect that our approach can support flexible management of multi-domain dialogs on the same framework. (C) 2009 Elsevier B.V. All rights reserved. C1 [Lee, Cheongjae; Jung, Sangkeun; Kim, Seokhwan; Lee, Gary Geunbae] Pohang Univ Sci & Technol, Dept Comp Sci & Engn, Pohang 790784, South Korea. RP Lee, C (reprint author), Pohang Univ Sci & Technol, Dept Comp Sci & Engn, San 31, Pohang 790784, South Korea. EM lcj80@postech.ac.kr FU Ministry of Knowledge Economy (MKE) [RTI04-02-06] FX This work-was supported by Grant No. RTI04-02-06 from the Regional Technology Innovation Program and by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Knowledge Economy (MKE). CR Allen J., 2000, NAT LANG ENG, V6, P1 Berger AL, 1996, COMPUT LINGUIST, V22, P39 BOHUS B, 2003, P EUR C SPEECH COMM, P597 BUI TH, 2004, P TSD 2004 BRNO CZEC, P579 CHELBA C, 2003, P IEEE INT C AC SPEE, P69 Dietterich TG, 1998, NEURAL COMPUT, V10, P1895, DOI 10.1162/089976698300017197 EUN J, 2005, P EUROSPEECH, P3441 Georgila K., 2005, P 9 EUR C SPEECH COM, P893 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X HENDERSON J, 2005, P WORKSH KNOWL REAS Hurtado L. F., 2005, P IEEE WORKSH AUT SP, P226 Inui M., 2001, P IEEE INT C SYST MA, P193 JENKINS MC, 2007, P INT C HUM COMP INT, P76 JUNG S, 2008, P WORKSH SPEECH PROC, P9 KOMATANI K, 2006, P 7 SIGDIAL WORKSH D, P9, DOI 10.3115/1654595.1654598 LAMEL L, 1999, P IEEE INT C AC SPEE, P501 Larsson S., 2000, NAT LANG ENG, V6, P323, DOI [DOI 10.1017/S1351324900002539, 10.1017 S1351324900002539] LARSSON S, 2002, DEMO ABSTRACT ASS CO, P104 LEE CJ, 2006, P IEEE INT C AC SPEE, P69 Lemon O., 2002, P 3 SIGDIAL WORKSH D, P113 Lemon O., 2006, P 11 C EUR CHAPT ASS, P119, DOI 10.3115/1608974.1608986 LESH N, 2001, P 9 INT C US MOD, P63 Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 Lopez-Cozar R, 2003, SPEECH COMMUN, V40, P387, DOI [10.1016/S0167-6393(02)00126-7, 10.1016/S0167-6393902)00126-7] McTear M., 1998, P INT C SPOK LANG PR, V4, P1223 Minker W, 2004, SPEECH COMMUN, V43, P89, DOI 10.1016/j.specom.2004.01.005 Murao H., 2003, P SIGDIAL WORKSH DIS, P140 Nagao M., 1984, P INT NATO S ART HUM, P173 O'Neill I, 2005, SCI COMPUT PROGRAM, V54, P99, DOI 10.1016/j.scico.2004.05.006 PAEK T, 2006, P WORKSH DIAL INT C PAKUCS B, 2003, P 8 EUR C SPEECH COM, P741 Papineni K., 2002, P 40 ANN M ASS COMP, P311 Peckham J., 1993, P 3 EUR C SPEECH COM, P33 Polifroni J., 2001, P EUR C SPEECH COMM, P1371 RICH C, 1998, J USER MODEL USER AD, V8, P315 SALTON G, 1973, J DOC, V29, P351, DOI 10.1108/eb026562 Schatzmann J, 2005, P 6 SIGDIAL WORKSH D, P45 SHIN J, 2002, P INT C SPOK LANG PR, P2069 THOMSON B, 2008, P IEEE INT C AC SPEE, P4937 Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271 Watanabe T, 1998, IEICE T INF SYST, VE81D, P1025 WENG F, 2006, P INT C SPOK LANG PR, P1061 WILLIAMS JD, 2005, P IEEE WORKSH AUT SP, P250 Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008 YOUNG S, 2007, P IEEE INT C AC SPEE, P149 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 46 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2009 VL 51 IS 5 BP 466 EP 484 DI 10.1016/j.specom.2009.01.008 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 429TI UT WOS:000264942600006 ER PT J AU Ozimek, E Kutzner, D Sek, A Wicher, A AF Ozimek, Edward Kutzner, Dariusz Sek, Aleksander Wicher, Andrzej TI Development and evaluation of Polish digit triplet test for auditory screening SO SPEECH COMMUNICATION LA English DT Article DE Speech intelligibility; Intelligibility function; Digit triplet; Speech-reception-threshold; Auditory screening ID SPEECH RECEPTION THRESHOLDS; SENTENCE MATERIALS; NOISE; HEARING; INTELLIGIBILITY; RECOGNITION AB The objective of this study was to develop and evaluate the Polish digit triplet test for speech intelligibility screening. The first part of the paper deals with the preparation of the speech material, the recording procedure and a listening experiment. In this part, triplet-specific intelligibility functions for 160 different digit complexes were determined and 100 'optimal' triplets were selected. Subsequently, four statistically balanced lists, each containing 25 different digit triplets, were developed. The speech material was phonemically equalized across the lists. The mean SRT and mean list-specific slope S-50 for the Polish test are -9.4 dB and 19.4%/dB, respectively, and are very similar to the data characterizing the German digit triplet test. The second part describes the results of the verification experiments in which reliability of the developed test was analyzed. The retest measurements were carried out by means of the standard constant stimuli paradigm and the adaptive procedure. It was found that mean SRT obtained with retest study was within the limits of standard deviation, in agreement with those obtained in the basic experiment. (C) 2008 Elsevier B.V. All rights reserved. C1 [Ozimek, Edward; Kutzner, Dariusz; Sek, Aleksander; Wicher, Andrzej] Adam Mickiewicz Univ, Inst Acoust, PL-61614 Poznan, Poland. RP Ozimek, E (reprint author), Adam Mickiewicz Univ, Inst Acoust, Ul Umultowska 85, PL-61614 Poznan, Poland. EM ozimaku@amu.edu.pl FU European Union FP6 [004171]; State Ministry of Education and Science FX This work was supported by the grant from the European Union FP6, Project 004171 HearCom and the State Ministry of Education and Science. CR Bellis TJ, 1996, ASSESSMENT MANAGEMEN BRACHMANSKI S., 1999, SPEECH LANGUAGE TECH, V3, P71 BROADBENT DE, 1954, J EXP PSYCHOL, V47, P191, DOI 10.1037/h0054182 ELBERLING C, 1989, SCAND AUDIOL, V18, P175 Fletcher H., 1929, SPEECH HEARING GROCHOLEWSKI S, 2001, STATYSTYCZNE PODSTAW Hall S, 2006, THESIS U SOUTHAMPTON Kaernbach C, 2001, PERCEPT PSYCHOPHYS, V63, P1389, DOI 10.3758/BF03194550 KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 Klein SA, 2001, PERCEPT PSYCHOPHYS, V63, P1421, DOI 10.3758/BF03194552 Kollmeier B., 1997, J ACOUST SOC AM, V102, P1085 Kollmeier B., 1990, MESSMETODIK MODELLIE MILLER GA, 1951, J EXP PSYCHOL, V41, P329, DOI 10.1037/h0062491 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 Ozimek E., 2006, ARCH ACOUST, V31, P431 PLOMP R, 1979, AUDIOLOGY, V18, P43 Pruszewicz A, 1994, Otolaryngol Pol, V48, P50 Pruszewicz A, 1994, Otolaryngol Pol, V48, P56 Ramkissoon Ishara, 2002, Am J Audiol, V11, P23, DOI 10.1044/1059-0889(2002/005) RUDMIN F, 1987, Journal of Auditory Research, V27, P15 SCHMIDTNIELSEN A, 1989, J ACOUST SOC AM, V86, pS76, DOI 10.1121/1.2027645 Smits C, 2006, EAR HEARING, V27, P538, DOI 10.1097/01.aud.0000233917.72551.cf Smits C, 2004, INT J AUDIOL, V43, P15, DOI 10.1080/14992020400050004 Smits C, 2007, INT J AUDIOL, V46, P134, DOI 10.1080/14992020601102170 STROUSE A, 2000, J REHABIL RES DEV, V37, P599 Versfeld NJ, 2000, J ACOUST SOC AM, V107, P1671, DOI 10.1121/1.428451 Wagener K, 1999, Z AUDIOL, V38, P44 Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080 Wagener K, 1999, Z AUDIOL, V38, P4 Wagener K., 1999, Z AUDIOL, V38, P86 WAGENER K, 2005, ZIFFER TRIPEL TEST S Wilson Richard H., 2004, Seminars in Hearing, V25, P93 Wilson RH, 2005, J REHABIL RES DEV, V42, P499, DOI 10.1682/JRRD.2004.10.0134 NR 33 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 307 EP 316 DI 10.1016/j.specom.2008.09.007 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700001 ER PT J AU Keshet, J Grangier, D Bengio, S AF Keshet, Joseph Grangier, David Bengio, Samy TI Discriminative keyword spotting SO SPEECH COMMUNICATION LA English DT Article DE Keyword spotting; Spoken term detection; Speech recognition; Large margin and kernel methods; Support vector machines; Discriminative models AB This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based oil mapping the input acoustic representation of the speech utterance along with the target keyword into a vector-space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested oil both read (HTIMIT, WSJ) and spontaneous speech (OGI Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system. (C) 2008 Elsevier B.V. All rights reserved. C1 [Keshet, Joseph] IDIAP Res Inst, CH-1920 Martigny, Switzerland. [Grangier, David] NEC Labs Amer, Princeton, NJ 08540 USA. [Bengio, Samy] Google Inc, Mountain View, CA 94043 USA. RP Keshet, J (reprint author), IDIAP Res Inst, Rue Marconi 19, CH-1920 Martigny, Switzerland. EM jkeshet@idiap.ch; dgrangier@nec-labs.com; bengio@google.com CR Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179> Benayed Y., 2004, P INT C AUD SPEECH S, P588 BENGIO S, 2005, P 22 INT C MACH LEAR BOURLARD H, 1994, P IEEE INT C AC SPEE, P373 Cardillo P. S., 2002, International Journal of Speech Technology, V5, DOI 10.1023/A:1013670312989 Cesa-Bianchi N, 2004, IEEE T INFORM THEORY, V50, P2050, DOI 10.1109/TIT.2004.833339 COLLOBERT R, 2002, 46 IDIAPRR CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411 Cortes C., 2004, ADV NEURAL INFORM PR, V17, P2004 CRAMMER K, 2006, J MACHINE LEARN RES, V7 Cristianini N., 2000, INTRO SUPPORT VECTOR Dekel O, 2004, WORKSH MULT INT REL, P146 Fu Q., 2007, IEEE WORKSH AUT SPEE, P278 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 Junkawitsch J., 1997, P EUR C SPEECH COMM, P259 KESHET J, 2006, P INT Keshet J, 2007, IEEE T AUDIO SPEECH, V15, P2373, DOI 10.1109/TASL.2007.903928 KESHET J, 2001, P 7 EUR C SPEECH COM, P1637 KETABDAR H, 2006, P INT MANOS AS, 1997, P ICASSP 97, P899 Paul D. B., 1992, P INT C SPOK LANG PR Platt J., 1998, ADV KERNEL METHODS S Rabiner L, 1993, FUNDAMENTALS SPEECH RAHIM M, 1997, IEEE T SPEECH AUDIO, P266 REYNOLDS DA, 1997, P INT C AC SPEECH SI, P1535 Rohlicek J., 1989, P IEEE INT C AC SPEE, P627 Rohlicek J.R., 1993, P 1993 IEEE INT C AC, pII459 Rosevear R. D., 1990, Power Technology International Salomon J., 2002, P 7 INT C SPOK LANG, P2645 SHALEVSHWARTZ S, 2004, P 5 INT C MUS INF RE Silaghi MC, 1999, P IEEE AUT SPEECH RE, P213 Szoke I., 2005, P JOINT WORKSH MULT TASKAR B, 2003, ADV NEURAL INFORM PR, V17 Vapnik V, 1998, STAT LEARNING THEORY Weintraub M., 1995, P INT C AUD SPEECH S, P129 NR 35 TC 28 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 317 EP 329 DI 10.1016/j.specom.2008.10.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700002 ER PT J AU Chomphan, S Kobayashi, T AF Chomphan, Suphattharachal Kobayashi, Takao TI Tone correctness improvement in speaker-independent average-voice-based Thai speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Tone correctness; Tone neutralization; Average-voice; Hidden Markov models; Speech synthesis ID MANDARIN; MODEL AB A novel approach to the context-clustering process in a speaker-independent HMM-based Thai speech synthesis is addressed in this paper. Improvements to the tone correctness (i.e., tone intelligibility) of the average-voice and also the speaker-adapted voice were our main objectives. To treat the problem of tone neutralization, we incorporated a number of tonal features called tone-geometrical and phrase-intonation features into the context-clustering process of the H M M training stage. We carried out subjective and objective evaluations of both the average voice and adapted voice in terms of the intelligibility of tone and the logarithmic fundamental frequency (F0) error in our experiments. The effects on the decision trees of the extracted features were also evaluated. Several speech-model scenarios including male/female and gender-dependent/gender-independent were implemented to confirm the effectiveness of the proposed approach. The results of subjective tests revealed that the proposed tonal features could improve the intelligibility of tones for all speech-model scenarios. The objective tests also yielded results corresponding to those of the subjective tests. The experimental results from both the subjective and objective evaluations confirmed that the proposed tonal features could alleviate the problem of tone neutralization; as a result, the tone correctness of synthesized speech was significantly improved. Crown Copyright (C) 2008 Published by Elsevier B.V. All rights reserved. C1 [Chomphan, Suphattharachal] Kasetsart Univ, Fac Engn Si Racha, Elect Engn Div, Si Racha 20230, Chonburi, Thailand. [Chomphan, Suphattharachal; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Midori Ku, Yokohama, Kanagawa 2268502, Japan. RP Chomphan, S (reprint author), Kasetsart Univ, Fac Engn Si Racha, Elect Engn Div, Si Racha 20230, Chonburi, Thailand. EM suphattharachai@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp CR ABRAMSON AS, 1979, INT C PHON SCI, P380 Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137 Chen YQ, 2002, J INTELL INF SYST, V19, P95, DOI 10.1023/A:1015568521453 CHOMPHAN S, 2007, P INTERSPEECH 2007, P2849 Chomphan S, 2008, SPEECH COMMUN, V50, P392, DOI 10.1016/j.specom.2007.12.002 CHOMPHAN S, 2007, 6 ISCA WORKSH SPEECH, P160 Fujisaki H., 1971, J ACOUST SOC JPN, V57, P445 Fujisaki H., 1984, J ACOUST SOC JPN ASJ, V5, P133 Fujisaki H, 1998, ICSP '98: 1998 FOURTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PROCEEDINGS, VOLS I AND II, P714, DOI 10.1109/ICOSP.1998.770311 FUJISAKI H, 1990, INT C SPOK LANG PROC, P841 GANDOUR J, 1994, J PHONETICS, V22, P477 HANSAKUNBUNTHEU.C, 2005, INT S NAT LANG PROC, P127 Iwasaki Shoichi, 2005, REFERENCE GRAMMAR TH KASURIYA S, 2003, JOINT INT C SNLP OR, P54 LI Y, 2004, SPEECH PROSODY, P467 LUKSANEEYANAWIN S, 1993, INT S NAT LANG PROC, P276 LUKSANEEYANAWIN S, 1992, INT S LANG LING, P75 Masuko T, 1996, INT CONF ACOUST SPEE, P389, DOI 10.1109/ICASSP.1996.541114 Mixdorff H., 2002, INT C SPOK LANG PROC, P753 Moren B, 2006, NAT LANG LINGUIST TH, V24, P113, DOI 10.1007/s11049-004-5454-y Ni JF, 2006, SPEECH COMMUN, V48, P989, DOI 10.1016/j.specom.2006.01.002 PALMER A, 1969, LANG LEARN, V19, P287, DOI 10.1111/j.1467-1770.1969.tb00469.x Riley M., 1992, TALKING MACHINES THE, P265 Russell M. J., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8) SAITO T, 2002, INT C SPOK LANG PROC, P165 Seresangtakul P, 2003, IEICE T INF SYST, VE86D, P2223 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 SORNLERTLAMVANI.V, 1998, INT C SPEECH DAT ASS, P131 TAO J, 2006, TC STAR WORKSH SPEEC, P171 THATHONG U, 2000, INT C SPOK LANG PROC, P47 TRAN DD, 2006, INT S TON ASP LANG Yamagishi J, 2003, IEICE T FUND ELECTR, VE86A, P1956 Yamagishi J, 2003, IEEE INT C AC SPEECH, P716 YAMAGISHI J, 2002, INT C SPOK LANG PROC, P133 YAMAGISHI J, 2004, INT C SPOK LANG PROC, P1213 Yoshida T, 1998, INTERNATIONAL ELECTRON DEVICES MEETING 1998 - TECHNICAL DIGEST, P29, DOI 10.1109/IEDM.1998.746239 Zen H, 2004, INT C SPOK LANG PROC, P1393 NR 37 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 330 EP 343 DI 10.1016/j.specom.2008.10.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700003 ER PT J AU Sciamarella, D Artana, G AF Sciamarella, D. Artana, G. TI A water hammer analysis of pressure and flow in the voice production system SO SPEECH COMMUNICATION LA English DT Article DE Voice production; Water hammer; Speech synthesis ID VOCAL-FOLD MODEL; 2-MASS MODEL; KOROTKOFF SOUNDS; SEPARATION; DYNAMICS AB The sudden pressure rise produced by glottal closure in the subglottal tract during vocal fold oscillation causes a flow transient which can be computed as a water hammer effect in engineering. In this article, we present a basic water hammer analysis for the trachea and the supralaryngeal tract under conditions which are analogue to those operating during voice production. This approach allows predicting both, the intra-oral and intra-tracheal pressure fluctuations induced by vocal fold motion, as well as the airflow evolution throughout the phonatory system. (C) 2008 Elsevier B.V. All rights reserved. C1 [Sciamarella, D.] LIMSI, CNRS, F-91403 Orsay, France. [Artana, G.] Univ Buenos Aires, CONICET, Fac Ingn, LFD, RA-1053 Buenos Aires, DF, Argentina. RP Sciamarella, D (reprint author), LIMSI, CNRS, BP 133, F-91403 Orsay, France. EM sciamarella@limsi.fr CR Alipour F, 2004, J ACOUST SOC AM, V116, P1710, DOI 10.1121/1.1779274 Allen J, 2004, PHYSIOL MEAS, V25, P107, DOI 10.1088/0967-3334/25/1/010 BLADE EJ, 1962, D1216 NASA BURNETT GC, 2002, Patent No. 20020099541 Chang HS, 2003, J NEUROL NEUROSUR PS, V74, P344, DOI 10.1136/jnnp.74.3.344 CHUNGCHAROEN D, 1964, AM J PHYSIOL, V207, P190 Dang JW, 2004, J ACOUST SOC AM, V115, P853, DOI 10.1121/1.1639325 D'Souza A.F., 1964, ASME, V86, P589 FLETCHER NH, 1993, J ACOUST SOC AM, V93, P2172, DOI 10.1121/1.406857 Ghidaoui M. S., 2005, Applied Mechanics Review, V58, DOI 10.1115/1.1828050 HERMAWAN V, 2004, INT MECH ENG C EXP B HIRSCHBERG A, 2001, LECT SERIES VONKARMA, V2 ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P190, DOI 10.1121/1.381064 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 JOUKOWSKY N, 1900, MEMOIRES ACAD IMPERI, V8, P5 Lous NJC, 1998, ACUSTICA, V84, P1135 Lucero JC, 1999, J ACOUST SOC AM, V105, P423, DOI 10.1121/1.424572 Miller TL, 2007, J BIOMECH, V40, P1615, DOI 10.1016/j.jbiomech.2006.07.022 Mongeau L, 1997, J ACOUST SOC AM, V102, P1121, DOI 10.1121/1.419864 PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449 Sciamarella D, 2004, ACTA ACUST UNITED AC, V90, P746 SCIAMARELLA D, 2007, EUR J MECH B-FLUID, V27, P42 Skalak R., 1956, T ASME, V78, P105 SPURK JH, 1997, FLUID MECH, P275 Streeter V. L., 1967, HYDRAULIC TRANSIENTS STREETER VL, 1974, ANNU REV FLUID MECH, V6, P57, DOI 10.1146/annurev.fl.06.010174.000421 Tijsseling AS, 1996, J FLUID STRUCT, V10, P109, DOI 10.1006/jfls.1996.0009 TITZE IR, 1993, PRINCIPLES VOICE PRO WOO P, 1996, LARYNGOSCOPE, V106 WOOD DJ, 1968, ASME J BASIC ENG, V90, P532 Wylie EB, 1993, FLUID TRANSIENTS SYS Zhao M, 2003, J HYDRAUL ENG-ASCE, V129, P1007, DOI 10.1061/(ASCE)0733-9429(2003)129:12(1007) Zhao M, 2007, J FLUID MECH, V570, P129, DOI [10.1017/S0022112006003193, 10.1017/s0022112006003193] NR 33 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 344 EP 351 DI 10.1016/j.specom.2008.10.004 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700004 ER PT J AU Hosom, JP AF Hosom, John-Paul TI Speaker-independent phoneme alignment using transition-dependent states SO SPEECH COMMUNICATION LA English DT Article DE Forced alignment; Phoneme alignment; Automatic phoneme alignment; Hidden Markov models ID HIDDEN MARKOV-MODELS; CHILDHOOD APRAXIA; DIAGNOSTIC MARKER; SPEECH; RATIO AB Determining the location of phonemes is important to a number of speech applications, including training of automatic speech recognition systems, building text-to-speech systems, and research on human speech processing. Agreement of humans oil the location of phonemes is, on average, 93.78% within 20 ms on a variety of corpora, and 93.49% within 20 ms on the TIMIT corpus. We describe a baseline forced-alignment system and a proposed system with several modifications to this baseline. Modifications include the addition of energy-based features to the standard cepstral feature set, the use of probabilities of a state transition given an observation, and the computation of probabilities of distinctive phonetic features instead of phoneme-level probabilities. Performance of the baseline system on the test partition of the TIMIT corpus is 91.48% within 20 ms, and performance of the proposed system oil this corpus is 93.36%, within 20 ms. The results of the proposed system are a 22% relative reduction in error over the baseline system, and a 14% reduction ill error over results from a non-HMM alignment system. This result of 93.36% agreement is the best known reported result on the TIMIT corpus. (C) 2008 Elsevier B.V. All rights reserved. C1 Oregon Hlth & Sci Univ, Sch Sci & Engn, Ctr Spoken Language Understanding, Beaverton, OR 97006 USA. RP Hosom, JP (reprint author), Oregon Hlth & Sci Univ, Sch Sci & Engn, Ctr Spoken Language Understanding, 20000 NW Walker Rd, Beaverton, OR 97006 USA. EM hosom@cslu.ogi.edu FU National Institutes of Health NIDCD [R21-DC06722]; National Institutes of Health NIA [AG08017, AG024978]; National Science Foundation [GER-9354959, IRI-9614217] FX This work was supported in part by the National Institutes of Health NIDCD Grant R21-DC06722, the National Institutes of Health NIA Grants AG08017 and AG024978, and the National Science Foundation Grants GER-9354959 and IRI-9614217. The views expressed here do not necessarily represent the views of the NTH or NSF. CR BOURLARD H, 1992, SPEECH COMMUN, V11, P237, DOI 10.1016/0167-6393(92)90018-3 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W Campbell N., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607292 Cole R., 1995, P EUR C SPEECH COMM, P821 COLE R, 1994, P INT C SPOK LANG PR, P2131 Cosi P., 1991, P EUROSPEECH 91, P693 COSI P, 2000, P ICSLP 2000 BEIJ CH, V2, P527 COX S, 1998, P INT C SPOK LANG PR, V5, P1947 Duc DN, 2003, LECT NOTES ARTIF INT, V2718, P481 FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 Garofolo J., 1990, DARPA TIMIT ACOUSTIC GILLICK L, 1993, P ICASSP 89 GLASG SC, P532 GONG Y, 1993, P EUR 93 BERL GERM, P1759 Gordon-Salant S, 2006, J ACOUST SOC AM, V119, P2455, DOI 10.1121/1.2171527 Greenberg Steven, 1996, P ESCA WORKSH AUD BA, P1 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hieronymus J. L., 1994, ASCII PHONETIC SYMBO Hosom J.-P., 1998, Australian Journal of Intelligent Information Processing Systems, V5 HOSOM JP, 2000, P ICSLP 2000 BEIJ, V4, P564 Huang X., 2001, SPOKEN LANGUAGE PROC HUOPANIEMI J, 1997, P 102 AUD ENG SOC AE Kain AB, 2007, SPEECH COMMUN, V49, P743, DOI 10.1016/j.specom.2007.05.001 Keshet J., 2005, P INT 05, P2961 KVALE K, 1994, P ICSLP 94 YOK JAP, V3, P1667 Ladefoged Peter, 1993, COURSE PHONETICS LEUNG HC, 1984, P ICASSP 84 SAN DIEG Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2 LJOLJE A, 1997, PROGR SPEECH SYNTHES Ljolje A., 1991, P INT C AC SPEECH SI, P473, DOI 10.1109/ICASSP.1991.150379 MALFRRE F, 1998, P ICSLP, P1571 Moore BCJ, 1997, INTRO PSYCHOL HEARIN PELLOM BL, 1998, THESIS DUKE U DURHAM Rabiner L, 1993, FUNDAMENTALS SPEECH Rapp S., 1995, P ELSNET GOES E IMAC Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 SHOBAKI K, P ICSLP 2000 BEIJ CH, V4, P258 Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P575, DOI 10.1080/0269920031000138141 Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P549, DOI 10.1080/0269920031000138123 Singh S, 2001, APHASIOLOGY, V15, P571, DOI 10.1080/02687040143000041 Svendsen T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) SVENDSEN T, 1990, P ICSLP 90 KOB JAP, P997 TORKKOLA K, 1988, P IEEE INT C AC SPEE, P611 WAGNER M, 1981, P ICASSP81 ATLANTA, P1156 WEI W, 1998, P INT C AC SPEECH SI, V1, P497, DOI 10.1109/ICASSP.1998.674476 Wesenick M.-B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607054 Wheatley B., 1992, P ICASSP 92 SAN FRAN, V1, P533 WIGHTMAN CW, 1997, PROGR SPEECH SYNTHES WOODLAND PC, 1995, P ICASSP, V1, P73 NR 48 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 352 EP 368 DI 10.1016/j.specom.2008.11.003 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700005 ER PT J AU Garcia-Sierra, A Diehl, RL Champlin, C AF Garcia-Sierra, Adrian Diehl, Randy L. Champlin, Craig TI Testing the double phonemic boundary in bilinguals SO SPEECH COMMUNICATION LA English DT Article DE Double phonemic boundary; Spanish-English bilinguals; Perceptual switching; Language contexts ID SPANISH-ENGLISH BILINGUALS; CROSS-LANGUAGE; PERCEPTION; CONTRAST; INFANTS; DISCRIMINATION; SPEAKERS; ADULTS; STOPS; RANGE AB It is widely known that language influences the way speech sounds are categorized. However, categorization of speech sounds by bilinguals is not well understood. There is evidence that bilinguals have different category boundaries than monolinguals, and there is evidence suggesting that bilinguals' phonemic boundaries can shift with language context. This phenomenon has been referred as the double phonemic boundary. In this investigation, the double phonemic boundary is tested in Spanish-English bilinguals (N = 18) and English monolinguals (N = 16). Participants were asked to categorize speech stimuli from a continuum ranging from /ga/ to /ka/ in two language contexts. The results showed phonemic boundary shifts in bilinguals and monolinguals which did not differ across language contexts. However, the magnitude of the phoneme boundary shift was significantly correlated with the level of confidence in using English and Spanish (reading, writing, speaking, and comprehension) for bilinguals, but not for monolinguals. The challenges of testing the double phonemic boundary are discussed, along with the limitations of the methodology used in this study. (C) 2008 Elsevier B.V. All rights reserved. C1 [Garcia-Sierra, Adrian; Diehl, Randy L.; Champlin, Craig] Univ Texas Austin, Dept Commun Sci & Disorders, Coll Commun, Austin, TX 78712 USA. RP Garcia-Sierra, A (reprint author), Univ Washington, Inst Learning & Brain Sci, Fisheries Ctr Bldg,Box 357988, Seattle, WA 98195 USA. EM gasa@u.washington.edu; diehl@psy.utexas.edu; champlin@austin.utexas.edu FU The Department of Communications Sciences and Disorders at The University of Texas Austin FX The present work was supported by The Department of Communications Sciences and Disorders at The University of Texas Austin. I thank the valuable help I received from Nairan Ramirez-Esparza, Denise Padden, Marco A. Jurado and Hayley Austin. CR Abramson A. S., 1970, P 6 INT C PHON SCI P, P569 Abramson A. S., 1972, J PHON, V1, P1 BEST CT, 1988, J EXP PSYCHOL HUMAN, V14, P345, DOI 10.1037/0096-1523.14.3.345 BRADY SA, 1978, J ACOUST SOC AM, V63, P1556, DOI 10.1121/1.381849 CARAMAZZ.A, 1973, J ACOUST SOC AM, V54, P421, DOI 10.1121/1.1913594 DIEHL RL, 1978, J EXP PSYCHOL HUMAN, V4, P599, DOI 10.1037//0096-1523.4.4.599 EIMAS PD, 1973, COGNITIVE PSYCHOL, V4, P99, DOI 10.1016/0010-0285(73)90006-6 ELMAN JL, 1977, J ACOUST SOC AM, V62, P971, DOI 10.1121/1.381591 Finney D. J., 1971, PROBIT ANAL, V3rd BOHN OS, 1993, J PHONETICS, V21, P267 FLEGE JE, 1987, J PHONETICS, V15, P67 Flege JE, 2002, APPL PSYCHOLINGUIST, V23, P567, DOI 10.1017/S0142716402004046 FLEGE JE, 1987, SPEECH COMMUN, V6, P185, DOI 10.1016/0167-6393(87)90025-2 Grosjean F, 1982, LIFE 2 LANGUAGES HAZAN VL, 1993, LANG SPEECH, V36, P17 Holt LL, 2005, PSYCHOL SCI, V16, P305, DOI 10.1111/j.0956-7976.2005.01532.x Holt LL, 2002, HEARING RES, V167, P156, DOI 10.1016/S0378-5955(02)00383-0 KEATING PA, 1981, J ACOUST SOC AM, V70, P1261, DOI 10.1121/1.387139 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Kohnert KJ, 1999, J SPEECH LANG HEAR R, V42, P1400 KUHL PK, 1992, SCIENCE, V255, P606, DOI 10.1126/science.1736364 Lisker L., 1970, P 6 INT C PHON SCI P, P563 Naatanen R., 1992, ATTENTION BRAIN FUNC Polka L, 2001, J ACOUST SOC AM, V109, P2190, DOI 10.1121/1.1362689 Sundara M, 2008, COGNITION, V108, P232, DOI 10.1016/j.cognition.2007.12.013 Sundara M, 2008, COGNITION, V106, P234, DOI 10.1016/j.cognition.2007.01.011 WILLIAMS L, 1977, PERCEPT PSYCHOPHYS, V21, P289, DOI 10.3758/BF03199477 NR 27 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 369 EP 378 DI 10.1016/j.specom.2008.11.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700006 ER PT J AU Jongtaveesataporn, M Thienlikit, I Wutiwiwatchai, C Furui, S AF Jongtaveesataporn, Markpong Thienlikit, Issara Wutiwiwatchai, Chai Furui, Sadaoki TI Lexical units for Thai LVCSR SO SPEECH COMMUNICATION LA English DT Article DE Thai LVCSR; Thai language model; Lexical unit; Pseudo-morpheme; Compound pseudo-morpheme; Word segmentation AB Traditional language models rely on lexical units that are defined as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative definitions of lexical units have to be pursued. The problem is to find the optimal set of lexical units that constitutes the vocabulary of the language model and yields the best final result. The word is a traditional lexical unit recognized by Thai people and is used by most of the natural language processing systems, including an automatic speech recognition system. This paper discusses problems with using words as a lexical unit and investigates other lexical units for the Thai large vocabulary continuous speech recognition (LVCSR) system. The pseudo-morpheme is introduced in the paper and shown to be unsuitable for use as a lexical unit directly. A technique using pseudo-morphemes to improve the system based on the traditional word model is introduced and some improvements can be gained by this technique. Then, a new lexical unit for Thai, the compound pseudo-morpheme, and an algorithm to build compound pseudo-morphemes are presented. The experimental results show that the system using compound pseudo-morphemes outperforms other systems. Thus, the compound pseudo-morpheme is the most suitable lexical unit for Thai LVCSR system. (C) 2008 Elsevier B.V. All rights reserved. C1 [Jongtaveesataporn, Markpong; Thienlikit, Issara; Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. [Wutiwiwatchai, Chai] Natl Elect & Comp Technol Ctr, Pathum Thani 12120, Thailand. RP Jongtaveesataporn, M (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayma, Tokyo 1528552, Japan. EM marky@furui.cs.titech.ac.jp RI Wutiwiwatchai, Chai/G-5010-2012 FU METI Project "Development of Fundamental Speech Recognition Technology" FX The speech corpus used for training the acoustic model was funded by the METI Project "Development of Fundamental Speech Recognition Technology". The authors would like to thank Dr. Wirote Aroonmanakun for allowing us to use his PM segmentation tool and NECTEC for providing useful Thai speech resources. We also would like to thank the anonymous reviewers for their insightful remarks on a previous version of this manuscript and Si-Phya Publishing Co., Ltd. which publishes Daily News newspaper for supplying us useful newspaper text. CR AROONMANAKUN W, 2002, P 5 S NAT LANG PROC, P68 Hacioglu K., 2003, P 8 EUR C SPEECH COM, P1165 JONGTAVEESATAPO.M, 2008, P INT C LARG RES EV Kasuriya S., 2003, Proceedings of the Oriental COCOSDA 2003. International Coordinating Committee on Speech Databases and Speech I/O System Assessment Lee A., 2001, P EUR C SPEECH COMM, P1691 *NECTEC, SWATH SMART WORD AN ROSENFELD R, 1995, P ARPA SPOK LANG TEC Saon G, 2001, IEEE T SPEECH AUDI P, V9, P327, DOI 10.1109/89.917678 Tarsaku P., 2001, P EUR C SPEECH COMM, P1057 Wutiwiwatchsi C, 2007, SPEECH COMMUN, V49, P8, DOI 10.1016/j.specom.2006.10.004 Young Steve, 2002, HTK BOOK VERSION 3 2 NR 11 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 379 EP 389 DI 10.1016/j.specom.2008.11.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700007 ER PT J AU Gomez, AM Peinado, AM Sanchez, V Carmona, JL AF Gomez, Angel M. Peinado, Antonio M. Sanchez, Victoria Carmona, Jose L. TI A robust scheme for distributed speech recognition over loss-prone packet channels SO SPEECH COMMUNICATION LA English DT Article DE Distributed speech recognition; Media-specific FEC; Interleaving; Weighted Viterbi decoding ID IP NETWORKS AB In this paper, we propose a whole recovery scheme designed to improve robustness against packet losses in distributed speech recognition systems. This scheme integrates two sender-driven techniques, namely, media-specific forward error correction (FEC) and frame interleaving, along with a receiver-based error concealment (EC) technique, the weighted Viterbi algorithm (WVA). Although these techniques have been already tested separately, providing a significant increase of performance in clean acoustic environments, in this paper they are jointly applied and their performance in adverse acoustic conditions is evaluated. In particular, a noisy speech database and the ETSI Advanced Front-end are used, while the dynamic features, which play ail important role in adverse acoustic environments, and their confidences for the WVA algorithm are examined. In order to solve the issue of mixing two sender-driven techniques (both causing a delay) whose direct composition causes an increase of the global latency, we propose a double stream scheme which limits the latency to the maximum delay of both techniques. As a result, with very few overhead bits and a very limited delay, the integrated scheme achieves a significant improvement in the performance of a DSR system over a degraded transmission channel, both in clean and noisy acoustic conditions. (C) 2009 Elsevier B.V. All rights reserved. C1 [Gomez, Angel M.; Peinado, Antonio M.; Sanchez, Victoria; Carmona, Jose L.] Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, E-18071 Granada, Spain. RP Gomez, AM (reprint author), Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, Campus Fuentenueva S-N, E-18071 Granada, Spain. EM amgg@ugr.es; amp@ugr.es; victoria@ugr.es; maqueda@ugr.es RI Sanchez , Victoria /C-2411-2012; Peinado, Antonio/C-2401-2012; Gomez Garcia, Angel Manuel/C-6856-2012 OI Gomez Garcia, Angel Manuel/0000-0002-9995-3068 FU Spanish MEC [TEC 2007-66600] FX This paper has been supported by the Spanish MEC, project TEC 2007-66600. CR Andrews K., 1997, THEORY INTERLEAVERS Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141 Bolot J. C., 1993, ACM SIGCOMM, P289 Cardenal-Lopez A, 2006, SPEECH COMMUN, V48, P1422, DOI 10.1016/j.specom.2006.01.006 CARDENALLOPEZ A, 2004, P IEEE INT C AC SPEE, P49 ENDO T, 2003, P EUR 03 *ETSI, 2005, ETSIES202212 *ETSI, 2003, ETSIES202211 *ETSI, 2000, ETSIES201108 *ETSI, 2002, ETSIES202050 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 GOMEZ A, 2003, P EUROSPEECH 03 GOMEZ A, 2007, IEEE INT C AC SPEECH, V4 GOMEZ A, 2004, STAT BASED RECONSTRU Gomez AM, 2007, IEEE T AUDIO SPEECH, V15, P1496, DOI 10.1109/TASL.2006.889800 Gomez AM, 2006, IEEE T MULTIMEDIA, V8, P1228, DOI 10.1109/TMM.2006.884611 Ion V, 2006, SPEECH COMMUN, V48, P1435, DOI 10.1016/j.specom.2006.03.007 JAMES A, 2004, P INT C SPOK LANG PR JAMES A, 2004, P IEEE INT C AC SPEE JAMES A, 2005, P IEEE INT C AC SPEE, P345 KOODLI R, 1999, 3357 RFC MACHO D, 2000, SPANISH SDC AURORA D Milner B, 2006, IEEE T AUDIO SPEECH, V14, P223, DOI 10.1109/TSA.2005.852997 MILNER B, 2004, PACKET LOSS MODELLIN MILNER B, 2000, P ICASSP, V3, P1791 MILNER B, 2003, P EUR 03 PEINADO A, 2006, P INTERSPEECH 06 PEINADO A, 2005, P IEEE INT C AC SPEE, P329 PEINADO A, 2006, ROBUSTNESS STANDARDS PERKINS C, 1998, IEEE NETWORK MAGAZIN Postel J., 1980, 768 RFC POTAMIANOS A, 2001, P IEEE INT C AC SPEE RAMSEY JL, 1970, IEEE T INFORM THEORY, V16, P338, DOI 10.1109/TIT.1970.1054443 Schulzrinne H., 1996, 1889 RFC XIE Q, 2002, 3357 RFC YOMA N, 1998, P IEEE INT C AC SPEE NR 36 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2009 VL 51 IS 4 BP 390 EP 400 DI 10.1016/j.specom.2008.12.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 421KB UT WOS:000264356700008 ER PT J AU Kjellstrom, H Engwall, O AF Kjellstrom, Hedvig Engwall, Olov TI Audiovisual-to-articulatory inversion SO SPEECH COMMUNICATION LA English DT Article DE Speech inversion; Articulatory inversion; Computer vision ID SPEECH RECOGNITION AB It has been shown that acoustic-to-articulatory inversion, i.e. estimation of the articulatory configuration from the corresponding acoustic signal, can be greatly improved by adding visual features extracted from the speaker's face. In order to make the inversion method usable in a realistic application, these features should be possible to obtain from a monocular frontal face video, where the speaker is not required to wear any special markers. In this study, we investigate the importance of visual cues for inversion. Experiments with motion capture data of the face show that important articulatory information can be extracted using only a few face measures that mimic the information that could be gained from a video-based method. We also show that the depth cue for these measures is not critical, which means that the relevant information can be extracted from a frontal video. A real video-based face feature extraction method is further presented, leading to similar improvements in inversion quality. Rather than tracking points on the face, it represents the appearance of the mouth area using independent component images. These findings are important for applications that need a simple audiovisual-to-articulatory inversion technique, e.g. articulatory phonetics training for second language learners or hearing-impaired persons. (c) 2008 Elsevier B.V. All rights reserved. C1 [Kjellstrom, Hedvig] KTH Royal Inst Technol, Sch Comp Sci & Commun, Comp Vis & Act Percept Lab, SE-10044 Stockholm, Sweden. [Engwall, Olov] KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol CTT, SE-10044 Stockholm, Sweden. RP Kjellstrom, H (reprint author), KTH Royal Inst Technol, Sch Comp Sci & Commun, Comp Vis & Act Percept Lab, SE-10044 Stockholm, Sweden. EM hedvig@kth.se; engwall@kth.se FU Swedish Research Council; ASPI; Audiovisual SPeech Inversion; Future and Emerging Technologies (FET) programme within the Sixth Framework Programme for Research of the European Commission [021324] FX This research is part of the two projects ARTUR, funded by the Swedish Research Council, and ASPI, Audiovisual SPeech Inversion. The authors acknowledge the financial support of the Future and Emerging Technologies (FET) programme within the Sixth Framework Programme for Research of the European Commission, under FET-Open Contract No. 021324 CR Ahlberg J., 2001, LITHISYR2326 LINKP U Bailly G., 2002, INT C SPEECH LANG PR, P1913 BESKOW J, 2003, RESYNTHESIS FACIAL I, P431 BRANDERUD P, 1985, P FRENCH SWED S SPEE, P113 BREGLER C, 1994, INT CONF ACOUST SPEE, P669 Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467 Cristianini N., 2000, INTRO SUPPORT VECTOR Doucet A., 2001, SEQUENTIAL MONTE CAR Dupont S, 2000, IEEE T MULTIMEDIA, V2, P141, DOI 10.1109/6046.865479 ENGWALL O, 2006, INT SEM SPEECH PROD, P469 Engwall O, 2006, BEHAV INFORM TECHNOL, V25, P353, DOI 10.1080/01449290600636702 ENGWALL O, 2005, INTERSPEECH, P3205 ENGWALL O, 2003, RESYNTHESIS 3D TONGU, P2261 Engwall O, 2003, SPEECH COMMUN, V41, P303, DOI [10.1016/S0167-6393(02)00132-2, 10.1016/S0167-6393(03)00132-2] Hyvarinen A, 2001, INDEPENDENT COMPONEN Hyvarinen A., 2005, FASTICA PACKAGE MATL Isard M, 1998, INT J COMPUT VISION, V29, P5, DOI 10.1023/A:1008078328650 Jiang JT, 2002, EURASIP J APPL SIG P, V2002, P1174, DOI 10.1155/S1110865702206046 KATSAMANIS A, 2008, IEEE INT C AC SPEECH KAUCIC R, 1998, IEEE INT C COMP VIS, P370 KJELLSTROM H, 2006, RECONSTRUCTING TONGU, P2238 MACLEOD A, 1990, British Journal of Audiology, V24, P29, DOI 10.3109/03005369009077840 MAEDA S, 1994, SPEECH MAPS WP2 SPEE, V3 Matthews I, 2002, IEEE T PATTERN ANAL, V24, P198, DOI 10.1109/34.982900 OUNI S, 2002, INT C SPOK LANG PROC, P2301 PETERSEN M, 1997, THESIS TU DENMARK DE Saenko K, 2005, IEEE I CONF COMP VIS, P1424, DOI 10.1109/ICCV.2005.251 SEYMOUR R, 2005, NEW POSTERIOR BASED, P1229 SHDAIFAT I, 2005, SYSTEM AUDIO VISUAL, P1221 SUGAMURA N, 1986, SPEECH COMMUN, V5, P199, DOI 10.1016/0167-6393(86)90008-7 Tipping ME, 2001, J MACH LEARN RES, V1, P211, DOI 10.1162/15324430152748236 TORRESANI L, 2004, ECCV, P299 Yang MH, 2002, IEEE T PATTERN ANAL, V24, P34 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X NR 34 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 195 EP 209 DI 10.1016/j.specom.2008.07.005 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900001 ER PT J AU Knoll, MA Uther, M Costall, A AF Knoll, Monja A. Uther, Maria Costall, Alan TI Effects of low-pass filtering on the judgment of vocal affect in speech directed to infants, adults and foreigners SO SPEECH COMMUNICATION LA English DT Article DE Low-pass filtering; Infant-directed speech; Foreign-directed speech; Vocal affect ID MOTHERS SPEECH; EMOTIONS; COMMUNICATION; MASKING; CUES; RECOGNITION; FREQUENCIES; INTONATION; LANGUAGE; PROSODY AB Low-pass filtering has been used in emotional research to remove the semantic content from speech on the assumption that the relevant acoustic cues for vocal affect remain intact. This method has also been adapted by recent investigations into the function of infant-directed speech (IDS). Similar to other emotion-related studies that have utilised various levels of low-pass filtering, these IDS investigations have used different frequency cut-offs. However, the effects of applying these different low-pass filters to speech samples on perceptual ratings of vocal affect are not well understood. Samples of natural IDS, foreigner- (FDS) and British adult-directed (ADS) speech were low-pass filtered at four different cut-offs (1200, 1000, 700, and 400 Hz), and affective ratings of these were compared to those of the original samples. The samples were also analyzed for mean fundamental frequency (F-0) and F-0 range. Whilst IDS received consistently higher affective ratings for all filters, the results of the adult conditions were more complex. ADS received significantly higher ratings of positive vocal affect than FDS with the lower cut-offs (1000-400 Hz), whereas no significant difference between the adult conditions was found in the original and 1200 Hz conditions. No difference between the adult conditions was found for encouragement of attention. These findings show that low-pass filtering leaves sufficient vocal affect for detection by raters between IDS and the adult conditions, but that residual semantic information in filters above 1000 Hz may have a confounding affect on raters' perception. (c) 2008 Elsevier B.V. All rights reserved. C1 [Knoll, Monja A.; Costall, Alan] Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England. [Uther, Maria] Brunel Univ, Sch Social Sci, Ctr Cognit & Neuroimaging, Uxbridge UB8 3PH, Middx, England. RP Knoll, MA (reprint author), Univ Portsmouth, Dept Psychol, King Henry Bldg,King Henry 1 St, Portsmouth PO1 2DY, Hants, England. EM Monja.Knoll@port.ac.uk; Maria.Uther@brunel.ac.uk; Alan.Costall@port.ac.uk FU ESRC FX This research was supported by a grant from the ESRC to Monja Knoll. We thank Pushpendra Singh (Newcastle University, UK), Paul Marshman (Portsmouth University, UK), James Uther (Sydney University, Australia), Axelle Philippon (Portsmouth University, UK) and Stig Walsh (NHM, UK) for technical support and discussion. Two anonymous reviewers are thanked for greatly improving this manuscript. CR Baer T, 2002, J ACOUST SOC AM, V112, P1133, DOI 10.1121/1.1498853 Biersack S., 2005, P 9 EUR C SPEECH COM, P2401 Boersma P., 2005, PRAAT DOING PHONETIC Burnham D., 2002, SCIENCE, V296, P1095 CAPORAEL LR, 1981, J PERS SOC PSYCHOL, V40, P876, DOI 10.1037//0022-3514.40.5.876 COHEN A, 1961, AM J PSYCHOL, V74, P90, DOI 10.2307/1419829 DAVITZ JR, 1961, J COMMUN, V81, P81 Dubno JR, 2005, J ACOUST SOC AM, V118, P923, DOI 10.1121/1.1953127 FERNALD A, 1991, DEV PSYCHOL, V27, P209, DOI 10.1037/0012-1649.27.2.209 FERNALD A, 1984, DEV PSYCHOL, V20, P104, DOI 10.1037//0012-1649.20.1.104 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 FRIEND M, 1994, J ACOUST SOC AM, V96, P1283, DOI 10.1121/1.410276 Hogan CA, 1998, J ACOUST SOC AM, V104, P432, DOI 10.1121/1.423247 Kitamura C, 2003, INFANCY, V4, P85, DOI 10.1207/S15327078IN0401_5 KNOLL MA, 2008, BPS ANN C 2008 DUBL, P232 Knoll MA, 2007, SYST ASSOC SPEC VOL, V74, P299 KNOLL MA, 2004, J ACOUST SOC AM, V116, P2522 Knower FH, 1941, J SOC PSYCHOL, V14, P369 KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473 KRAMER E, 1963, PSYCHOL BULL, V60, P408, DOI 10.1037/h0044890 MILMOE S, 1967, J ABNORM PSYCHOL, V72, P78, DOI 10.1037/h0024219 Morton JB, 2001, CHILD DEV, V72, P834, DOI 10.1111/1467-8624.00318 PAPOUSEK M, 1991, APPL PSYCHOLINGUIST, V12, P481, DOI 10.1017/S0142716400005889 POLLACK I, 1948, J ACOUST SOC AM, V20, P259, DOI 10.1121/1.1906369 ROGERS PL, 1971, BEHAV RES METH INSTR, V3, P16, DOI 10.3758/BF03208115 ROSS M, 1973, AM ANN DEAF, V118, P37 SCHERER KR, 1971, J EXP RES PERS, V5, P155 SCHERER KR, 1972, J PSYCHOLINGUIST RES, V1, P269, DOI 10.1007/BF01074443 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Soskin W.F., 1961, J COMMUN, V11, P73, DOI 10.1111/j.1460-2466.1961.tb00331.x STARKWEATHER JA, 1956, AM J PSYCHOL, V69, P121, DOI 10.2307/1418129 Starkweather J.A., 1967, RES VERBAL BEHAV SOM, P253 Trainor LJ, 2000, PSYCHOL SCI, V11, P188, DOI 10.1111/1467-9280.00240 Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003 VANBEZOOIJEN R, 1986, J PSYCHOLINGUIST RES, V15, P103 WHITE GM, 1975, J ACOUST SOC AM, V58, P106 NR 36 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 210 EP 216 DI 10.1016/j.specom.2008.08.001 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900002 ER PT J AU Caballero, M Moreno, A Nogueiras, A AF Caballero, Monica Moreno, Asuncion Nogueiras, Albino TI Multidialectal Spanish acoustic modeling for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Multidialectal ASR system; Dialect-independent ASR system; Spanish acoustic modeling; Spanish dialects AB During the last years, language resources for speech recognition have been collected for many languages and specifically, for global languages. One of the characteristics of global languages is their wide geographical dispersion, and consequently, their wide phonetic, lexical, and semantic dialectal variability. Even if the collected data is huge, it is difficult to represent dialectal variants accurately. This paper deals with multidialectal acoustic modeling for Spanish. The goal is to create a set of multidialectal acoustic models that represents the sounds of the Spanish language as spoken in Latin America and Spain. A comparative study of different methods for combining data between dialects is presented. The developed approaches are based on decision tree clustering algorithms. They differ oil whether a multidialectal phone set is defined, and in the decision tree structure applied. Besides, a common overall phonetic transcription for all dialects is proposed. This transcription can be used in combination with all the proposed acoustic modeling approaches. Overall transcription combined with approaches based on defining a multidialectal phone set leads to a full dialect-independent recognizer, capable to recognize any dialect even with a total absence of training data from such dialect. Multidialectal systems are evaluated over data collected in five different countries: Spain, Colombia, Venezuela, Argentina and Mexico. The best results given by multidialectal systems show a relative improvement of 13% over the results obtained with monodialectal systems. Experiments with dialect-independent systems have been conducted to recognize speech from Chile, a dialect not seen in the training process. The recognition results obtained for this dialect are similar to the ones obtained for other dialects. (c) 2008 Elsevier B.V. All rights reserved. C1 [Caballero, Monica; Moreno, Asuncion; Nogueiras, Albino] Univ Politecn Cataluna, Talp Res Ctr, E-08028 Barcelona, Spain. RP Caballero, M (reprint author), Univ Politecn Cataluna, Talp Res Ctr, E-08028 Barcelona, Spain. EM monica@gps.tsc.upc.edu; asuncion@gps.tsc.upc.edu; albino@gps.tsc.upc.edu RI Nogueiras Rodriguez, Albino/G-1418-2013 OI Nogueiras Rodriguez, Albino/0000-0002-3159-1718 FU Spanish Government [TEC2006-13694-C03] FX This work was granted by Spanish Government TEC2006-13694-C03. CR AALBURG S, 2003, P EUROSPEECH 2003, P1489 BAUM M, 2001, ISCA WORKSH AD METH, P135 BERINGER N, 1998, P ICSLP SYDN AUSTR, V2, P85 Billa J., 1997, P EUR, P363 Brousseau J., 1992, P INT C SPOK LANG PR, P1003 BYRNE W, 2000, P ICASSP, V2, P1029 CABALLERO M, 2004, P ICSLP JEJ ISL KOR, P837 Chengalvarayan R, 2001, P EUR AALB DENM, P2733 DELATORRE C, 1996, P ICSLP PHIL, V4, P2032, DOI 10.1109/ICSLP.1996.607198 DIAKOLUKAS D, 1997, P ICASSP MUN GERM, P1455 DUCHATEAU J, 1997, P EUR 97, V3, P1183 Ferreiros J, 1999, SPEECH COMMUN, V29, P65, DOI 10.1016/S0167-6393(99)00013-8 FISCHER V, 1998, P ICSLP SYDN AUSTR FOLDVIK AK, 1998, P ICSLP SYDN AUSTR Gibbon D, 1997, HDB STANDARDS RESOUR Heeringa W, 2003, COMPUT HUMANITIES, V37, P293, DOI 10.1023/A:1025087115665 HUERTA JM, 1998, DARPA BN TRANSCR UND Imperl B, 2003, SPEECH COMMUN, V39, P353, DOI 10.1016/S0167-6393(02)00048-1 Kirchhoff K, 2005, SPEECH COMMUN, V46, P37, DOI 10.1016/j.specom.2005.01.004 Kohler J, 2001, SPEECH COMMUN, V35, P21, DOI 10.1016/S0167-6393(00)00093-5 Kudo I., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607195 Lipski J. M., 1994, LATIN AM SPANISH LLISTERRI J, 1993, SAMAUPC001VI MARINO JB, 2000, P INT WORKSH VER LAR, P57 MARINO JB, 1998, P ICSLP SYDN AUSTR, V1, P477 MORENO A, 1998, P ICSLP SIDN AUSTR MORENO A., 1998, P INT C LANG RES EV, VI, P367 NOGUEIRAS A, 2002, P ICASSP ORL US SALVI G, 2003, P 15 INT C PHON SCI Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 YU H, 2003, P EUR GEN SWITZ, P1869 ZISSMANM MA, 1996, P ICASSP ATL GA US, P777 NR 32 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 217 EP 229 DI 10.1016/j.specom.2008.08.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900003 ER PT J AU Li, YP Wang, DL AF Li, Yipeng Wang, DeLiang TI On the optimality of ideal binary time-frequency masks SO SPEECH COMMUNICATION LA English DT Article DE Ideal binary mask; Ideal ratio mask; Optimality; Sound separation; Wiener filter ID AUDITORY SCENE ANALYSIS; SPEECH RECOGNITION; SOURCE SEPARATION; MONAURAL SPEECH; SEGREGATION; ENHANCEMENT AB The concept of ideal binary time-frequency masks has received attention recently in monaural and binaural sound separation. Although often assumed, the optimality of ideal binary masks in terms of signal-to-noise ratio has not been rigorously addressed. In this paper we give a formal treatment on this issue and clarify the conditions for ideal binary masks to be optimal. We also experimentally compare the performance of ideal binary masks to that of ideal ratio masks on a speech mixture database and a music database. The results show that ideal binary masks are close in performance to ideal ratio masks which are closely related to the Wiener filter, the theoretically optimal linear filter. (c) 2008 Elsevier B.V. All rights reserved. C1 [Li, Yipeng; Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. [Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA. RP Li, YP (reprint author), Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. EM li.434@osu.edu FU AFOSR [F49620-04-1-0027]; AFRL [FA8750-04-1-0093] FX We wish to thank the three anonymous reviewers for their constructive suggestions/criticisms. This research was supported in part by an AFOSR Grant (F49620-04-1-0027) and an AFRL Grant (FA8750-04-1-0093). CR Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929 Cooke M.P., 1993, MODELING AUDITORY PR Deshmukh OD, 2007, J ACOUST SOC AM, V121, P3886, DOI 10.1121/1.2714913 Ellis D.P.W., 2006, COMPUTATIONAL AUDITO, P115 Goto M., 2003, INT C MUS INF RETR Harding S, 2006, IEEE T AUDIO SPEECH, V14, P58, DOI 10.1109/TSA.2005.860354 Hu G. N., 2001, IEEE WORKSH APPL SIG Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Hubbard TL, 2001, AM J PSYCHOL, V114, P569, DOI 10.2307/1423611 KIM YI, 2006, NEURAL INFORM PROCES, V10, P125 Li N, 2008, J ACOUST SOC AM, V123, P1673, DOI 10.1121/1.2832617 Li P, 2006, IEEE T AUDIO SPEECH, V14, P2014, DOI 10.1109/TASL.2006.883258 LI Y, 2007, IEEE INT C AC SPEECH, P481 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Oppenheim A. V., 1999, DISCRETE TIME SIGNAL PRINCEN JP, 1986, IEEE T ACOUST SPEECH, V34, P1153, DOI 10.1109/TASSP.1986.1164954 Radfar MH, 2007, EURASIP J AUDIO SPEE, DOI 10.1155/2007/84186 REDDY AM, 2007, IEEE T AUDIO SPEECH, V25, P1766 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003 Strang G., 1996, WAVELETS FILTER BANK Van Trees H., 1968, DETECTION ESTIMATION, V1st Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005 Vincent E, 2007, SIGNAL PROCESS, V87, P1933, DOI 10.1016/j.sigpro.2007.01.016 Wang D., 2006, COMPUTATIONAL AUDITO Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12 Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 Weintraub M., 1985, THESIS STANFORD U Wiener N., 1949, EXTRAPOLATION INTERP NR 31 TC 42 Z9 44 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 230 EP 239 DI 10.1016/j.specom.2008.09.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900004 ER PT J AU Recasens, D Espinosa, A AF Recasens, Daniel Espinosa, Aina TI Dispersion and variability in Catalan five and six peripheral vowel systems SO SPEECH COMMUNICATION LA English DT Article DE Vowel space; Catalan; Contextual variability; Near-mergers; Mid vowel neutralization; Acoustic analysis ID MODERN GREEK; ENGLISH; LANGUAGE; PATTERNS; COARTICULATION; CONTRAST; CONTEXT; SPEECH AB This study compares F1 and F2 for the vowels of the five and six peripheral vowel systems of four minor dialects of Catalan (Felanitxer, Gironi, Sitgeta, Rossellones), with those of the seven peripheral vowel systems of the major dialects those minor dialects belong to (Majorcan, Eastern). Results indicate that most mid vowel pairs subjected to neutralization may be characterized as near-mergers. Merging appears to have proceeded through two stages: in the first place, one of the two mid vowel pairs undergoes neutralization yielding a relatively close mid vowel in the resulting six vowel system; then, the members of the second vowel pair approach each other until they cease to be contrastive, and the front and back mid vowels of the resulting five vowel system tend to occupy a fairly equidistant position with respect to the mid high and mid low cognates. Moreover, in six vowel systems with a single mid vowel pair, the contrasting members of this pair approach each other if belonging to the back series but not if belonging to the front series. These findings are in support of two hypotheses: vowel systems tend to be symmetrical; reparation of six vowel systems is most prone to occur if the system is unoptimal. Predictions of the adaptive dispersion theory were not supported by the data. Thus, smaller vowel systems turned out not to be less disperse than larger ones, and mid vowels were not clearly more variable in five or six vowel systems than in seven vowel systems. It appears that for these predictions to come into play, the systems being compared need to differ considerably in number of vowels. (c) 2008 Elsevier B.V. All rights reserved. C1 [Recasens, Daniel] Univ Autonoma Barcelona, Dept Catalan Philol, E-08193 Barcelona, Spain. Inst Estudis Catalans, Phonet Lab, Barcelona 08001, Spain. RP Recasens, D (reprint author), Univ Autonoma Barcelona, Dept Catalan Philol, E-08193 Barcelona, Spain. EM daniel.recasens@uab.es FU Spanish Ministry of Education and Science [HUM2006-03743]; FEDER [2005SGR864] FX This research was funded by project HUM2006-03743 of the Spanish Ministry of Education and Science and FEDER, and by project 2005SGR864 of the Generalitat de Catalunya. We thank two anonymous reviewers for comments on a previous manuscript version. CR ADANK P, 2003, VOWEL NORMALIZATION Adank P, 2004, J ACOUST SOC AM, V116, P3099, DOI 10.1121/1.1795335 ALTAMIMI JE, 2005, DOES VOWEL SYSTEM SI, P2465 Beddor PS, 2002, J PHONETICS, V30, P591, DOI 10.1006/jpho.2002.0177 BLADON A, 1983, SPEECH COMMUN, V2, P305, DOI 10.1016/0167-6393(83)90047-X Boersma P, 1998, FUNCTIONAL PHONOLOGY BOHN O, 2004, J INT PHON ASSOC, V34, P161, DOI 10.1017/S002510030400180X Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 BRADLOW AR, 1995, J ACOUST SOC AM, V97, P1916, DOI 10.1121/1.412064 CALAMAI S, 2002, SCUOLA NORMALE SUPER, V3, P40 CHAPTER J, 2004, CAMBRIDGE OCCASIONAL, V2, P100 de Boer B., 2001, ORIGINS VOWEL SYSTEM de Boer B, 2000, J PHONETICS, V28, P441, DOI 10.1006/jpho.2000.0125 DEGRAAF T, 1984, P I PHONETIC SCI, V8, P41 Disner S. F., 1983, UCLA WORKING PAPERS Fant G., 1973, SPEECH SOUNDS FEATUR Ferrero F., 1978, J ITALIAN LINGUISTIC, V3, P87 FLEGE JE, 1989, LANG SPEECH, V32, P123 Flemming E., 2004, PHONETICALLY BASED P, P232, DOI 10.1017/CBO9780511486401.008 Flemming Edward S., 2002, AUDITORY REPRESENTAT Fourakis M, 1999, PHONETICA, V56, P28, DOI 10.1159/000028439 Goldstein L., 1983, 10TH INT C PHON SCI, P267 Guion SG, 2003, PHONETICA, V60, P98, DOI 10.1159/000071449 Hawks JW, 1995, LANG SPEECH, V38, P237 Hillenbrand JM, 2001, J ACOUST SOC AM, V109, P748, DOI 10.1121/1.1337959 JONGMAN A, 1989, LANG SPEECH, V32, P221 KEATING PA, 1994, J PHONETICS, V22, P407 KEATING PA, 1984, PHONETICA, V41, P191 KOOPMANSVANBEIN.FJ, 1973, J PHONETICS, V1, P249 Labov W., 1994, PRINCIPLES LINGUISTI LADEFOGED P, 1967, NATURE VOWEL QUALITY, P50 Lass Roger, 1992, CAMBRIDGE HIST ENGLI, V2, p[1066, 23] Lass Roger, 1999, CAMBRIDGE HIST ENGLI, VIII, P56 Lass Roger, 1994, OLD ENGLISH HIST LIN LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991 Lindblom B., 1986, EXPT PHONOLOGY, P13 LIVJN P, 2000, PERILUS, V23, P93 MANUEL SY, 1990, J ACOUST SOC AM, V88, P1286, DOI 10.1121/1.399705 Martinet Andre, 1970, EC CHANGEMENTS PHONE MARTINS MRD, 1964, B FILOLOGIA, V22, P303 Max L, 1999, J SPEECH LANG HEAR R, V42, P261 MEUNIER C, 2003, P 15 INT C PHON SCI, V1, P723 Mok P, 2006, THESIS U CAMBRIDGE MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 Most T, 2000, LANG SPEECH, V43, P295 Munson B, 2004, J SPEECH LANG HEAR R, V47, P1048, DOI 10.1044/1092-4388(2004/078) Nearey Terrance Michael, 1978, PHONETIC FEATURE SYS NOBRE MA, 1987, HONOR ILSE LEHISTE, P195 PAPCUN G, 1976, UCLA WORKING PAPERS, V31, P38 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Quilis A., 1981, FONETICA ACUSTICA LE Recasens D., 1996, FONETICA DESCRIPTIVA Recasens D, 2006, SPEECH COMMUN, V48, P645, DOI 10.1016/j.specom.2005.09.011 RECASENS D, 1985, LANG SPEECH, V28, P97 Schwartz JL, 1997, J PHONETICS, V25, P255, DOI 10.1006/jpho.1997.0043 Schwartz JL, 1997, J PHONETICS, V25, P233, DOI 10.1006/jpho.1997.0044 Sole M. J., 2007, EXPT APPROACHES PHON, P104 Stevens K.N., 1998, ACOUSTIC PHONETICS STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 VENY J, 1983, ELS PARLARS CATALANS NR 60 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 240 EP 258 DI 10.1016/j.specom.2008.09.002 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900005 ER PT J AU Ding, HJ Soon, IY Koh, SN Yeo, CK AF Ding, Huijun Soon, Ing Yann Koh, Soo Nee Yeo, Chai Kiat TI A spectral filtering method based on hybrid wiener filters for speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Wiener filter; Spectrogram filtering; Noise reduction ID NOISE SUPPRESSION FILTER; SUBTRACTION; ESTIMATOR AB It is well known that speech enhancement using spectral filtering will result in residual noise. Residual noise which is musical in nature is very annoying to human listeners. Many speech enhancement approaches assume that the transform coefficients are independent of one another and can thus be attenuated separately, thereby ignoring the correlations that exist between different time frames and within each frame. This paper, proposes a single channel speech enhancement system which exploits such correlations between the different time frames to further reduce residual noise. Unlike other 2D speech enhancement techniques which apply a post-processor after some classical algorithms such as spectral subtraction, the proposed approach uses a hybrid Wiener spectrogram filter (HWSF) for effective noise reduction, followed by a multi-blade post-processor which exploits the 2D features of the spectrogram to preserve the speech quality and to further reduce the residual noise. This results in pleasant sounding speech for human listeners. Spectrogram comparisons show that in the proposed scheme, musical noise is significantly reduced. The effectiveness of the proposed algorithm is further confirmed through objective assessments and informal subjective listening tests. (c) 2008 Elsevier B.V. All rights reserved. C1 [Ding, Huijun; Soon, Ing Yann; Koh, Soo Nee] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. [Yeo, Chai Kiat] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore. RP Ding, HJ (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. EM ding0032@ntu.edu.sg; eiysoon@ntu.edu.sg; esnkoh@ntu.edu.sg; asckyeo@ntu.edu.sg RI KOH, Soo Ngee/A-5081-2011; Soon, Ing Yann/A-5173-2011 CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Evans N. W. D., 2002, Proceedings of the Fourth IASTED International Conference Signal and Image Processing Goh Z, 1998, IEEE T SPEECH AUDI P, V6, P287 Hu Y., 2006, P INTERSPEECH 2006 P Jensen J, 2001, IEEE T SPEECH AUDI P, V9, P731, DOI 10.1109/89.952491 LIN Z, 2003, P 2 IEEE INT WORKSH, P61 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621 Quatieri T. F., 2001, DISCRETE TIME SPEECH Soon IY, 2003, IEEE T SPEECH AUDI P, V11, P717, DOI 10.1109/TSA.2003.816063 Soon IY, 1999, SIGNAL PROCESS, V75, P151, DOI 10.1016/S0165-1684(98)00230-8 WANG SH, 1992, IEEE J SEL AREA COMM, V10, P819, DOI 10.1109/49.138987 WHIPPLE G, 1994, P ICASSP, V1 NR 16 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 259 EP 267 DI 10.1016/j.specom.2008.09.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900006 ER PT J AU Inanoglu, Z Young, S AF Inanoglu, Zeynep Young, Steve TI Data-driven emotion conversion in spoken English SO SPEECH COMMUNICATION LA English DT Article DE Emotion conversion; Expressive speech synthesis; Prosody modeling ID VOICE CONVERSION; SPEECH SYNTHESIS AB This paper describes an emotion conversion system that combines independent parameter transformation techniques to endow a neutral utterance with a desired target emotion. A set of prosody conversion methods have been developed which utilise a small amount of expressive training data (similar to 15 min) and which have been evaluated for three target emotions: anger, surprise and sadness. The system performs F0 conversion at the syllable level while duration conversion takes place at the phone level using a set of linguistic regression trees. Two alternative methods are presented as a means to predict F0 contours for unseen utterances. Firstly, an HMM-based approach uses syllables as linguistic building blocks to model and generate F0 contours. Secondly, an F0 segment selection approach expresses F0 conversion as a search problem, where syllable-based F0 contour segments from a target speech corpus are spliced together under contextual constraints. To complement the prosody modules, a GMM-based spectral conversion function is used to transform the voice quality. Each independent module and the combined emotion conversion framework were evaluated through a perceptual study. Preference tests demonstrated that each module contributes a measurable improvement in the perception of the target emotion. Furthermore, an emotion classification test showed that converted utterances with either F0 generation technique were able to convey the desired emotion above chance level. However, F0 segment selection outperforms the HMM-based F0 generation method both in terms of emotion recognition rates as well as intonation quality scores, particularly in the case of anger and surprise. Using segment selection, the emotion recognition rates for the converted neutral utterances were comparable to the same utterances spoken directly in the target emotion. (c) 2008 Elsevier B.V. All rights reserved. C1 [Inanoglu, Zeynep; Young, Steve] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Young, S (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM zeynep@gatesscholar.org; sjy@eng.cam.ac.uk CR Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 BARRA R, 2007, P INT Boersma P., 2005, PRAAT DOING PHONETIC BULUT M, 2007, P ICASSP FALLSIDE F, 1987, SPEECH LANG, V2, P27 GILLETT B, 2003, P EUR GOUBANOVA O, 2003, P INT C PHON SCI, V3, P2349 HELANDER E, 2007, P ICASSP, V4, P509 Iida A, 2003, SPEECH COMMUN, V40, P161, DOI 10.1016/S0167-6393(02)00081-X INANOGLU Z, 2005, P AFF COMP INT INT Inanoglu Z., 2003, THESIS U CAMBRIDGE INANOGLU Z, 2008, THESIS U CAMBRIDGE JENSEN U, 1994, COMPUT SPEECH LANG, V8, P227 JONES D, 1996, CAMBRIDGE ENGLISH PR Kain A., 1998, P ICASSP, V1, P285, DOI 10.1109/ICASSP.1998.674423 Kawanami H., 1999, IEEE T SPEECH AUDIO, V7, P697 ROSS K, 1994, P ESCA IEEE WORKSH S SCHERER K, 2004, P SPEECH PROS SCHRODER M, 1999, P EUROSPEECH, V1, P561 SILVERMAN K, 1992, P ICSLP, V40, P862 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 Tao JH, 2006, IEEE T AUDIO SPEECH, V14, P1145, DOI 10.1109/TASL.2006.876113 TIAN J, 2007, P INT TODA T, 2005, P INT Tokuda K., 2002, IEEE SPEECH SYNTH WO Tokuda K, 2002, IEICE T INF SYST, VE85D, P455 TOKUDA K, 2000, P ICASSP, V3, P1315 TSUZUKI H, 2004, P ICSLP, V2, P1185 Vroomen J., 1993, P EUR, V1, P577 Wu CH, 2006, IEEE T AUDIO SPEECH, V14, P1109, DOI 10.1109/TASL.2006.876112 YAMAGISHI J, 2003, P EUROSPEECH, V3, P2461 YE H, 2005, THESIS CAMBRIDGE U YILDIRIM S, 2004, P ICSLP YOSHIMURA T, 1998, P ICSLP Young S. J., 2006, HTK BOOK VERSION 3 4 NR 35 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 268 EP 283 DI 10.1016/j.specom.2008.09.006 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900007 ER PT J AU Breslin, C Gales, MJF AF Breslin, C. Gales, M. J. F. TI Directed decision trees for generating complementary systems SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Complementary systems; System combination ID NEWS TRANSCRIPTION SYSTEM; MODELS AB Many large vocabulary continuous speech recognition systems use a combination of multiple systems to obtain the final hypothesis. These complementary systems are typically found in an ad-hoc manner, by testing combinations of diverse systems and selecting the best. This paper presents a new algorithm for generating complementary systems by altering the decision tree generation, and a divergence measure for comparing decision trees. In this paper, the decision tree is biased against clustering states which have previously led to confusions. This leads to a system which concentrates states in contexts that were previously confusable. Thus these systems tend to make different errors. Results are presented on two broadcast news tasks - Mandarin and Arabic. The results show that combining multiple systems built from directed decision trees give gains in performance when confusion network combination is used as the method of combination. The results also show that the gains achieved using the directed tree algorithm are additive to the gains achieved using other techniques that have been empirically shown as complementary. (c) 2008 Elsevier B.V. All rights reserved. C1 [Breslin, C.; Gales, M. J. F.] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Breslin, C (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM catherine.breslin@crl.toshiba.co.uk; mjfg@eng.cam.ac.uk CR Arslan LM, 1999, IEEE T SPEECH AUDI P, V7, P46, DOI 10.1109/89.736330 Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350 Breiman L, 2001, MACH LEARN, V45, P5, DOI 10.1023/A:1010933404324 BRESLIN C, 2006, P ICSLP BRESLIN C, 2007, P ICASSP BRESLIN C, 2007, P INT BUCKWATER T, 2004, LDC2004L02 Cincarek T, 2006, IEICE T INF SYST, VE89D, P962, DOI 10.1093/ietisy/e89-d.3.962 DIETTERICH T, 1999, MACH LEARN, V12, P1 Dietterich TG, 2000, LECT NOTES COMPUT SC, V1857, P1 DIMITRAKAKIS C, 2004, P ICASSP Evermann G., 2000, P SPEECH TRANSCR WOR FISCUS JG, 1997, P IEEE ASRU WORKSH Freund Y., 1996, P 13 INT C MACH LEAR Gales M.J.F, 2007, P ASRU Gales MJF, 2006, IEEE T AUDIO SPEECH, V14, P1513, DOI 10.1109/TASL.2006.878264 Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9 GILLICK L, 1989, SOME STAT ISSUES COM HAIN T, 2007, AMI SYSTEM TRANSCRIP Hoffmeister B., 2006, P ICSLP HU R, 2007, P ICASSP HUANG J, 2007, P INT HWANG M, 2007, ADV MANDARIN BROADCA Jiang H, 2005, SPEECH COMMUN, V45, P455, DOI 10.1016/j.specrom.2004.12.004 KAMM T, 2003, P IEEE WORKSH AUT SP KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 LAMEL L, 2002, SPEECH LANG Mangu L., 1999, P EUR MEYER C, 2002, P ICASSP NGUYEN L, 2002, SPEECH LANG Nock H.J., 1997, P EUR Odell J.J., 1995, THESIS U CAMBRIDGE POVEY D, 2005, THESIS U CAMBRIDGE RAMABHADRAN B, 2006, IBM 2006 SPEECH TRAN RAND WM, 1971, J AM STAT ASSOC, V66, P846, DOI 10.2307/2284239 SCHWENK H, 1999, P ICASSP Sinha R., 2006, P ICASSP Siohan O., 2005, P ICASSP STUKER S, 2006, P ICSLP XUE J, 2007, P ICASSP Zhang R., 2004, P ICSLP ZHANG R, 2003, P ICASSP ZWEIG G, 2000, P ICASSP NR 43 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 284 EP 295 DI 10.1016/j.specom.2008.09.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900008 ER PT J AU Knoll, M Scharrer, L Costall, A AF Knoll, Monja Scharrer, Lisa Costall, Alan TI Are actresses better simulators than female students? The effects of simulation on prosodic modifications of infant- and foreigner-directed speech SO SPEECH COMMUNICATION LA English DT Article DE IDS; Simulated speech; Actresses; Hyperarticulation ID MOTHERS SPEECH; VOCAL EXPRESSION; MATERNAL SPEECH; EMOTION; LANGUAGE; COMMUNICATION; INTONATION; PREFERENCE; CLEAR; CUES AB Previous research has used simulated interactions to investigate emotional and linguistic speech phenomena. Here, we evaluate the use of these simulated interactions by comparing speech addressed to imaginary speech partners produced by psychology students and actresses, to an existing study of natural speech addressed to genuine interaction partners. Simulated infant-(IDS), foreigner-(FDS) and adult-directed speech (ADS) was obtained from 10 female students and 10 female actresses. These samples were acoustically analysed and rated for positive vocal affect. Our results for affect for actresses and student speakers are consistent with previous findings using natural interactions, with IDS rated higher in positive affect than ADS/FDS, and ADS rated higher than FDS. In contrast to natural speech, acoustic analyses of IDS produced by student speakers revealed a smaller vowel space than ADS/FDS, with no significant difference between those adult conditions. In contrast to natural speech (IDS > ADS/FDS), the mean F-0 of IDS was significantly higher than ADS, but not than FDS. Acoustic analyses of actress speech were more similar to natural speech, with IDS vowel space significantly larger than ADS vowel space, and with the mean FDS vowel space positioned between these two conditions. IDS mean F-0 of the actress speakers was significantly higher than both ADS and FDS. These results indicate that training plays an important role in eliciting natural-like speech in simulated interactions, and that participants without training are less successful in reproducing such speech. Speech obtained from simulated interactions should therefore be used with caution, and level of experience and training of the speakers should be taken into account. (c) 2008 Elsevier B.V. All rights reserved. C1 [Knoll, Monja; Scharrer, Lisa; Costall, Alan] Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England. [Scharrer, Lisa] Heidelberg Univ, Inst Psychol, D-69117 Heidelberg, Germany. RP Knoll, M (reprint author), Univ Portsmouth, Dept Psychol, King Henry Bldg,King Henry 1 St, Portsmouth PO1 2DY, Hants, England. EM monja.knoll@port.ac.uk; lisa.scharrer@port.ac.uk; alan.costall@port.ac.uk FU Economic and Social Research Council (ESRC) FX We thank Dr Maria Uther (Brunel University, UK), Dr Stig Walsh (NHM London) and Dr Darren Van Laar (University of Portsmouth) for useful comments on the project. Our particular thanks go to David Bauckham and the 'The Bridge Theatre Training Company' for their invaluable help in recruiting the actresses used in this study. This research was supported by a grant from the Economic and Social Research Council (ESRC) to Monja Knoll. CR Andruski J. E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607913 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Biersack S., 2005, P 9 EUR C SPEECH COM, P2401 Boersma P., 2006, PRAAT DOING PHONETIC Burnham D, 2002, SCIENCE, V296, P1435, DOI 10.1126/science.1069587 Englund KT, 2005, J PSYCHOLINGUIST RES, V34, P259, DOI 10.1007/s10936-005-3640-7 FERNALD A, 1984, DEV PSYCHOL, V20, P104, DOI 10.1037//0012-1649.20.1.104 FERNALD A, 1987, INFANT BEHAV DEV, V10, P279, DOI 10.1016/0163-6383(87)90017-8 GRIESER DL, 1988, DEV PSYCHOL, V24, P14, DOI 10.1037/0012-1649.24.1.14 Kitamura C, 2003, INFANCY, V4, P85, DOI 10.1207/S15327078IN0401_5 Kitamura C., 1998, ADV INFANCY RES, V12, P221 Knoll MA, 2007, SYST ASSOC SPEC VOL, V74, P299 KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473 Kuhl Patricia K., 1999, J ACOUST SOC AM, V105.2, P1095, DOI 10.1121/1.425135 Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533 Kuhl PK, 1997, SCIENCE, V277, P684, DOI 10.1126/science.277.5326.684 Liu HM, 2003, DEVELOPMENTAL SCI, V6, pF1, DOI 10.1111/1467-7687.00275 PAPOUSEK M, 1991, INFANT BEHAV DEV, V14, P415, DOI 10.1016/0163-6383(91)90031-M PAPOUSEK M, 1991, APPL PSYCHOLINGUIST, V12, P481, DOI 10.1017/S0142716400005889 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 Schaeffler F., 2006, 3 SPEECH PROS C DRES Scherer KR, 2007, EMOTION, V7, P158, DOI 10.1037/1528-3542.7.1.158 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788 STERN DN, 1983, J CHILD LANG, V10, P1 Stewart M. A., 1982, J LANG SOC PSYCHOL, V1, P91, DOI 10.1177/0261927X8200100201 Street R. L., 1983, J LANG SOC PSYCHOL, V2, P37, DOI 10.1177/0261927X8300200103 Trainor LJ, 2000, PSYCHOL SCI, V11, P188, DOI 10.1111/1467-9280.00240 Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003 VISCOVICH N, 2003, MOTOR SKILLS, V96, P759 WALLBOTT HG, 1986, J PERS SOC PSYCHOL, V51, P690, DOI 10.1037//0022-3514.51.4.690 WERKER JF, 1994, INFANT BEHAV DEV, V17, P323, DOI 10.1016/0163-6383(94)90012-4 NR 33 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2009 VL 51 IS 3 BP 296 EP 305 DI 10.1016/j.specom.2008.10.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 405DW UT WOS:000263203900009 ER PT J AU Kim, W Hansen, JHL AF Kim, Wooil Hansen, John H. L. TI Feature compensation in the cepstral domain employing model combination SO SPEECH COMMUNICATION LA English DT Article DE Speech recognition; Feature compensation; Model combination; Multiple models; Mixture sharing ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; SPEAKER ADAPTATION; NOISE; ENHANCEMENT; CLASSIFICATION; STRESS AB In this paper, we present all effective cepstral feature compensation scheme which leverages knowledge of the speech model in order to achieve robust speech recognition. Ill the proposed scheme, the requirement for a prior noisy speech database in off-line training is eliminated by employing parallel model combination for the noise-corrupted speech model. Gaussian mixture models of clean speech and noise are used for the model combination. The adaptation of the noisy speech model is possible only by updating the noise model. This method has the advantage of reduced computational expenses and improved accuracy for model estimation since it is applied in the cepstral domain. In order to cope with time-varying background noise, a novel interpolation method of multiple models is employed. By sequentially calculating the posterior probability of each environmental model, the compensation procedure call be applied oil a frame-by-frame basis. In order to reduce the computational expense due to the multiple-model method, a technique of sharing similar Gaussian components is proposed. Acoustically similar components across all inventory of environmental models are selected by the proposed sub-optimal algorithm which employs the Kullback-Leibler similarity distance. The combined hybrid model, which consists of the selected Gaussian components is used for noisy speech model sharing. The performance is examined using Aurora2 and speech data for an in-vehicle environment. The proposed feature compensation algorithm is compared with standard methods in the field (e.g., CMN, spectral subtraction, RATZ). The experimental results demonstrate that the proposed feature compensation schemes are very effective in realizing robust speech recognition in adverse noisy environments. The proposed model combination-based feature compensation method is superior to existing model-based feature compensation methods. Of particular interest is that the proposed method shows up to an 11.59% relative WER reduction compared to the ETSI AFE front-end method. The multi-model approach is effective at coping with changing noise conditions for input speech, producing comparable performance to the matched model condition. Applying the mixture sharing method brings a significant reduction in computational overhead, while maintaining recognition performance at a reasonable level with near real-time operation. (C) 2008 Elsevier B.V. All rights reserved. C1 [Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, Richardson, TX 75080 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA. EM john.hansen@utdallas.edu CR Acero A., 1993, ACOUSTIC ENV ROBUSTN Akbacak M, 2007, IEEE T AUDIO SPEECH, V15, P465, DOI 10.1109/TASL.2006.881694 Angkititrakul P., 2007, IEEE INT VEH S, P566 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Droppo J., 2001, EUROSPEECH2001, P217 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 *ETSI, 2000, 201108V112200004 ETS *ETSI, 2002, 202050V111200210 ETS Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088 Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935 Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P169, DOI 10.1109/89.388143 Hansen JHL, 2004, DSP IN VEHICLE MOBIL HIRSCH HG, 2000, AURORA EXPT FRAMEWOR KAWAGUCHI N, 2004, DSP IN VEHICLE MOBIL, pCH1 Kim NS, 2002, SPEECH COMMUN, V37, P231, DOI 10.1016/S0167-6393(01)00013-9 KIM W, 2003, EUROSPEECH 2003, P677 KIM W, 2004, ICASSP2004, P989 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 Lee K.-F., 1989, AUTOMATIC SPEECH REC LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Martin R., 1994, EUSIPCO 94, P1182 Morales N, 2006, INT CONF ACOUST SPEE, P533 Moreno P.J., 1996, THESIS CARNEGIE MELL Moreno PJ, 1998, SPEECH COMMUN, V24, P267, DOI 10.1016/S0167-6393(98)00025-9 RAJ B, 2005, IEEE SIGNAL PROCESS, V22 SASOU A, 2003, EUROSPEECH2003, P29 Sasou A., 2004, ICSLP2004, P121 Schulte-Fortkamp B., 2007, ACOUST TODAY, V3, P7, DOI 10.1121/1.2961148 SEGURA JC, 2001, EUROSPEECH2001, P221 SINGH R, 2002, CHAPTER CRC HDB NOIS Stouten V., 2004, ICASSP2004, P949 VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970 Westphal M, 2001, INT CONF ACOUST SPEE, P221, DOI 10.1109/ICASSP.2001.940807 WOMACK BD, 1996, IEEE P INT C AC SPEE, V1, P53 Womack BD, 1999, IEEE T SPEECH AUDI P, V7, P668, DOI 10.1109/89.799692 NR 40 TC 22 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2009 VL 51 IS 2 BP 83 EP 96 DI 10.1016/j.specom.2008.06.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391AC UT WOS:000262203800001 ER PT J AU Jallon, JF Berthommier, F AF Jallon, Julie Fontecave Berthommier, Frederic TI A semi-automatic method for extracting vocal tract movements from X-ray films SO SPEECH COMMUNICATION LA English DT Article DE Cineradiography; Contour extraction; Low-frequency DCT components; Vocal tract movements ID SPEECH PRODUCTION AB Despite the development of new imaging techniques, existing X-ray data remain an appropriate tool to study speech production phenomena. However, to exploit these images, the shapes of the vocal tract articulators must first be extracted. This task, usually manually realized, is long and laborious. This paper describes a semi-automatic technique for facilitating the extraction of vocal tract contours from complete sequences of large existing cineradiographic databases in the context of continuous speech production. The proposed method efficiently combines the human expertise required for marking a small number of key images and an automatic indexing of the video data to infer dynamic 2D data. Manually acquired geometrical data are associated to each image of the sequence via a similarity measure based on the low-frequency Discrete Cosine Transform (DCT) components of the images. Moreover, to reduce the reconstruction error and improve the geometrical contour estimation, we perform post-processing treatments, such as a neighborhood averaging and a temporal filtering. The method is applied independently for each articulator (tongue, velum, lips, and mandible). Then the acquired contours are combined to reconstruct the movements of the entire vocal tract. We carry out evaluations, including comparisons with manual markings and with another semi-automatic method. (C) 2008 Elsevier B.V. All rights reserved. C1 [Jallon, Julie Fontecave; Berthommier, Frederic] Domaine Univ, Dept Speech & Cognit, GIPSA Lab, F-38402 St Martin Dheres, France. RP Berthommier, F (reprint author), Domaine Univ, Dept Speech & Cognit, GIPSA Lab, BP 46, F-38402 St Martin Dheres, France. EM Julie.Fontecave@gipsa-lab.inpg.fr; Frederic.Berthommier@gipsa-lab.inpg.fr RI Fontecave-Jallon, Julie/M-5807-2014 CR Akgul YS, 1999, IEEE T MED IMAGING, V18, P1035, DOI 10.1109/42.811315 ARNAL A, 2000, P 23 JOURN ET PAR AU BADIN P, 1995, P INT C AC TRONDH NO BADIN P, 1998, P INT C SPOK LANG PR BERTHOMMIER F, 2004, P INT C AC SPEECH SI Bothorel A., 1986, CINERADIOGRAPHIE VOY DEPAULA H, 2006, SPEECH PRODUCTION MO Fant G., 1960, ACOUSTIC THEORY SPEE FONTECAVE J, 2005, P EUR C SPEECH COMM GUIARDMARIGNY T, 1996, PROGR SPEECH SYNTHES HARDCAST.WJ, 1972, PHONETICA, V25, P197 HECKMANN M, 2000, P INT C SPOK LANG PR HECKMANN M, 2003, P AUD VIS SPEECH PRO HEINZ JM, 1964, J ACOUST SOC AM, V36 KASS M, 1987, INT J COMPUT VISION, V4, P321 Laprie Y., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607097 MAEDA S, 1979, 20 JOURN ET PAR GREN, P152 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 MUNHALL KG, 1995, J ACOUST SOC AM, V98, P1222, DOI 10.1121/1.413621 Narayanan S, 2004, J ACOUST SOC AM, V115, P1771, DOI 10.1121/1.1652588 PERKELL J, 1972, J ACOUST SOC AM, V92, P3078 Perkell JS, 1969, PHYSL SPEECH PRODUCT, V53 Potamianos G, 1998, P IEEE INT C IM PROC, V3, P173, DOI 10.1109/ICIP.1998.999008 Rao K. R., 1990, DISCRETE COSINE TRAN ROY JP, 2003, INTRIC INTERFACE TRA, P163 RUBIN P, 1996, P 4 INT SEM SPEECH P STEVENS KN, 1963, STL QPSR, V4, P11 Thimm G., 1999, P EUR C SPEECH COMM, P157 THIMM G, 1998, ILLUMINATION ROBUST TIEDE MK, 1994, P INT C SPOK LANG PR WOOD S, 1979, J PHONETICS, V7, P25 NR 31 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2009 VL 51 IS 2 BP 97 EP 115 DI 10.1016/j.specom.2008.06.005 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391AC UT WOS:000262203800002 ER PT J AU den Ouden, H Noordman, L Terken, J AF den Ouden, Hanny Noordman, Leo Terken, Jacques TI Prosodic realizations of global and local structure and rhetorical relations in read aloud news reports SO SPEECH COMMUNICATION LA English DT Article DE Discourse; Text structure; Prosody; Global structure; Local structure; Rhetorical relations ID COHERENCE RELATIONS; DISCOURSE AB The aim of this research is to Study effects of global and local structure of texts and of rhetorical relations between sentences on the prosodic realization of sentences in read aloud text. Twenty texts were analyzed using Rhetorical Structure Theory. Based oil these analyses, the global structure in terms of hierarchical level, the local structure in terms of the relative importance of text segments and the rhetorical relations between text segments were identified. The texts were read aloud. Pause durations preceding segments, F0-maxima and articulation rates of the segments were measured. It was found that speakers give prosodic indications about hierarchical level by means of variations in pause duration and pitch range: the higher the segments are connected in the text structure, the longer the preceding pauses and the higher the F0-maxima are realized. Also, it was found that speakers articulate important segments more slowly than unimportant segments, and that they read aloud causally related segments with shorter in-between pauses and at faster rate than non-causally related segments. We conclude that variation in pause duration and F0-maximum is a robust means for speakers to express the global structure of texts, although this does not apply to all speakers. Speakers also vary pause duration and articulation rate to indicate importance of sentences and meaning relations between sentences. (C) 2008 Elsevier B.V. All rights reserved. C1 [den Ouden, Hanny] Univ Utrecht, Fac Humanities, NL-3512 JK Utrecht, Netherlands. [Noordman, Leo] Tilburg Univ, Fac Arts, NL-5000 LE Tilburg, Netherlands. [Terken, Jacques] Eindhoven Univ Technol, Dept Ind Design, NL-5600 MB Eindhoven, Netherlands. RP den Ouden, H (reprint author), Univ Utrecht, Fac Humanities, Trans 10, NL-3512 JK Utrecht, Netherlands. EM Hanny.denOuden@let.uu.nl; noordman@uvt.nl; j.m.b.terken@tue.nl FU SOBU (Cooperation of Brabant Universities) FX The research reported in this paper was funded by SOBU (Cooperation of Brabant Universities). Our thanks go to Leo Vogten and Jan Roelof de Pijper for making available their programming software. CR Bateman JA, 1997, DISCOURSE PROCESS, V24, P3 BRUBAKER RS, 1972, J PSYCHOLINGUIST RES, V1, P141, DOI 10.1007/BF01068103 Cooper W. E., 1980, SYNTAX SPEECH den Ouden H., 2004, PROSODIC REALIZATION DENOUDEN H, 2001, P 7 EUR C SPEECH COM, P91 DENOUDEN H, 1998, IPO ANN PROGR REPORT, V33, P129 HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 Hirschberg J., 1992, P SPEECH NAT LANG WO, P441, DOI 10.3115/1075527.1075632 Hirschberg JG, 1996, NATO ADV SCI I A-LIF, V286, P293 Lehiste I., 1975, STRUCTURE PROCESS SP, P195 MANNS MP, 1988, CLIN LAB MED, V8, P281 Noordman L, 1999, AMST STUD THEORY HIS, V176, P133 POTTER A, 2007, THESIS NOVA SE U Prince Ellen, 1981, RADICAL PRAGMATICS, P223 SANDERS TJM, 1992, DISCOURSE PROCESS, V15, P1 Sanders TJM, 2000, DISCOURSE PROCESS, V29, P37, DOI 10.1207/S15326950dp2901_3 SCHILPEROORD J, 1996, THESIS UTRECHT U SILVERMAN T, 1987, THESIS CAMBRIDGE U C Swerts M, 1997, J ACOUST SOC AM, V101, P514, DOI 10.1121/1.418114 THORNDYKE PW, 1977, COGNITIVE PSYCHOL, V9, P77, DOI 10.1016/0010-0285(77)90005-6 THORSEN N, 1985, J ACOUST SOC AM, V80, P1205 Van Donzel M., 1999, THESIS U AMSTERDAM Wennerstrom A, 2001, MUSIC EVERYDAY SPEEC Wichmann A., 2000, INTONATION TEXT DISC YULE G, 1980, LINGUA, V52, P33, DOI 10.1016/0024-3841(80)90016-9 NR 25 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2009 VL 51 IS 2 BP 116 EP 129 DI 10.1016/j.specom.2008.06.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391AC UT WOS:000262203800003 ER EF